What Is ChatGPT Doing … and Why Does It Work?

Complete illustrated analysis with expandable commentary on every paragraph

👤 Stephen Wolfram 📅 February 2023 🖼️ 80+ Original Images 💬 Click paragraphs for commentary
175B
Parameters
12,288
Embedding Dims
96
Attention Heads
~300B
Training Words
~50K
Vocabulary

The Complete Illustrated Essay

All original images from Wolfram's essay with expandable commentary on every paragraph

Chapter 1

It's Just Adding One Word at a Time

ChatGPT text continuation concept
Figure 1.1: The fundamental concept—ChatGPT generates text by continuing what came before, one word at a time.
That ChatGPT can automatically generate something that reads even superficially like human-written text is remarkable, and unexpected. But how does it do it? And why does it work? My purpose here is to give a rough outline of what's going on inside ChatGPT—and then to explore why it is that it can do so well in producing what we might consider to be meaningful text.
Click for commentary
💡
Analysis

Why This Opening Matters

Wolfram begins by acknowledging genuine surprise—even AI researchers didn't expect language models to work this well. The word "unexpected" is crucial: this wasn't a foregone conclusion but a discovery.

Key Questions Posed

  • How: What are the mechanisms?
  • Why: What makes it possible at all?
The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a "reasonable continuation" of whatever text it's got so far, where by "reasonable" we mean "what one might expect someone to write after seeing what people have written on billions of webpages, etc."
Click for commentary
💡
Analysis

The Core Mechanism

This single sentence captures ChatGPT's entire purpose: produce reasonable continuations. Not "understand," not "think"—simply continue text in a statistically plausible way based on patterns from training data.

Important Distinction

"Reasonable" = statistically likely based on what humans have written. This is fundamentally different from human intentional writing, yet produces remarkably similar outputs.

So let's say we've got the text "The best thing about AI is its ability to". Imagine scanning billions of pages of human-written text (say on the web and in digitized books) and finding all instances of this text—then seeing what word comes next what fraction of the time.
Click for commentary
💡
Analysis

The Mental Model

This simplified mental model helps build intuition. Imagine literally counting what follows this phrase across all text ever written. That's not exactly how ChatGPT works, but it's the right intuition for understanding probability-based generation.

Word probability rankings
Figure 1.2: Probability rankings for words following "The best thing about AI is its ability to"—words like "learn" and "predict" rank highest.
ChatGPT effectively does something like this, except that (as I'll explain) it doesn't look at literal text; it looks for things that in a certain sense "match in meaning". But the end result is that it produces a ranked list of words that might follow, together with "probabilities".
Click for commentary
💡
Analysis

"Match in Meaning"

This hints at embeddings—numerical representations where semantically similar text has similar numbers. ChatGPT doesn't do literal string matching but operates in a "meaning space."

The Output

For any context, ChatGPT outputs ~50,000 probabilities (one per token in vocabulary). These sum to 1.0, forming a complete probability distribution.

GPT-2 model retrieval
Figure 1.3: Retrieving the GPT-2 model in Wolfram Language
Top words query
Figure 1.4: Querying top 5 probable next words
Dataset formatting
Figure 1.5: Formatting results into a structured dataset showing words and their probabilities
And the remarkable thing is that when ChatGPT does something like write an essay what it's essentially doing is just asking over and over again "given the text so far, what should the next word be?"—and each time adding a word.
Click for commentary
💡
Analysis

The Iterative Process

This reveals the autoregressive nature: each word depends only on previous words. There's no "planning ahead" or "thinking about the whole essay"—just repeated next-word prediction.

Why It's Remarkable

This simple loop produces coherent essays, code, poetry, and more. Complexity emerges from 175 billion parameters, not sophisticated reasoning.

Iterative generation
Figure 1.6: Iterative text generation using highest-probability words at each step
But, OK, at each step it gets a list of words with probabilities. But which one should it actually pick to add to the essay (or whatever) that it's writing? One might think it should be the "highest-ranked" word (i.e. the one to which the highest "probability" was assigned). But this is where a bit of voodoo begins to creep in.
Click for commentary
💡
Analysis

The Selection Problem

Given probabilities, how do we choose? The naive answer (always pick highest) turns out wrong. "Voodoo" signals this solution lacks rigorous theoretical justification—it works empirically.

Zero temperature output
Figure 1.7: The problem with always selecting highest-probability words: repetitive, confused text that goes in circles.
Because it turns out that if we always pick the highest-ranked word, we'll typically get a very "flat" essay, that never seems to "show any creativity" (and even sometimes repeats word for word). But if sometimes (at random) we pick lower-ranked words, we get a "more interesting" essay.
Click for commentary
💡
Analysis

The Creativity Paradox

Counterintuitively, adding randomness makes text better. This introduces variety, surprise, and the appearance of creativity.

The Balance

Too deterministic = boring/repetitive
Too random = nonsensical
Sweet spot = "creative" and coherent

Temperature sampling code
Figure 1.8: Code for temperature-based sampling at T=0.8
Multiple outputs
Figure 1.9: Five different outputs with temperature sampling—much more varied!
There's a particular so-called "temperature" parameter that determines how often lower-ranked words will be used, and for essay generation, it turns out that a "temperature" of 0.8 seems best.
Click for commentary
💡
Analysis

Temperature Explained

  • T=0: Always pick highest probability (deterministic)
  • T=0.8: Slight randomness, good for essays
  • T=1: Sample directly from distribution
  • T>1: More random, more "creative"

Why 0.8?

Empirically determined—it "works." Different tasks may need different temperatures (code often uses lower).

Probability distribution
Figure 1.10: Log-log plot showing power-law distribution of word probabilities—a few words are very likely, most are rare.
GPT-2 extended output
Figure 1.11: GPT-2 (smaller model) extended generation—somewhat coherent but tends to wander
GPT-3 zero temp
Figure 1.12: GPT-3 with temperature=0 (deterministic)
GPT-3 temp 0.8
Figure 1.13: GPT-3 with temperature=0.8—much more natural!
Key Concept
ChatGPT operates by repeatedly asking "what should the next word be?" and sampling from probability distributions. Temperature controls the randomness—0.8 works well for creative text.
Chapter 2

Where Do the Probabilities Come From?

OK, so ChatGPT always picks its next word based on probabilities. But where do those probabilities come from? Let's start with a simpler problem. Let's consider generating English text one letter (rather than word) at a time.
Click for commentary
💡
Analysis

Pedagogical Strategy

Wolfram starts with letters (26 options) instead of words (~50,000 tokens) to build intuition. The principles transfer directly but are easier to visualize at letter scale.

Letter frequency cats
Figure 2.1: Letter frequencies from Wikipedia's "cats" article
Letter frequency dogs
Figure 2.2: Letter frequencies from Wikipedia's "dogs" article
Aggregated frequencies
Figure 2.3: Aggregated letter frequencies from large English text sample—'e' is most common
And here's a sample of "random text" we get by just picking each letter independently with the same probability that it appears in the Wikipedia article on cats: "tletoramsleraunsouemrctacosyfmtsalrceapmsyaefpnte..."
Click for commentary
💡
Analysis

The Result: Gibberish

This has correct letter frequencies but is unreadable. English isn't just about individual letter frequencies—it's about how letters combine. This demonstrates that context matters.

Random letters
Figure 2.4: Random letters by frequency—gibberish
With spaces
Figure 2.5: Adding spaces—still gibberish
Word length distribution
Figure 2.6: With realistic word lengths—fake words but still not English
We can add a bit more "Englishness" by considering not just how probable each individual letter is on its own, but how probable pairs of letters ("2-grams") are. We know, for example, that if we have a "q", the next letter basically has to be "u".
Click for commentary
💡
Analysis

Introducing N-grams

  • 1-gram: Individual letter frequencies
  • 2-gram: Pairs like "th", "qu", "er"
  • 3-gram: Triples like "the", "ing"

In English, P(u|q) ≈ 0.99. This single rule dramatically improves generation!

Single letter probabilities
Figure 2.7: Single letter probability distribution
2-gram matrix
Figure 2.8: 2-gram probability matrix—note 'q' column only has 'u'
2-gram text
Figure 2.9: Text generated with 2-grams—some real words start appearing!
Progressive n-grams
Figure 2.10: Progression from 2-grams to higher n-grams—increasingly realistic
Let's go back to words now. English has around 40,000 "reasonably commonly used" words. And by looking at a large enough corpus of English text (say a few million books, or a few billion webpages), we can get fairly good estimates of how common each word is.
Click for commentary
💡
Analysis

Scale Jump

Moving from 26 letters to 40,000+ words dramatically increases complexity. Word frequencies follow Zipf's Law: a few words are very common ("the" ~7%), most are rare.

Random word sequence
Figure 2.11: Random words by frequency—grammatically nonsensical
Word 2-grams
Figure 2.12: Word-level 2-grams starting with "cat"—slightly more coherent
But how about "2-grams" for words? In principle there are about 40,000×40,000 ≈ 1.6 billion possible 2-grams. And of these, there actually appear in reasonable English text the order of a million or so.
Click for commentary
💡
Analysis

The Combinatorial Explosion

N-gramPossibleObserved
2-gram1.6 billion~1 million
3-gram60 trillion~few million
4-gram2.4 quadrillion~tens of millions

Most combinations never occur—we need models that generalize!

Key Concept
Direct probability counting fails because there are more possible word sequences than we could ever observe. This motivates neural networks: models that learn patterns and generalize to unseen combinations.
Chapter 3

What Is a Model?

Say we want to know (as Galileo did back in the late 1500s) how long it takes a cannon ball to fall from each floor of the Tower of Pisa. Well, we could just measure it in each case and make a table of the results.
Click for commentary
💡
Analysis

The Galileo Analogy

Two approaches to knowledge:

  • Empirical: Measure every case, store in table
  • Theoretical: Find formula that predicts all cases

N-gram counting = measuring each floor
Neural networks = finding the formula

Fall time data
Figure 3.1: Hypothetical fall time data from different heights
Linear fit
Figure 3.2: Simple linear model—doesn't fit well
Quadratic fit
Figure 3.3: Quadratic model—much better fit!
There's never a "model-less model". Any model you use has some particular underlying structure—then a certain set of "knobs you can turn" (i.e. parameters you can set) to fit your data.
Click for commentary
💡
Analysis

Model = Structure + Parameters

  • Structure: Architecture (linear, polynomial, neural net)
  • Parameters: Adjustable values (weights, biases)

ChatGPT: Transformer structure + 175 billion parameters

Poor model fit
Figure 3.4: Wrong structure = poor fit regardless of parameters. Model choice matters!
Key Concept
A model provides a procedure for computing answers rather than storing every case. ChatGPT's 175 billion parameters encode patterns that let it generalize to text it's never seen.
Chapter 4

Models for Human-Like Tasks

So far our examples of tasks have involved fairly simple, numerical data. But what about tasks that we humans consider, well, "human tasks"—like recognizing images, or understanding text?
Click for commentary
💡
Analysis

The Leap to Human Tasks

Physics has equations like F=ma. What's the equation for "this image contains a cat"? There isn't one we can write simply. Human-like tasks require learning patterns from examples.

Handwritten digits
Figure 4.1: Handwritten digits 0-9—humans recognize these instantly, but how?
Single digit examples
Figure 4.2: Many examples of a single digit—enormous variation in handwriting
Distorted digits
Figure 4.3: The same digit with distortions, rotations, modifications—still recognizable!
Digit recognition network
Figure 4.4: How a neural network processes digit images to produce classification
Blurred digits
Figure 4.5: Progressive blurring—at what point does recognition fail? Where does certainty end?
Key Concept
Neural networks learn to extract hierarchical features from data—edges become shapes, shapes become parts, parts become concepts. No explicit programming, just learning from examples.
Chapter 5

Neural Nets

OK, so how do neural nets actually work? At their core they're based on simple idealizations of how brains seem to work. In a human brain there are about 100 billion neurons, each capable of producing an electrical pulse up to perhaps a thousand times a second.
Click for commentary
💡
Analysis

Biological Inspiration

  • Brain: 100 billion neurons, ~1000 activations/sec
  • ChatGPT: 175 billion parameters, billions ops/sec

Modern neural nets are inspired by but not faithful to biology.

Neural network diagram
Figure 5.1: Architecture of a digit-recognition network—11 layers transforming pixels to probabilities
1s and 2s
Figure 5.2: The classification task: distinguish 1s from 2s
Voronoi diagram
Figure 5.3: Voronoi diagram showing "attractor basins"—each region flows to one point
Three points
Figure 5.4: Simple nearest-point classification task
Simple network
Figure 5.5: Simple layered neural network architecture
Data flow
Figure 5.6: How values propagate through layers—each neuron computes weighted sum + activation
ReLU activation
Figure 5.7: The ReLU activation function: f(x) = max(0, x)—simple but powerful
Neuron functions
Figure 5.8: Different neurons compute different functions depending on their weights
Classification boundary
Figure 5.9: Trained network's classification boundary—dividing space into regions
Network size comparison
Figure 5.10: Larger networks produce smoother, more accurate boundaries
Digit separation
Figure 5.11: How the network separates different handwritten digits
Cat dog classification
Figure 5.12: Cat vs dog classification with probability scores
Cat image
Figure 5.13: Input cat image for feature analysis
First layer features
Figure 5.14: First layer features—edges and basic patterns
Tenth layer features
Figure 5.15: Tenth layer—abstract, harder to interpret
Single Neuron Computation
output = f(w · x + b)

Where:
  x = input vector
  w = weights (learned)
  b = bias (learned)
  f = activation (e.g., ReLU)

ReLU: f(x) = max(0, x)
Chapter 6

Machine Learning, and the Training of Neural Nets

We've seen that neural nets can do some remarkable things. But how do we get them to do what we want? The basic idea of "machine learning" is to have a procedure that progressively adjusts the parameters ("weights") of the neural net to make it do better.
Click for commentary
💡
Analysis

Learning = Parameter Adjustment

Network structure is fixed. Learning means finding right weights. For ChatGPT: find 175 billion numbers that make it good at predicting text.

Target function
Figure 6.1: Target function the network should learn
Simple network
Figure 6.2: Simple network with one input and output
Random weights
Figure 6.3: Random weights produce random functions—nothing like target
Training progress
Figure 6.4: Progressive training—network gradually learns the target function
Loss curve
Figure 6.5: Loss decreasing during training—the classic learning curve
At the core of machine learning is the idea of "gradient descent". One imagines one's parameters as defining a position on a "landscape"—defined by the loss function. Then the idea is to progressively follow the path of steepest descent down this landscape to its minimum.
Click for commentary
💡
Analysis

The Landscape Metaphor

  • High loss = mountain peak (bad)
  • Low loss = valley bottom (good)
  • Gradient descent = rolling downhill

With 175B parameters, this "landscape" has 175B dimensions!

2D loss landscape
Figure 6.6: 2D loss landscape—contours show equal loss
Gradient descent path
Figure 6.7: Gradient descent path following steepest descent
Multiple solutions
Figure 6.8: Different starting points → different solutions
Extrapolation
Figure 6.9: Different solutions extrapolate differently
Key Concept
Backpropagation uses the chain rule to compute how each weight affects the loss, enabling efficient gradient computation through hundreds of layers. This makes training deep networks feasible.
Chapter 7

The Practice and Lore of Neural Net Training

There's an awful lot of "lore" about neural net training. Fundamentally much of it is about what architecture of neural net to use, how to set up training, and what data to train on.
Click for commentary
💡
Analysis

Art vs. Science

"Lore" = knowledge passed through practice, not theory. Many decisions work empirically but lack theoretical justification. Architecture, hyperparameters, data—all involve craft knowledge.

Network architecture details
Figure 7.1: Detailed layer-by-layer architecture showing feature transformation
Small network limits
Figure 7.2: Small networks have limited capacity—can't fit complex functions
Training monitor
Figure 7.3: Training interface showing loss reduction over time
Chapter 8

"Surely a Network That's Big Enough Can Do Anything!"

There's a notion that if we just had a "big enough" neural net it would be able to do anything. But this isn't true. The fundamental issue is the phenomenon of "computational irreducibility".
Click for commentary
💡
Analysis

Fundamental Limits

Some computations can't be shortcut—they require step-by-step execution. No neural network, regardless of size, can bypass computational irreducibility.

Cellular automaton
Figure 8.1: Cellular automaton evolution—an example of computational irreducibility. You can't predict the future without computing each step.
Key Insight
ChatGPT's success reveals that essay-writing is "computationally shallower" than we thought. Language follows patterns that can be learned—it doesn't require solving computationally irreducible problems.
Chapter 9

The Concept of Embeddings

Neural nets—at least as they're set up today—are fundamentally based on numbers. So if we're going to deal with something like text, we need some way to represent it in terms of numbers.
Click for commentary
💡
Analysis

Words as Vectors

Each word becomes a point in high-dimensional space. Similar meanings → similar vectors. "King" and "queen" are close; "king" and "banana" are far.

Word embeddings 2D
Figure 9.1: Word embeddings projected to 2D—semantically similar words cluster together
Digit network layers
Figure 9.2: Network architecture showing transformation from pixels to embeddings to outputs
Softmax output
Figure 9.3: Final softmax output—near-certainty for digit "4"
Pre-softmax
Figure 9.4: Pre-softmax layer—the raw embedding
Multiple embeddings
Figure 9.5: Embeddings for different instances of 4s and 8s—similar digits cluster
3D embedding space
Figure 9.6: 3D projection of digit embeddings—clear clustering by digit identity
GPT-2 embeddings
Figure 9.7: Raw GPT-2 embedding vectors—768 numbers per word
Key Concept
Embeddings convert words to vectors where semantic similarity = geometric proximity. The famous example: king - man + woman ≈ queen. Meaning becomes geometry.
Chapter 10

Inside ChatGPT

OK, so we're finally ready to discuss what's inside ChatGPT. And, yes, ultimately, it's a giant neural net—currently a version of the so-called GPT-3 network with 175 billion weights.
Click for commentary
💡
Analysis

The Scale

175 billion parameters × 2 bytes = 350GB just for weights. Requires multiple high-end GPUs to run. This is the largest publicly-known model of its era.

Embedding module
Figure 10.1: The embedding module—combines token embeddings with positional embeddings
Hello bye embeddings
Figure 10.2: Token and position embeddings visualized for "hello" and "bye" sequences
The most important thing about transformer neural nets like the ones used in ChatGPT is a piece called an "attention block". The idea of attention is that it provides a way for the sequence of tokens being processed to "pay attention to" (and draw information from) tokens that preceded it in the sequence.
Click for commentary
💡
Analysis

The Key Innovation

Attention solves long-range dependencies. In "The cat sat on the mat because it was tired," the word "it" needs to attend to "cat"—attention enables this connection across many tokens.

Attention block
Figure 10.3: Single attention block with multiple attention heads—the core of the transformer
Attention patterns
Figure 10.4: Attention weight patterns for 12 heads—each learns different relationships
Weight matrix
Figure 10.5: 768×768 weight matrix in fully-connected layer
Smoothed weights
Figure 10.6: Smoothed view revealing structure in weights
Attention across blocks
Figure 10.7: Attention patterns evolve across multiple transformer blocks
FC matrices across blocks
Figure 10.8: Fully-connected weight matrices from different layers—each encodes different features
Weight distributions
Figure 10.9: Weight magnitude distributions vary across layers

ChatGPT Architecture Summary

Parameters175,000,000,000
Transformer Blocks96 layers
Attention Heads/Block96
Embedding Dimensions12,288
Vocabulary Size~50,257 tokens
Context Length4,096 tokens
Chapter 11

The Training of ChatGPT

ChatGPT's training corpus was essentially "all of the web" (i.e. a few billion pages of text, with a trillion or so words), together with a few million books, and other sources.
Click for commentary
💡
Analysis

Training Data Scale

SourceVolume
Web pages~1 trillion words
Books~100 billion words
Total~300 billion tokens
Key Fact
ChatGPT was trained on roughly 300 billion tokens. The number of parameters (~175B) is comparable to the training token count—a pattern that seems optimal empirically.
Chapter 12

Beyond Basic Training

The raw GPT-3 model was trained just to "complete text". But ChatGPT was further trained using something called RLHF—"Reinforcement Learning from Human Feedback"—which effectively taught it to produce outputs that humans rate as "good".
Click for commentary
💡
Analysis

RLHF Process

  1. Generate multiple responses
  2. Humans rank them
  3. Train reward model on preferences
  4. Use RL to optimize for reward

This makes ChatGPT helpful and safe, not just good at predicting text.

Chapter 13

What Really Lets ChatGPT Work?

Human language—and the thought processes behind it—have always seemed to us to be somehow very special. And so it's seemed like something "AI-complete" to be able to produce human language. But now ChatGPT can do these things. So what's going on?
Click for commentary
💡
Analysis

The Deep Question

If a statistical model produces convincing language, maybe language itself is more statistical than we thought. This is a discovery about language, not just about AI.

Key Insight
ChatGPT's success suggests producing human-like text requires less computational sophistication than assumed. Language follows learnable patterns—a scientific discovery about language itself.
Chapter 14

Meaning Space and Semantic Laws of Motion

We've talked about ChatGPT working with embeddings. And we can think of these embeddings as defining a kind of "meaning space" in which words, sentences and larger pieces of text get placed.
Click for commentary
💡
Analysis

Meaning as Geometry

Reasoning = trajectories through meaning space. Creativity = novel paths. Coherence = smooth, connected paths. Language generation follows "semantic laws of motion."

Chapter 15

Semantic Grammar and the Power of Computational Language

We've been talking so far about the impressive ability of ChatGPT to deal with human natural language. But Wolfram|Alpha uses a different kind of language: computational language.
Click for commentary
💡
Analysis

Two Languages

  • Natural: Ambiguous, contextual, human
  • Computational: Precise, formal, executable

Combining both could give the best of both worlds.

Chapter 16

So ... What Is ChatGPT Doing, and Why Does It Work?

The basic answer is that what ChatGPT is doing is generating text by successively adding one token at a time, each time choosing its next token by sampling from a probability distribution that's been "learned" by training on a large corpus of text.
Click for commentary
💡
Analysis

The Summary

After 15 chapters, it's simple: predict next token, sample, repeat. The magic is in 175 billion parameters capturing patterns from 300 billion training tokens.

Final Answer
ChatGPT generates text by repeatedly predicting next-token probabilities and sampling. It works because language follows learnable statistical patterns, and 175 billion parameters can capture enough of them. This is a scientific discovery about language itself.

Product Requirements Document

Educational product specification based on Wolfram's essay

Executive Summary

Transform Wolfram's comprehensive essay into accessible, multi-format educational resources for diverse audiences—from executives needing strategic understanding to engineers wanting technical depth.

Target Audiences

AudienceLevelPrimary NeedFormat
ExecutivesBeginnerStrategic understanding2-page summary
DevelopersIntermediateImplementation detailsCode examples
ML EngineersAdvancedTechnical depthFull mathematics
StudentsProgressiveLearning pathInteractive course

Technical Specifications

ChatGPT Key Numbers

Parameters175,000,000,000
Embedding Dims12,288
Attention Heads96 per block
Layers96 blocks
Training Data~300B tokens
Vocabulary~50,257
Context4,096 tokens
Temperature0.8 typical

Glossary

Token
Basic text unit—word or subword piece
Embedding
Vector representation where similar meanings = similar vectors
Attention
Mechanism for tokens to "look at" relevant preceding tokens
Transformer
Architecture using attention—basis of GPT models
Temperature
Controls randomness: 0=deterministic, higher=more random
Loss Function
Measures prediction error—training minimizes this
Gradient Descent
Optimization by following steepest downhill path
Backpropagation
Computing gradients through layers via chain rule
RLHF
Reinforcement Learning from Human Feedback
Computational Irreducibility
Problems requiring step-by-step computation—can't be shortcut