What Is ChatGPT Doing … and Why Does It Work?

A comprehensive analysis with paragraph-by-paragraph commentary

👤 Stephen Wolfram 📅 February 2023 📖 16 Chapters 💬 Click paragraphs for commentary
175B
Parameters
12,288
Embedding Dims
96
Attention Heads
~300B
Training Words
~50K
Vocabulary

Part 1: The Original Essay with Commentary

Click on any paragraph to expand detailed explanations, context, and technical insights

Chapter 1

It's Just Adding One Word at a Time

That ChatGPT can automatically generate something that reads even superficially like human-written text is remarkable, and unexpected. But how does it do it? And why does it work? My purpose here is to give a rough outline of what's going on inside ChatGPT—and then to explore why it is that it can do so well in producing what we might consider to be meaningful text.
💡
Commentary & Analysis

Why This Opening Matters

Wolfram begins by acknowledging what many felt in early 2023: genuine surprise that a machine could produce coherent, meaningful text. This sets up the essay's central question—not just "how" but "why" it works.

Key Insight

The word "unexpected" is crucial. Even AI researchers were surprised by how well large language models performed. This wasn't a foregone conclusion but a genuine discovery about the nature of language.

What to Watch For

  • Wolfram will argue that ChatGPT's success reveals something fundamental about language itself
  • The "rough outline" approach means accessible explanations over mathematical rigor
The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a "reasonable continuation" of whatever text it's got so far, where by "reasonable" we mean "what one might expect someone to write after seeing what people have written on billions of webpages, etc."
💡
Commentary & Analysis

The Core Mechanism Revealed

This single sentence captures ChatGPT's entire purpose: produce reasonable continuations. Not "understand," not "think," not "reason"—simply continue text in a statistically plausible way.

What "Reasonable" Means

Reasonable = statistically likely based on training data. If billions of web pages show certain patterns after certain phrases, ChatGPT learns these patterns.

Important Distinction

This is fundamentally different from how humans write. We have intentions, knowledge, and goals. ChatGPT has statistical patterns. Yet the outputs can be remarkably similar.

So let's say we've got the text "The best thing about AI is its ability to". Imagine scanning billions of pages of human-written text (say on the web and in digitized books) and finding all instances of this text—then seeing what word comes next what fraction of the time.
💡
Commentary & Analysis

The Conceptual Model

Wolfram introduces a simplified mental model: imagine literally counting what words follow this exact phrase in all text ever written. This isn't exactly how ChatGPT works, but it's the right intuition.

Why This Example?

The phrase "The best thing about AI is its ability to" sets up natural continuations like:

  • "learn" (very common)
  • "adapt" (common)
  • "process" (common)
  • "dance" (very rare)

Scale Matters

"Billions of pages" emphasizes the massive scale. ChatGPT was trained on roughly 300 billion words—more than any human could read in thousands of lifetimes.

Key Concept
ChatGPT operates by repeatedly asking: "Given all the text so far, what should the next word be?" It answers this by computing probabilities based on patterns learned from training data.
ChatGPT effectively does something like this, except that (as I'll explain) it doesn't look at literal text; it looks for things that in a certain sense "match in meaning". But the end result is that it produces a ranked list of words that might follow, together with "probabilities".
💡
Commentary & Analysis

"Match in Meaning"

This is a crucial distinction. ChatGPT doesn't do literal string matching. Instead, it uses embeddings—numerical representations where semantically similar text has similar numbers.

The Output: Probabilities

For any context, ChatGPT outputs roughly 50,000 probabilities (one for each token in its vocabulary). These probabilities sum to 1.0, forming a complete probability distribution.

Example Output

For "The best thing about AI is its ability to":

  • "learn" → 0.15 (15%)
  • "adapt" → 0.08 (8%)
  • "process" → 0.06 (6%)
  • ... thousands more options with smaller probabilities
📊
Probability Distribution Visualization
The original article shows a bar chart of word probabilities for the next token. Words like "learn," "help," and "solve" have higher bars, while unusual continuations have tiny probabilities.
And the remarkable thing is that when ChatGPT does something like write an essay what it's essentially doing is just asking over and over again "given the text so far, what should the next word be?"—and each time adding a word.
💡
Commentary & Analysis

The Iterative Process

This reveals the autoregressive nature of language models. Each word depends only on previous words. There's no "planning ahead" or "thinking about the whole essay."

Step by Step

  1. Start with prompt: "Write an essay about..."
  2. Compute probabilities for next word
  3. Select a word (we'll see how)
  4. Add it to the context
  5. Repeat from step 2

Why This Is Remarkable

This simple loop produces coherent essays, code, poetry, and more. The complexity emerges from the 175 billion parameters, not from sophisticated reasoning algorithms.

But, OK, at each step it gets a list of words with probabilities. But which one should it actually pick to add to the essay (or whatever) that it's writing? One might think it should be the "highest-ranked" word (i.e. the one to which the highest "probability" was assigned). But this is where a bit of voodoo begins to creep in.
💡
Commentary & Analysis

The Selection Problem

Here Wolfram identifies a key design decision: given probabilities, how do we choose? The naive answer (always pick the highest) turns out to be wrong.

"Voodoo"

Wolfram's use of "voodoo" signals that the solution lacks rigorous theoretical justification. It works empirically, but we don't fully understand why.

Why Not Always Pick the Top?

If you always pick the highest probability word, you get:

  • Repetitive text ("the the the...")
  • Boring, predictable outputs
  • Getting stuck in loops
Because it turns out that if we always pick the highest-ranked word, we'll typically get a very "flat" essay, that never seems to "show any creativity" (and even sometimes repeats word for word). But if sometimes (at random) we pick lower-ranked words, we get a "more interesting" essay.
💡
Commentary & Analysis

The Creativity Paradox

This is counterintuitive: adding randomness makes text better. Randomness introduces variety, surprise, and the appearance of creativity.

Human Parallel

Humans don't always choose the most obvious word either. Sometimes we use unexpected vocabulary, make creative leaps, or introduce variety for its own sake.

The Balance

Too deterministic = boring and repetitive
Too random = nonsensical and incoherent
The sweet spot = "creative" and "interesting"

There's a particular so-called "temperature" parameter that determines how often lower-ranked words will be used, and for essay generation, it turns out that a "temperature" of 0.8 seems best.
💡
Commentary & Analysis

Temperature Explained

Temperature is a hyperparameter that controls the "sharpness" of the probability distribution:

  • T = 0: Always pick the highest probability (deterministic)
  • T = 1: Sample directly from the distribution
  • T > 1: Flatten the distribution (more random)
  • T = 0.8: Slight preference for high-probability, with some randomness

Mathematical Definition

New probability = original_probability^(1/T) / sum of all adjusted probabilities

Why 0.8?

Empirically determined through experimentation. Different tasks may benefit from different temperatures (code generation often uses lower temperatures for accuracy).

Technical Detail: Temperature Sampling
# Simplified temperature sampling
def sample_with_temperature(logits, temperature=0.8):
    # Adjust logits by temperature
    adjusted = logits / temperature

    # Convert to probabilities
    probs = softmax(adjusted)

    # Sample from distribution
    return random_choice(vocabulary, probabilities=probs)
Chapter 2

Where Do the Probabilities Come From?

OK, so ChatGPT always picks its next word based on probabilities. But where do those probabilities come from? Let's start with a simpler problem. Let's consider generating English text one letter (rather than word) at a time.
💡
Commentary & Analysis

Pedagogical Strategy

Wolfram starts with letters instead of words to build intuition. Letters are simpler: only 26 options vs. ~50,000 tokens. The principles transfer directly.

Why This Matters

Understanding how letter-level models fail helps us appreciate why word-level models with billions of parameters are necessary.

How can we work out what the probability for each letter should be? A very minimal thing we could do is just take a sample of English text, and calculate how often different letters occur in it. So, for example, this is a histogram of letter frequencies in the Wikipedia article on "cats":
💡
Commentary & Analysis

Frequency Analysis

The simplest possible language model: count letter frequencies. In English:

  • 'e' appears ~12% of the time
  • 't' appears ~9% of the time
  • 'z' appears ~0.1% of the time

Practical Example

The "cats" Wikipedia article provides real data. Different articles have slightly different distributions, but large samples converge to consistent English letter frequencies.

📊
Letter Frequency Histogram
The original shows a bar chart with 'e', 'a', 't', 'o', 'i' having the tallest bars, while 'z', 'q', 'x' have very short bars. This reflects standard English letter frequencies.
And here's a sample of "random text" we get by just picking each letter independently with the same probability that it appears in the Wikipedia article on cats: "tletoramsleraunsouemrctacosyfmtsalrceapmsyaefpnte..."
💡
Commentary & Analysis

The Result: Gibberish

This random string has the right letter frequencies but is completely unreadable. Why? Because English isn't just about individual letter frequencies—it's about how letters combine.

What's Missing

  • No word boundaries
  • No common letter combinations ("th", "er", "ing")
  • No prohibition of impossible sequences ("qx", "zz" at start)

This demonstrates that context matters.

We can add a bit more "Englishness" by considering not just how probable each individual letter is on its own, but how probable pairs of letters ("2-grams") are. We know, for example, that if we have a "q", the next letter basically has to be "u".
💡
Commentary & Analysis

Introducing N-grams

N-grams capture sequential patterns:

  • 1-gram (unigram): Individual letter frequencies
  • 2-gram (bigram): Pairs like "th", "qu", "er"
  • 3-gram (trigram): Triples like "the", "ing", "tion"

The "qu" Example

In English, 'q' is almost always followed by 'u'. A bigram model captures this: P(u|q) ≈ 0.99. This single rule dramatically improves text generation.

Key Concept
N-gram models improve by considering context, but face a fundamental problem: the number of possible n-grams grows exponentially. With 40,000 words, there are 1.6 billion possible 2-grams and 60 trillion possible 3-grams—far more than we can observe in any corpus.
If we take a large enough corpus of English text we can get a pretty good estimate of the probability for any pair of letters. And with those estimates we can start generating text by picking pairs of letters according to their "2-gram probabilities".
💡
Commentary & Analysis

Bigram Text Generation

Process:

  1. Start with a random letter (weighted by frequency)
  2. Look up probabilities for next letter given current letter
  3. Sample from those probabilities
  4. Repeat

Improvement

Bigram-generated text looks more English-like: "theres" might appear, common patterns like "th" and "er" emerge naturally.

Well, we can get more and more "Englishy" by using longer and longer n-grams. And with sufficiently long n-grams we can start generating text that looks quite convincingly like English.
💡
Commentary & Analysis

The Pattern

Longer context = better predictions:

  • 3-grams: Recognizable fragments emerge
  • 5-grams: Some real words appear
  • 10-grams: Sentences start forming

But There's a Catch...

The exponential explosion makes long n-grams impractical with direct counting. This motivates the need for models that can generalize rather than memorize.

Let's go back to words now. English has around 40,000 "reasonably commonly used" words. And by looking at a large enough corpus of English text (say a few million books, or a few billion webpages), we can get fairly good estimates of how common each word is.
💡
Commentary & Analysis

Word-Level Statistics

Moving from letters to words dramatically increases vocabulary size: 26 letters → 40,000+ words.

Word Frequencies

Common words in English:

  • "the" ~7% of all words
  • "be" ~4%
  • "to" ~3%
  • Most words < 0.01%

Zipf's Law

Word frequencies follow a power law: a few words are very common, most words are rare. This pattern appears in all human languages.

But how about "2-grams" for words? In principle there are about 40,000×40,000 ≈ 1.6 billion possible 2-grams. And of these, there actually appear in reasonable English text the order of a million or so. But even so, we can get good estimates for their relative frequencies.
💡
Commentary & Analysis

The Combinatorial Explosion

This is the critical insight: 1.6 billion possible word pairs, but only about a million commonly appear. This means:

  • Most combinations never occur in training data
  • Direct counting fails for rare combinations
  • We need models that generalize

Practical Implication

Even with billions of pages of text, most possible word sequences have never been observed. Yet ChatGPT must assign probabilities to all of them.

But how about "3-grams" for words? There are about 60 trillion of these that can be formed, and again only a tiny fraction appear in actual texts. And for 4-grams, or 5-grams, and more, the situation is vastly worse. So we've run out of data to get "empirical" estimates.
💡
Commentary & Analysis

The Data Crisis

N-gramPossible CombinationsObserved
2-gram1.6 billion~1 million
3-gram60 trillion~few million
4-gram2.4 quadrillion~tens of millions
5-gram10^20impossibly few

The Solution

We can't count our way to language understanding. We need models that generalize—neural networks that learn underlying patterns rather than memorizing specific sequences.

The idea is to make a model that lets us estimate the probabilities of sequences we've never explicitly seen. And at the core of ChatGPT is precisely a so-called "large language model" (LLM) that's been built to do a good job of estimating these probabilities.
💡
Commentary & Analysis

The LLM Solution

Large Language Models solve the data problem by learning patterns rather than instances. Key capabilities:

  • Generalize from seen to unseen sequences
  • Capture long-range dependencies
  • Learn hierarchical representations of language

How It Works

Instead of storing "P(word|context)" directly, LLMs learn functions that compute these probabilities from learned representations of words and contexts.

Chapter 3

What Is a Model?

Say we want to know (as Galileo did back in the late 1500s) how long it takes a cannon ball to fall from each floor of the Tower of Pisa. Well, we could just measure it in each case and make a table of the results.
💡
Commentary & Analysis

The Galileo Analogy

Wolfram uses this historical example to contrast two approaches:

  • Empirical: Measure every case, store in a table
  • Theoretical: Find a formula that predicts all cases

The Parallel to Language

N-gram counting = measuring each floor
Neural networks = finding the underlying formula

Or we can do what is the essence of theoretical science: make a model that gives us a procedure for computing the answer, rather than just measuring and remembering each case.
💡
Commentary & Analysis

The Power of Models

Models compress knowledge. Instead of storing millions of measurements, we store a few parameters and a procedure. For falling objects: t = √(2h/g)

Neural Networks as Models

ChatGPT's 175 billion parameters encode a "procedure" for computing word probabilities. It's astronomically more complex than t = √(2h/g), but the principle is the same: compress patterns into computable functions.

There's never a "model-less model". Any model you use has some particular underlying structure—then a certain set of "knobs you can turn" (i.e. parameters you can set) to fit your data.
💡
Commentary & Analysis

Model = Structure + Parameters

Every model has two components:

  1. Structure: The architecture (linear, polynomial, neural network)
  2. Parameters: The adjustable values (weights, biases)

ChatGPT's Structure

Structure: Transformer architecture with attention mechanisms
Parameters: 175 billion weights learned from training

Key Concept
A model provides a procedure for computing answers rather than storing every case. ChatGPT's 175 billion parameters encode patterns from training data, allowing it to generalize to sequences it has never seen.
Chapter 4

Models for Human-Like Tasks

So far our examples of tasks have involved fairly simple, numerical data. But what about tasks that we humans consider, well, "human tasks"—like recognizing images, or understanding text? The key idea is that for such tasks, there typically isn't any obvious set of simple underlying equations.
💡
Commentary & Analysis

The Leap to Human Tasks

Physics has equations like F=ma. What's the equation for "this image contains a cat"? There isn't one we can write down simply.

Why This Matters

For centuries, science progressed by finding simple equations. Human-like tasks require a different approach: learn the pattern from examples rather than deriving it from first principles.

But it turns out that a crucial fact—that is at the core of the success of neural nets—is that neural nets are somehow successful at implicitly doing this kind of thing. And with "enough examples" they can be made to produce what would seem like useful "human-like" outputs.
💡
Commentary & Analysis

The Empirical Discovery

Neural networks weren't proven to work theoretically first—they were discovered to work empirically. Given enough data and parameters, they produce human-like outputs.

"Somehow"

Wolfram's use of "somehow" acknowledges our incomplete understanding. We know neural nets work, but explaining why they work as well as they do remains an open research question.

🖌
Handwritten Digit Recognition
The original shows examples of handwritten digits (0-9) in various styles. A neural network must recognize "2" whether written in cursive, block letters, or with flourishes—capturing the essence of "two-ness" rather than matching exact pixels.
But wait: are we sure the neural net can learn to recognize all the 2s we might care about? The answer is basically no. We don't have a mathematical theorem that tells us this. All we can do is try it out—and it seems to work.
💡
Commentary & Analysis

No Guarantees

This is a crucial admission: there's no mathematical proof that neural networks will work for any particular task. We rely on:

  • Empirical testing
  • Historical success on similar tasks
  • Faith in the generalization ability of large models

Practical Implications

This explains why AI systems can fail unexpectedly on edge cases—there's no theoretical guarantee they handle all inputs correctly.

Chapter 5

Neural Nets

OK, so how do neural nets actually work? At their core they're based on simple idealizations of how brains seem to work. In a human brain there are about 100 billion neurons, each capable of producing an electrical pulse up to perhaps a thousand times a second.
💡
Commentary & Analysis

The Biological Inspiration

Key parallels between brains and neural networks:

  • Brain: 100 billion neurons, ~1000 activations/second
  • ChatGPT: 175 billion parameters, billions of operations/second

Important Caveat

Modern neural networks are inspired by but not faithful to biological neurons. Real neurons are far more complex, with dendritic computation, timing effects, and biochemical processes that artificial neurons don't model.

The neurons are connected in a complicated net, with each neuron effectively getting input from thousands of others—with the outputs of the neurons affected by different "weights".
💡
Commentary & Analysis

Connection Weights

The "weights" are the learnable parameters:

  • Higher weight = stronger connection
  • Negative weight = inhibitory connection
  • Learning = adjusting weights

Scale

In ChatGPT, 175 billion weights encode all learned knowledge. Each weight is a single number (typically 16 or 32 bits).

A typical neural net might have perhaps just a few layers, and perhaps just a few thousand neurons. But ChatGPT has about 400 layers, with a total of nearly 200 billion neurons.
💡
Commentary & Analysis

Scale Comparison

ModelLayersParameters
Simple classifier2-3thousands
ResNet-50 (images)5025 million
GPT-2~501.5 billion
GPT-3/ChatGPT~400175 billion

Why So Deep?

Deeper networks can learn more abstract representations. Layer 1 might detect basic patterns; layer 100 might represent concepts; layer 400 might encode complex reasoning patterns.

Single Neuron Computation
output = f(w · x + b)

Where:
  x = input vector (from previous layer)
  w = weights (learned parameters)
  b = bias term (learned threshold)
  f = activation function (e.g., ReLU)

Example with ReLU:
  inputs: [0.5, -0.3, 0.8]
  weights: [0.2, 0.5, -0.1]
  bias: 0.1

  weighted_sum = (0.5×0.2) + (-0.3×0.5) + (0.8×-0.1) + 0.1
               = 0.1 - 0.15 - 0.08 + 0.1 = -0.03

  output = max(0, -0.03) = 0  (ReLU sets negatives to 0)
The function f is usually the same for all neurons. In earlier days it was typically a sigmoid or tanh function. But nowadays it's more often a ReLU—which just takes the input and sets any negative value to zero.
💡
Commentary & Analysis

Activation Functions Evolution

  • Sigmoid: σ(x) = 1/(1+e^-x), outputs in (0,1)
  • Tanh: Similar shape, outputs in (-1,1)
  • ReLU: f(x) = max(0,x), simple and fast

Why ReLU Won

  • Computationally simple (just a comparison)
  • No "vanishing gradient" problem
  • Sparse activation (many zeros = efficient)
Chapter 6

Machine Learning, and the Training of Neural Nets

We've seen that neural nets can do some remarkable things. But how do we get them to do what we want? The basic idea of "machine learning" is to have a procedure that progressively adjusts the parameters ("weights") of the neural net to make it do better.
💡
Commentary & Analysis

Learning = Parameter Adjustment

The network structure is fixed. Learning means finding the right weights. For ChatGPT, this means finding 175 billion numbers that make it good at predicting text.

The Training Loop

  1. Make a prediction
  2. Compare to correct answer
  3. Adjust weights to reduce error
  4. Repeat billions of times
At the core of machine learning is the idea of "gradient descent". One imagines one's parameters as defining a position on a "landscape"—defined by the loss function. Then the idea is to progressively follow the path of steepest descent down this landscape to its minimum.
💡
Commentary & Analysis

The Landscape Metaphor

Imagine the loss as height on a mountain range:

  • High loss = mountain peak (bad)
  • Low loss = valley bottom (good)
  • Gradient descent = rolling downhill

Mathematical Reality

With 175 billion parameters, this "landscape" has 175 billion dimensions—impossible to visualize but mathematically tractable using calculus.

📈
Learning Curve
The original shows a typical training curve: loss starts high and decreases rapidly at first, then more slowly, eventually plateauing. This curve shows training progress over time.
Key Concept
Backpropagation uses the chain rule of calculus to compute how each weight affects the loss, allowing efficient gradient computation through hundreds of layers. This makes training deep networks computationally feasible.
Chapter 7

The Practice and Lore of Neural Net Training

There's an awful lot of "lore" about neural net training. Fundamentally much of it is about what architecture of neural net to use, how to set up training, and what data to train on.
💡
Commentary & Analysis

Art vs. Science

"Lore" suggests knowledge passed down through practice rather than derived from theory. Neural network training involves many decisions that work empirically but lack theoretical justification.

Key Decisions

  • Architecture: How many layers? What types?
  • Hyperparameters: Learning rate, batch size, etc.
  • Data: How much? What preprocessing?
Chapter 8

"Surely a Network That's Big Enough Can Do Anything!"

There's a notion that if we just had a "big enough" neural net it would be able to do anything. But this isn't true. The fundamental issue is the phenomenon of "computational irreducibility".
💡
Commentary & Analysis

The Limits of Learning

No matter how big, neural networks can't:

  • Solve problems requiring step-by-step computation
  • Prove mathematical theorems reliably
  • Predict computationally irreducible systems

Computational Irreducibility

Some processes can't be predicted without running them step-by-step. No shortcut exists—not even for infinitely large neural networks.

Key Concept
ChatGPT's success reveals that essay-writing is "computationally shallower" than we thought—it doesn't require solving computationally irreducible problems, just pattern matching and continuation.
Chapter 9

The Concept of Embeddings

Neural nets—at least as they're set up today—are fundamentally based on numbers. So if we're going to deal with something like text, we need some way to represent it in terms of numbers. The idea of embeddings is to represent something—say, a word—by an array of numbers in such a way that "nearby things" correspond to nearby arrays of numbers.
💡
Commentary & Analysis

Words as Vectors

Each word becomes a point in high-dimensional space:

  • "king" → [0.2, -0.5, 0.8, ...] (12,288 numbers)
  • "queen" → [0.21, -0.48, 0.79, ...] (nearby!)
  • "banana" → [-0.7, 0.3, -0.1, ...] (far away)

Why This Works

Words appearing in similar contexts get similar embeddings. "King" and "queen" appear in similar sentences, so their vectors cluster together.

But what's remarkable is that somehow in this embedding space we can often do things like find that "king" - "man" + "woman" = "queen"—presumably reflecting the meaning that a queen is to women what a king is to men.
💡
Commentary & Analysis

Vector Arithmetic on Meaning

This famous example shows embeddings capture relationships:

  • vector(king) - vector(man) = "royalty" direction
  • vector(woman) + "royalty" ≈ vector(queen)

Other Examples

  • Paris - France + Italy ≈ Rome
  • bigger - big + small ≈ smaller

Limitations

This doesn't always work perfectly—embeddings capture statistical co-occurrence, not true understanding of meaning.

Chapter 10

Inside ChatGPT

OK, so we're finally ready to discuss what's inside ChatGPT. And, yes, ultimately, it's a giant neural net—currently a version of the so-called GPT-3 network with 175 billion weights.
💡
Commentary & Analysis

The Scale

175 billion weights means 175 billion learned numbers. If each is 2 bytes (float16), that's 350 GB just for weights—requiring multiple high-end GPUs to run.

Historical Context

GPT-3 (2020) was 100x larger than GPT-2 (2019). ChatGPT fine-tuned GPT-3 for conversation.

In many ways the most remarkable thing about ChatGPT is that all it needs to do is to generate text—and to do that it just has to compute successive probabilities for "what token to add next". At its core it's a neural net that's been set up to produce that.
💡
Commentary & Analysis

Remarkable Simplicity

The task is deceptively simple: just predict the next token. Yet from this simple objective emerges:

  • Coherent paragraphs
  • Logical reasoning (sometimes)
  • Creative writing
  • Code generation
  • Question answering

Emergence

Complex capabilities emerge from the simple task of next-token prediction at sufficient scale.

The most important thing about transformer neural nets like the ones used in ChatGPT is a piece called an "attention block". The idea of attention is that it provides a way for the sequence of tokens being processed to "pay attention to" (and draw information from) tokens that preceded it in the sequence.
💡
Commentary & Analysis

The Key Innovation

Attention solves the "long-range dependency" problem. Without attention, early tokens would be "forgotten" by the time later tokens are processed.

How Attention Works

  1. Each token creates Query, Key, and Value vectors
  2. Query asks "what am I looking for?"
  3. Keys answer "what do I contain?"
  4. Matching Query-Key pairs retrieve Values

Example

In "The cat sat on the mat because it was tired", the word "it" needs to attend to "cat" to resolve the reference—attention enables this.

ChatGPT Architecture Summary
┌─────────────────────────────────────────────────┐
│                 ChatGPT (GPT-3)                  │
├─────────────────────────────────────────────────┤
│ Parameters:           175,000,000,000            │
│ Layers:               96 transformer blocks      │
│ Attention Heads:      96 per block               │
│ Embedding Dimensions: 12,288                     │
│ Context Length:       4,096 tokens (original)    │
│ Vocabulary Size:      ~50,257 tokens             │
├─────────────────────────────────────────────────┤
│ Processing Flow:                                 │
│   Input Tokens                                   │
│        ↓                                         │
│   Token Embedding + Position Embedding           │
│        ↓                                         │
│   Attention Block 1 (96 attention heads)         │
│        ↓                                         │
│   ... Attention Blocks 2-95 ...                  │
│        ↓                                         │
│   Attention Block 96                             │
│        ↓                                         │
│   Output: 50,257 probability values              │
└─────────────────────────────────────────────────┘
Chapter 11

The Training of ChatGPT

ChatGPT's training corpus was essentially "all of the web" (i.e. a few billion pages of text, with a trillion or so words), together with a few million books, and other sources.
💡
Commentary & Analysis

Training Data Scale

SourceVolume
Web pages~1 trillion words
Books~100 billion words
Other sourcesAdditional billions
Total~300 billion tokens

Data Quality

Not all web text is equal. Training includes filtering for quality, though the exact criteria aren't public.

Chapter 12

Beyond Basic Training

The raw GPT-3 model was trained just to "complete text". But ChatGPT was further trained using something called RLHF—"Reinforcement Learning from Human Feedback"—which effectively taught it to produce outputs that humans rate as "good".
💡
Commentary & Analysis

The RLHF Process

  1. Supervised Fine-tuning: Train on human-written responses
  2. Reward Model Training: Humans rank model outputs
  3. Policy Optimization: Use RL to maximize reward model score

Why RLHF Matters

Raw GPT-3 might complete "How do I make a bomb?" with actual instructions. RLHF teaches the model to refuse harmful requests while being helpful for legitimate ones.

Chapter 13

What Really Lets ChatGPT Work?

Human language—and the thought processes behind it—have always seemed to us to be somehow very special. And so it's seemed like something "AI-complete" to be able to produce human language and have human-like conversations. But now ChatGPT can do these things. So what's going on? Is this telling us that human language—and thought—are in some sense less special than we believed?
💡
Commentary & Analysis

The Deep Question

ChatGPT's success forces us to reconsider: what makes human language special? If a statistical model can produce convincing language, perhaps language itself is more statistical than we thought.

Two Interpretations

  • Optimistic: We've achieved AI that truly understands language
  • Skeptical: Language is simpler than we thought; ChatGPT exploits this without true understanding
Key Insight
ChatGPT's success suggests that producing human-like text requires less computational sophistication than we assumed. The patterns in human language are learnable by statistical methods—a discovery about language itself, not just about AI.
Chapter 14

Meaning Space and Semantic Laws of Motion

We've talked about ChatGPT working with embeddings. And we can think of these embeddings as defining a kind of "meaning space" in which words, sentences and larger pieces of text get placed.
💡
Commentary & Analysis

The Geometric View

Meaning becomes geometry. Related concepts cluster; transitions between ideas become "movements" through this space.

Implications

  • Reasoning = trajectories through meaning space
  • Creativity = novel paths through the space
  • Coherence = smooth, connected paths
Chapter 15

Semantic Grammar and the Power of Computational Language

We've been talking so far about the impressive ability of ChatGPT to deal with human natural language. But Wolfram|Alpha uses a different kind of language: computational language, specifically the Wolfram Language.
💡
Commentary & Analysis

Two Types of Language

  • Natural Language: Ambiguous, contextual, human
  • Computational Language: Precise, formal, executable

The Synthesis

Wolfram proposes combining ChatGPT's natural language abilities with Wolfram Language's computational precision—getting the best of both worlds.

Chapter 16

So ... What Is ChatGPT Doing, and Why Does It Work?

The basic answer is that what ChatGPT is doing is generating text by successively adding one token at a time, each time choosing its next token by sampling from a probability distribution that's been "learned" by training on a large corpus of text.
💡
Commentary & Analysis

The Summary

After 15 chapters, the answer is simple:

  1. Predict next token probabilities
  2. Sample from those probabilities
  3. Repeat

Why It Works

Because language has learnable patterns, and 175 billion parameters can capture enough of them to produce convincingly human-like text.

Final Insight
ChatGPT's success is a scientific discovery about the nature of human language: producing coherent text is computationally shallower than we assumed. The patterns underlying human language can be learned through statistical methods from sufficient examples.

Part 2: Product Requirements Document

Transforming Wolfram's essay into an educational product

Executive Summary

This PRD defines the requirements for converting Stephen Wolfram's comprehensive essay into a multi-format educational resource. The goal is to make these insights accessible to diverse audiences while preserving technical accuracy.

Target Audiences

Audience Technical Level Primary Need Format Preference
Executives Beginner Strategic understanding 2-page summary
Developers Intermediate Implementation details Code examples, diagrams
ML Engineers Advanced Technical depth Mathematical detail
Students Beginner-Intermediate Learning path Interactive course
Educators All levels Teaching materials Modular content

Content Modules

Module 1: Foundations

Chapters 1-3

"How ChatGPT Generates Text"

  • Token-by-token generation
  • Temperature and sampling
  • N-gram limitations
  • Model fundamentals
Module 2: Neural Networks

Chapters 4-7

"Neural Networks Explained"

  • Neurons and layers
  • Activation functions
  • Training and backprop
  • Practice and lore
Module 3: Architecture

Chapters 8-12

"Inside the Transformer"

  • Embeddings
  • Attention mechanism
  • GPT architecture
  • Training at scale
Module 4: Theory

Chapters 13-16

"Why It Works"

  • Computational limits
  • Meaning space
  • Semantic grammar
  • Conclusions

Technical Specifications

ChatGPT Key Specifications

Parameters 175,000,000,000
Embedding Dimensions 12,288
Attention Heads 96 per block
Transformer Blocks 96 layers
Training Data ~300 billion tokens
Vocabulary Size ~50,257 tokens
Context Length 4,096 tokens (original)
Temperature (typical) 0.8

Glossary

Token
The basic unit of text processing—a word, subword, or character that the model operates on.
Embedding
A numerical vector representation of a token where similar meanings have similar vectors.
Attention
A mechanism that allows tokens to "look at" and draw information from other tokens in the sequence.
Transformer
The neural network architecture using attention mechanisms, forming the basis of GPT models.
Temperature
A parameter controlling randomness in token selection—lower is more deterministic, higher is more creative.
Loss Function
A measure of how wrong the model's predictions are—training minimizes this value.
Gradient Descent
The optimization algorithm that adjusts weights to minimize loss by following the steepest downhill direction.
Backpropagation
The algorithm for computing gradients through the network using the chain rule of calculus.
RLHF
Reinforcement Learning from Human Feedback—a technique for fine-tuning models based on human preferences.
Computational Irreducibility
The property of processes that cannot be predicted without running them step-by-step—a fundamental limit on what neural networks can learn.