What Is ChatGPT Doing … and Why Does It Work?
A comprehensive analysis with paragraph-by-paragraph commentary
Part 1: The Original Essay with Commentary
Click on any paragraph to expand detailed explanations, context, and technical insights
It's Just Adding One Word at a Time
The Core Mechanism Revealed
This single sentence captures ChatGPT's entire purpose: produce reasonable continuations. Not "understand," not "think," not "reason"—simply continue text in a statistically plausible way.
What "Reasonable" Means
Reasonable = statistically likely based on training data. If billions of web pages show certain patterns after certain phrases, ChatGPT learns these patterns.
Important Distinction
This is fundamentally different from how humans write. We have intentions, knowledge, and goals. ChatGPT has statistical patterns. Yet the outputs can be remarkably similar.
The Conceptual Model
Wolfram introduces a simplified mental model: imagine literally counting what words follow this exact phrase in all text ever written. This isn't exactly how ChatGPT works, but it's the right intuition.
Why This Example?
The phrase "The best thing about AI is its ability to" sets up natural continuations like:
- "learn" (very common)
- "adapt" (common)
- "process" (common)
- "dance" (very rare)
Scale Matters
"Billions of pages" emphasizes the massive scale. ChatGPT was trained on roughly 300 billion words—more than any human could read in thousands of lifetimes.
"Match in Meaning"
This is a crucial distinction. ChatGPT doesn't do literal string matching. Instead, it uses embeddings—numerical representations where semantically similar text has similar numbers.
The Output: Probabilities
For any context, ChatGPT outputs roughly 50,000 probabilities (one for each token in its vocabulary). These probabilities sum to 1.0, forming a complete probability distribution.
Example Output
For "The best thing about AI is its ability to":
- "learn" → 0.15 (15%)
- "adapt" → 0.08 (8%)
- "process" → 0.06 (6%)
- ... thousands more options with smaller probabilities
The Iterative Process
This reveals the autoregressive nature of language models. Each word depends only on previous words. There's no "planning ahead" or "thinking about the whole essay."
Step by Step
- Start with prompt: "Write an essay about..."
- Compute probabilities for next word
- Select a word (we'll see how)
- Add it to the context
- Repeat from step 2
Why This Is Remarkable
This simple loop produces coherent essays, code, poetry, and more. The complexity emerges from the 175 billion parameters, not from sophisticated reasoning algorithms.
The Selection Problem
Here Wolfram identifies a key design decision: given probabilities, how do we choose? The naive answer (always pick the highest) turns out to be wrong.
"Voodoo"
Wolfram's use of "voodoo" signals that the solution lacks rigorous theoretical justification. It works empirically, but we don't fully understand why.
Why Not Always Pick the Top?
If you always pick the highest probability word, you get:
- Repetitive text ("the the the...")
- Boring, predictable outputs
- Getting stuck in loops
The Creativity Paradox
This is counterintuitive: adding randomness makes text better. Randomness introduces variety, surprise, and the appearance of creativity.
Human Parallel
Humans don't always choose the most obvious word either. Sometimes we use unexpected vocabulary, make creative leaps, or introduce variety for its own sake.
The Balance
Too deterministic = boring and repetitive
Too random = nonsensical and incoherent
The sweet spot = "creative" and "interesting"
Temperature Explained
Temperature is a hyperparameter that controls the "sharpness" of the probability distribution:
- T = 0: Always pick the highest probability (deterministic)
- T = 1: Sample directly from the distribution
- T > 1: Flatten the distribution (more random)
- T = 0.8: Slight preference for high-probability, with some randomness
Mathematical Definition
New probability = original_probability^(1/T) / sum of all adjusted probabilities
Why 0.8?
Empirically determined through experimentation. Different tasks may benefit from different temperatures (code generation often uses lower temperatures for accuracy).
# Simplified temperature sampling
def sample_with_temperature(logits, temperature=0.8):
# Adjust logits by temperature
adjusted = logits / temperature
# Convert to probabilities
probs = softmax(adjusted)
# Sample from distribution
return random_choice(vocabulary, probabilities=probs)
Where Do the Probabilities Come From?
Pedagogical Strategy
Wolfram starts with letters instead of words to build intuition. Letters are simpler: only 26 options vs. ~50,000 tokens. The principles transfer directly.
Why This Matters
Understanding how letter-level models fail helps us appreciate why word-level models with billions of parameters are necessary.
Frequency Analysis
The simplest possible language model: count letter frequencies. In English:
- 'e' appears ~12% of the time
- 't' appears ~9% of the time
- 'z' appears ~0.1% of the time
Practical Example
The "cats" Wikipedia article provides real data. Different articles have slightly different distributions, but large samples converge to consistent English letter frequencies.
The Result: Gibberish
This random string has the right letter frequencies but is completely unreadable. Why? Because English isn't just about individual letter frequencies—it's about how letters combine.
What's Missing
- No word boundaries
- No common letter combinations ("th", "er", "ing")
- No prohibition of impossible sequences ("qx", "zz" at start)
This demonstrates that context matters.
Introducing N-grams
N-grams capture sequential patterns:
- 1-gram (unigram): Individual letter frequencies
- 2-gram (bigram): Pairs like "th", "qu", "er"
- 3-gram (trigram): Triples like "the", "ing", "tion"
The "qu" Example
In English, 'q' is almost always followed by 'u'. A bigram model captures this: P(u|q) ≈ 0.99. This single rule dramatically improves text generation.
Bigram Text Generation
Process:
- Start with a random letter (weighted by frequency)
- Look up probabilities for next letter given current letter
- Sample from those probabilities
- Repeat
Improvement
Bigram-generated text looks more English-like: "theres" might appear, common patterns like "th" and "er" emerge naturally.
The Pattern
Longer context = better predictions:
- 3-grams: Recognizable fragments emerge
- 5-grams: Some real words appear
- 10-grams: Sentences start forming
But There's a Catch...
The exponential explosion makes long n-grams impractical with direct counting. This motivates the need for models that can generalize rather than memorize.
Word-Level Statistics
Moving from letters to words dramatically increases vocabulary size: 26 letters → 40,000+ words.
Word Frequencies
Common words in English:
- "the" ~7% of all words
- "be" ~4%
- "to" ~3%
- Most words < 0.01%
Zipf's Law
Word frequencies follow a power law: a few words are very common, most words are rare. This pattern appears in all human languages.
The Combinatorial Explosion
This is the critical insight: 1.6 billion possible word pairs, but only about a million commonly appear. This means:
- Most combinations never occur in training data
- Direct counting fails for rare combinations
- We need models that generalize
Practical Implication
Even with billions of pages of text, most possible word sequences have never been observed. Yet ChatGPT must assign probabilities to all of them.
The Data Crisis
| N-gram | Possible Combinations | Observed |
|---|---|---|
| 2-gram | 1.6 billion | ~1 million |
| 3-gram | 60 trillion | ~few million |
| 4-gram | 2.4 quadrillion | ~tens of millions |
| 5-gram | 10^20 | impossibly few |
The Solution
We can't count our way to language understanding. We need models that generalize—neural networks that learn underlying patterns rather than memorizing specific sequences.
The LLM Solution
Large Language Models solve the data problem by learning patterns rather than instances. Key capabilities:
- Generalize from seen to unseen sequences
- Capture long-range dependencies
- Learn hierarchical representations of language
How It Works
Instead of storing "P(word|context)" directly, LLMs learn functions that compute these probabilities from learned representations of words and contexts.
What Is a Model?
The Galileo Analogy
Wolfram uses this historical example to contrast two approaches:
- Empirical: Measure every case, store in a table
- Theoretical: Find a formula that predicts all cases
The Parallel to Language
N-gram counting = measuring each floor
Neural networks = finding the underlying formula
The Power of Models
Models compress knowledge. Instead of storing millions of measurements, we store a few parameters and a procedure. For falling objects: t = √(2h/g)
Neural Networks as Models
ChatGPT's 175 billion parameters encode a "procedure" for computing word probabilities. It's astronomically more complex than t = √(2h/g), but the principle is the same: compress patterns into computable functions.
Model = Structure + Parameters
Every model has two components:
- Structure: The architecture (linear, polynomial, neural network)
- Parameters: The adjustable values (weights, biases)
ChatGPT's Structure
Structure: Transformer architecture with attention mechanisms
Parameters: 175 billion weights learned from training
Models for Human-Like Tasks
The Leap to Human Tasks
Physics has equations like F=ma. What's the equation for "this image contains a cat"? There isn't one we can write down simply.
Why This Matters
For centuries, science progressed by finding simple equations. Human-like tasks require a different approach: learn the pattern from examples rather than deriving it from first principles.
The Empirical Discovery
Neural networks weren't proven to work theoretically first—they were discovered to work empirically. Given enough data and parameters, they produce human-like outputs.
"Somehow"
Wolfram's use of "somehow" acknowledges our incomplete understanding. We know neural nets work, but explaining why they work as well as they do remains an open research question.
No Guarantees
This is a crucial admission: there's no mathematical proof that neural networks will work for any particular task. We rely on:
- Empirical testing
- Historical success on similar tasks
- Faith in the generalization ability of large models
Practical Implications
This explains why AI systems can fail unexpectedly on edge cases—there's no theoretical guarantee they handle all inputs correctly.
Neural Nets
The Biological Inspiration
Key parallels between brains and neural networks:
- Brain: 100 billion neurons, ~1000 activations/second
- ChatGPT: 175 billion parameters, billions of operations/second
Important Caveat
Modern neural networks are inspired by but not faithful to biological neurons. Real neurons are far more complex, with dendritic computation, timing effects, and biochemical processes that artificial neurons don't model.
Connection Weights
The "weights" are the learnable parameters:
- Higher weight = stronger connection
- Negative weight = inhibitory connection
- Learning = adjusting weights
Scale
In ChatGPT, 175 billion weights encode all learned knowledge. Each weight is a single number (typically 16 or 32 bits).
Scale Comparison
| Model | Layers | Parameters |
|---|---|---|
| Simple classifier | 2-3 | thousands |
| ResNet-50 (images) | 50 | 25 million |
| GPT-2 | ~50 | 1.5 billion |
| GPT-3/ChatGPT | ~400 | 175 billion |
Why So Deep?
Deeper networks can learn more abstract representations. Layer 1 might detect basic patterns; layer 100 might represent concepts; layer 400 might encode complex reasoning patterns.
output = f(w · x + b)
Where:
x = input vector (from previous layer)
w = weights (learned parameters)
b = bias term (learned threshold)
f = activation function (e.g., ReLU)
Example with ReLU:
inputs: [0.5, -0.3, 0.8]
weights: [0.2, 0.5, -0.1]
bias: 0.1
weighted_sum = (0.5×0.2) + (-0.3×0.5) + (0.8×-0.1) + 0.1
= 0.1 - 0.15 - 0.08 + 0.1 = -0.03
output = max(0, -0.03) = 0 (ReLU sets negatives to 0)
Activation Functions Evolution
- Sigmoid: σ(x) = 1/(1+e^-x), outputs in (0,1)
- Tanh: Similar shape, outputs in (-1,1)
- ReLU: f(x) = max(0,x), simple and fast
Why ReLU Won
- Computationally simple (just a comparison)
- No "vanishing gradient" problem
- Sparse activation (many zeros = efficient)
Machine Learning, and the Training of Neural Nets
Learning = Parameter Adjustment
The network structure is fixed. Learning means finding the right weights. For ChatGPT, this means finding 175 billion numbers that make it good at predicting text.
The Training Loop
- Make a prediction
- Compare to correct answer
- Adjust weights to reduce error
- Repeat billions of times
The Landscape Metaphor
Imagine the loss as height on a mountain range:
- High loss = mountain peak (bad)
- Low loss = valley bottom (good)
- Gradient descent = rolling downhill
Mathematical Reality
With 175 billion parameters, this "landscape" has 175 billion dimensions—impossible to visualize but mathematically tractable using calculus.
The Practice and Lore of Neural Net Training
Art vs. Science
"Lore" suggests knowledge passed down through practice rather than derived from theory. Neural network training involves many decisions that work empirically but lack theoretical justification.
Key Decisions
- Architecture: How many layers? What types?
- Hyperparameters: Learning rate, batch size, etc.
- Data: How much? What preprocessing?
"Surely a Network That's Big Enough Can Do Anything!"
The Limits of Learning
No matter how big, neural networks can't:
- Solve problems requiring step-by-step computation
- Prove mathematical theorems reliably
- Predict computationally irreducible systems
Computational Irreducibility
Some processes can't be predicted without running them step-by-step. No shortcut exists—not even for infinitely large neural networks.
The Concept of Embeddings
Words as Vectors
Each word becomes a point in high-dimensional space:
- "king" → [0.2, -0.5, 0.8, ...] (12,288 numbers)
- "queen" → [0.21, -0.48, 0.79, ...] (nearby!)
- "banana" → [-0.7, 0.3, -0.1, ...] (far away)
Why This Works
Words appearing in similar contexts get similar embeddings. "King" and "queen" appear in similar sentences, so their vectors cluster together.
Vector Arithmetic on Meaning
This famous example shows embeddings capture relationships:
- vector(king) - vector(man) = "royalty" direction
- vector(woman) + "royalty" ≈ vector(queen)
Other Examples
- Paris - France + Italy ≈ Rome
- bigger - big + small ≈ smaller
Limitations
This doesn't always work perfectly—embeddings capture statistical co-occurrence, not true understanding of meaning.
Inside ChatGPT
The Scale
175 billion weights means 175 billion learned numbers. If each is 2 bytes (float16), that's 350 GB just for weights—requiring multiple high-end GPUs to run.
Historical Context
GPT-3 (2020) was 100x larger than GPT-2 (2019). ChatGPT fine-tuned GPT-3 for conversation.
Remarkable Simplicity
The task is deceptively simple: just predict the next token. Yet from this simple objective emerges:
- Coherent paragraphs
- Logical reasoning (sometimes)
- Creative writing
- Code generation
- Question answering
Emergence
Complex capabilities emerge from the simple task of next-token prediction at sufficient scale.
The Key Innovation
Attention solves the "long-range dependency" problem. Without attention, early tokens would be "forgotten" by the time later tokens are processed.
How Attention Works
- Each token creates Query, Key, and Value vectors
- Query asks "what am I looking for?"
- Keys answer "what do I contain?"
- Matching Query-Key pairs retrieve Values
Example
In "The cat sat on the mat because it was tired", the word "it" needs to attend to "cat" to resolve the reference—attention enables this.
┌─────────────────────────────────────────────────┐ │ ChatGPT (GPT-3) │ ├─────────────────────────────────────────────────┤ │ Parameters: 175,000,000,000 │ │ Layers: 96 transformer blocks │ │ Attention Heads: 96 per block │ │ Embedding Dimensions: 12,288 │ │ Context Length: 4,096 tokens (original) │ │ Vocabulary Size: ~50,257 tokens │ ├─────────────────────────────────────────────────┤ │ Processing Flow: │ │ Input Tokens │ │ ↓ │ │ Token Embedding + Position Embedding │ │ ↓ │ │ Attention Block 1 (96 attention heads) │ │ ↓ │ │ ... Attention Blocks 2-95 ... │ │ ↓ │ │ Attention Block 96 │ │ ↓ │ │ Output: 50,257 probability values │ └─────────────────────────────────────────────────┘
The Training of ChatGPT
Training Data Scale
| Source | Volume |
|---|---|
| Web pages | ~1 trillion words |
| Books | ~100 billion words |
| Other sources | Additional billions |
| Total | ~300 billion tokens |
Data Quality
Not all web text is equal. Training includes filtering for quality, though the exact criteria aren't public.
Beyond Basic Training
The RLHF Process
- Supervised Fine-tuning: Train on human-written responses
- Reward Model Training: Humans rank model outputs
- Policy Optimization: Use RL to maximize reward model score
Why RLHF Matters
Raw GPT-3 might complete "How do I make a bomb?" with actual instructions. RLHF teaches the model to refuse harmful requests while being helpful for legitimate ones.
What Really Lets ChatGPT Work?
The Deep Question
ChatGPT's success forces us to reconsider: what makes human language special? If a statistical model can produce convincing language, perhaps language itself is more statistical than we thought.
Two Interpretations
- Optimistic: We've achieved AI that truly understands language
- Skeptical: Language is simpler than we thought; ChatGPT exploits this without true understanding
Meaning Space and Semantic Laws of Motion
The Geometric View
Meaning becomes geometry. Related concepts cluster; transitions between ideas become "movements" through this space.
Implications
- Reasoning = trajectories through meaning space
- Creativity = novel paths through the space
- Coherence = smooth, connected paths
Semantic Grammar and the Power of Computational Language
Two Types of Language
- Natural Language: Ambiguous, contextual, human
- Computational Language: Precise, formal, executable
The Synthesis
Wolfram proposes combining ChatGPT's natural language abilities with Wolfram Language's computational precision—getting the best of both worlds.
So ... What Is ChatGPT Doing, and Why Does It Work?
The Summary
After 15 chapters, the answer is simple:
- Predict next token probabilities
- Sample from those probabilities
- Repeat
Why It Works
Because language has learnable patterns, and 175 billion parameters can capture enough of them to produce convincingly human-like text.
Part 2: Product Requirements Document
Transforming Wolfram's essay into an educational product
Executive Summary
This PRD defines the requirements for converting Stephen Wolfram's comprehensive essay into a multi-format educational resource. The goal is to make these insights accessible to diverse audiences while preserving technical accuracy.
Target Audiences
| Audience | Technical Level | Primary Need | Format Preference |
|---|---|---|---|
| Executives | Beginner | Strategic understanding | 2-page summary |
| Developers | Intermediate | Implementation details | Code examples, diagrams |
| ML Engineers | Advanced | Technical depth | Mathematical detail |
| Students | Beginner-Intermediate | Learning path | Interactive course |
| Educators | All levels | Teaching materials | Modular content |
Content Modules
Chapters 1-3
"How ChatGPT Generates Text"
- Token-by-token generation
- Temperature and sampling
- N-gram limitations
- Model fundamentals
Chapters 4-7
"Neural Networks Explained"
- Neurons and layers
- Activation functions
- Training and backprop
- Practice and lore
Chapters 8-12
"Inside the Transformer"
- Embeddings
- Attention mechanism
- GPT architecture
- Training at scale
Chapters 13-16
"Why It Works"
- Computational limits
- Meaning space
- Semantic grammar
- Conclusions
Why This Opening Matters
Wolfram begins by acknowledging what many felt in early 2023: genuine surprise that a machine could produce coherent, meaningful text. This sets up the essay's central question—not just "how" but "why" it works.
Key Insight
The word "unexpected" is crucial. Even AI researchers were surprised by how well large language models performed. This wasn't a foregone conclusion but a genuine discovery about the nature of language.
What to Watch For