What Is ChatGPT Doing … and Why Does It Work?

Complete illustrated analysis with expandable commentary on every paragraph

👤 Stephen Wolfram 📅 February 2023 🖼️ 80+ Original Images 💬 Click paragraphs for commentary

175B

Parameters

12,288

Embedding Dims

Attention Heads

~300B

Training Words

~50K

Vocabulary

The Complete Illustrated Essay

All original images from Wolfram's essay with expandable commentary on every paragraph

Chapter 1

It's Just Adding One Word at a Time

Figure 1.1: The fundamental concept—ChatGPT generates text by continuing what came before, one word at a time.

That ChatGPT can automatically generate something that reads even superficially like human-written text is remarkable, and unexpected. But how does it do it? And why does it work? My purpose here is to give a rough outline of what's going on inside ChatGPT—and then to explore why it is that it can do so well in producing what we might consider to be meaningful text.

Click for commentary

💡

Analysis

Why This Opening Matters

Wolfram begins by acknowledging genuine surprise—even AI researchers didn't expect language models to work this well. The word "unexpected" is crucial: this wasn't a foregone conclusion but a discovery.

Key Questions Posed

How: What are the mechanisms?
Why: What makes it possible at all?

The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a "reasonable continuation" of whatever text it's got so far, where by "reasonable" we mean "what one might expect someone to write after seeing what people have written on billions of webpages, etc."

Click for commentary

💡

Analysis

The Core Mechanism

This single sentence captures ChatGPT's entire purpose: produce reasonable continuations. Not "understand," not "think"—simply continue text in a statistically plausible way based on patterns from training data.

Important Distinction

"Reasonable" = statistically likely based on what humans have written. This is fundamentally different from human intentional writing, yet produces remarkably similar outputs.

So let's say we've got the text "The best thing about AI is its ability to". Imagine scanning billions of pages of human-written text (say on the web and in digitized books) and finding all instances of this text—then seeing what word comes next what fraction of the time.

Click for commentary

💡

Analysis

The Mental Model

This simplified mental model helps build intuition. Imagine literally counting what follows this phrase across all text ever written. That's not exactly how ChatGPT works, but it's the right intuition for understanding probability-based generation.

Figure 1.2: Probability rankings for words following "The best thing about AI is its ability to"—words like "learn" and "predict" rank highest.

ChatGPT effectively does something like this, except that (as I'll explain) it doesn't look at literal text; it looks for things that in a certain sense "match in meaning". But the end result is that it produces a ranked list of words that might follow, together with "probabilities".

Click for commentary

💡

Analysis

"Match in Meaning"

This hints at embeddings—numerical representations where semantically similar text has similar numbers. ChatGPT doesn't do literal string matching but operates in a "meaning space."

The Output

For any context, ChatGPT outputs ~50,000 probabilities (one per token in vocabulary). These sum to 1.0, forming a complete probability distribution.

Figure 1.3: Retrieving the GPT-2 model in Wolfram Language

Figure 1.4: Querying top 5 probable next words

Figure 1.5: Formatting results into a structured dataset showing words and their probabilities

And the remarkable thing is that when ChatGPT does something like write an essay what it's essentially doing is just asking over and over again "given the text so far, what should the next word be?"—and each time adding a word.

Click for commentary

💡

Analysis

The Iterative Process

This reveals the autoregressive nature: each word depends only on previous words. There's no "planning ahead" or "thinking about the whole essay"—just repeated next-word prediction.

Why It's Remarkable

This simple loop produces coherent essays, code, poetry, and more. Complexity emerges from 175 billion parameters, not sophisticated reasoning.

Figure 1.6: Iterative text generation using highest-probability words at each step

But, OK, at each step it gets a list of words with probabilities. But which one should it actually pick to add to the essay (or whatever) that it's writing? One might think it should be the "highest-ranked" word (i.e. the one to which the highest "probability" was assigned). But this is where a bit of voodoo begins to creep in.

Click for commentary

💡

Analysis

The Selection Problem

Given probabilities, how do we choose? The naive answer (always pick highest) turns out wrong. "Voodoo" signals this solution lacks rigorous theoretical justification—it works empirically.

Figure 1.7: The problem with always selecting highest-probability words: repetitive, confused text that goes in circles.

Because it turns out that if we always pick the highest-ranked word, we'll typically get a very "flat" essay, that never seems to "show any creativity" (and even sometimes repeats word for word). But if sometimes (at random) we pick lower-ranked words, we get a "more interesting" essay.

Click for commentary

💡

Analysis

The Creativity Paradox

Counterintuitively, adding randomness makes text better. This introduces variety, surprise, and the appearance of creativity.

The Balance

Too deterministic = boring/repetitive
Too random = nonsensical
Sweet spot = "creative" and coherent

Figure 1.8: Code for temperature-based sampling at T=0.8

Figure 1.9: Five different outputs with temperature sampling—much more varied!

There's a particular so-called "temperature" parameter that determines how often lower-ranked words will be used, and for essay generation, it turns out that a "temperature" of 0.8 seems best.

Click for commentary

💡

Analysis

Temperature Explained

T=0: Always pick highest probability (deterministic)
T=0.8: Slight randomness, good for essays
T=1: Sample directly from distribution
T>1: More random, more "creative"

Why 0.8?

Empirically determined—it "works." Different tasks may need different temperatures (code often uses lower).

Figure 1.10: Log-log plot showing power-law distribution of word probabilities—a few words are very likely, most are rare.

Figure 1.11: GPT-2 (smaller model) extended generation—somewhat coherent but tends to wander

Figure 1.12: GPT-3 with temperature=0 (deterministic)

Figure 1.13: GPT-3 with temperature=0.8—much more natural!

Key Concept

ChatGPT operates by repeatedly asking "what should the next word be?" and sampling from probability distributions. Temperature controls the randomness—0.8 works well for creative text.

Chapter 2

Where Do the Probabilities Come From?

OK, so ChatGPT always picks its next word based on probabilities. But where do those probabilities come from? Let's start with a simpler problem. Let's consider generating English text one letter (rather than word) at a time.

Click for commentary

💡

Analysis

Pedagogical Strategy

Wolfram starts with letters (26 options) instead of words (~50,000 tokens) to build intuition. The principles transfer directly but are easier to visualize at letter scale.

Figure 2.1: Letter frequencies from Wikipedia's "cats" article

Figure 2.2: Letter frequencies from Wikipedia's "dogs" article

Figure 2.3: Aggregated letter frequencies from large English text sample—'e' is most common

And here's a sample of "random text" we get by just picking each letter independently with the same probability that it appears in the Wikipedia article on cats: "tletoramsleraunsouemrctacosyfmtsalrceapmsyaefpnte..."

Click for commentary

💡

Analysis

The Result: Gibberish

This has correct letter frequencies but is unreadable. English isn't just about individual letter frequencies—it's about how letters combine. This demonstrates that context matters.

Figure 2.4: Random letters by frequency—gibberish

Figure 2.5: Adding spaces—still gibberish

Figure 2.6: With realistic word lengths—fake words but still not English

We can add a bit more "Englishness" by considering not just how probable each individual letter is on its own, but how probable pairs of letters ("2-grams") are. We know, for example, that if we have a "q", the next letter basically has to be "u".

Click for commentary

💡

Analysis

Introducing N-grams

1-gram: Individual letter frequencies
2-gram: Pairs like "th", "qu", "er"
3-gram: Triples like "the", "ing"

In English, P(u|q) ≈ 0.99. This single rule dramatically improves generation!

Figure 2.7: Single letter probability distribution

Figure 2.8: 2-gram probability matrix—note 'q' column only has 'u'

Figure 2.9: Text generated with 2-grams—some real words start appearing!

Figure 2.10: Progression from 2-grams to higher n-grams—increasingly realistic

Let's go back to words now. English has around 40,000 "reasonably commonly used" words. And by looking at a large enough corpus of English text (say a few million books, or a few billion webpages), we can get fairly good estimates of how common each word is.

Click for commentary

💡

Analysis

Scale Jump

Moving from 26 letters to 40,000+ words dramatically increases complexity. Word frequencies follow Zipf's Law: a few words are very common ("the" ~7%), most are rare.

Figure 2.11: Random words by frequency—grammatically nonsensical

Figure 2.12: Word-level 2-grams starting with "cat"—slightly more coherent

But how about "2-grams" for words? In principle there are about 40,000×40,000 ≈ 1.6 billion possible 2-grams. And of these, there actually appear in reasonable English text the order of a million or so.

Click for commentary

💡

Analysis

The Combinatorial Explosion

N-gram	Possible	Observed
2-gram	1.6 billion	~1 million
3-gram	60 trillion	~few million
4-gram	2.4 quadrillion	~tens of millions

Most combinations never occur—we need models that generalize!

Key Concept

Direct probability counting fails because there are more possible word sequences than we could ever observe. This motivates neural networks: models that learn patterns and generalize to unseen combinations.

Chapter 3

What Is a Model?

Say we want to know (as Galileo did back in the late 1500s) how long it takes a cannon ball to fall from each floor of the Tower of Pisa. Well, we could just measure it in each case and make a table of the results.

Click for commentary

💡

Analysis

The Galileo Analogy

Two approaches to knowledge:

Empirical: Measure every case, store in table
Theoretical: Find formula that predicts all cases

N-gram counting = measuring each floor
Neural networks = finding the formula

Figure 3.1: Hypothetical fall time data from different heights

Figure 3.2: Simple linear model—doesn't fit well

Figure 3.3: Quadratic model—much better fit!

There's never a "model-less model". Any model you use has some particular underlying structure—then a certain set of "knobs you can turn" (i.e. parameters you can set) to fit your data.

Click for commentary

💡

Analysis

Model = Structure + Parameters

Structure: Architecture (linear, polynomial, neural net)
Parameters: Adjustable values (weights, biases)

ChatGPT: Transformer structure + 175 billion parameters

Figure 3.4: Wrong structure = poor fit regardless of parameters. Model choice matters!

Key Concept

A model provides a procedure for computing answers rather than storing every case. ChatGPT's 175 billion parameters encode patterns that let it generalize to text it's never seen.

Chapter 4

Models for Human-Like Tasks

So far our examples of tasks have involved fairly simple, numerical data. But what about tasks that we humans consider, well, "human tasks"—like recognizing images, or understanding text?

Click for commentary

💡

Analysis

The Leap to Human Tasks

Physics has equations like F=ma. What's the equation for "this image contains a cat"? There isn't one we can write simply. Human-like tasks require learning patterns from examples.

Figure 4.1: Handwritten digits 0-9—humans recognize these instantly, but how?

Figure 4.2: Many examples of a single digit—enormous variation in handwriting

Figure 4.3: The same digit with distortions, rotations, modifications—still recognizable!

Figure 4.4: How a neural network processes digit images to produce classification

Figure 4.5: Progressive blurring—at what point does recognition fail? Where does certainty end?

Key Concept

Neural networks learn to extract hierarchical features from data—edges become shapes, shapes become parts, parts become concepts. No explicit programming, just learning from examples.

Chapter 5

Neural Nets

OK, so how do neural nets actually work? At their core they're based on simple idealizations of how brains seem to work. In a human brain there are about 100 billion neurons, each capable of producing an electrical pulse up to perhaps a thousand times a second.

Click for commentary

💡

Analysis

Biological Inspiration

Brain: 100 billion neurons, ~1000 activations/sec
ChatGPT: 175 billion parameters, billions ops/sec

Modern neural nets are inspired by but not faithful to biology.

Figure 5.1: Architecture of a digit-recognition network—11 layers transforming pixels to probabilities

Figure 5.2: The classification task: distinguish 1s from 2s

Figure 5.3: Voronoi diagram showing "attractor basins"—each region flows to one point

Figure 5.4: Simple nearest-point classification task

Figure 5.5: Simple layered neural network architecture

Figure 5.6: How values propagate through layers—each neuron computes weighted sum + activation

Figure 5.7: The ReLU activation function: f(x) = max(0, x)—simple but powerful

Figure 5.8: Different neurons compute different functions depending on their weights

Figure 5.9: Trained network's classification boundary—dividing space into regions

Figure 5.10: Larger networks produce smoother, more accurate boundaries

Figure 5.11: How the network separates different handwritten digits

Figure 5.12: Cat vs dog classification with probability scores

Figure 5.13: Input cat image for feature analysis

Figure 5.14: First layer features—edges and basic patterns

Figure 5.15: Tenth layer—abstract, harder to interpret

Single Neuron Computation

output = f(w · x + b)

Where:
  x = input vector
  w = weights (learned)
  b = bias (learned)
  f = activation (e.g., ReLU)

ReLU: f(x) = max(0, x)

Chapter 6

Machine Learning, and the Training of Neural Nets

We've seen that neural nets can do some remarkable things. But how do we get them to do what we want? The basic idea of "machine learning" is to have a procedure that progressively adjusts the parameters ("weights") of the neural net to make it do better.

Click for commentary

💡

Analysis

Learning = Parameter Adjustment

Network structure is fixed. Learning means finding right weights. For ChatGPT: find 175 billion numbers that make it good at predicting text.

Figure 6.1: Target function the network should learn

Figure 6.2: Simple network with one input and output

Figure 6.3: Random weights produce random functions—nothing like target

Figure 6.4: Progressive training—network gradually learns the target function

Figure 6.5: Loss decreasing during training—the classic learning curve

At the core of machine learning is the idea of "gradient descent". One imagines one's parameters as defining a position on a "landscape"—defined by the loss function. Then the idea is to progressively follow the path of steepest descent down this landscape to its minimum.

Click for commentary

💡

Analysis

The Landscape Metaphor

High loss = mountain peak (bad)
Low loss = valley bottom (good)
Gradient descent = rolling downhill

With 175B parameters, this "landscape" has 175B dimensions!

Figure 6.6: 2D loss landscape—contours show equal loss

Figure 6.7: Gradient descent path following steepest descent

Figure 6.8: Different starting points → different solutions

Figure 6.9: Different solutions extrapolate differently

Key Concept

Backpropagation uses the chain rule to compute how each weight affects the loss, enabling efficient gradient computation through hundreds of layers. This makes training deep networks feasible.

Chapter 7

The Practice and Lore of Neural Net Training

There's an awful lot of "lore" about neural net training. Fundamentally much of it is about what architecture of neural net to use, how to set up training, and what data to train on.

Click for commentary

💡

Analysis

Art vs. Science

"Lore" = knowledge passed through practice, not theory. Many decisions work empirically but lack theoretical justification. Architecture, hyperparameters, data—all involve craft knowledge.

Figure 7.1: Detailed layer-by-layer architecture showing feature transformation

Figure 7.2: Small networks have limited capacity—can't fit complex functions

Figure 7.3: Training interface showing loss reduction over time

Chapter 8

"Surely a Network That's Big Enough Can Do Anything!"

There's a notion that if we just had a "big enough" neural net it would be able to do anything. But this isn't true. The fundamental issue is the phenomenon of "computational irreducibility".

Click for commentary

💡

Analysis

Fundamental Limits

Some computations can't be shortcut—they require step-by-step execution. No neural network, regardless of size, can bypass computational irreducibility.

Figure 8.1: Cellular automaton evolution—an example of computational irreducibility. You can't predict the future without computing each step.

Key Insight

ChatGPT's success reveals that essay-writing is "computationally shallower" than we thought. Language follows patterns that can be learned—it doesn't require solving computationally irreducible problems.

Chapter 9

The Concept of Embeddings

Neural nets—at least as they're set up today—are fundamentally based on numbers. So if we're going to deal with something like text, we need some way to represent it in terms of numbers.

Click for commentary

💡

Analysis

Words as Vectors

Each word becomes a point in high-dimensional space. Similar meanings → similar vectors. "King" and "queen" are close; "king" and "banana" are far.

Figure 9.1: Word embeddings projected to 2D—semantically similar words cluster together

Figure 9.2: Network architecture showing transformation from pixels to embeddings to outputs

Figure 9.3: Final softmax output—near-certainty for digit "4"

Figure 9.4: Pre-softmax layer—the raw embedding

Figure 9.5: Embeddings for different instances of 4s and 8s—similar digits cluster

Figure 9.6: 3D projection of digit embeddings—clear clustering by digit identity

Figure 9.7: Raw GPT-2 embedding vectors—768 numbers per word

Key Concept

Embeddings convert words to vectors where semantic similarity = geometric proximity. The famous example: king - man + woman ≈ queen. Meaning becomes geometry.

Chapter 10

Inside ChatGPT

OK, so we're finally ready to discuss what's inside ChatGPT. And, yes, ultimately, it's a giant neural net—currently a version of the so-called GPT-3 network with 175 billion weights.

Click for commentary

💡

Analysis

The Scale

175 billion parameters × 2 bytes = 350GB just for weights. Requires multiple high-end GPUs to run. This is the largest publicly-known model of its era.

Figure 10.1: The embedding module—combines token embeddings with positional embeddings

Figure 10.2: Token and position embeddings visualized for "hello" and "bye" sequences

The most important thing about transformer neural nets like the ones used in ChatGPT is a piece called an "attention block". The idea of attention is that it provides a way for the sequence of tokens being processed to "pay attention to" (and draw information from) tokens that preceded it in the sequence.

Click for commentary

💡

Analysis

The Key Innovation

Attention solves long-range dependencies. In "The cat sat on the mat because it was tired," the word "it" needs to attend to "cat"—attention enables this connection across many tokens.

Figure 10.3: Single attention block with multiple attention heads—the core of the transformer

Figure 10.4: Attention weight patterns for 12 heads—each learns different relationships

Figure 10.5: 768×768 weight matrix in fully-connected layer

Figure 10.6: Smoothed view revealing structure in weights

Figure 10.7: Attention patterns evolve across multiple transformer blocks

Figure 10.8: Fully-connected weight matrices from different layers—each encodes different features

Figure 10.9: Weight magnitude distributions vary across layers

ChatGPT Architecture Summary

Parameters175,000,000,000

Transformer Blocks96 layers

Attention Heads/Block96

Embedding Dimensions12,288

Vocabulary Size~50,257 tokens

Context Length4,096 tokens

Chapter 11

The Training of ChatGPT

ChatGPT's training corpus was essentially "all of the web" (i.e. a few billion pages of text, with a trillion or so words), together with a few million books, and other sources.

Click for commentary

💡

Analysis

Training Data Scale

Source	Volume
Web pages	~1 trillion words
Books	~100 billion words
Total	~300 billion tokens

Key Fact

ChatGPT was trained on roughly 300 billion tokens. The number of parameters (~175B) is comparable to the training token count—a pattern that seems optimal empirically.

Chapter 12

Beyond Basic Training

The raw GPT-3 model was trained just to "complete text". But ChatGPT was further trained using something called RLHF—"Reinforcement Learning from Human Feedback"—which effectively taught it to produce outputs that humans rate as "good".

Click for commentary

💡

Analysis

RLHF Process

Generate multiple responses
Humans rank them
Train reward model on preferences
Use RL to optimize for reward

This makes ChatGPT helpful and safe, not just good at predicting text.

Chapter 13

What Really Lets ChatGPT Work?

Human language—and the thought processes behind it—have always seemed to us to be somehow very special. And so it's seemed like something "AI-complete" to be able to produce human language. But now ChatGPT can do these things. So what's going on?

Click for commentary

💡

Analysis

The Deep Question

If a statistical model produces convincing language, maybe language itself is more statistical than we thought. This is a discovery about language, not just about AI.

Key Insight

ChatGPT's success suggests producing human-like text requires less computational sophistication than assumed. Language follows learnable patterns—a scientific discovery about language itself.

Chapter 14

Meaning Space and Semantic Laws of Motion

We've talked about ChatGPT working with embeddings. And we can think of these embeddings as defining a kind of "meaning space" in which words, sentences and larger pieces of text get placed.

Click for commentary

💡

Analysis

Meaning as Geometry

Reasoning = trajectories through meaning space. Creativity = novel paths. Coherence = smooth, connected paths. Language generation follows "semantic laws of motion."

Chapter 15

Semantic Grammar and the Power of Computational Language

We've been talking so far about the impressive ability of ChatGPT to deal with human natural language. But Wolfram|Alpha uses a different kind of language: computational language.

Click for commentary

💡

Analysis

Two Languages

Natural: Ambiguous, contextual, human
Computational: Precise, formal, executable

Combining both could give the best of both worlds.

Chapter 16

So ... What Is ChatGPT Doing, and Why Does It Work?

The basic answer is that what ChatGPT is doing is generating text by successively adding one token at a time, each time choosing its next token by sampling from a probability distribution that's been "learned" by training on a large corpus of text.

Click for commentary

💡

Analysis

The Summary

After 15 chapters, it's simple: predict next token, sample, repeat. The magic is in 175 billion parameters capturing patterns from 300 billion training tokens.

Final Answer

ChatGPT generates text by repeatedly predicting next-token probabilities and sampling. It works because language follows learnable statistical patterns, and 175 billion parameters can capture enough of them. This is a scientific discovery about language itself.

Product Requirements Document

Educational product specification based on Wolfram's essay

Executive Summary

Transform Wolfram's comprehensive essay into accessible, multi-format educational resources for diverse audiences—from executives needing strategic understanding to engineers wanting technical depth.

Target Audiences

Audience	Level	Primary Need	Format
Executives	Beginner	Strategic understanding	2-page summary
Developers	Intermediate	Implementation details	Code examples
ML Engineers	Advanced	Technical depth	Full mathematics
Students	Progressive	Learning path	Interactive course

Technical Specifications

ChatGPT Key Numbers

Parameters175,000,000,000

Embedding Dims12,288

Attention Heads96 per block

Layers96 blocks

Training Data~300B tokens

Vocabulary~50,257

Context4,096 tokens

Temperature0.8 typical

Glossary

Token

Basic text unit—word or subword piece

Embedding

Vector representation where similar meanings = similar vectors

Attention

Mechanism for tokens to "look at" relevant preceding tokens

Transformer

Architecture using attention—basis of GPT models

Temperature

Controls randomness: 0=deterministic, higher=more random

Loss Function

Measures prediction error—training minimizes this

Gradient Descent

Optimization by following steepest downhill path

Backpropagation

Computing gradients through layers via chain rule

RLHF

Reinforcement Learning from Human Feedback

Computational Irreducibility

Problems requiring step-by-step computation—can't be shortcut