← Back to Analysis

Transformer Simulation

Interactive visualization of attention mechanisms

Step 1: Tokenization

Convert input text into tokens. In real Transformers, this uses BPE (Byte-Pair Encoding) to handle unknown words.

Tokens

How Tokenization Works

Real transformers use subword tokenization (BPE). "unhappiness" might become ["un", "happiness"] or ["un", "hap", "pi", "ness"]. This simulation uses simple word-level tokenization for clarity.

Step 2: Token Embeddings

Each token is converted to a dense vector. Similar words have similar embeddings.

Token Embeddings (each token → d-dimensional vector)
Embedding: token → d_model (512-dim in original paper)

Learned Representations

Embeddings are learned during training. Words used in similar contexts end up with similar vectors. "king" - "man" + "woman" ≈ "queen" demonstrates the semantic structure captured.

Step 3: Positional Encoding

Since attention is position-invariant, we add sinusoidal position information to embeddings.

Positional Encoding Matrix (position × dimension)
PE(pos, 2i) = sin(pos / 100002i/d)   |   PE(pos, 2i+1) = cos(pos / 100002i/d)

Why Sinusoids?

Sine/cosine waves at different frequencies create unique patterns for each position. The model can learn to attend to relative positions because PE(pos+k) can be expressed as a linear function of PE(pos).

Step 4: Scaled Dot-Product Attention

The core mechanism: compute attention weights from Query-Key dot products, then weight the Values.

1 Q, K, V
2 QKT
3 Scale
4 Softmax
5 × V
Query, Key, Value Matrices
Attention(Q, K, V) = softmax(QKT / √dk) V

Step 1: Create Q, K, V

Each token's embedding is projected into three vectors: Query (what am I looking for?), Key (what do I contain?), Value (what information do I have?).

Step 5: Multi-Head Attention

Run multiple attention operations in parallel, each learning different relationship types.

512
d_model
4
Heads
128
d_k per head
Attention Patterns by Head
MultiHead(Q,K,V) = Concat(head1, ..., headh) WO

Why Multiple Heads?

Each head can specialize: one might learn syntax, another semantics, another coreference. The outputs are concatenated and projected back to d_model dimensions.

Step 6: Complete Transformer Flow

See how input flows through the entire encoder layer.

1 Input
2 Embed
3 +Position
4 Attention
5 Add&Norm
6 FFN
7 Add&Norm
8 Output
Input Tokens

Encoder Layer Structure

Input → Embedding + Positional → Multi-Head Self-Attention → Add & LayerNorm → Feed-Forward → Add & LayerNorm → Output