Transformer Simulation - Interactive Attention Visualization

Step 1: Tokenization

Convert input text into tokens. In real Transformers, this uses BPE (Byte-Pair Encoding) to handle unknown words.

Enter a sentence:

Tokens

How Tokenization Works

Real transformers use subword tokenization (BPE). "unhappiness" might become ["un", "happiness"] or ["un", "hap", "pi", "ness"]. This simulation uses simple word-level tokenization for clarity.

Step 2: Token Embeddings

Each token is converted to a dense vector. Similar words have similar embeddings.

Embedding Dimension (d_model):

Token Embeddings (each token → d-dimensional vector)

Embedding: token → ℝ^d_model (512-dim in original paper)

Learned Representations

Embeddings are learned during training. Words used in similar contexts end up with similar vectors. "king" - "man" + "woman" ≈ "queen" demonstrates the semantic structure captured.

Step 3: Positional Encoding

Since attention is position-invariant, we add sinusoidal position information to embeddings.

Number of Positions: 6

Embedding Dimensions to Show: 8

Positional Encoding Matrix (position × dimension)

PE_{(pos, 2i)} = sin(pos / 10000^2i/d) | PE_{(pos, 2i+1)} = cos(pos / 10000^2i/d)

Why Sinusoids?

Sine/cosine waves at different frequencies create unique patterns for each position. The model can learn to attend to relative positions because PE(pos+k) can be expressed as a linear function of PE(pos).

Step 4: Scaled Dot-Product Attention

The core mechanism: compute attention weights from Query-Key dot products, then weight the Values.

1 Q, K, V

2 QK^T

3 Scale

4 Softmax

5 × V

Query, Key, Value Matrices

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Step 1: Create Q, K, V

Each token's embedding is projected into three vectors: Query (what am I looking for?), Key (what do I contain?), Value (what information do I have?).

Step 5: Multi-Head Attention

Run multiple attention operations in parallel, each learning different relationship types.

Number of Heads: 4

512

d_model

4

Heads

128

d_k per head

Attention Patterns by Head

MultiHead(Q,K,V) = Concat(head₁, ..., head_h) W^O

Why Multiple Heads?

Each head can specialize: one might learn syntax, another semantics, another coreference. The outputs are concatenated and projected back to d_model dimensions.

Step 6: Complete Transformer Flow

See how input flows through the entire encoder layer.

Input Sentence:

1 Input

2 Embed

3 +Position

4 Attention

5 Add&Norm

6 FFN

7 Add&Norm

8 Output

Input Tokens

Encoder Layer Structure

Input → Embedding + Positional → Multi-Head Self-Attention → Add & LayerNorm → Feed-Forward → Add & LayerNorm → Output