Interactive visualization of attention mechanisms
Convert input text into tokens. In real Transformers, this uses BPE (Byte-Pair Encoding) to handle unknown words.
Real transformers use subword tokenization (BPE). "unhappiness" might become ["un", "happiness"] or ["un", "hap", "pi", "ness"]. This simulation uses simple word-level tokenization for clarity.
Each token is converted to a dense vector. Similar words have similar embeddings.
Embeddings are learned during training. Words used in similar contexts end up with similar vectors. "king" - "man" + "woman" ≈ "queen" demonstrates the semantic structure captured.
Since attention is position-invariant, we add sinusoidal position information to embeddings.
Sine/cosine waves at different frequencies create unique patterns for each position. The model can learn to attend to relative positions because PE(pos+k) can be expressed as a linear function of PE(pos).
The core mechanism: compute attention weights from Query-Key dot products, then weight the Values.
Each token's embedding is projected into three vectors: Query (what am I looking for?), Key (what do I contain?), Value (what information do I have?).
Run multiple attention operations in parallel, each learning different relationship types.
Each head can specialize: one might learn syntax, another semantics, another coreference. The outputs are concatenated and projected back to d_model dimensions.
See how input flows through the entire encoder layer.
Input → Embedding + Positional → Multi-Head Self-Attention → Add & LayerNorm → Feed-Forward → Add & LayerNorm → Output