Attention Is All You Need

The paper that introduced the Transformer architecture and revolutionized NLP

πŸ‘₯ Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin πŸ“… June 2017 (NeurIPS) πŸ›οΈ Google Brain & Google Research πŸ’¬ Click paragraphs for commentary
512
Model Dimension
8
Attention Heads
6
Encoder Layers
6
Decoder Layers
65M
Parameters (Base)
41.8
BLEU (EN-FR)

Complete Illustrated Analysis

The foundational paper behind GPT, BERT, and modern large language models

Chapter 1

Introduction: Why Attention Matters

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Click for commentary
πŸ’‘
Analysis

The Revolutionary Claim

This single sentence upends decades of sequence modeling. Before 2017, the dominant paradigm was: RNNs for sequences, CNNs for images. The authors boldly claim attention alone is sufficientβ€”no recurrence, no convolutions.

Why This Matters

  • Recurrence: Processes tokens sequentially (slow, hard to parallelize)
  • Convolutions: Fixed receptive field (limited long-range dependencies)
  • Attention: Direct connections between any positions (parallelizable, global context)
Recurrent models typically factor computation along the symbol positions of the input and output sequences. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths.
Click for commentary
πŸ’‘
Analysis

The Sequential Bottleneck

RNNs process tokens one-by-one: token 5 must wait for tokens 1-4 to finish. With 1000 tokens, you need 1000 sequential steps. This is fundamentally incompatible with GPU parallelism.

The Math Problem

Training time scales as O(n) for sequence length n. For long documents, this becomes prohibitive. Modern LLMs process 100K+ tokensβ€”impossible with sequential processing.

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models, allowing modeling of dependencies without regard to their distance in the input or output sequences.
Click for commentary
πŸ’‘
Analysis

Distance-Independent Dependencies

In RNNs, connecting token 1 to token 1000 requires information to flow through 999 intermediate stepsβ€”gradients vanish. With attention, token 1 directly attends to token 1000 in a single operation.

Historical Context

Attention was already used in seq2seq models (Bahdanau 2014), but always alongside RNNs. This paper asks: what if attention is the only mechanism?

Key Concept
The Transformer eliminates recurrence entirely, using only attention mechanisms. This enables full parallelization during training and direct modeling of long-range dependencies regardless of distance.
Chapter 2

Background: The Limitations of RNNs and CNNs

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet, and ConvS2S, all of which use convolutional neural networks as basic building block.
Click for commentary
πŸ’‘
Analysis

Prior Attempts at Parallelism

Others tried CNNs for sequencesβ€”convolutions are parallelizable! But CNNs have fixed receptive fields. To connect distant tokens, you need many stacked layers, each adding computational cost.

The Trade-off

  • ByteNet/ConvS2S: O(log n) layers needed for distance n
  • Transformer: O(1) operations for any distance
In these models, the number of operations required to relate signals from two arbitrary input or output positions grows with the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.
Click for commentary
πŸ’‘
Analysis

Computational Complexity Comparison

ModelOperations for Distance n
RNNO(n) sequential
ConvS2SO(n) parallel
ByteNetO(log n) parallel
TransformerO(1) parallel

The Transformer achieves constant-time dependency modelingβ€”a fundamental improvement.

Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Click for commentary
πŸ’‘
Analysis

Self-Attention Defined

Traditional attention: decoder attends to encoder. Self-attention: a sequence attends to itself. Every position can directly "look at" every other position in the same sequence.

Why "Self"?

The query, key, and value all come from the same sequence. Position 5 asks "what's relevant to me?" and gets weighted information from positions 1, 2, 3, 4, 6, 7, etc.

Key Concept
Self-attention allows every position to directly attend to every other position in O(1) operations, eliminating the distance-dependent computational cost of RNNs and CNNs.
Chapter 3

Model Architecture Overview

The Transformer model architecture
Figure 1: The Transformer model architecture. The encoder (left) processes input through N=6 identical layers of self-attention and feed-forward networks. The decoder (right) adds encoder-decoder attention.
The Transformer follows an encoder-decoder structure using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.
Click for commentary
πŸ’‘
Analysis

The Big Picture

Despite the revolutionary attention mechanism, the overall structure is familiar: encoder-decoder, just like seq2seq. The innovation is what's inside each component.

Two Components

  • Encoder: Processes input sequence β†’ contextual representations
  • Decoder: Generates output sequence using encoder output + previous outputs
Architecture Overview
Input Sequence
      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    ENCODER      β”‚  ← 6 identical layers
β”‚  (Self-Attentionβ”‚     Each: Self-Attention + FFN
β”‚   + Feed-Forward)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         ↓ (Keys, Values)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    DECODER      β”‚  ← 6 identical layers
β”‚  (Masked Self-  β”‚     Each: Masked Self-Attention
β”‚   Attention +   β”‚           + Encoder-Decoder Attention
β”‚   Cross-Attentionβ”‚          + FFN
β”‚   + Feed-Forward)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
   Output Sequence
The encoder maps an input sequence of symbol representations (x₁, ..., xβ‚™) to a sequence of continuous representations z = (z₁, ..., zβ‚™). Given z, the decoder then generates an output sequence (y₁, ..., yβ‚˜) of symbols one element at a time.
Click for commentary
πŸ’‘
Analysis

The Two-Phase Process

  1. Encoding: All input tokens processed in parallel β†’ rich contextual vectors
  2. Decoding: Output generated autoregressively (one token at a time)

Why Autoregressive Decoding?

Each output token depends on previous outputs. "The cat sat on the ___" β†’ "mat" depends on knowing "cat" and "sat" came before.

At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.
Click for commentary
πŸ’‘
Analysis

Auto-regressive Generation

This is exactly how ChatGPT works! Generate token 1, feed it back, generate token 2, feed both back, generate token 3... The Transformer decoder is the ancestor of GPT.

Training vs. Inference

  • Training: Teacher forcingβ€”use ground truth previous tokens
  • Inference: Use model's own predictions as previous tokens
Key Concept
The Transformer uses an encoder-decoder structure where the encoder processes all input in parallel, and the decoder generates output auto-regressively, one token at a time.
Chapter 4

The Encoder Stack

The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
Click for commentary
πŸ’‘
Analysis

Encoder Layer Anatomy

Each of the 6 encoder layers contains exactly two components:

  1. Multi-Head Self-Attention: Lets each position attend to all positions
  2. Feed-Forward Network: Applies same MLP to each position independently

Why 6 Layers?

Empirically chosen. More layers = more capacity but more compute. 6 was the sweet spot for translation tasks. Modern LLMs use 32-96+ layers.

Single Encoder Layer
        Input (512-dim per position)
              ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Multi-Head         β”‚
    β”‚  Self-Attention     β”‚ ← 8 heads, each 64-dim
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
         Add & Norm        ← Residual connection + LayerNorm
              ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Feed-Forward       β”‚
    β”‚  Network            β”‚ ← 512 β†’ 2048 β†’ 512
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
         Add & Norm        ← Residual connection + LayerNorm
              ↓
        Output (512-dim per position)
We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)).
Click for commentary
πŸ’‘
Analysis

Residual Connections

From ResNet (2015): add input to output. This creates "skip connections" that help gradients flow during backpropagation through deep networks.

Layer Normalization

Normalizes across the feature dimension (not batch). Stabilizes training by keeping activations in a reasonable range. Critical for training deep Transformers.

The Formula

output = LayerNorm(x + Sublayer(x))

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model = 512.
Click for commentary
πŸ’‘
Analysis

Consistent Dimensionality

Every vector throughout the model is 512-dimensional. This uniformity enables residual connections (you can only add vectors of the same size) and simplifies the architecture.

Scaling Up

The "Big" Transformer uses d_model = 1024. GPT-3 uses 12,288. The architecture scales by increasing this dimension.

Key Concept
Each encoder layer has two sub-layers (self-attention + FFN) with residual connections and layer normalization. All vectors maintain dimension d_model = 512 throughout.
Chapter 5

The Decoder Stack

The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
Click for commentary
πŸ’‘
Analysis

Decoder = Encoder + Cross-Attention

The decoder has everything the encoder has, plus a third sub-layer: encoder-decoder attention (cross-attention). This is how the decoder "reads" the encoded input.

Three Sub-layers

  1. Masked Self-Attention: Attend to previous output positions only
  2. Encoder-Decoder Attention: Attend to encoder outputs
  3. Feed-Forward Network: Same as encoder
Single Decoder Layer
        Input (previous outputs, 512-dim)
              ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Masked Multi-Head  β”‚
    β”‚  Self-Attention     β”‚ ← Can only see previous positions
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
         Add & Norm
              ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Multi-Head         β”‚
    β”‚  Encoder-Decoder    β”‚ ← Q from decoder, K/V from encoder
    β”‚  Attention          β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
         Add & Norm
              ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Feed-Forward       β”‚
    β”‚  Network            β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
         Add & Norm
              ↓
        Output (512-dim)
We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with the fact that output embeddings are offset by one position, ensures that predictions for position i can depend only on the known outputs at positions less than i.
Click for commentary
πŸ’‘
Analysis

Causal Masking

The decoder can't "cheat" by looking at future tokens. When predicting position 5, it can only see positions 1-4. This is enforced by setting attention scores to -∞ for future positions (becomes 0 after softmax).

Why This Matters

During training, we process all positions in parallel for efficiency. But logically, each position should only "know" about previous positionsβ€”as it would during actual generation.

The Mask

A triangular matrix where position i can attend to positions 1...i but not i+1...n.

Key Concept
The decoder adds encoder-decoder attention to read the input, and uses masked self-attention to prevent "seeing the future" during auto-regressive generation.
Chapter 6

Scaled Dot-Product Attention

Scaled Dot-Product Attention
Figure 2 (Left): Scaled Dot-Product Attention. MatMul QΓ—K, scale by √d_k, apply mask (optional), softmax, then MatMul with V.
Multi-Head Attention
Figure 2 (Right): Multi-Head Attention. Multiple attention heads run in parallel, outputs are concatenated and projected.
We call our particular attention "Scaled Dot-Product Attention". The input consists of queries and keys of dimension d_k, and values of dimension d_v.
Click for commentary
πŸ’‘
Analysis

The Three Vectors

  • Query (Q): "What am I looking for?" - the current position's question
  • Key (K): "What do I contain?" - each position's identifier
  • Value (V): "What information do I have?" - the actual content

Intuition

Like a database: Query searches for matching Keys, then retrieves corresponding Values. High QΒ·K similarity β†’ more weight on that V.

The Attention Formula Attention(Q, K, V) = softmax(QKT / √dk) V
We compute the dot products of the query with all keys, divide each by √d_k, and apply a softmax function to obtain the weights on the values.
Click for commentary
πŸ’‘
Analysis

Step-by-Step Breakdown

  1. QKT: Dot product gives similarity scores (nΓ—n matrix)
  2. ÷ √dk: Scale down to prevent extreme values
  3. softmax: Convert to probabilities (sum to 1)
  4. Γ— V: Weighted sum of value vectors

Output

Each position gets a weighted combination of all Values, where weights reflect Query-Key similarity.

We suspect that for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/√d_k.
Click for commentary
πŸ’‘
Analysis

Why Scale?

With d_k = 64, dot products of random vectors have variance ~64. Large values β†’ softmax saturates β†’ gradients vanish. Dividing by √64 = 8 keeps variance ~1.

The Math

If q, k are random with variance 1, then q·k has variance d_k. Scaling by √d_k normalizes this back to variance 1.

Practical Impact

Without scaling, training becomes unstable. This simple fix enables training deep attention networks.

Key Concept
Attention computes weighted sums of Values based on Query-Key similarity. Scaling by √d_k prevents gradient vanishing in softmax for large dimensions.
Chapter 7

Multi-Head Attention

Instead of performing a single attention function with d_model-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections.
Click for commentary
πŸ’‘
Analysis

Multiple Perspectives

One attention head captures one type of relationship. Multiple heads capture multiple relationship types simultaneously: syntactic, semantic, positional, etc.

The Projection

512-dim vectors are projected to 8 different 64-dim spaces. Each "head" runs attention independently, then results are concatenated back to 512-dim.

Multi-Head Attention Formula MultiHead(Q, K, V) = Concat(head1, ..., headh) WO

where headi = Attention(QWiQ, KWiK, VWiV)
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
Click for commentary
πŸ’‘
Analysis

Representation Subspaces

Each head learns to project into a different "subspace" where different relationships are easier to detect. One head might find subject-verb agreement, another might find coreference.

Why Not One Big Head?

A single 512-dim attention would average across all relationship types. Multiple smaller heads can specialize, then combine their findings.

Multi-Head Attention Dimensions
Input: Q, K, V each 512-dim

For each of 8 heads:
  Q Γ— W_Q (512Γ—64) β†’ 64-dim query
  K Γ— W_K (512Γ—64) β†’ 64-dim key
  V Γ— W_V (512Γ—64) β†’ 64-dim value

  Attention output: 64-dim

Concat 8 heads: 8 Γ— 64 = 512-dim
Final projection: 512 Γ— W_O (512Γ—512) β†’ 512-dim output
In this work we employ h = 8 parallel attention heads. For each of these we use d_k = d_v = d_model/h = 64.
Click for commentary
πŸ’‘
Analysis

The Numbers

  • h = 8: Number of attention heads
  • d_model = 512: Total model dimension
  • d_k = d_v = 64: Dimension per head (512/8)

Compute Efficiency

8 heads Γ— 64-dim = same total compute as 1 head Γ— 512-dim. Multi-head is "free" in terms of computation, but more expressive.

Key Concept
Multi-head attention runs 8 parallel attention operations in different learned subspaces, allowing the model to capture multiple types of relationships simultaneously without additional computational cost.
Chapter 8

Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.
Click for commentary
πŸ’‘
Analysis

"Position-wise"

The same MLP is applied independently to each position. No information flows between positions hereβ€”that's the attention layer's job. FFN processes each position's 512-dim vector in isolation.

Role of FFN

Attention aggregates information across positions. FFN processes that aggregated information, adding non-linearity and capacity. Think of it as "thinking about what attention gathered."

Feed-Forward Network FFN(x) = max(0, xW1 + b1) W2 + b2
This consists of two linear transformations with a ReLU activation in between. The dimensionality of input and output is d_model = 512, and the inner-layer has dimensionality d_ff = 2048.
Click for commentary
πŸ’‘
Analysis

The Expansion

512 β†’ 2048 β†’ 512. The inner layer is 4Γ— larger! This "bottleneck" structure gives the network capacity to learn complex transformations.

Why ReLU?

ReLU(x) = max(0, x). Simple, effective non-linearity. Modern Transformers often use GELU or SwiGLU instead, but ReLU worked well here.

Parameter Count

W1: 512Γ—2048 = 1M params, W2: 2048Γ—512 = 1M params. FFN contains most of the parameters in each layer!

Key Concept
The feed-forward network (512β†’2048β†’512 with ReLU) is applied identically to each position, providing non-linear transformation capacity after attention has aggregated cross-position information.
Chapter 9

Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.
Click for commentary
πŸ’‘
Analysis

The Position Problem

Attention is permutation-invariant: "cat sat mat" and "mat sat cat" would produce identical attention patterns! Without positional information, the Transformer treats input as a bag of words.

Why RNNs Don't Need This

RNNs process sequentiallyβ€”position is implicit in the order of processing. Transformers process all positions in parallel, so position must be explicitly encoded.

Sinusoidal Positional Encoding PE(pos, 2i) = sin(pos / 100002i/d_model)
PE(pos, 2i+1) = cos(pos / 100002i/d_model)
We use sine and cosine functions of different frequencies. Each dimension of the positional encoding corresponds to a sinusoid with wavelengths forming a geometric progression from 2Ο€ to 10000Β·2Ο€.
Click for commentary
πŸ’‘
Analysis

Why Sinusoids?

  • Unique encoding: Each position gets a distinct 512-dim vector
  • Bounded values: Always between -1 and 1
  • Relative positions: PE(pos+k) can be represented as linear function of PE(pos)
  • Extrapolation: Works for positions not seen during training

The Geometric Progression

Low dimensions have high-frequency waves (distinguish nearby positions). High dimensions have low-frequency waves (distinguish distant positions).

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE_pos+k can be represented as a linear function of PE_pos.
Click for commentary
πŸ’‘
Analysis

Relative Position Attention

The model can learn "attend to 3 positions back" as a linear transformation. sin(pos+k) and cos(pos+k) can be expressed using sin(pos), cos(pos), and constants.

Alternative: Learned Positions

The paper also tested learned positional embeddingsβ€”nearly identical results. But sinusoidal allows extrapolation to longer sequences than seen during training.

Key Concept
Sinusoidal positional encodings inject position information using sine/cosine waves at different frequencies. This enables learning relative position patterns and generalizing to unseen sequence lengths.
Chapter 10

Training Details

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences.
Click for commentary
πŸ’‘
Analysis

The Datasets

  • EN-DE: 4.5M sentence pairs (smaller, harder)
  • EN-FR: 36M sentence pairs (larger, easier)

WMT Benchmark

Workshop on Machine Translationβ€”the standard benchmark for translation quality. Results are measured in BLEU score (higher = better).

We trained the base models for 100,000 steps or 12 hours on 8 NVIDIA P100 GPUs. The big models were trained for 300,000 steps (3.5 days).
Click for commentary
πŸ’‘
Analysis

Training Time Revolution

12 hours to train a state-of-the-art translation model! Previous RNN-based models took weeks. The parallelization advantage is massive.

Hardware Context (2017)

8Γ— P100 GPUs was high-end hardware. Today's LLMs train on thousands of GPUs for months, but the efficiency breakthrough started here.

We used the Adam optimizer with Ξ²1 = 0.9, Ξ²2 = 0.98 and Ξ΅ = 10^-9. We varied the learning rate over the course of training using a warmup schedule.
Click for commentary
πŸ’‘
Analysis

The Warmup Schedule

Learning rate increases linearly for warmup_steps, then decreases proportionally to 1/√step. This prevents early training instability.

Why Warmup?

Early in training, gradients are noisy and can be large. Starting with a small learning rate prevents the model from making huge, incorrect updates before it "settles down."

Training Configuration
Base Model:
  d_model = 512, heads = 8, layers = 6
  d_ff = 2048, dropout = 0.1
  ~65M parameters
  Training: 100K steps, 12 hours, 8Γ— P100

Big Model:
  d_model = 1024, heads = 16, layers = 6
  d_ff = 4096, dropout = 0.3
  ~213M parameters
  Training: 300K steps, 3.5 days, 8Γ— P100
We employ three types of regularization: residual dropout (P_drop = 0.1), attention dropout, and label smoothing (Ξ΅_ls = 0.1).
Click for commentary
πŸ’‘
Analysis

Regularization Techniques

  • Dropout: Randomly zero 10% of values β†’ prevents overfitting
  • Label Smoothing: Instead of hard targets (0 or 1), use soft targets (0.1 or 0.9) β†’ improves generalization

Label Smoothing Effect

Hurts perplexity (model is less confident) but improves BLEU (actual translation quality). The model learns to be appropriately uncertain.

Key Concept
The base Transformer trains in just 12 hours on 8 GPUsβ€”a dramatic speedup over RNNs. Key techniques: Adam optimizer with warmup schedule, dropout, and label smoothing.
Chapter 11

Results and Experiments

On the WMT 2014 English-to-German translation task, the big transformer model outperforms the best previously reported models including ensembles by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4.
Click for commentary
πŸ’‘
Analysis

Breaking Records

+2.0 BLEU is a huge improvement in translation quality. The Transformer didn't just match RNN ensemblesβ€”it crushed single models that took much longer to train.

Quality + Speed

Previous SOTA required ensembling multiple slow models. Transformer achieves better results with a single model in a fraction of the training time.

ModelEN-DE BLEUEN-FR BLEUTraining Cost
Previous SOTA (ensemble)26.441.0High
Transformer (base)27.338.112 hours
Transformer (big)28.441.83.5 days
On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.8, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model.
Click for commentary
πŸ’‘
Analysis

Efficiency Breakthrough

Same quality, 4Γ— less training cost. This is the key result: Transformers aren't just better, they're dramatically more efficient to train.

The Implication

If you can train 4Γ— faster, you can iterate 4Γ— more on architecture/hyperparameters. This accelerates research. The modern LLM explosion was enabled by this efficiency.

To evaluate the importance of different components of the Transformer, we varied our base model in different ways and measured the change in translation quality on the EN-DE development set.
Click for commentary
πŸ’‘
Analysis

Ablation Studies

The paper systematically tested: What happens if we change the number of heads? Reduce dimensions? Remove components? This tells us what actually matters.

Key Findings

  • Single-head attention hurts quality significantly
  • Reducing d_k hurts quality (attention needs capacity)
  • Bigger models = better, but with diminishing returns
  • Dropout is essential for regularization
Key Concept
The Transformer achieved new SOTA on translation benchmarks while training 4Γ— faster than previous methods. Ablations confirmed that multi-head attention and model scale are critical for performance.
Chapter 12

Conclusion and Historical Impact

Attention visualization for long-distance dependencies
Figure 3: Encoder self-attention in layer 5 of 6. Many attention heads attend to a distant dependency of the verb "making", completing the phrase "making...more difficult". Attentions visualized in different colors for different heads.
Anaphora resolution attention head 1
Figure 4: Two attention heads in layer 5/6 exhibiting anaphora resolution behavior. The word "its" attends strongly to "Law", showing the model learns coreference.
Anaphora resolution attention head 2
Figure 4 (cont.): Different sentence showing attention from "its" to the referent. Heads appear to have learned different aspects of syntax.
Attention head behavior example 1
Figure 5: Attention heads exhibit behavior related to sentence structure. Different heads learn to perform different tasks.
Attention head behavior example 2
Figure 5 (cont.): Full attentions for head 5-6. Notice how different heads capture different linguistic phenomena.
In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
Click for commentary
πŸ’‘
Analysis

The Summary

The paper's contribution is architectural: prove that attention alone is sufficient for sequence transduction. No recurrence needed. This was controversial at the timeβ€”RNNs were deeply entrenched.

What "Attention Is All You Need" Means

The title is a bold claim: you don't need LSTMs, GRUs, or any recurrent structure. Attention mechanisms, when properly designed, can handle everything.

We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text.
Click for commentary
πŸ’‘
Analysis

Prophetic Words

This understated conclusion predicted everything that followed. Transformers now dominate:

  • Language: GPT, BERT, T5, LLaMA, Claude
  • Vision: ViT, CLIP, DALL-E
  • Audio: Whisper, AudioLM
  • Multimodal: GPT-4V, Gemini
  • Protein: AlphaFold2
For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. We achieved a new state of the art on both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks.
Click for commentary
πŸ’‘
Analysis

Historical Impact (2017-2025)

YearMilestone
2017Transformer paper published
2018BERT revolutionizes NLP understanding
2018GPT-1 shows generative potential
2019GPT-2 demonstrates emergent abilities
2020GPT-3 (175B params) stuns the world
2022ChatGPT launches AI mainstream adoption
2023GPT-4, Claude 2 show reasoning capabilities
2024Multimodal AI becomes standard
Final Insight
This 2017 paper laid the foundation for all modern large language models. The Transformer architectureβ€”attention, feed-forward networks, positional encodingβ€”remains the blueprint for GPT-4, Claude, and beyond. "Attention Is All You Need" may be the most influential machine learning paper of the decade.

Technical Specifications

Complete architecture parameters and configurations

Model Configurations

Transformer Base

Layers (N)6
Model Dimension (d_model)512
Feed-Forward Dimension (d_ff)2048
Attention Heads (h)8
Key/Value Dimension (d_k, d_v)64
Dropout0.1
Parameters65M

Transformer Big

Layers (N)6
Model Dimension (d_model)1024
Feed-Forward Dimension (d_ff)4096
Attention Heads (h)16
Key/Value Dimension (d_k, d_v)64
Dropout0.3
Parameters213M

Glossary

Attention
Mechanism for computing weighted sums based on query-key similarity
Self-Attention
Attention where Q, K, V all come from the same sequence
Multi-Head Attention
Running h parallel attention operations in different subspaces
Query (Q)
Vector representing "what am I looking for?"
Key (K)
Vector representing "what do I contain?" for matching
Value (V)
Vector containing actual information to retrieve
Positional Encoding
Sinusoidal vectors added to embeddings to encode position
Layer Normalization
Normalizing across features to stabilize training
Residual Connection
Adding input to output: y = x + f(x)
BLEU Score
Bilingual Evaluation Understudyβ€”measures translation quality
Encoder
Component that processes input sequence into representations
Decoder
Component that generates output sequence autoregressively