How Attention Works
Self-attention is the core mechanism behind every major language model — GPT, Claude, BERT. It lets every token in a sequence directly attend to every other token, deciding what context it needs. This page walks through the mechanism step by step using real attention weights extracted from BERT, from the math to multi-head patterns to encoder vs. decoder masking.
From Text to Tokens
Before a transformer sees any text, it splits it into tokens — subword pieces from a fixed vocabulary. Common words are usually one token; longer or rarer words are split into familiar fragments. Each token is then mapped to a vector before attention runs.
Word Embeddings in 3D
Before attention runs, each token is converted into a dense vector called an embedding — a point in high-dimensional space where meaning is encoded as geometry. Words with similar meanings land close together; unrelated words are far apart.
Enter words below to project their 768-dimensional embeddings into 3D.
Drag to rotate · scroll to zoom
The Problem Attention Solves
Read this sentence: "The animal didn't cross the street because it was too tired." What does "it" refer to — the animal or the street? You resolved that instantly, but how? Your brain didn't scan every word equally. It attended selectively, weighting animal and tired heavily when interpreting it.
Older architectures like RNNs read words one at a time, left to right. By the time the model reached "it," the signal from "animal" had passed through many intermediate steps and faded. The transformer fixes this with self-attention: every token computes a direct connection to every other token simultaneously, so "it" can attend to "animal" in a single step regardless of how far apart they are.
From Token Embeddings to Q, K, V
Before attention can run, each token needs to be turned into something the model can compare. Every token starts as a 768-dimensional embedding — a vector learned during training that encodes its identity. The model then applies three independent learned linear projections to produce Query, Key, and Value vectors:
Each projection produces a 768-d vector — the same size as the input. That vector is then reshaped into 12 heads of 64 dimensions each, so every attention head gets its own independent 64-d Q, K, and V to work with. The 64-d is not an arbitrary slice of the original embedding; it comes from a learned transformation that mixes all 768 input dimensions together.
After each head runs its attention computation independently, the 12 resulting 64-d vectors are concatenated back into 768-d and passed through a final output projection WO — which learns how to mix the heads together. For a closer look at how that fits into the full transformer block, see the transformer architecture visualizer.
Choose a sentence to visualize
How Attention Works Mechanically
Every token produces three vectors — a Query, a Key, and a Value — by multiplying its embedding with three learned weight matrices:
Query
What am I looking for?
"it" asks: who is tired?
Key
What do I represent?
"animal" says: I am a subject.
Value
What do I contribute?
"animal" contributes its meaning.
To score how much token A attends to token B, we compute the of A's Query and B's Key, divided by to prevent scores from growing too large. A converts scores into probabilities, and each token's output is a of all Value vectors.
Click any token below to step through the computation using real BERT attention weights. Use the head selector to see how different heads specialize in different relationships.
— it
Each cell is one dimension of the 64-d Query vector. Indigo = positive, slate = negative. Dimensions where this Query and a token's Key share the same sign contribute positively to the similarity score.
— all tokens
Scan vertically: where a Key column looks similar to the Query above (same color pattern), the dot product will be high and that token will attract more attention weight.
Q · K / √d
Raw values before softmax. Can be positive or negative.
softmax(scores)
Probabilities summing to 100%. Same ordering as scores.
Each token projects into a vector. The output for it is the weighted sum of all Value vectors using the weights above — the token's new representation is no longer just itself, it's a mixture of the whole sentence weighted by relevance.
output(it) = Σᵢ wᵢ · V(tokenᵢ)
= 37.3%·V(the) + 45.6%·V(animal) + 3.9%·V(didn) + 1.6%·V(') + 2.4%·V(t) + 0.9%·V(cross) + 4.2%·V(the) + 1.7%·V(street) + 0.2%·V(because) + 1.0%·V(it) + 0.2%·V(was) + 0.3%·V(too) + 0.7%·V(tired)
Multi-Head Attention
We can lay out every token-to-token attention score as a grid: each row is a token asking a question, each column is a token being considered as an answer, and a bright cell means strong attention. But a single head can only focus on one type of relationship at a time. Transformers run many heads in parallel — each with its own independent WQ, WK, WV weights — letting different heads specialize: one might track coreference, another verb–noun dependencies, another local word order. The four grids below show individual heads, each automatically labeled by the pattern it forms.
Head 1Coreference
"it" attends strongly to "animal." Across examples this head consistently links pronouns and repeated nouns back to their antecedents — it is the closest thing BERT has to a dedicated reference-tracking head.
Head 4Verb dependency
"the" and "street" both point strongly to "cross." This head pulls article and noun tokens toward the main verb of their clause — mapping out which words belong to the same predicate.
Head 8Clause bridging
"was" attends back to "didn't" across the "because" boundary. This head often links the two verb phrases of a complex clause, building long-range structural connections rather than local ones.
Head 12Local chain
Strong backward attention throughout — each token attends heavily to its left neighbor. This head is tracking local word order rather than meaning. It is the most positional of the four.
Encoder vs. Decoder
BERT, the model shown on this page, is an encoder-only transformer. When processing a sentence, every token can attend to every other token simultaneously — left, right, and across any distance. That bidirectional view makes encoders excellent at understanding tasks: named entity recognition, question answering, sentence classification.
The models you interact with day-to-day — GPT, Claude, Llama, Mistral — are decoder-only. They generate text one token at a time, left to right. Because each token is produced before the ones that follow it, future tokens don't exist yet and must be masked out. The attention matrix becomes lower-triangular: a token can only attend to itself and everything to its left.
Encoder · BERTbidirectional · all tokens visible
Row token can attend anywhere — left or right
Decoder · GPT-2causal mask · left context only
Single head (layer 11, head 8) · muted cells masked to −∞
This causal constraint is a feature, not a limitation. Training a decoder is self-supervised: given any text, the model learns by predicting the next token using only what came before. No labels required. This objective scales to internet-scale corpora, which is why decoder-only models have dominated the scaling curve.
A third family — encoder-decoder(T5, BART, the original “Attention Is All You Need” translation model) — combines both: the encoder reads the full input bidirectionally; the decoder generates output while cross-attending to it. Translation and summarization are natural fits, though very large decoder-only models now match or exceed them on most benchmarks.
Generating One Token at a Time
A decoder doesn't output a whole sentence at once. At every step it runs a real forward pass over the full context and produces a probability distribution over the entire vocabulary. Click any bar to sample that token — the next distribution is computed live from GPT-2 running in your browser. To see how all the pieces fit together, visit the transformer architecture visualizer.
Pick different tokens at each step and watch the distributions shift. This is why temperature matters: lowering it concentrates probability on the tallest bars, making output more deterministic; raising it spreads mass across the distribution, introducing more variety.
Further reading
Continue learning
Return to the neural network visualizer.