Attention: how does a model decide where to look?

When a model reads a sentence, each word should not be understood in isolation. If a sentence says ‘it works well,’ the model needs earlier context to know what ‘it’ refers to. Attention performs that search.

Query The current token asks: what information do I need right now?

Key Other tokens provide an index: what clues do I contain?

Value The actual information passed along when a token is judged relevant.

What are attention weights?

The model compares the Query with each Key. The better they match, the higher the weight; the higher the weight, the more that Value influences the current representation.

These relationships are not hand-written rules. They are learned from large amounts of text, where the model discovers which words tend to explain, limit, or complete each other.

Why use multiple heads?

A sentence can contain many relationships at once: subject-verb links, references, time, cause and effect. Multi-head attention lets the model inspect the same sentence from several angles.

One-sentence takeaway

Attention lets a model do more than read in order. It dynamically decides where the current understanding should look for support.