When a model reads a sentence, each word should not be understood in isolation. If a sentence says ‘it works well,’ the model needs earlier context to know what ‘it’ refers to. Attention performs that search.
What are attention weights?
The model compares the Query with each Key. The better they match, the higher the weight; the higher the weight, the more that Value influences the current representation.
These relationships are not hand-written rules. They are learned from large amounts of text, where the model discovers which words tend to explain, limit, or complete each other.
Why use multiple heads?
A sentence can contain many relationships at once: subject-verb links, references, time, cause and effect. Multi-head attention lets the model inspect the same sentence from several angles.
One-sentence takeaway
Attention lets a model do more than read in order. It dynamically decides where the current understanding should look for support.