← All simulations · Pillar 8: Brains made of math

Attention: what should I focus on?

What it is

When you read “the hungry cat chased the small mouse,” the word chased only makes sense if you remember cat — who did the chasing. Attention is how a machine does the same trick: for every word, it decides which earlier words to focus on.

Go deeper: each word builds a query (“what am I looking for?”) and every word offers a key (“here’s what I am”). The match between a query and a key becomes an attention weight. This is the self-attention at the heart of every transformer — the “T” in models like GPT.

Why care

Attention is the engine behind the AI you hear about every day: chatbots that write, tools that translate languages, and apps that summarize a long article in one line. They all work by paying attention to the right words at the right time.

The idea, intuitively

Think of each word raising its hand to ask a question. chased asks, “Who is a thing I can chase?” The earlier words answer. cat is a thing, so it gets a strong, dark connection. Words that don’t fit the question get a faint one.

One important rule: a word can only look backward, at itself and the words before it — never ahead. That’s exactly how a model writes text left to right, one word at a time. In the grid below, that’s why the top-right corner is always blank.

Peek at the data first

A language model’s data is text. Here the example sentence is split into words, each tagged with a simple role — those roles are the only clue the sim uses. Look at the words before you watch them pay attention.

Try it

Pick a sentence, then click any word to see what it pays attention to. Read a row of the grid left to right: darker squares mean more attention. Drag Focus to sharpen or soften how picky each word is.

Where it shows up

Translation. To turn a sentence into another language, the model attends to the source words that matter for each new word — word order can differ between languages, so focusing on the right one is everything.
Writing & chat. To pick the next word, a chatbot attends back over everything said so far, the way you glance back at a sentence before finishing it.
Summarizing. Attention helps a model find the few important words in a long passage and ignore the filler.

Where it came from

Attention for neural networks was introduced in 2014 by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio to improve machine translation (“Neural Machine Translation by Jointly Learning to Align and Translate”). In 2017 a team at Google — Ashish Vaswani and colleagues — showed you could build a whole model out of attention alone, in the paper “Attention Is All You Need.” That design, the transformer, powers today’s large language models. Many people contributed; these are the papers usually credited.

Try it in code

The Studio has a real (tiny) transformer you can train on synthetic sentences and watch its attention grid light up:

data = load "sayings"
brain = make_model "transformer"
train_model brain, on: data, using: "text"
show_model brain
generate brain, count: 4

Open it in the Studio ▶

Check your understanding

In the grid, why is the very top-right always empty?
The word chased is an action. Which earlier word does it focus on, and why?
When you drag Focus to “sharp,” what happens to where each word looks?