← All simulations · Pillar 3: Finding patterns

k-means clustering

What it is

Sometimes you have a pile of things and no labels at all — you just want to know which ones naturally belong together. k-means finds the groups by itself. You tell it how many groups to look for (that’s the k), and it sorts everything into that many clusters.

Go deeper: this is unsupervised learning — no right answers are given. Each cluster has a center (the average of its points). The algorithm repeats two steps: (1) color each point by its nearest center, (2) move each center to the middle of its points. Repeat until nothing changes. That loop is called Lloyd’s algorithm.

Why care

Grouping without labels is everywhere: a store sorting shoppers into kinds of customers, a photo app bundling similar pictures, a streaming service finding “types” of listeners. k-means is the workhorse that does it.

The idea, intuitively

Drop a few flags onto the field at random. Every point joins the nearest flag’s team. Then each flag walks to the middle of its team. Some points switch teams, so the flags walk again… and again… until everyone is happy and the flags stop moving. The flags have found the groups.

Peek at the data first

Look first — but notice what’s missing: there is no group column. The model only sees two numbers per point and has to discover the groups on its own. Here are a few of the points, with a summary of each column.

Try it

Press Step to run one round — watch the centers (big dots) walk and the points re-color. Keep stepping until it settles. Try New seeds to start the centers somewhere else, or change k to look for a different number of groups.

Where it shows up

Customer groups. Stores cluster shoppers by what they buy to tailor offers — no one labels the groups in advance.
Squishing colors. An image can be redrawn with just a few colors by clustering all its pixel colors (how some image compression works).
Organizing the unknown. Sorting documents, songs, or sensor readings into “kinds” when nobody has labelled them yet.

Where it came from

The core idea was described by Stuart Lloyd at Bell Labs in 1957 (widely circulated, formally published in 1982) and independently by Edward Forgy in 1965. The name “k-means” was coined by James MacQueen in 1967. Related grouping ideas go back to Hugo Steinhaus in 1956 — so, like many ideas, credit is shared.

Try it in code

The Studio’s clusterer is k-means; show_model reveals the centers it found:

data = load "fruits"

groups = make_model "clusterer"
train_model groups, on: data, using: ["color_score", "size", "sweetness"], groups: 3
show_model groups

say predict(groups, color_score: 9, size: 2, sweetness: 7)

Open it in the Studio ▶

Check your understanding

Why do some points change color between steps?
How do you know when the algorithm is finished?
Why can pressing New seeds give a different final grouping for the same points?