← All simulations · Pillar 8: Brains made of math

Gradient descent

What it is

Every model that “learns” is really just trying to be less wrong. Picture its wrongness as a valley: high on the sides, low at the bottom. Gradient descent is how the model walks downhill — it feels which way the ground slopes and takes a step that way, again and again, until it reaches the bottom.

Go deeper: the curve is the loss (how wrong the model is) for each setting it could pick. The slope at a point is the gradient. Each step is new = old − speed × slope. The speed is the learning rate: too small and learning crawls; too big and it overshoots and may never settle.

Why care

This single idea trains almost all of modern AI — from the lemonade line you fit earlier, all the way up to giant language models. When people say a model is “training,” this downhill walk (over millions of settings at once) is what’s happening.

The idea, intuitively

You’re standing on a foggy hill and want to reach the lowest spot. You can’t see far, but you can feel which way is downhill under your feet. So you take a step that way, feel again, step again. Big confident steps get you down fast — until they’re so big you leap over the bottom and land higher up the other side.

Peek at the data first

This sim isn’t trained on examples — its “data” is the loss (how wrong the model is) at each setting. Training just hunts for the setting with the smallest loss. Here is the loss at a few settings, so you can see the valley before you roll into it.

Try it

Press Step to take one downhill step, or Roll to watch it go. The short line on the ball is the slope — steep on the sides, flat at the bottom. Now drag Speed up high and roll again to see it overshoot.

Where it shows up

Where it came from

The method of rolling downhill on a curve goes back to Augustin-Louis Cauchy in 1847, who suggested following the slope to find a minimum. The stochastic version that powers machine learning — taking noisy steps from small samples — grew from work by Herbert Robbins and Sutton Monro in 1951. Together these ideas became the engine of training that we use today.

Try it in code

When you train a network in the Studio, speed is the learning rate from this sim — and plot_training shows the loss rolling downhill:

data = load "fruits"

net = make_network
  layer input  from data
  layer hidden size 16 kind relu
  layer output size 4  kind softmax
end

train_network net, on: data, rounds: 30, speed: 0.1
plot_training net

Open it in the Studio ▶

Check your understanding