Imagine teaching a computer to recognise handwritten digits. You pass it thousands of examples, and over time, it learns to identify the different digits. How does it do this? One of the answers to this question is backpropagation. This is the method that allows neural networks to learn from their mistakes, tweaking their internal settings to improve predictions. Without backpropagation, training deep neural networks would be nearly impossible.
In this blog, I will try to explain how backpropagation works. I'll start off with a basic example and then try to explain how it scales to complex networks. During this, we will cover topics like activation functions, vanishing gradients, and the chain rule.
Backpropagation is one of the backbones of neural network training. It is an algorithm that calculates the contribution of each weight in the network to the error in its predictions and adjusts those weights to minimise that error. Say a chef was creating a recipe, if the final dish tastes too salty, the chef reduces the salt next time, and so forth, adjusting each ingredient based on feedback.
Formally, backpropagation computes the gradient of the loss function, a measure of prediction error, for each weight in the network. The gradient tells us how much the error changes if we tweak a weight slightly. Using this information, we update the weights in the direction that reduces the error, typically via gradient descent. The "back" in backpropagation comes from how it works; unlike when the model is generating a prediction starting from the input layer, backpropagation starts at the output layer and propagates the error backwards through the network, layer by layer, using the chain rule. This process is repeated many times, allowing the network to learn patterns from data and improve its accuracy.
Say you have a basic neural network, one input neuron, one hidden neuron, and one output neuron, and we're trying to predict a target value based on an input (x).
Network Structure
We'll use the sigmoid function as our activation function, defined as:
sigmoid(z) = 1 / 1 + e−z
It's derivative is:
sigmoid'(z) = sigmoid(z) ⋅ (1 − sigmoid(z))
1. Forward Pass
Begin by passing an input through the network to get its output. Our network gets its output in two steps:
The goal is to adjust w1 and w2 to minimise this error.
2. Backward Pass: Calculating Gradients
To update the weights, we need the partial derivatives of the error for each weight.
For w2
Since w2 directly affects (o), which affects the error, we use the chain rule (The symbol ∂ (curly d) denotes a partial derivative):
∂error / ∂w2 = ∂error / ∂o ⋅ ∂o / ∂w2
For w1
This is trickier because w1 affects (h), which affects (o), which affects the error:
∂error / ∂w1 = ∂error / ∂o ⋅ ∂o / ∂h ⋅ ∂h / ∂w1
3. Updating Weights
Using gradient descent, we adjust the weights in the direction that reduces the loss. Usually, this is done using an optimisation algorithm like Gradient Descent or a variant like Adam.
w2new = w2 − α ⋅ ∂error / ∂w2
w1new = w1 − α ⋅ ∂error / ∂w1
Where α is the learning rate, controlling the step size.
Generalising to Multi-Layer Networks
Real neural networks have multiple layers and neurons, but the idea stays the same. Backpropagation starts at the output layer, computes gradients, and works backward to the input layer.
For a weight wi in layer (l):
∂loss / ∂wl = ∂loss / ∂al ⋅ ∂al / ∂zl ⋅ ∂zl / ∂wl
This recursive computation relies on the chain rule, making backpropagation efficient via matrix operations in frameworks like TensorFlow or PyTorch. For large networks, the process scales by repeating these steps across all layers and neurons.
Activation Functions
Backpropagation requires differentiable activation functions. Common options include:
ReLU is popular is deep networks, however "dying ReLU" (neurons stuck at 0) can occur. Variants like Leaky ReLU (f(x) = max(0.01x, x)) solve this issue.
The Vanishing Gradient Problem
In deep networks, gradients can shrink exponentially as they propagate backwards, especially with sigmoid or tanh. Early layers barely update, stalling learning. Solutions include:
Efficiency
Computing gradients for every weight is intensive. Mini-batch gradient descent uses small data subsets, balancing speed and accuracy. Optimisations like momentum or Adam accelerate convergence.
Backpropagation is the heartbeat of neural network training. By computing gradients and adjusting weights, it turns raw data into powerful models. It's why we can train deep networks to recognise images, translate languages, and more.
While tools automate the math, understanding backpropagation unlocks insights into network behaviour, why some designs work better, how to fix training issues, and where to innovate. Next time you train a model, picture backpropagation tirelessly refining the weights behind the scenes.