Backpropagation

Backpropagation

Imagine teaching a computer to recognise handwritten digits. You pass it thousands of examples, and over time, it learns to identify the different digits. How does it do this? One of the answers to this question is backpropagation. This is the method that allows neural networks to learn from their mistakes, tweaking their internal settings to improve predictions. Without backpropagation, training deep neural networks would be nearly impossible.

In this blog, I will try to explain how backpropagation works. I'll start off with a basic example and then try to explain how it scales to complex networks. During this, we will cover topics like activation functions, vanishing gradients, and the chain rule.

What is Backpropagation?

Backpropagation is one of the backbones of neural network training. It is an algorithm that calculates the contribution of each weight in the network to the error in its predictions and adjusts those weights to minimise that error. Say a chef was creating a recipe, if the final dish tastes too salty, the chef reduces the salt next time, and so forth, adjusting each ingredient based on feedback.

Formally, backpropagation computes the gradient of the loss function, a measure of prediction error, for each weight in the network. The gradient tells us how much the error changes if we tweak a weight slightly. Using this information, we update the weights in the direction that reduces the error, typically via gradient descent. The "back" in backpropagation comes from how it works; unlike when the model is generating a prediction starting from the input layer, backpropagation starts at the output layer and propagates the error backwards through the network, layer by layer, using the chain rule. This process is repeated many times, allowing the network to learn patterns from data and improve its accuracy.

Simple Example

Say you have a basic neural network, one input neuron, one hidden neuron, and one output neuron, and we're trying to predict a target value based on an input (x).

Network Structure

We'll use the sigmoid function as our activation function, defined as:

sigmoid(z) = 1 / 1 + e−z

It's derivative is:

sigmoid'(z) = sigmoid(z) ⋅ (1 − sigmoid(z))

1. Forward Pass

Begin by passing an input through the network to get its output. Our network gets its output in two steps:

The goal is to adjust w1 and w2 to minimise this error.

2. Backward Pass: Calculating Gradients

To update the weights, we need the partial derivatives of the error for each weight.

For w2

Since w2 directly affects (o), which affects the error, we use the chain rule (The symbol ∂ (curly d) denotes a partial derivative):

∂error / ∂w2 = ∂error / ∂o ⋅ ∂o / ∂w2

For w1

This is trickier because w1 affects (h), which affects (o), which affects the error:

∂error / ∂w1 = ∂error / ∂o ⋅ ∂o / ∂h ⋅ ∂h / ∂w1

3. Updating Weights

Using gradient descent, we adjust the weights in the direction that reduces the loss. Usually, this is done using an optimisation algorithm like Gradient Descent or a variant like Adam.

w2new = w2 − α ⋅ ∂error / ∂w2

w1new = w1 − α ⋅ ∂error / ∂w1

Where α is the learning rate, controlling the step size.

Generalising to Multi-Layer Networks

Real neural networks have multiple layers and neurons, but the idea stays the same. Backpropagation starts at the output layer, computes gradients, and works backward to the input layer.

For a weight wi in layer (l):

∂loss / ∂wl = ∂loss / ∂al ⋅ ∂al / ∂zl ⋅ ∂zl / ∂wl

This recursive computation relies on the chain rule, making backpropagation efficient via matrix operations in frameworks like TensorFlow or PyTorch. For large networks, the process scales by repeating these steps across all layers and neurons.

Practical Considerations

Activation Functions

Backpropagation requires differentiable activation functions. Common options include:

ReLU is popular is deep networks, however "dying ReLU" (neurons stuck at 0) can occur. Variants like Leaky ReLU (f(x) = max(0.01x, x)) solve this issue.

The Vanishing Gradient Problem

In deep networks, gradients can shrink exponentially as they propagate backwards, especially with sigmoid or tanh. Early layers barely update, stalling learning. Solutions include:

Efficiency

Computing gradients for every weight is intensive. Mini-batch gradient descent uses small data subsets, balancing speed and accuracy. Optimisations like momentum or Adam accelerate convergence.

Conclusion

Backpropagation is the heartbeat of neural network training. By computing gradients and adjusting weights, it turns raw data into powerful models. It's why we can train deep networks to recognise images, translate languages, and more.

While tools automate the math, understanding backpropagation unlocks insights into network behaviour, why some designs work better, how to fix training issues, and where to innovate. Next time you train a model, picture backpropagation tirelessly refining the weights behind the scenes.