Edward Jermyn

Backpropagation

CONTACT ME

Backpropagation

Imagine teaching a computer to recognise handwritten digits. You pass it thousands of examples, and over time, it learns to identify the different digits. How does it do this? One of the answers to this question is backpropagation. This is the method that allows neural networks to learn from their mistakes, tweaking their internal settings to improve predictions. Without backpropagation, training deep neural networks would be nearly impossible.

In this blog, I will try to explain how backpropagation works. I'll start off with a basic example and then try to explain how it scales to complex networks. During this, we will cover topics like activation functions, vanishing gradients, and the chain rule.

What is Backpropagation?

Backpropagation is one of the backbones of neural network training. It is an algorithm that calculates the contribution of each weight in the network to the error in its predictions and adjusts those weights to minimise that error. Say a chef was creating a recipe, if the final dish tastes too salty, the chef reduces the salt next time, and so forth, adjusting each ingredient based on feedback.

Formally, backpropagation computes the gradient of the loss function, a measure of prediction error, for each weight in the network. The gradient tells us how much the error changes if we tweak a weight slightly. Using this information, we update the weights in the direction that reduces the error, typically via gradient descent. The "back" in backpropagation comes from how it works; unlike when the model is generating a prediction starting from the input layer, backpropagation starts at the output layer and propagates the error backwards through the network, layer by layer, using the chain rule. This process is repeated many times, allowing the network to learn patterns from data and improve its accuracy.

Simple Example

Say you have a basic neural network, one input neuron, one hidden neuron, and one output neuron, and we're trying to predict a target value based on an input (x).

Network Structure

•
Input:(x) (e.g. a single feature)
•
Hidden layer: One neuron with weight w₁ and an activation function
•
Output layer: One neuron with weight w₂ and an activation funciton

We'll use the sigmoid function as our activation function, defined as:

sigmoid(z) = 1 / 1 + e^−z

It's derivative is:

sigmoid'(z) = sigmoid(z) ⋅ (1 − sigmoid(z))

1. Forward Pass

Begin by passing an input through the network to get its output. Our network gets its output in two steps:

•
Hidden layer output: h = sigmoid(w¹⋅x)
•
Final output: 0 = sigmoid(w²⋅h)
•
We measure the error using the mean squared error: error = (o - target)²

The goal is to adjust w₁ and w₂ to minimise this error.

2. Backward Pass: Calculating Gradients

To update the weights, we need the partial derivatives of the error for each weight.

For w₂

Since w₂ directly affects (o), which affects the error, we use the chain rule (The symbol ∂ (curly d) denotes a partial derivative):

∂error / ∂w2 = ∂error / ∂o ⋅ ∂o / ∂w2

•
Error with respect to output: ∂error / ∂o = 2(o − target)
•
Error with respect to w₂: ∂error / ∂o = 2(o − target) (this comes from the sigmoid derivative, where o = sigmoid(w₂ ⋅ h)
•
Combined: ∂error / ∂w₂ = 2(o − target) ⋅ o(1 − o) ⋅ h

For w₁

This is trickier because w₁ affects (h), which affects (o), which affects the error:

∂error / ∂w₁ = ∂error / ∂o ⋅ ∂o / ∂h ⋅ ∂h / ∂w1

•
We already have ∂error / ∂o = 2(o − target)
•
Output with respect to hidden: ∂o / ∂h = w₂ ⋅ o(1 − o)
•
Hidden with respect to w₁: ∂h / ∂w1 = h(1 − h) ⋅ x
•
Combined: ∂error / ∂w₁ = 2(o − target) ⋅ w₂ ⋅ o(1 − o) ⋅ h(1 − h) ⋅ x

3. Updating Weights

Using gradient descent, we adjust the weights in the direction that reduces the loss. Usually, this is done using an optimisation algorithm like Gradient Descent or a variant like Adam.

w₂^new = w2 − α ⋅ ∂error / ∂w2

w₁^new = w1 − α ⋅ ∂error / ∂w1

Where α is the learning rate, controlling the step size.

Generalising to Multi-Layer Networks

Real neural networks have multiple layers and neurons, but the idea stays the same. Backpropagation starts at the output layer, computes gradients, and works backward to the input layer.

For a weight w_i in layer (l):

∂loss / ∂w_l = ∂loss / ∂a_l ⋅ ∂a_l / ∂z_l ⋅ ∂z_l / ∂w_l

•
a_l: Activation of layer (l)
•
z_l: Weighted input (w_l ⋅ a_l-1 + bias)
•
∂loss / ∂a_l: Error term from the next layer

This recursive computation relies on the chain rule, making backpropagation efficient via matrix operations in frameworks like TensorFlow or PyTorch. For large networks, the process scales by repeating these steps across all layers and neurons.

Practical Considerations

Activation Functions

Backpropagation requires differentiable activation functions. Common options include:

•
Sigmoid: Outputs (0, 1), but gradients shrink for large inputs.
•
Tanh: Outputs(-1,1), similar to sigmoid but centered at zero.
•
ReLU: f(x) = max(0,x), with a constant gradient of 1 for x > 0.

ReLU is popular is deep networks, however "dying ReLU" (neurons stuck at 0) can occur. Variants like Leaky ReLU (f(x) = max(0.01x, x)) solve this issue.

The Vanishing Gradient Problem

In deep networks, gradients can shrink exponentially as they propagate backwards, especially with sigmoid or tanh. Early layers barely update, stalling learning. Solutions include:

•
ReLU: Larger gradients for positive inputs..
•
Weight initialisation: Techniques like Xavier initialisation.
•
Batch normalisation: Normalises layer inputs.
•
Residual connectections: Allows gradients to bypass layers (e.g. ResNet).

Efficiency

Computing gradients for every weight is intensive. Mini-batch gradient descent uses small data subsets, balancing speed and accuracy. Optimisations like momentum or Adam accelerate convergence.

Conclusion

Backpropagation is the heartbeat of neural network training. By computing gradients and adjusting weights, it turns raw data into powerful models. It's why we can train deep networks to recognise images, translate languages, and more.

While tools automate the math, understanding backpropagation unlocks insights into network behaviour, why some designs work better, how to fix training issues, and where to innovate. Next time you train a model, picture backpropagation tirelessly refining the weights behind the scenes.