Edward Jermyn

Activation Functions

CONTACT ME

Activation Functions

If backpropagation is the engine of a neural network, then activation functions are the spark that keeps it running. These small functions shape how neurons process information, enabling neural networks to learn everything from recognising cat photos to generating human responses. Activation functions introduce nonlinearity, allowing networks to tackle difficult tasks like those mentioned before. In this blog, I'm going to explain five of the most popular activation functions. These are Sigmoid, Tanh, ReLU, Leaky ReLU, and Swish.

Why do these functions even matter?

Activation functions decide how much a neuron "fires" based on its input. In a neural network, each neuron takes a weighted sum of inputs, adds a bias, and passes it through an activation function to decide its output. Without nonlinearity, the network would be a glorified linear regression model being unable to model curves, edges, or intricate patterns, etc.

How Activation Functions Work With Backpropagation

Backpropagation relies on gradients to update weights, and activation functions play a starring role here. Their derivatives determine how errors flow backwards through the network. A well-behaved activation function has a clear, computable derivative that ensures stable gradient flow.

•
Sigmoid: f(x) = 1 / 1 + e^-x, derivative f'(x) = f(x) • (1 − f(x))
•
ReLU: f(x) = max(0, x), derivative f'(x) = 1 if x > 0, else 0

If the derivative is too small (or vanishes), gradients can shrink, slowing learning or causing "dead" neurons. If it's too large, gradients can explode, which destabilises training. This makes choosing the right activation function critical for efficient backpropagation.

Sigmoid

•
What it does: Maps inputs to a range between 0 and 1, resembling a smooth "S" curve. Since output values are bound between 0 and 1 it normalises the output of each neuron.
•
Formula: f(x) = 1 / 1 + e^-x
•
Smooth gradient: Prevents "jumps" in output values.
•
Binary Classification: Good for binary classification (e.g. yes or no outputs)
•
Gradient Vanishing: Prone to gradient vanishing when the function value is either too high or too low, the derivative becomes very small.

Tanh or Hyperbolic Tangent Activation Function

•
What it does: Similar to Sigmoid but outputs values between -1 and 1, centred at zero.
•
Formula: f(x) = tanh(x) = (2 / 1 + e^-2x) - 1
•
Zero centered: Outputs can improve gradient flow compared to Sigmoid.
•
Binary Classification: Good for binary classification, typically used for the hidden layer and sigmoid for the output layer. However, this is not static, and the specific activation function to be used must be analysed for the specific problem.
•
Gradient Vanishing: Still prone to vanishing gradients in deep networks.

ReLU (Rectified Linear Unit)

•
What it does: Outputs the input if positive else outputs zero.
•
Formula: f(x) = max(0, x)
•
Fast to compute: Avoids the vanishing gradient problem for positive inputs, and speeds up training.
•
"Dying ReLU": Neurons stuck at zero for negative inputs can't recover.

Leaky ReLU

•
What it does: Like ReLU, but allows a tiny slope for negative inputs to keep them alive.
•
Formula: f(x) = x if x > 0, else f(x) = αx (typically α = 0.001)
•
Prevent "Dying ReLU": Ensures gradients exists for all inputs.
•
α: Requires tuning the slope parameter α.

Swish (A Self-Gated) Function

•
What it does: Combined input with Sigmoid in a smooth, self-gated way.
•
Formula: f(x) = x • sigmoid(x)
•
Outperforms ReLU: Often outperforms ReLU in deep networks, smooth gradients improve learning.
•
Intensive: More computationally intensive than ReLU.

Comparison

This plot highlights the different shapes of the different activation functions. Sigmoid and Tanh curve smoothly, whilst ReLU and Leaky ReLU are piecewise, and Swish bridges the gap.

What's Next for Activation Functions?

Researchers, engineers, scientists, you name it, are always looking for new and improved activation functions. Recent innovations like Mish (f(x) = x • tanh(softplus(x))) and GELU (used in transformers) aim for smoother gradients and better performance. Researchers are also looking into learnable activation functions, where the network itself self-tunes the function during training.

Conclusion

Activation functions are a cornerstone of neural networks, turning linear math into nonlinear magic essential for modelling complex patterns. Pioneering functions like Sigmoid and Tanh laid the foundation for early neural networks, while ReLU introduced computational efficiency and accelerated training. Leaky ReLU enhanced robustness by mitigating issues like dying neurons, and Swish represents a cutting-edge advancement, offering smoother gradients for improved performance in deep architectures.