Activation Functions

Activation Functions

If backpropagation is the engine of a neural network, then activation functions are the spark that keeps it running. These small functions shape how neurons process information, enabling neural networks to learn everything from recognising cat photos to generating human responses. Activation functions introduce nonlinearity, allowing networks to tackle difficult tasks like those mentioned before. In this blog, I'm going to explain five of the most popular activation functions. These are Sigmoid, Tanh, ReLU, Leaky ReLU, and Swish.

Why do these functions even matter?

Activation functions decide how much a neuron "fires" based on its input. In a neural network, each neuron takes a weighted sum of inputs, adds a bias, and passes it through an activation function to decide its output. Without nonlinearity, the network would be a glorified linear regression model being unable to model curves, edges, or intricate patterns, etc.

How Activation Functions Work With Backpropagation

Backpropagation relies on gradients to update weights, and activation functions play a starring role here. Their derivatives determine how errors flow backwards through the network. A well-behaved activation function has a clear, computable derivative that ensures stable gradient flow.

If the derivative is too small (or vanishes), gradients can shrink, slowing learning or causing "dead" neurons. If it's too large, gradients can explode, which destabilises training. This makes choosing the right activation function critical for efficient backpropagation.

Sigmoid

Tanh or Hyperbolic Tangent Activation Function

ReLU (Rectified Linear Unit)

Leaky ReLU

Swish (A Self-Gated) Function

Comparison

This plot highlights the different shapes of the different activation functions. Sigmoid and Tanh curve smoothly, whilst ReLU and Leaky ReLU are piecewise, and Swish bridges the gap.

What's Next for Activation Functions?

Researchers, engineers, scientists, you name it, are always looking for new and improved activation functions. Recent innovations like Mish (f(x) = x • tanh(softplus(x))) and GELU (used in transformers) aim for smoother gradients and better performance. Researchers are also looking into learnable activation functions, where the network itself self-tunes the function during training.

Conclusion

Activation functions are a cornerstone of neural networks, turning linear math into nonlinear magic essential for modelling complex patterns. Pioneering functions like Sigmoid and Tanh laid the foundation for early neural networks, while ReLU introduced computational efficiency and accelerated training. Leaky ReLU enhanced robustness by mitigating issues like dying neurons, and Swish represents a cutting-edge advancement, offering smoother gradients for improved performance in deep architectures.