Explaining CNNs

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a powerful class of deep learning models that are particularly strong for processing visual data such as images. CNNs excel at tasks like image recognition, object detection, and more. Their ability to automatically learn patterns and features from raw data has made them a cornerstone of modern computer vision, driving advancements in areas such as autonomous driving, medical imaging and facial recognition. This post explores what CNNs are, how they work, and their applications in the world. For this particular blog, I will also avoid explaining the mathematical details behind these models, and will likely explain that in a later post.

Brief History

The origins of Convolutional Neural Networks (CNNs) trace back to the early 1980s, when researchers began exploring ways to mimic the visual processing systems found in the human brain. One of the earliest models was the Neocognitron, developed by Kunihiko Fukushima in 1980. It introduced key concepts such as local receptive fields and hierarchical layers for pattern recognition, ideas that would later become central to CNNs. However, it lacked a learning algorithm like backpropagation, which limited its scalability and training efficiency.

In the late 1980s and early 1990s, Yann LeCun built upon these ideas to develop the first practical CNNs for real-world applications. His work led to LeNet-1 and later LeNet-5, a more refined architecture designed to recognise handwritten digits, such as those found on bank checks or ZIP codes. LeNet-5 used convolutional layers, subsampling (pooling), and fully connected layers, trained using backpropagation, a breakthrough at the time. Despite its success, the use of CNNs remained limited due to the era's restricted computational resources and lack of large, labelled datasets.

For much of the 1990s and early 2000s, CNNs and neural networks in general fell out of favour in mainstream AI research, a period sometimes referred to as the "AI winter." During this time, traditional machine learning techniques like support vector machines and decision trees dominated, as they performed better with the smaller datasets and lower computational resources available.

This changed dramatically in the early 2010s, as powerful GPUs became more accessible and vast labelled datasets, such as ImageNet, became available. The breakthrough came in 2012 with the introduction of AlexNet by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. AlexNet, a deep CNN trained on ImageNet, achieved unprecedented performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), cutting the top-5 error rate nearly in half compared to previous models. It introduced several innovations that became standard in deep learning, including the use of ReLU activations, dropout for regularisation, and GPU acceleration for training.

AlexNet's success ignited a wave of research and development in deep learning, leading to a rapid evolution of CNN architectures. In 2014, VGGNet demonstrated that deeper networks with small, uniform convolution filters (3×3) could further improve performance. In 2015, ResNet (Residual Network) introduced the concept of residual connections, solving the vanishing gradient problem and enabling the training of ultra-deep networks with over 100 layers. Other influential architectures followed, including GoogLeNet (Inception), DenseNet, and later, highly optimised models like EfficientNet.

Today, CNNs are at the heart of modern computer vision systems. They power applications ranging from medical imaging and autonomous vehicles to facial recognition and image generation. While newer architectures like Vision Transformers (ViTs) are emerging and gaining traction, CNNs remain a fundamental and widely used approach for understanding visual data.

What They Are

CNNs are specialised neural networks built to handle data with a grid-like structure, such as images, which are typically represented as a 2D grid of pixels (height × width × channels). Unlike traditional neural networks that treat inputs as flat vectors, CNNs preserve the spatial relationships in data by applying filters to local regions. This makes them highly effective for recognising patterns, like edges, textures, or objects in visual inputs.

At their essence, CNNs learn hierarchical feature representations directly from data. They start with simple features (e.g., lines or corners) and progressively build up to complex concepts (e.g., faces or cars). This adaptability, combined with their efficiency in handling high-dimensional data, makes CNNs especially useful for tasks in computer vision.

Architecture

A CNN's architecture consists of multiple layers that work together to extract features and make predictions. The components are the convolutional layer, pooling layer, and the fully connected layer, often supplemented by additional techniques like activation functions and normalisation. Each layer plays a distinct role in transforming raw input into meaningful output, such as a classification label.

Convolution Layer

The Convolution layer uses small, learnable filters (or kernels) to scan across the input data, performing a convolution operation. This involves sliding each filter over the input and computing dot products between the filter weights and local regions, producing feature maps that highlight specific patterns.

This layer's ability to focus on local patterns while maintaining spatial context is what makes CNNs so effective for images.

Pooling Layer

Polling layers reduce the size of feature maps, making the network more computationally efficient and less prone to overfitting. They downsample the data by summarising local regions, typically using one of two methods:

Pooling also introduces robustness, allowing the network to recognise features even if they shift slightly in position.

Fully Connected Layer

After feature extraction through convolutional and pooling layers, the data is flattened into a vector and fed into fully connected layers. These layers connect every neuron to all neurons in the previous layer, integrating the extracted features for final decision making, such as classifying an image.

How CNNs Differ from Other Neural Networks

CNNs stand out from traditional neural networks do to their focus on spatial data:

These properties make CNNs more efficient and suited for tasks like image processing compared to fully connected networks.

Applications of CNNs

CNNs are widely used across many domains, especially those involving visual data. Some of these applications include:

Beyond these core uses, CNNs have also been adopted in a variety of real-world scenarios:

In addition to visual data, CNNs are also applied in:

Challenges and Future Directions

CNNs face hurdles like high data and compute demands, limited interpretability, and vulnerability to adversarial attacks. Future research is exploring lighter models, better explainability, and integration with emerging techniques like transformers.

Conclusion

Convolutional Neural Networks have redefined how machines interpret visual information. Their layered architecture, tailored to spatial data, enables them to tackle complex tasks with efficiency and precision. While challenges like data dependency and interpretability persist, ongoing innovations promise to enhance their capabilities further. As CNNs continue to evolve, they will remain at the forefront of AI, shaping a future where machines see and understand the world much like we do.

Some good resources for understanding and viewing CNNs in action include: