Automatic Image Captioning
This project implements an automatic image captioning system that generates descriptions for images by combining computer vision and natural language processing. The system extracts visual features from images and produces readable captions. The project was built using PyTorch, Flickr8k dataset and a CNN to RNN architecture.
The goal of this project is to develop and end to end image captioning system capable of generating accurate captions for different images. Project Overview:
- Dataset: Flickr8k, containing ~8,000 images with ~5 captions each.
- Model: A CNN to RNN architecture with a pre-trained ResNet50 encoder and LSTM decoder.
- Training: Optimised using CrossEntropyLoss over 10 epochs with a batch size of 32 and learning rate of 3e-4.
- Evaluation: Measured by BLEU-4 score.
- Visualisation: A tool to overlay generated captions on images using Pillow.
Data
The Flickr8k dataset has ~8,000 images with ~40,000 captions (~5 per image). The images are first resized to 224x224 pixels and normalised for ResNet50. The captions are then tokenised and numericalised using a custom Vocabulary class (frequency threshold: 5), and padded with special tokens. The vocabulary size is ~3,000 words and current issue I have is that the dataset is not properly split between train, validation, and test splits. This leads to overfitting, I will fix this in the future.
Model Design:
- Encoder
- Pre-trained ResNet50 extracts visual features from images
- Removes the final fully connected layer, outputting a 2048-dimensional feature vector.
- Applies a linear layer to reduce features to 256 dimensions (embedSize), follow by batch normalisation
- Decoder
- LSTM with 1 layer, 512 hidden units, and an embedding size of 256.
- Takes image features and previous words to predict the next word in the caption sequence.
- Uses teacher forcing during training to optimise work predictions
- Inference
- Autoregressive generation starts with image features, creating captions until or a maximum length set at 50.
- Training vs. Inference
- Training: Uses teacher forcing, where the model is fed the ground-truth caption words as inputs and predicts the next word. For example, for the caption "A dog runs," it gets "A" and predicts "dog," then gets "dog" and predicts "runs."
- Limitations
- There is no attention mechanism, I will make modifications for this in the future.
- During inference errors early in the sequence can compound, leading to poor captions that don't match ground-truth, lowering the BLEU score.
What I learned
- Deep Learning:
- Gained experience with encoder-decoder architectures, combining CNNs and RNNs for image captioning
- PyTorch:
- More practice at designing building, training, and testing models.
- Computer Vision and NLP:
- Understanding feature extraction with pre-trained CNNs and sequence generation with LSTMs.
- Looked into evaluation metrics like BLEU-4.
- Image Processing:
- Using Pillow for text overlay, handling font rendering, text wrapping, and dynamic layouts.
- Future:
- To enhance the model with attention mechanisms and train on correctly split larger datasets.
- Improve evaluation by fixing BLEU calculation.
Code and full break down available here and screenshot below.