AI Image Captioning

Automatic Image Captioning

This project implements an automatic image captioning system that generates descriptions for images by combining computer vision and natural language processing. The system extracts visual features from images and produces readable captions. The project was built using PyTorch, Flickr8k dataset and a CNN to RNN architecture.

The goal of this project is to develop and end to end image captioning system capable of generating accurate captions for different images. Project Overview:

Data

The Flickr8k dataset has ~8,000 images with ~40,000 captions (~5 per image). The images are first resized to 224x224 pixels and normalised for ResNet50. The captions are then tokenised and numericalised using a custom Vocabulary class (frequency threshold: 5), and padded with special tokens. The vocabulary size is ~3,000 words and current issue I have is that the dataset is not properly split between train, validation, and test splits. This leads to overfitting, I will fix this in the future.

Model Design:

What I learned

Code and full break down available here and screenshot below.