The project focuses on developing an Image Captioning Model that generates textual descriptions for images. Leveraging Deep Learning techniques, the solution integrates a Convolutional Neural Network (CNN) for feature extraction and a Long Short-Term Memory Network (LSTM) for sequence modeling, deployed on a robust FastAPI backend for real-time predictions.
The model uses the Flickr8k dataset, consisting of 8,000 images paired with multiple captions.
Flickr8k dataset was selected due to its diversity, consisting of images paired with five unique captions each.
Pretrained DenseNet201 extracts feature embeddings.
Images are resized to 224×224, normalized, and processed in batches for efficiency.
Captions are tokenized using TensorFlow's Tokenizer.
Words are indexed, padded to uniform lengths, and transformed into sequences for compatibility with the LSTM.
DenseNet201's penultimate layer provides rich, high-dimensional image embeddings.
DenseNet has a densely connected architecture where every layer receives input from all preceding layers. This ensures maximum information flow and feature reuse, making it computationally efficient.
Sequential layers include Embedding, LSTM, and a Dense output layer.
The model employs an Additive Attention mechanism to enhance caption accuracy.
Optimized using the Adam optimizer with early stopping and learning rate reduction to prevent overfitting.
Features and captions are processed in parallel, using the extracted embeddings and tokenized sequences.
The model integrates multiple components for efficient image-to-text mapping:
DenseNet201 was used to generate embeddings of size 1×1920, representing the visual content in a high-dimensional space.
The decoder consists of:
An Embedding layer that maps integer sequences into dense vector representations.
A single-layer LSTM network with 256 units to capture sequential dependencies in the text data.
A Dense output layer with softmax activation to predict the next word in the sequence
c. Training and Optimization
Categorical crossentropy was used to evaluate how well the predicted captions match the ground truth.
The Adam optimizer was employed with a learning rate scheduler for faster convergence.
Early stopping prevented overfitting by halting training if validation performance ceased improving.
Learning rate reduction on plateau ensured steady progress by reducing the learning rate when validation loss stagnated.
The model generated high-quality captions with contextual relevance. Here’s a summary:
Generated Caption Samples:
Input: An image of two children playing with a ball.
Output: "Two young boys are playing soccer on a grassy field."
CNNs like DenseNet excel at capturing spatial features in images.
LSTMs are ideal for generating sequential data, such as text captions.
Their combination enables the model to map complex visual data into coherent textual descriptions.
This project demonstrates a robust and scalable solution for image captioning, showcasing:
A well-integrated pipeline leveraging CNN-LSTM architectures for efficient image-to-text translation.
Deployment on a FastAPI server, enabling real-time inference.
Flexibility to adapt the solution for different datasets or use cases, such as assistive technology for visually impaired users.
Explore advanced architectures like Vision Transformers (ViT) or BERT-based decoders for improved performance.
Incorporate multilingual support to generate captions in different languages.
Enhance caption diversity by using generative models like GPT-based approaches