GitHub repo: https://github.com/Pikurrot/CAP-GIA
Report: Image_Captioning_Report.pdf
Image captioning continues to push the boundaries of visual understanding, but current solutions struggle to interpret domain-specific images and contexts. In this project, we explore the use of Transformer-based Vision Encoder-Decoder models to enhance caption generation for recipe images. By using a dataset dedicated to recipes, we show that these advanced methods can learn to reproduce detailed food descriptions, but still struggle to match them with the actual content of the images. It is not a trivial task, since many times the caption is the technical name of the recipe, rather than a description of the food present in the image.
Given that the task of image captioning involves the understanding of the image but also the generation of text, the approach followed by BLIP (Bootstrapping Language-Image Pre-training) seems to be the most accurate. The model architecture adapted for image captioning is shown in the image. Our task was to adapt this model for our captioning of recipes, meticulously finetunning it using the technique LoRA (Low-Rank Adaptation). Other models and architectures we experimented with are further explained in the report.
This example shows the complexity of the caption of one sample image in the dataset. BLIP provides a caption that approaches the ground truth in great measure.
The table shows how our finetunning of BLIP highly improves the performance depicted by all metrics in this dataset, however, its bigger version, BLIP2, is not as well suited for this task.
Find our finetuned models on Hugging Face: