Generating high-quality captions for images is a critical challenge in AI, requiring advancements at the intersection of computer vision and natural language processing. This study introduces a novel image captioning pipeline that integrates the CPTR (Full Transformer Network) architecture with Self-critical Sequence Training (SCST) for optimization. By using a two-step training process, we demonstrate substantial improvements in caption quality, as measured by the METEOR score. This work highlights the potential of combining state-of-the-art transformers with reinforcement learning techniques to address complex computer vision tasks.
Image captioning, the task of generating textual descriptions for visual content, bridges the gap between vision and language understanding. While traditional models have achieved notable success in optimizing differentiable objectives, they often fall short when it comes to non-differentiable evaluation metrics such as METEOR and BLEU. These metrics, which better reflect human evaluation criteria, remain challenging to optimize directly with standard training approaches. Addressing this limitation is essential for improving alignment between model-generated captions and human judgment.
This study presents a two-step training pipeline that leverages the strengths of:
We hypothesize that reinforcement learning can significantly enhance alignment between model-generated captions and human evaluation metrics. The contributions of this work are as follows:
The transformer model, initially proposed by Vaswani et al. (2017), has become a cornerstone of NLP and computer vision tasks. CPTR [1], a fully transformer-based image captioning model, eliminates the need for convolutional backbones by utilizing a Vision Transformer encoder.
SCST [2], introduced by Rennie et al., revolutionized image captioning by enabling direct optimization of evaluation metrics. Unlike supervised learning, SCST trains models to generate captions that maximize rewards such as CIDEr, BLEU, or METEOR.
Our baseline model employs CPTR, which integrates:
In the second training phase, we apply SCST to optimize captions using the METEOR score as a reward signal. The SCST loss function is defined as:
where:
SCST encourages the model to generate captions with higher rewards than the baseline.
The baseline training uses:
where
SCST optimizes METEOR directly, as defined in Section 4.2.
We evaluate our pipeline on the Flickr8K dataset, which contains:
distilbert-base-uncased
google/vit-base-patch16-384
.
Figure: Baseline Training and Validation Loss by Epoch
Figure: Baseline Training and Validation Accuracy by Epoch
We use a batched single-reference meteor score to evaluate our baseline and post-SCST captions on our test set.
Metric | Score |
---|---|
Single METEOR | 0.276 |
Metric | Before SCST | After SCST |
---|---|---|
Single METEOR | 0.276 | 0.301 |
We use beam search to decode our captions for final evaluation and generation, with a beam size of 3. A simple normalized function is used as the score function for beam search. It is defined as:
METEOR | BLEU 1 | BLEU 2 | BLEU 3 | BLEU 4 |
---|---|---|---|---|
0.4659 | 0.5528 | 0.4006 | 0.2714 | 0.1743 |
This study demonstrates the synergy between transformers and reinforcement learning in image captioning. By aligning model outputs with human evaluation metrics, the proposed pipeline achieves significant improvements in caption quality, providing a foundation for further research.
[1] Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, and Jing Liu. "CPTR: Full Transformer Network for Image Captioning." CoRR, vol. abs/2101.10804, 2021. Available: https://arxiv.org/abs/2101.10804.
[2] Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. "Self-critical Sequence Training for Image Captioning." CoRR, vol. abs/1612.00563, 2016. Available: http://arxiv.org/abs/1612.00563.