This work, conducted in 2021, focuses on improving performance in paragraph-level image captioning, a task that produces long, detailed textual descriptions of images rather than simpler, object-centric captions.
We enhance the previously proposed Hierarchical RNN architecture[1] by exploring alternative model configurations, including:
An end-to-end Vision Encoder-Decoder framework.
The use of ConvNext [2] in place of a general-purpose ViT for image embedding generation.
By leveraging various pretrained Encoder-Decoder models (with the exception of ViT + BERT-base), we outperform the original complex Regions-Hierarchical architecture on key human-annotation-aligned metricsโnamely METEOR, BLEU-3, and BLEU-4โwhile maintaining comparable performance on BLEU-1 and BLEU-2.
These results demonstrate that effective use of cross-modal, end-to-end architectures can overcome challenges once believed to require specialized structures, thereby efficiently harnessing the potential of contemporary pretrained models.
Image Captioning aims to generate textual descriptions of image content, facilitating better understanding of visual scenes and supporting applications such as improved image search.
Traditionally, the task comprises two major components: image recognitionโidentifying objects, people, and scenesโand text generationโproducing a coherent description informed by these identified elements.
However, a persistent obstacle in Image Captioning lies in capturing rich detail. Many systems only produce simple, single-sentence โsubject + verb + objectโ descriptions, lacking narrative depth and contextual cohesion.
Although Dense Captioning introduced multiple localized captions per image, it fails to weave these fragments into a logically consistent narrative.
To address this limitation, Paragraph-level Image Captioning generates multi-sentence descriptions that resemble human narratives, integrating relationships among various subjects and maintaining thematic consistency.
The ability to produce such structured, coherent paragraphs remains a significant challenge. In this study, we focus on improving Paragraph-level Image Captioning through more effective architectures, ultimately demonstrating that cross-modal, end-to-end models can achieve high-quality, narrative-rich descriptions without the need for overly complex, specialized frameworks.
[1] Krause, Jonathan, et al. used a Dense Image Captioning model pre-trained on the Visual Genome dataset as a Region Detector for image recognition, identifying "regions of interest" in images.
To prevent overfitting and improve training efficiency, the parameters of the Region Detector were kept "frozen" during the training process.
For text generation, they employed a Hierarchical RNN to decode image vectors into paragraphs, consisting of two structures:
We observed Vision Encoder Decoder models, which combine pretrained transformer-based image encoders (such as ViT, BEiT, DeiT, Swin) with pretrained language models (such as RoBERTa, GPT2, BERT, DistilBERT), capable of generating paragraph-level text in an end-to-end manner.
Li, Minghao, et al.[3] demonstrate the effectiveness of using pretrained image-to-text models. Although Vision Encoder Decoder models are commonly used for Optical Character Recognition and Image Captioning tasks, we have not found other applications of Vision Encoder Decoder in Paragraph-level Image Captioning.
Therefore, we aim to verify whether "powerful pretrained language models (such as GPT2) combined with pretrained encoders (like ViT) can directly describe various thematic details of images."
In our experiments, we tried several different combinations of pretrained Encoders and Decoders (see Experiments for details) to compare their performance differences and demonstrate the reliability and significance of this approach.
Metrics
BLEU (Bilingual Evaluation Understudy) Score
BLEU score is an automated evaluation metric for machine translation quality, calculated using the following formula:
where:
where c is the length of the candidate translation and r is the length of the reference translation.
METEOR (Metric for Evaluation of Translation with Explicit ORdering)
METEOR enhances translation evaluation by incorporating semantic matching through the following computation:
Harmonic Mean calculation:
where:
Final METEOR score computation:
where:
These metrics complement each other in translation evaluation: BLEU focuses on n-gram precision matching, while METEOR incorporates semantic factors through synonym matching and morphological variants.
The combination of both metrics provides a comprehensive assessment of translation quality, with BLEU capturing fluency and METEOR addressing adequacy aspects of the translation.
The implementation of these metrics involves preprocessing steps such as tokenization and lowercasing, followed by the calculation of various sub-components as detailed in the formulas above. For reliable evaluation, multiple reference translations are recommended when available, particularly for BLEU score calculation.
Krause, Jonathan, et al. [1] proposed a CNN + RPN & Hierarchical RNN model, whose performance is shown in the table below as a baseline. Our goal is not to surpass it, but rather to use it as the performance standard of the Hierarchical Decoder, to prove that using a pre-trained model alone (without a complicated Visual Extractor design or a special decoding structure) can also achieve the expected effects. This serves as a new perspective for paragraph-level image captioning. The โHumanโ row represents human evaluation.
Model Comparison
From the table above, we can see:
Decoding Strategy
In the comparison of three beam sizes: 1, 3, and 5, greedy search has achieved the best results compared to more beam sizes.
The reason for adding Repeat penalty is to improve fluency, but if there is no penalty, the score is usually higher, which is related to the multiple repeated descriptions in the dataset(the same sentence annotation in the visual genome dataset has multiple repeats).
Similar to Repeat penalty, due to the nature of the data set, adding penalty weight will reduce the indicator score.
When we tried adding n-gram repeat penalties and other related penalties, we found that the Visual Genome datasetโdue to its partially machine-synthesized natureโleads to certain sentence combinations where paragraphs share similar structures or repeated phrases.
For instance: A large building with bars on the windows in front of it. There is people walking in front of the building. There is a street in front of the building with many cars on it.
We see phrases like โThere isโ and โin front ofโ repeated. This issue also appears in dataset Flickr, reflecting the fact that not all large-scale datasets are fully human-annotated.
Thus, when applying these models in real-world scenarios (rather than simply chasing benchmark scores), the model must employ certain strategies to balance between evaluation metrics and sentence fluency.
In this research, we attempt to improve previous approaches to paragraph-level image captioning.
Traditional methods often rely on a CNN-based encoder to detect individual objects in the image, using an RNN as a decoder to generate descriptions for each object and then stitch these descriptions together.
We introduce three key improvements:
Visual Extractor
Instead of extracting object regions and using specialized selection techniques with CNN-based models, we employ a pretrained vision encoder to obtain a global vector representation of the image.
Pre-trained Language Model
Rather than using an RNN decoder (which often requires multiple stacked layers or specialized extraction methods), we replace it with a pretrained language model.
Model Characteristics
Benefiting from the above two modifications, we simplify the model architecture and avoid designing specialized structures for specific cases. Furthermore, our experimental results show that a pretrained model can achieve performance comparable toโor even better thanโprevious methods.
Notably, our approach produces sentences that align more closely with human judgment, especially on more challenging evaluation metrics.
Finally, our experimental results confirm that leveraging a pretrained model can match or surpass previous methods, and on more difficult metrics, the sentences generated by our method are closer to human standards.
[1] Krause, Jonathan, et al. "A hierarchical approach for generating descriptive image paragraphs." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[2] Liu, Zhuang, et al. "A convnet for the 2020s." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022
[3] Li, Minghao, et al. "Trocr: Transformer-based optical character recognition with pre-trained models." arXiv preprint arXiv:2109.10282 (2021).
[4] Xu, Chunpu, et al. "Interactive key-value memory-augmented attention for image paragraph captioning." Proceedings of the 28th international conference on computational linguistics. 2020.