Due to its broad applications, remote sensing image captioning (RSIC) has gained popularity in recent years. However, it poses extra challenges for containing low-resolution images with highly structured semantic content. By incorporating image labeling and segmentation, this work develops an RSIC framework using a structured attention module that highlights important semantic components to maintain a geometric and structured shape. The quality and edge emphasis of UCM-captioned photographs is improved by upsampling them to 512×512 pixels. Using the Segment Anything Model (SAM) produces better image proposals, leading to higher accuracy than traditional techniques. A balanced output of large- and small-object masks is facilitated by SAM's promptability. The decoder can more easily learn a suitable statistical model using the model's spatial structure to provide an all-encompassing attention map. This work investigates the effects of multiple hyperparameters, including teacher forcing, the number of region proposals, and the impact of DSR and AVR loss factors. Overall, by combining image labeling and segmentation, this research improves remote sensing capabilities. It also shows how well the structured attention module and SAM work together to improve accuracy and consider different hyperparameter issues.
Remote sensing image captioning (RSIC) aims to generate descriptive sentences for complex aerial imagery, a task with growing importance for applications like urban planning, environmental monitoring, and disaster management. Unlike traditional image analysis tasks such as object detection and segmentation, RSIC demands the generation of semantically rich captions that capture relationships among structured image elements. Existing methods based on encoder-decoder architectures and coarse-grained attention mechanisms often struggle to fully leverage the geometric and structured nature of semantic content in high-resolution remote sensing images. This work proposes a novel RSIC framework that integrates structured attention with the Segment Anything Model (SAM) for superior segmentation and caption generation. By enhancing spatial awareness and leveraging fine-grained attention, our approach addresses the limitations of prior methods and achieves significant performance improvements.
The proposed framework combines advanced image processing techniques with structured attention to enhance RSIC. The backbone of the model is a ResNet-50 encoder, pre-trained on ImageNet, to extract high-level image features. For segmentation, we utilize the state-of-the-art SAM network, which generates accurate region proposals. These regions are refined and fed into a structured attention module, which emphasizes spatial relationships and assigns distinct attention weights to different regions. The decoder employs an LSTM-based architecture that integrates these attention weights to generate captions. Regularization terms, including Double Stochastic Regularization (DSR) and Attention Variance Regularization (AVR), are applied to ensure balanced and diverse attention across image regions. This methodology enables precise captioning while preserving the geometric integrity of the semantic content.
The experiments are conducted on the UCM-captions dataset, comprising 2,100 aerial images with five descriptive sentences per image. To evaluate performance, we preprocess the dataset by upsampling images to 512×512 pixels and applying data augmentation techniques. Segmentation masks are generated using SAM, and regions with insufficient detail are filtered out. The model is trained using a combination of cross-entropy loss and regularization terms, with hyperparameter optimization for teacher forcing and the number of region proposals. Warm-up training phases are employed to improve the diversity of generated captions. Evaluation metrics include accuracy and BLEU scores, providing a comprehensive measure of the model's ability to generate meaningful and accurate descriptions.
The proposed model demonstrates substantial improvements over baseline methods in terms of segmentation accuracy and captioning quality. Using SAM for region proposals enhances the precision of segmentation, particularly for small and irregularly shaped objects, compared to traditional selective search methods. Experiments show that incorporating structured attention leads to fine-grained attention maps and more accurate sentence descriptions. Regularization terms further improve performance by balancing attention across image regions. The model achieves BLEU-4 scores of up to 69.0%, surpassing benchmarks and highlighting the effectiveness of upscaling image resolution and using advanced segmentation techniques. Results also emphasize the significance of warm-up training for increasing linguistic diversity.
This research presents a high-resolution RSIC framework that effectively integrates structured attention and SAM-based segmentation for improved caption generation. By addressing the limitations of coarse-grained attention mechanisms, the proposed method achieves superior accuracy in both segmentation and descriptive sentence generation. The use of SAM enables more detailed and balanced region proposals, while structured attention fully exploits the geometric relationships within images. Experiments on the UCM-captions dataset confirm the efficacy of the framework, achieving state-of-the-art BLEU scores. This approach not only advances the field of RSIC but also provides a foundation for future work in high-resolution image understanding and captioning.