Spatial-Semantic Attention For Grounded Image Captioning

Image credit: Unsplash

Abstract

Grounded image captioning models usually process high­dimensional vectors from the feature extractor to generate de­scriptions. However, mere vectors do not provide adequate information. The model needs more explicit information for grounded image captioning. Besides high dimensional vec­tors, the feature extractor also predicts the locations and cat­egories of the objects, which contains low-level spatial in­formation and high-level semantic information. To this end, we propose a new attention module called Spatial-Semantic (SS) Attention, which utilizes the predictions from the back­bone network to help the model attend to the correct objects. Specifically, the SS attention module collects the position of proposals and the class probabilities from the feature extractor as spatial and semantic information to assist attention weight­ing. In addition, we propose a grounding loss to supervise the SS attention. Our method achieves high performance on cap­tioning and grounding metrics and outperforms some power­ful previous models on the Flickr30k Entities dataset.

Publication
In International Conference on Image Processing (ICIP), 2022