Xi TianXiaobao YangSugang MaBohui SongZiqing He
Transformer-based image captioning models have been widely used in recent years, but most existing attentions are designed to capture spatial dependencies. These are still inadequate for image captioning. For example, the performance of image captioning also heavily depends on the categories and attributes of the objects. Meanwhile, in the decoding process, when fusing text and vision information, simple splicing is used without fully fusing text and visual information, and the vision information is not fully utilized, which affects the representation capability of the model. Therefore, in order to remedy the above limitations, we propose a Dual-branch Spatial and Channel Joint Attention for image captioning task, which captures both spatial and channel information to improve the representation capability of the model. Further, it also uses a Cross Pre-Fusion module in the decoder to explore the deep relationship between text and vision information, to improve the quality of the sentences. The entire model is abbreviated as DSCJA-captioner. Finally, we have done extensive experiments on the MS COCO dataset to validate the effectiveness of our method. Compared with the state-of-the-art models, our model is competitive.
Xiaobao YangYang YangJunsheng WuWei SunSugang MaZhiqiang Hou
Shabareesh AryanS. V. SubrahmanyaPahel JagtapLeander ManiM Sowmya.
Bao T. NguyenSon T. NguyenAnh H. Vo
Yuanqiu LiuHong YuHui LiXin HanHan Liu