Most recent researches in image captioning adopt attention mechanism based on encoder-decoder framework, where the attention module aligns input features for the decoder and boosts performance consequently. A common defect of traditional attention methods is that the inequality among different types of inputs is ignored, resulting in under-exploitation of certain informative features. In this paper, we propose a novel cascade attention module, which processes different types of input in a sequential manner. The cascade attention module enables inputs of higher priorities to affect the attention of other inputs so as to emphasize such inequality. We implement our model by introducing global feature of the image to the captioning process of R-CNN based frameworks, where such feature is rich of context information but takes few effects via traditional attention module. Experimental results demonstrate that our proposed method is able to exploit feature of different types, acquiring improvements on multiple automatic measurements.
Qiujuan TongChan HeJiaqi LiYifan Li
Tejaswini NakirekantiD. Deepika
Ammara SattarMuhammad AssamTahani Jaser AlahmadiUzair Aslam BhattiHao TangMuhammad Aamir
Kaidi ZhengChen ZhuShaopeng LuYonggang Liu