WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan
In recent years,the encoder-decoder framework based on self-attention mechanism has become the mainstream model in image captioning.However,self-attention in the encoder only models the visual relations of low-scale features,ignoring some effective information in high-scale visual features,thus affecting the quality of the generated descriptions.To solve this problem,this paper proposes a cross-scale feature fusion self-attention(CFFSA) method for image captioning.Specifically,CFFSA integrates low-scale and high-scale visual features in self-attention to improve the range of attention from a visual perspective,which increases effective visual information and reduces noise,thereby learning more accurate visual and semantic relationships.Experiments on MS COCO dataset show that the proposed method can more accurately capture the relationship between cross-scale visual features and generate more accurate descriptions.In addition,CFFSA is a general method,which can further improve the performance of the model by combining with other self-attention based image captioning methods.
Jing ZhangYingshuai XieWeichao DingZhe Wang
Qingqing LuXiaomei ZhangXin KangFuji RenKaren SimonyanA ZissermanSzegedyIoffe VanhouckeWojna ShlensX HeS ZhangJ RenSuX ChenC ZitnickX JiaE GavvesB FernandoT TuytelaarsO VinyalsA ToshevS BengioD ErhanT GuanY WangL DuanR JiX ShiY ShaoA KarpathyL Fei-FeiP JiangF RenN ZhengX WangM PengL PanM HuC JinF RenQ YouH JinZ WangC FangJ LuoK XuJ BaR KirosK ChoA CourvilleR SalakhutdinovR ZemelY BengioJ LuC XiongD ParikhR SocherM GrubingerP CloughX HeS ZhangJ RenSu
Mozhgan PourkeshavarzShahabedin NabaviMohsen Ebrahimi MoghaddamMehrnoush Shamsfard
Haiyan HuangZhenfeng ShaoQimin ChengXiaoping Wu