Qingqing LuXiaomei ZhangXin KangFuji RenKaren SimonyanA ZissermanSzegedyIoffe VanhouckeWojna ShlensX HeS ZhangJ RenSuX ChenC ZitnickX JiaE GavvesB FernandoT TuytelaarsO VinyalsA ToshevS BengioD ErhanT GuanY WangL DuanR JiX ShiY ShaoA KarpathyL Fei-FeiP JiangF RenN ZhengX WangM PengL PanM HuC JinF RenQ YouH JinZ WangC FangJ LuoK XuJ BaR KirosK ChoA CourvilleR SalakhutdinovR ZemelY BengioJ LuC XiongD ParikhR SocherM GrubingerP CloughX HeS ZhangJ RenSu
Generating a natural language description of an image is a challenging but meaningful task.The task combines two significant artificial intelligent fields: computer vision and natural language processing.This task is valuable for many applications, such as searching images and assisting the people who have visually impaired to view the world, etc.Most approaches adopt an encoder-decoder framework, and some of the future methods are improved on the basis of this framework.In these methods, image features are extracted by VGG net or other networks, but the feature map will lose important information during the processing.In this paper, we fusing different kinds of image features extracted by the two networks: VGG19 and Resnet50, and put it into the neural network to train.We also add an attention into the a basic neural encoder-decoder model for generating natural sentence descriptions, at each time step, our model will attend to the image feature and pick up the most meaningful parts to generate captions.We test our model on the benchmark dataset called IAPR TC-12, comparing with other methods, we validate our model have state-of-the-art performance.
WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan
Xian ZhongGuozhang NieWenxin HuangWenxuan LiuBo MaChia‐Wen Lin
Shih-Shinh HuangFeng-Chia ChangShih-Huan Tseng
Qiujuan TongChan HeJiaqi LiYifan Li