Feature Fusion Based on Neural Image Captioning with Spatial Attention

Qingqing Lu; Xiaomei Zhang; Xin Kang; Fuji Ren; Karen Simonyan; A Zisserman; Szegedy; Ioffe Vanhoucke; Wojna Shlens; X Chen; C Zitnick; X Jia; E Gavves; B Fernando; T Tuytelaars; O Vinyals; A Toshev; S Bengio; D Erhan; T Guan; Y Wang; L Duan; R Ji; X Shi; Y Shao; A Karpathy; L Fei-Fei; P Jiang; N Zheng; X Wang; M Peng; L Pan; M Hu; C Jin; F Ren; Q You; H Jin; Z Wang; C Fang; J Luo; K Xu; J Ba; R Kiros; K Cho; A Courville; R Salakhutdinov; R Zemel; Y Bengio; J Lu; C Xiong; D Parikh; R Socher; M Grubinger; P Clough; X He; S Zhang; J Ren; Su

doi:10.18178/wcse.2019.06.029

JOURNAL ARTICLE

Feature Fusion Based on Neural Image Captioning with Spatial Attention

Qingqing Lu Xiaomei Zhang Xin Kang Fuji Ren Karen Simonyan A Zisserman Szegedy Ioffe Vanhoucke Wojna Shlens X He S Zhang J Ren Su X Chen C Zitnick X Jia E Gavves B Fernando T Tuytelaars O Vinyals A Toshev S Bengio D Erhan T Guan Y Wang L Duan R Ji X Shi Y Shao A Karpathy L Fei-Fei P Jiang F Ren N Zheng X Wang M Peng L Pan M Hu C Jin F Ren Q You H Jin Z Wang C Fang J Luo K Xu J Ba R Kiros K Cho A Courville R Salakhutdinov R Zemel Y Bengio J Lu C Xiong D Parikh R Socher M Grubinger P Clough X He S Zhang J Ren Su

Year: 2019 Journal: Proceedings of 2019 the 9th International Workshop on Computer Science and Engineering

DOI: 10.18178/wcse.2019.06.029

Get Full-Text PDF Get Analytical Report

Abstract

Generating a natural language description of an image is a challenging but meaningful task.The task combines two significant artificial intelligent fields: computer vision and natural language processing.This task is valuable for many applications, such as searching images and assisting the people who have visually impaired to view the world, etc.Most approaches adopt an encoder-decoder framework, and some of the future methods are improved on the basis of this framework.In these methods, image features are extracted by VGG net or other networks, but the feature map will lose important information during the processing.In this paper, we fusing different kinds of image features extracted by the two networks: VGG19 and Resnet50, and put it into the neural network to train.We also add an attention into the a basic neural encoder-decoder model for generating natural sentence descriptions, at each time step, our model will attend to the image feature and pick up the most meaningful parts to generate captions.We test our model on the benchmark dataset called IAPR TC-12, comparing with other methods, we validate our model have state-of-the-art performance.

Keywords:

Closed captioning Computer science Artificial intelligence Feature (linguistics) Image fusion Image (mathematics) Computer vision Pattern recognition (psychology) Fusion Linguistics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.38

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Feature Fusion Based on Neural Image Captioning with Spatial Attention

Abstract

Metrics

Topics

Related Documents

Cross-scale Feature Fusion Self-attention for Image Captioning

Attention-guided image captioning with adaptive global and local feature fusion

Underwater Image Captioning Based on Feature Fusion

Segmentation-Based Attention with Spatial Encoding for Image Captioning

Image Captioning Method Based on Layer Feature Attention