Abstract

Generating a natural language description of an image is a challenging but meaningful task.The task combines two significant artificial intelligent fields: computer vision and natural language processing.This task is valuable for many applications, such as searching images and assisting the people who have visually impaired to view the world, etc.Most approaches adopt an encoder-decoder framework, and some of the future methods are improved on the basis of this framework.In these methods, image features are extracted by VGG net or other networks, but the feature map will lose important information during the processing.In this paper, we fusing different kinds of image features extracted by the two networks: VGG19 and Resnet50, and put it into the neural network to train.We also add an attention into the a basic neural encoder-decoder model for generating natural sentence descriptions, at each time step, our model will attend to the image feature and pick up the most meaningful parts to generate captions.We test our model on the benchmark dataset called IAPR TC-12, comparing with other methods, we validate our model have state-of-the-art performance.

Keywords:
Closed captioning Computer science Artificial intelligence Feature (linguistics) Image fusion Image (mathematics) Computer vision Pattern recognition (psychology) Fusion Linguistics

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
19
Refs
0.38
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.