Fine-Grained and Semantic-Guided Visual Attention for Image Captioning

Zongjian Zhang; Qiang Wu; Yang Wang; Fang Chen

doi:10.1109/wacv.2018.00190

ScienceGate Book Chapters

JOURNAL ARTICLE

Fine-Grained and Semantic-Guided Visual Attention for Image Captioning

Zongjian Zhang Qiang Wu Yang Wang Fang Chen

Year: 2018 Pages: 1709-1717

DOI: 10.1109/wacv.2018.00190

Get Full-Text PDF Get Analytical Report

Abstract

Soft-attention is regarded as one of the representative methods for image captioning. Based on the end-to-end CNN-LSTM framework, it tries to link the relevant visual information on the image with the semantic representation in the text (i.e. captioning) for the first time. In recent years, there are several state-of-the-art methods published, which are motivated by this approach and include more elegant fine-tune operation. However, due to the constraints of CNN architecture, the given image is only segmented to fixed-resolution grid at a coarse level. The overall visual feature created for each grid cell indiscriminately fuses all inside objects and/or their portions. There is no semantic link among grid cells, although an object may be segmented into different grid cells. In addition, the large-area stuff (e.g. sky and beach) cannot be represented in the current methods. To tackle the problems above, this paper proposes a new model based on the FCN-LSTM framework which can segment the input image into a fine-grained grid. Moreover, the visual feature representing each grid cell is contributed only by the principal object or its portion in the corresponding cell. By adopting the pixel-wise labels (i.e. semantic segmentation), the visual representations of different grid cells are correlated to each other. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can better link the relevant visual information with each semantic meaning inside the text through LSTM. Without using the elegant fine-tune, the comprehensive experiments show promising performance consistently across different evaluation metrics.

Keywords:

Closed captioning Computer science Grid Feature (linguistics) Object (grammar) Artificial intelligence Semantics (computer science) Visualization Semantic grid Representation (politics) Pixel Image (mathematics) Natural language processing Computer vision Pattern recognition (psychology) Semantic Web Linguistics

Metrics

Cited By

1.01

FWCI (Field Weighted Citation Impact)

Refs

0.76

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Fine-Grained and Semantic-Guided Visual Attention for Image Captioning

Abstract

Metrics

Citation History

Topics

Related Documents

High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention

Attention-Guided Hierarchical Parsing for Fine-Grained Person-Centric Image Captioning

Image Captioning With Visual-Semantic Double Attention

An Attention-Guided Visual Semantic Fusion Method for Remote Sensing Image Captioning

Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning