Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Lianli Gao; Xiangpeng Li; Jingkuan Song; Heng Tao Shen

doi:10.1109/tpami.2019.2894139

ScienceGate Book Chapters

JOURNAL ARTICLE

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Lianli Gao Xiangpeng Li Jingkuan Song Heng Tao Shen

Year: 2019 Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 42 (5)Pages: 1-1 Publisher: IEEE Computer Society

DOI: 10.1109/tpami.2019.2894139

Get Full-Text PDF Get Analytical Report

Abstract

Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g., "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. Considering these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We design the hLSTMat model as a general framework, and we first instantiate it for the task of video captioning. Then, we further instantiate our hLSTMarefine it and apply it to the imioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.

Keywords:

Closed captioning Computer science Context (archaeology) Artificial intelligence Encoder Task (project management) Natural language processing Word (group theory) Visualization Natural language Language model Representation (politics) Speech recognition Computer vision Image (mathematics)

Metrics

248

Cited By

21.38

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Abstract

Metrics

Citation History

Topics

Related Documents

DAA: Dual LSTMs with adaptive attention for image captioning

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

Geometry Attention Transformer with position-aware LSTMs for image captioning

Bengali Image Captioning with Visual Attention