Video Captioning using Hierarchical Multi-Attention Model

Huanhou Xiao; Jinglun Shi

doi:10.1145/3239576.3239580

ScienceGate Book Chapters

JOURNAL ARTICLE

Video Captioning using Hierarchical Multi-Attention Model

Huanhou Xiao Jinglun Shi

Year: 2018 Vol: 00 Pages: 96-101

DOI: 10.1145/3239576.3239580

Get Full-Text PDF Get Analytical Report

Abstract

Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such as "a" and "the" are non-visual words, which require no visual information but text information. Therefore, simply imposing attention on non-visual words may mislead the decoder and decrease the overall performance of video captioning. To tackle this issue, we propose a Hierarchical Multi-Attention Model named HMAM, which uses two independent attention mechanisms to make a soft-selection over frames feature and video attributes respectively, and then integrates another attention model to automatically decide when to rely on the visual signals and when to rely on the text information. Experiments on the benchmark dataset MSVD demonstrates that our method which only uses single feature achieves superior or even better results than the state-of-the-art baselines for video captioning.

Keywords:

Closed captioning Computer science Artificial intelligence Computer vision Multimedia Image (mathematics)

Metrics

Cited By

0.29

FWCI (Field Weighted Citation Impact)

Refs

0.58

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Captioning using Hierarchical Multi-Attention Model

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-Modal Hierarchical Attention-Based Dense Video Captioning

Hierarchical Attention-Based Video Captioning Using Key Frames

Video Captioning with Multi-Faceted Attention

Hierarchical attention-based multimodal fusion for video captioning

Multimodal-enhanced hierarchical attention network for video captioning