JOURNAL ARTICLE

Video Captioning using Hierarchical Multi-Attention Model

Abstract

Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such as "a" and "the" are non-visual words, which require no visual information but text information. Therefore, simply imposing attention on non-visual words may mislead the decoder and decrease the overall performance of video captioning. To tackle this issue, we propose a Hierarchical Multi-Attention Model named HMAM, which uses two independent attention mechanisms to make a soft-selection over frames feature and video attributes respectively, and then integrates another attention model to automatically decide when to rely on the visual signals and when to rely on the text information. Experiments on the benchmark dataset MSVD demonstrates that our method which only uses single feature achieves superior or even better results than the state-of-the-art baselines for video captioning.

Keywords:
Closed captioning Computer science Artificial intelligence Computer vision Multimedia Image (mathematics)

Metrics

2
Cited By
0.29
FWCI (Field Weighted Citation Impact)
29
Refs
0.58
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

BOOK-CHAPTER

Hierarchical Attention-Based Video Captioning Using Key Frames

M. HemalathaKarthik Periyasamy

Lecture notes in electrical engineering Year: 2021 Pages: 295-302
JOURNAL ARTICLE

Video Captioning with Multi-Faceted Attention

Xiang LongChuang GanGerard de Melo

Journal:   Transactions of the Association for Computational Linguistics Year: 2018 Vol: 6 Pages: 173-184
JOURNAL ARTICLE

Multimodal-enhanced hierarchical attention network for video captioning

Maosheng ZhongYoude ChenHao ZhangHao XiongZhixiang Wang

Journal:   Multimedia Systems Year: 2023 Vol: 29 (5)Pages: 2469-2482
© 2026 ScienceGate Book Chapters — All rights reserved.