JOURNAL ARTICLE

Multi-Modal Hierarchical Attention-Based Dense Video Captioning

Abstract

Most of the existing dense video captioning models use a single modality of features for captioning. A video has a wide variety of information like spatial features, temporal features, audio features, and semantic features. In this paper, we propose a dense video captioning model that captures crossmodal attention between different types of features using an audio-visual attention block in the encoder and a hierarchical attention block in the decoder. The audio-visual attention block applies cross-modal attention between the RGB, flow, and audio features. The hierarchical attention block performs two-level attention between the semantic features and the features from the encoder for generating descriptions. The results show that the proposed approach performs better than the state-of-the-art approaches.

Keywords:
Closed captioning Computer science Encoder Artificial intelligence Block (permutation group theory) Crossmodal Speech recognition Semantics (computer science) Audio analyzer Audio signal processing Speech coding Audio signal Image (mathematics) Visual perception Perception

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
22
Refs
0.11
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.