Most of the existing dense video captioning models use a single modality of features for captioning. A video has a wide variety of information like spatial features, temporal features, audio features, and semantic features. In this paper, we propose a dense video captioning model that captures crossmodal attention between different types of features using an audio-visual attention block in the encoder and a hierarchical attention block in the decoder. The audio-visual attention block applies cross-modal attention between the RGB, flow, and audio features. The hierarchical attention block performs two-level attention between the semantic features and the features from the encoder for generating descriptions. The results show that the proposed approach performs better than the state-of-the-art approaches.
Mingjing YuHuicheng ZhengZehua Liu
Wei ChenJianwei NiuXuefeng Liu
Yong QianYingchi MaoZhihao ChenChang LiOlano Teah BlohQian Huang