Multi-Modal Hierarchical Attention-Based Dense Video Captioning

M. Hemalatha; Charu Chandra

doi:10.1109/icip49359.2023.10222065

ScienceGate Book Chapters

JOURNAL ARTICLE

Multi-Modal Hierarchical Attention-Based Dense Video Captioning

M. Hemalatha Charu Chandra

Year: 2023 Pages: 475-479

DOI: 10.1109/icip49359.2023.10222065

Get Full-Text PDF Get Analytical Report

Abstract

Most of the existing dense video captioning models use a single modality of features for captioning. A video has a wide variety of information like spatial features, temporal features, audio features, and semantic features. In this paper, we propose a dense video captioning model that captures crossmodal attention between different types of features using an audio-visual attention block in the encoder and a hierarchical attention block in the decoder. The audio-visual attention block applies cross-modal attention between the RGB, flow, and audio features. The hierarchical attention block performs two-level attention between the semantic features and the features from the encoder for generating descriptions. The results show that the proposed approach performs better than the state-of-the-art approaches.

Keywords:

Closed captioning Computer science Encoder Artificial intelligence Block (permutation group theory) Crossmodal Speech recognition Semantics (computer science) Audio analyzer Audio signal processing Speech coding Audio signal Image (mathematics) Visual perception Perception

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.11

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multi-Modal Hierarchical Attention-Based Dense Video Captioning

Abstract

Metrics

Topics

Related Documents

Multi-modal Dense Video Captioning

Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks

MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning

Video Captioning using Hierarchical Multi-Attention Model

Dense video captioning based on local attention