Dense Video Captioning Based on Memory Enhanced Attention and Guided Learning

Kunjun Liang; Xiaodong Cai; Shunhong Long; Yeyang Huang

doi:10.1109/icicml60161.2023.10424767

ScienceGate Book Chapters

JOURNAL ARTICLE

Dense Video Captioning Based on Memory Enhanced Attention and Guided Learning

Kunjun Liang Xiaodong Cai Shunhong Long Yeyang Huang

Year: 2023 Pages: 359-364

DOI: 10.1109/icicml60161.2023.10424767

Get Full-Text PDF Get Analytical Report

Abstract

In the task of captioning multiple events in video content, traditional models using self-attention mechanisms often suffer from the problem of missing fine-grained visual semantic details due to the complexity and diversity of the content. The long-tail distribution of words in the video corpus further exacerbates the insufficient training of words related to visual content, which in turn affects the accuracy of video captions. Therefore, this paper proposes a Memory-Enhanced Attention Network that utilizes a memory module to capture the importance of frame-level features. It combines the historical visual semantic information and performs weighted fusion to select and control the flow, enabling the model to capture complex visual semantic information and better capture long-range inter-frame correlations, thus addressing the problem of missing visual semantic details.Additionally, the paper introduces a guided learning cross-entropy loss function that incorporates rich linguistic knowledge into the model using a pre-trained language model (ELM). By minimizing the KL divergence between the predicted word distribution and the ELM probability distribution at the sentence level, the model's caption accuracy is improved, alleviating the long-tail word distribution problem in the video corpus. Experimental results on the ActivityNet Captions and YouCook2 datasets demonstrate that the proposed approach outperforms other state-of-the-art models.

Keywords:

Closed captioning Computer science Multimedia Speech recognition Artificial intelligence Image (mathematics)

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.21

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Dense Video Captioning Based on Memory Enhanced Attention and Guided Learning

Abstract

Metrics

Topics

Related Documents

Dense video captioning based on local attention

Multi-Modal Hierarchical Attention-Based Dense Video Captioning

Post-Attention Modulator for Dense Video Captioning

Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks

Motion Guided Spatial Attention for Video Captioning