JOURNAL ARTICLE

Dense Video Captioning Based on Memory Enhanced Attention and Guided Learning

Abstract

In the task of captioning multiple events in video content, traditional models using self-attention mechanisms often suffer from the problem of missing fine-grained visual semantic details due to the complexity and diversity of the content. The long-tail distribution of words in the video corpus further exacerbates the insufficient training of words related to visual content, which in turn affects the accuracy of video captions. Therefore, this paper proposes a Memory-Enhanced Attention Network that utilizes a memory module to capture the importance of frame-level features. It combines the historical visual semantic information and performs weighted fusion to select and control the flow, enabling the model to capture complex visual semantic information and better capture long-range inter-frame correlations, thus addressing the problem of missing visual semantic details.Additionally, the paper introduces a guided learning cross-entropy loss function that incorporates rich linguistic knowledge into the model using a pre-trained language model (ELM). By minimizing the KL divergence between the predicted word distribution and the ELM probability distribution at the sentence level, the model's caption accuracy is improved, alleviating the long-tail word distribution problem in the video corpus. Experimental results on the ActivityNet Captions and YouCook2 datasets demonstrate that the proposed approach outperforms other state-of-the-art models.

Keywords:
Closed captioning Computer science Multimedia Speech recognition Artificial intelligence Image (mathematics)

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
14
Refs
0.21
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Dense video captioning based on local attention

Yong QianYingchi MaoZhihao ChenChang LiOlano Teah BlohQian Huang

Journal:   IET Image Processing Year: 2023 Vol: 17 (9)Pages: 2673-2685
JOURNAL ARTICLE

Post-Attention Modulator for Dense Video Captioning

Zixin GuoTzu-Jui Julius WangJorma Laaksonen

Journal:   2022 26th International Conference on Pattern Recognition (ICPR) Year: 2022 Pages: 1536-1542
JOURNAL ARTICLE

Motion Guided Spatial Attention for Video Captioning

Shaoxiang ChenYu–Gang Jiang

Journal:   Proceedings of the AAAI Conference on Artificial Intelligence Year: 2019 Vol: 33 (01)Pages: 8191-8198
© 2026 ScienceGate Book Chapters — All rights reserved.