In the task of captioning multiple events in video content, traditional models using self-attention mechanisms often suffer from the problem of missing fine-grained visual semantic details due to the complexity and diversity of the content. The long-tail distribution of words in the video corpus further exacerbates the insufficient training of words related to visual content, which in turn affects the accuracy of video captions. Therefore, this paper proposes a Memory-Enhanced Attention Network that utilizes a memory module to capture the importance of frame-level features. It combines the historical visual semantic information and performs weighted fusion to select and control the flow, enabling the model to capture complex visual semantic information and better capture long-range inter-frame correlations, thus addressing the problem of missing visual semantic details.Additionally, the paper introduces a guided learning cross-entropy loss function that incorporates rich linguistic knowledge into the model using a pre-trained language model (ELM). By minimizing the KL divergence between the predicted word distribution and the ELM probability distribution at the sentence level, the model's caption accuracy is improved, alleviating the long-tail word distribution problem in the video corpus. Experimental results on the ActivityNet Captions and YouCook2 datasets demonstrate that the proposed approach outperforms other state-of-the-art models.
Yong QianYingchi MaoZhihao ChenChang LiOlano Teah BlohQian Huang
Zixin GuoTzu-Jui Julius WangJorma Laaksonen
Mingjing YuHuicheng ZhengZehua Liu