JOURNAL ARTICLE

Improving Dense Video Captioning with a Transformer-based Multimodal Fusion Model

Abstract

Dense Video Captioning (DVC) plays a pivotal role in advancing video understanding within computer vision and natural language processing. Traditional DVC models have predominantly focused on visual information, often neglecting the auditory component. To address this limitation, we propose a Transformer-based multimodal fusion model that integrates audio and visual cues for comprehensive multimodal input processing. Built on an encoder-decoder architecture, the model synergizes audio and visual streams. The feature encoder combines self-attention mechanisms with convolutional neural networks to achieve precise audio feature encoding, while the decoder employs multimodal fusion by leveraging intermodal confidence scores to adaptively integrate inputs. A feedforward neural network enhances historical textual representations, and strategic skip connections eliminate redundant data, prioritizing key video features for refined captioning. Extensive validation on benchmark datasets MSR-VTT and MSVD demonstrates that our model outperforms existing methods, achieving BLEU-4, ROUGE, METEOR, and CIDEr scores of 0.427, 0.618, 0.294, and 0.532 on MSR-VTT, and 0.539, 0.741, 0.369, and 0.976 on MSVD. By effectively leveraging the complementary strengths of audio and visual data, our model establishes a new benchmark in DVC, offering precise and comprehensive video content interpretation.

Keywords:
Closed captioning Transformer Computer science Fusion Computer vision Speech recognition Artificial intelligence Engineering Linguistics Image (mathematics) Electrical engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
13
Refs
0.33
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Multimodal representation fusion method for dense video captioning

Haojie FangYonggang LiYingjian Li

Journal:   Knowledge-Based Systems Year: 2025 Vol: 324 Pages: 113856-113856
BOOK-CHAPTER

Multimodal Interaction Fusion Network Based on Transformer for Video Captioning

Hui XuPengpeng ZengAbdullah Aman Khan

Communications in computer and information science Year: 2022 Pages: 21-36
JOURNAL ARTICLE

Position embedding fusion on transformer for dense video captioning

Sixuan YangPengjie TangHanli WangQinyu Li

Journal:   Developments of Artificial Intelligence Technologies in Computation and Robotics Year: 2020 Pages: 792-799
© 2026 ScienceGate Book Chapters — All rights reserved.