Improving Dense Video Captioning with a Transformer-based Multimodal Fusion Model

Yixuan Liu; Ziwei Zhou; Shen Hui; Haoyuan Ma; Hong‐Ju Li; Zhibo Zhang

doi:10.62517/jiem.202403407

ScienceGate Book Chapters

JOURNAL ARTICLE

Improving Dense Video Captioning with a Transformer-based Multimodal Fusion Model

Yixuan Liu Ziwei Zhou Shen Hui Haoyuan Ma Hong‐Ju Li Zhibo Zhang

Year: 2024 Journal: Journal of industry and engineering management. Vol: 2 (4)Pages: 33-40

DOI: 10.62517/jiem.202403407

Get Full-Text PDF Get Analytical Report

Abstract

Dense Video Captioning (DVC) plays a pivotal role in advancing video understanding within computer vision and natural language processing. Traditional DVC models have predominantly focused on visual information, often neglecting the auditory component. To address this limitation, we propose a Transformer-based multimodal fusion model that integrates audio and visual cues for comprehensive multimodal input processing. Built on an encoder-decoder architecture, the model synergizes audio and visual streams. The feature encoder combines self-attention mechanisms with convolutional neural networks to achieve precise audio feature encoding, while the decoder employs multimodal fusion by leveraging intermodal confidence scores to adaptively integrate inputs. A feedforward neural network enhances historical textual representations, and strategic skip connections eliminate redundant data, prioritizing key video features for refined captioning. Extensive validation on benchmark datasets MSR-VTT and MSVD demonstrates that our model outperforms existing methods, achieving BLEU-4, ROUGE, METEOR, and CIDEr scores of 0.427, 0.618, 0.294, and 0.532 on MSR-VTT, and 0.539, 0.741, 0.369, and 0.976 on MSVD. By effectively leveraging the complementary strengths of audio and visual data, our model establishes a new benchmark in DVC, offering precise and comprehensive video content interpretation.

Keywords:

Closed captioning Transformer Computer science Fusion Computer vision Speech recognition Artificial intelligence Engineering Linguistics Image (mathematics) Electrical engineering

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.33

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Improving Dense Video Captioning with a Transformer-based Multimodal Fusion Model

Abstract

Metrics

Topics

Related Documents

A Transformer-based Multimodal Feature Fusion Model for Video Captioning

End-to-End Dense Video Captioning Model Based on Multimodal Feature Fusion

Multimodal representation fusion method for dense video captioning

Multimodal Interaction Fusion Network Based on Transformer for Video Captioning

Position embedding fusion on transformer for dense video captioning