JOURNAL ARTICLE

Video Captioning Method Based on Multi-Modal Information Fusion

Abstract

Video captioning is a challenging problem in the field of computer vision, which aims at using computer to automatically generate human-readable natural language sentences to describe video content.However, the traditional method only uses the single mode information of the video sequence to generate description sentences, and it is not enough to describe the whole content of the video. To describe the video content more accurately and detailed, this paper designs a multi-modal fusion based model to capture and integrate multiple cues for video captioning. The information of different modes contained in the video is fused through feature concatenation and attention mechanism to generate more detailed description sequences, which including the original RGB information, the optical flow information and the attribute tag information. Thus, the model extracts all of the appearance information, motion information and high-level semantic information to represent the video. In experiment, the proposed method achieves comparable and even better results on the public video captioning dataset.

Keywords:
Closed captioning Computer science Concatenation (mathematics) Artificial intelligence Feature (linguistics) RGB color model Optical flow Computer vision Natural language Semantics (computer science) Image (mathematics)

Metrics

3
Cited By
0.20
FWCI (Field Weighted Citation Impact)
17
Refs
0.48
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.