Video Captioning Method Based on Multi-Modal Information Fusion

Xin Ai; Hao Wang; Haiyang Chen; Wanru Xu

doi:10.1109/icaica52286.2021.9497899

ScienceGate Book Chapters

JOURNAL ARTICLE

Video Captioning Method Based on Multi-Modal Information Fusion

Xin Ai Hao Wang Haiyang Chen Wanru Xu

Year: 2021 Pages: 820-824

DOI: 10.1109/icaica52286.2021.9497899

Get Full-Text PDF Get Analytical Report

Abstract

Video captioning is a challenging problem in the field of computer vision, which aims at using computer to automatically generate human-readable natural language sentences to describe video content.However, the traditional method only uses the single mode information of the video sequence to generate description sentences, and it is not enough to describe the whole content of the video. To describe the video content more accurately and detailed, this paper designs a multi-modal fusion based model to capture and integrate multiple cues for video captioning. The information of different modes contained in the video is fused through feature concatenation and attention mechanism to generate more detailed description sequences, which including the original RGB information, the optical flow information and the attribute tag information. Thus, the model extracts all of the appearance information, motion information and high-level semantic information to represent the video. In experiment, the proposed method achieves comparable and even better results on the public video captioning dataset.

Keywords:

Closed captioning Computer science Concatenation (mathematics) Artificial intelligence Feature (linguistics) RGB color model Optical flow Computer vision Natural language Semantics (computer science) Image (mathematics)

Metrics

Cited By

0.20

FWCI (Field Weighted Citation Impact)

Refs

0.48

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Captioning Method Based on Multi-Modal Information Fusion

Abstract

Metrics

Citation History

Topics

Related Documents

Event-centric multi-modal fusion method for dense video captioning

Multi-modal Dense Video Captioning

Multi-Modal Hierarchical Attention-Based Dense Video Captioning

News video classification based on multi-modal information fusion

Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning