Video captioning is a challenging problem in the field of computer vision, which aims at using computer to automatically generate human-readable natural language sentences to describe video content.However, the traditional method only uses the single mode information of the video sequence to generate description sentences, and it is not enough to describe the whole content of the video. To describe the video content more accurately and detailed, this paper designs a multi-modal fusion based model to capture and integrate multiple cues for video captioning. The information of different modes contained in the video is fused through feature concatenation and attention mechanism to generate more detailed description sequences, which including the original RGB information, the optical flow information and the attribute tag information. Thus, the model extracts all of the appearance information, motion information and high-level semantic information to represent the video. In experiment, the proposed method achieves comparable and even better results on the public video captioning dataset.
Zhi Yong ChangDexin ZhaoHuilin ChenJingdan LiPengfei Liu
Yulai XieJingjing NiuYang ZhangFang Ren