Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

Chenyi Lei; Shixian Luo; Yong Liu; Wanggui He; Jiamang Wang; Guoxin Wang; Haihong Tang; Chunyan Miao; Houqiang Li

doi:10.1145/3474085.3475431

ScienceGate Book Chapters

JOURNAL ARTICLE

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

Chenyi Lei Shixian Luo Yong Liu Wanggui He Jiamang Wang Guoxin Wang Haihong Tang Chunyan Miao Houqiang Li

Year: 2021 Pages: 2567-2576

DOI: 10.1145/3474085.3475431

Get Full-Text PDF Get Analytical Report

Abstract

The pre-trained neural models have recently achieved impressive performance in understanding multimodal content. However, it is still very challenging to pre-train neural models for video and language understanding, especially for Chinese video-language data, due to the following reasons. Firstly, existing video-language pre-training algorithms mainly focus on the co-occurrence of words and video frames, but ignore other valuable semantic and structure information of video-language content, e.g., sequential order and spatiotemporal relationships. Secondly, there exist conflicts between video sentence alignment and other proxy tasks. Thirdly, there is a lack of large-scale and high-quality Chinese video-language datasets (eg. including 10 million unique videos), which are the fundamental success conditions for pre-training techniques. In this work, we propose a novel video-language understanding framework named Victor, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training. Besides general proxy tasks such as masked language modeling, Victor constructs several novel proxy tasks under the contrastive learning paradigm, making the model be more robust and able to capture more complex multimodal semantic and structural relationships from different perspectives. Victor is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions. We apply the pre-trained Victor model to a series of downstream applications and demonstrate its superior performance, comparing against the state-of-the-art pre-training methods such as VideoBERT and UniVL.

Keywords:

Computer science Sentence Natural language processing Artificial intelligence Proxy (statistics) Language model Quality (philosophy) Focus (optics) Machine learning

Metrics

Cited By

3.07

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

Abstract

Metrics

Citation History

Topics

Related Documents

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Scaling up Multimodal Pre-Training for Sign Language Understanding

Multimodal Hate Speech Detection in Memes Using Contrastive Language-Image Pre-Training

Multimodal Hate Speech Detection in Memes Using Contrastive Language-Image Pre-Training

Chinese Vision-language Understanding based on Multimodal Pre-trained Models