JOURNAL ARTICLE

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

Abstract

The pre-trained neural models have recently achieved impressive performance in understanding multimodal content. However, it is still very challenging to pre-train neural models for video and language understanding, especially for Chinese video-language data, due to the following reasons. Firstly, existing video-language pre-training algorithms mainly focus on the co-occurrence of words and video frames, but ignore other valuable semantic and structure information of video-language content, e.g., sequential order and spatiotemporal relationships. Secondly, there exist conflicts between video sentence alignment and other proxy tasks. Thirdly, there is a lack of large-scale and high-quality Chinese video-language datasets (eg. including 10 million unique videos), which are the fundamental success conditions for pre-training techniques. In this work, we propose a novel video-language understanding framework named Victor, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training. Besides general proxy tasks such as masked language modeling, Victor constructs several novel proxy tasks under the contrastive learning paradigm, making the model be more robust and able to capture more complex multimodal semantic and structural relationships from different perspectives. Victor is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions. We apply the pre-trained Victor model to a series of downstream applications and demonstrate its superior performance, comparing against the state-of-the-art pre-training methods such as VideoBERT and UniVL.

Keywords:
Computer science Sentence Natural language processing Artificial intelligence Proxy (statistics) Language model Quality (philosophy) Focus (optics) Machine learning

Metrics

34
Cited By
3.07
FWCI (Field Weighted Citation Impact)
60
Refs
0.93
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Hu XuGargi GhoshPo-Yao HuangDmytro OkhonkoArmen AghajanyanFlorian MetzeLuke ZettlemoyerChristoph Feichtenhofer

Journal:   Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Year: 2021 Pages: 6787-6800
JOURNAL ARTICLE

Scaling up Multimodal Pre-Training for Sign Language Understanding

Wengang ZhouWeichao ZhaoHezhen HuZecheng LiHouqiang Li

Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Year: 2025 Vol: 47 (12)Pages: 11753-11767
JOURNAL ARTICLE

Multimodal Hate Speech Detection in Memes Using Contrastive Language-Image Pre-Training

Journal:   International Research Journal of Modernization in Engineering Technology and Science Year: 2025
© 2026 ScienceGate Book Chapters — All rights reserved.