JOURNAL ARTICLE

CLAP: Contrastive Language-Audio Pre-training Model for Multi-modal Sentiment Analysis

Abstract

Multi-modal Sentiment Analysis (MSA) is a hotspot of multi-modal fusion. To make full use of the correlation and complementarity between modalities in the process of fusing multi-modal data, we propose a two-stage framework of Contrastive Language-Audio Pre-training (CLAP) for the MSA task: 1) Making contrastive pre-training on an unlabeled large-scaled external data to yield better single-modal representations; 2) Adopting a Transformer-based multi-modal fusion module, to achieve further single-modal feature optimization and sentiment prediction via the task-driven training process. Our work fully demonstrates the importance and necessity of core elements such as pre-training, contrastive learning, and representation learning for the MSA task and significantly outperforms existing methods on two well-recognized MSA benchmarks.

Keywords:
Computer science Modal Artificial intelligence Natural language processing Sentiment analysis Speech recognition

Metrics

7
Cited By
1.79
FWCI (Field Weighted Citation Impact)
27
Refs
0.84
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Sentiment Analysis and Opinion Mining
Physical Sciences →  Computer Science →  Artificial Intelligence
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.