GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

Yu Pan; Yanni Hu; Yuguang Yang; Fei Wen; Jixun Yao; Heng Lu; Lei Ma; Jianjun Zhao

doi:10.1109/icassp48485.2024.10448394

ScienceGate Book Chapters

JOURNAL ARTICLE

GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

Yu Pan Yanni Hu Yuguang Yang Fei Wen Jixun Yao Heng Lu Lei Ma Jianjun Zhao

Year: 2024 Pages: 10021-10025

DOI: 10.1109/icassp48485.2024.10448394

Get Full-Text PDF Get Analytical Report

Abstract

Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER). In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for SER. Specifically, we first construct an effective emotion CLAP (Emo-CLAP) for SER, using pretrained text and audio encoders. Second, given the significance of gender information in SER, two novel multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) and soft label based GEmo-CLAP (SL-GEmo-CLAP) models are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP indicate that our proposed two GEmo-CLAPs consistently outperform Emo-CLAP with different pre-trained models. Remarkably, the proposed WavLM-based SL-GEmo-CLAP obtains the best WAR of 83.16%, which performs better than state-of-the-art SER methods.

Keywords:

Computer science Speech recognition Modality (human–computer interaction) Artificial intelligence Task (project management) Natural language processing

Metrics

Cited By

4.99

FWCI (Field Weighted Citation Impact)

Refs

0.91

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

Human-CLAP: Human-perception-based Contrastive Language-audio Pretraining

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

Multimodal Audio-Language Model for Speech Emotion Recognition

EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition