Disentangled Representation Learning for Multimodal Emotion Recognition

Dingkang Yang; Shuai Huang; Haopeng Kuang; Yangtao Du; Lihua Zhang

doi:10.1145/3503161.3547754

ScienceGate Book Chapters

JOURNAL ARTICLE

Disentangled Representation Learning for Multimodal Emotion Recognition

Dingkang Yang Shuai Huang Haopeng Kuang Yangtao Du Lihua Zhang

Year: 2022 Journal: Proceedings of the 30th ACM International Conference on Multimedia Pages: 1642-1651

DOI: 10.1145/3503161.3547754

Get Full-Text PDF Get Analytical Report

Abstract

Multimodal emotion recognition aims to identify human emotions from text, audio, and visual modalities. Previous methods either explore correlations between different modalities or design sophisticated fusion strategies. However, the serious problem is that the distribution gap and information redundancy often exist across heterogeneous modalities, resulting in learned multimodal representations that may be unrefined. Motivated by these observations, we propose a Feature-Disentangled Multimodal Emotion Recognition (FDMER) method, which learns the common and private feature representations for each modality. Specifically, we design the common and private encoders to project each modality into modality-invariant and modality-specific subspaces, respectively. The modality-invariant subspace aims to explore the commonality among different modalities and reduce the distribution gap sufficiently. The modality-specific subspaces attempt to enhance the diversity and capture the unique characteristics of each modality. After that, a modality discriminator is introduced to guide the parameter learning of the common and private encoders in an adversarial manner. We achieve the modality consistency and disparity constraints by designing tailored losses for the above subspaces. Furthermore, we present a cross-modal attention fusion module to learn adaptive weights for obtaining effective multimodal representations. The final representation is used for different downstream tasks. Experimental results show that the FDMER outperforms the state-of-the-art methods on two multimodal emotion recognition benchmarks. Moreover, we further verify the effectiveness of our model via experiments on the multimodal humor detection task.

Keywords:

Modalities Computer science Modality (human–computer interaction) Artificial intelligence Feature learning Linear subspace Multimodal learning Encoder Redundancy (engineering) Representation (politics) Feature (linguistics) Subspace topology Machine learning Natural language processing Mathematics

Metrics

195

Cited By

30.99

FWCI (Field Weighted Citation Impact)

Refs

1.00

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Sentiment Analysis and Opinion Mining

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Disentangled Representation Learning for Multimodal Emotion Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Fine-Grained Disentangled Representation Learning For Multimodal Emotion Recognition

Disentangled Representation and Contrastive Learning with Adaptive Affinity Squeeze-Excitation for Multimodal Emotion Recognition

Disentangled Multimodal Representation Learning for Recommendation

Disentangled Multimodal Representation Learning for Recommendation

EEG-Based Multimodal Representation Learning for Emotion Recognition