Attention Driven Fusion for Multi-Modal Emotion Recognition

Darshana Priyasad; Tharindu Fernando; Simon Denman; Sridha Sridharan; Clinton Fookes

doi:10.1109/icassp40776.2020.9054441

ScienceGate Book Chapters

JOURNAL ARTICLE

Attention Driven Fusion for Multi-Modal Emotion Recognition

Darshana Priyasad Tharindu Fernando Simon Denman Sridha Sridharan Clinton Fookes

Year: 2020 Pages: 3227-3231

DOI: 10.1109/icassp40776.2020.9054441

Get Full-Text PDF Get Analytical Report

Abstract

Deep learning has emerged as a powerful alternative to hand-crafted methods for emotion recognition on combined acoustic and text modalities. Baseline systems model emotion information in text and acoustic modes independently using Deep Convolutional Neural Networks (DCNN) and Recurrent Neural Networks (RNN), followed by applying attention, fusion, and classification. In this paper, we present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification. We utilize a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN. This approach learns filter banks tuned for emotion recognition and provides more effective features compared to directly applying convolutions over the raw speech signal. For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations on hidden representations received from the Bi-RNN. Following existing state-of-the-art, we evaluate the performance of the proposed system on the IEMOCAP dataset. Experimental results indicate that the proposed system outperforms existing methods, achieving 5.2% improvement in weighted accuracy.

Keywords:

Computer science Recurrent neural network Artificial intelligence Speech recognition Deep learning Fuse (electrical) Emotion recognition Convolutional neural network Pattern recognition (psychology) Filter (signal processing) Artificial neural network

Metrics

Cited By

9.62

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Attention Driven Fusion for Multi-Modal Emotion Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-head attention fusion networks for multi-modal speech emotion recognition

Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition

Multi-Modal Attention for Speech Emotion Recognition

Multi-corpus emotion recognition method based on cross-modal gated attention fusion

Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework