Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation

Vikramjit Mitra; Hsiang-Yun Sherry Chien; Vasudha Kowtha; Joseph Y. Cheng; Erdrin Azemi

doi:10.21437/interspeech.2022-957

ScienceGate Book Chapters

JOURNAL ARTICLE

Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation

Vikramjit Mitra Hsiang-Yun Sherry Chien Vasudha Kowtha Joseph Y. Cheng Erdrin Azemi

Year: 2022 Journal: Interspeech 2022 Pages: 4715-4719

DOI: 10.21437/interspeech.2022-957

Get Full-Text PDF Get Analytical Report

Abstract

Estimating dimensional emotions, such as activation, valence and dominance, from acoustic speech signals has been widely explored over the past few years.While accurate estimation of activation and dominance from speech seem to be possible, the same for valence remains challenging.Previous research has shown that the use of lexical information can improve valence estimation performance.Lexical information can be obtained from pre-trained acoustic models, where the learned representations can improve valence estimation from speech.We investigate the use of pre-trained model representations to improve valence estimation from acoustic speech signal.We also explore fusion of representations to improve emotion estimation across all three emotion dimensions: activation, valence and dominance.Additionally, we investigate if representations from pre-trained models can be distilled into models trained with low-level features, resulting in models with a less number of parameters.We show that fusion of pre-trained model embeddings result in a 79% relative improvement in concordance correlation coefficient (CCC) on valence estimation compared to standard acoustic feature baseline (mel-filterbank energies), while distillation from pre-trained model embeddings to lowerdimensional representations yielded a relative 12% improvement.Such performance gains were observed over two evaluation sets, indicating that our proposed architecture generalizes across those evaluation sets.We report new state-of-theart "text-free" acoustic-only dimensional emotion estimation CCC values on two MSP-Podcast evaluation sets.

Keywords:

Computer science Task (project management) Distillation Speech recognition Multi-task learning Natural language processing Artificial intelligence Human–computer interaction Machine learning Engineering

Metrics

Cited By

0.98

FWCI (Field Weighted Citation Impact)

Refs

0.74

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation

Abstract

Metrics

Citation History

Topics

Related Documents

Application of Knowledge Distillation to Multi-Task Speech Representation Learning

Knowledge Distillation for Multi-task Learning

Lightweight Speech Emotion Recognition Model Based on Multi-Task Learning

Speech Emotion Recognition with Multi-Task Learning

Online Knowledge Distillation for Multi-task Learning