Recent advances in technology made emotion-aware interaction between human and machine plausible. Thus, collecting data from human communication, as in customer call centres, is common practice. Deep learning research has progressed to employ the human speech recordings to automatically infer speaker’s emotion and intention. Many models and architectures has shown success in expanding the input modality from only audio to other modalities, such as texts. Yet, existing works focuses on choice of models or architectures, yielding superficial understanding on interactions between modalities. In this thesis, we study the methods that can explicitly model the interactions between audio and text modalities. Attention mechanism is the most used method to combine representations between audio and text modalities. Attention layer maps the part of sequential representations from audio or text to the part of other modality sequence. However, such mapping is beneficial for translation task but not for fusion to yield emotional labels prediction. We propose 2-D attention mechanism to find the most salient pair of items, one from audio sequence and the other from text sequence. 2-D attention allows emotion recognition from explicit relationship between parts of audio sequence and parts of text sequence, which has not been tried in existing works. Lastly, multi-head attention mechanism is employed and each attention layer outputs are fed into recurrent unit for prediction. Our methods are experimented against existing fusion models for multimodal speech emotion recognition on IEMOCAP dataset. The experiment results support the advantage of our method as it achieved state-of-the-art accuracies. While most research focuses on models or architecture to efficiently tackle the speech emotion recognition problem, generalizability arises as another challenge, which is explored less and even lesser for multimodal inputs. Thus, we conducted research on cross-corpus speech emotion recognition and expanded to multimodal input, which is the text modality. We introduce a novel multitask transformer framework for cross-corpus speech emotion recognition. Pre-trained transformers are fine-tuned on emotion recognition task with unsupervised learning objectives, contrastive learning and information maximization loss, as auxiliary tasks. Unsupervised learning objectives are also used as a transfer learning method from source dataset to target dataset. Finally, we expand this to multimodal inputs, by applying same framework on text input and adopting decision level fusion from audio transformer and text transformer. We use public available dataset, IEMOCAP, MSP-IMPROV and Emo-DB for experiments. The results showed that our method can surpass the existing state-of-the-art results in cross-corpus speech emotion recognition and also significant improvement when text modality is added with simple decision level fusions. Fusion of multimodal inputs is not an only option to utilize in speech emotion recognition. We explore co-learning methodology that improves speech emotion recognition from text input. Data augmentation methods to improve speech emotion recognition is studied, where data augmentation with generative models with text input, or text conditioning, as available. Generative models that can map text input to generate audio input is na¨ıve and limited, as its generated audio input quality is not guaranteed. We propose mutual information maximization principle, which has not been explored yet in text conditioned generation for data augmentation. Experiment is conducted with commonly used datasets, IEMOCAP and MSP-IMPROV to thoroughly examine the efficacy of our methods. The results yield improved state-of-the-art accuracy on speech emotion recognition problem. In conclusion, this thesis contributed several methods to enhance the interaction between audio modality and text modality to achieve multimodal speech emotion recognition. First, we proposed novel attention mechanism to fuse audio and text at fine-grained level to improve emotion recognition. Second, we improved the generalizability of emotion recognition via novel multitask transformer framework that is successful in both speech emotion recognition and multimodal emotion recognition. Finally, we propose generative model to augment audio input using text information, while maximizing mutual information to achieve improved speech emotion recognition. In future, we will we will explore generative models in multimodal data quantify the advantage of mutual information in multimodal fusion.
Ananya AmodeSiya DhameliaSmriti LotlikarNidhi NairPankti Doshi
Jonathan Christian SetyonoAmalia Zahra
Puneet KumarSarthak MalikBalasubramanian Raman