Audio constitutes an indisputably indispensable attribute of nature. The study of sound becomes an interesting endeavour in this regard to expand our knowledge of nature and its pertinent peculiarities. One such characteristic intrinsically associated with audio is the emotion or the sentiment that it relays. Emotions play a deep role in understanding the human psyche and mindset in a behavioural context and advance domains of sociology, psychology, et cetera. It is with this vision of advancement that in this paper, we explore the realm of speech emotion recognition and undertake the endeavour to a comparative study of previously extant popular deep learning algorithms like - CNN & LSTM with certain temperaments of our own, in terms of the architecture and the hyperparameters used, and take into account the performance metrics to propose a transformer model incorporating bidirectional-LSTM, encoder, decoder and scaled-dot product attention. A popular standard - the RAVDESS dataset has been used for this purpose. The model has shown promising results on a preliminary basis when subjected to different metrics of testing and validation, and can potentially be employed in high-precision requisite systems.
Moung Ho YiMyung Jin LimJu Hyun Shin
Chunjun ZhengChunli WangNing Jia
Xiaopeng SiDong HuangYulin SunShudi HuangHe HuangDong Ming
Po-Yuan ShihChia-Ping ChenChung‐Hsien Wu