JOURNAL ARTICLE

Multimodal Emotion Recognition using Acoustic and Visual Features

Abstract

Emotions are strong messengers that convey our inner experiences, wants, and aspirations. Furthermore, correctly understanding the emotions allows us to negotiate life's problems, make educated decisions, build meaningful connections with others, and develop emotional intelligence. This research aims at automatically determining the emotion of a person accurately using multimodal emotion recognition strategy which is a fusion of acoustic and visual modalities. The RAVDESS dataset has been used for the purpose of emotion detection. Machine learning algorithms such SVM, Random Forest, KNN, Gradient Boosting, MLP, Decision Tree, Nave Bayes, and Ensemble Learning techniques were used for testing and training to identify emotion from the auditory components. The Le-Net 5 model was used to identify emotion from visual imagery. Metrics like accuracy, confusion matrix and training testing validation loss were used to evaluate the performance of these models. The proposed technique uses high-quality audio and video data, with the acoustic ensemble method attaining 65% accuracy and the video CNN model obtaining an accuracy of 86%. The recognition accuracy increases to 94.5% when the acoustic and visual components are combined at model level.

Keywords:
Emotion recognition Computer science Speech recognition Artificial intelligence Human–computer interaction

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
17
Refs
0.27
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
IoT-based Smart Home Systems
Physical Sciences →  Engineering →  Electrical and Electronic Engineering
© 2026 ScienceGate Book Chapters — All rights reserved.