JOURNAL ARTICLE

Multimodal Depression Detection Using Audio Visual Cues

Abstract

Millions of people all around the world suffer from depression, which is a common mental health issue. Effective intervention and therapy depend on early and precise identification of depression. Multimodal techniques that incorporate audio and video data have recently demonstrated promising outcomes in the detection of depression. In this article, authors suggest a convolutional neural network (CNN) model for multimodal depression identification based on audio and video. The proposed model makes use of audio and visual elements to gather comprehensive and complementary data about depression. Pitch, intensity, and spectral information, among others, are derived from voice recordings for the audio modality. The video modality uses face landmarks, optical flow, and pose estimation techniques to combine facial expressions, head movements, and body gestures. The parallel branches of the CNN architecture process the audio and visual inputs individually. To learn discriminative audio representations, the audio branch applies three convolutional layers, followed by pooling and dense layers. To extract video-specific information, the video branch uses five convolutional layers with different filter widths and depths, followed by fully connected and pooling layers. Further a late fusion technique has been adopted for multimodal fusion, concatenating learned features from the two modalities and passing them through additional thick layers for depression prediction. Authors also used regularization strategies, including dropout and batch normalization, during training to alleviate the overfitting problem. Authors evaluated the proposed multimodal CNN model on a DAIC-WOZ dataset consisting of audio and video recordings of individuals with and without depression. The proposed model achieved 77% accuracy and hence demonstrate that the proposed multimodal CNN model achieves superior performance compared to unimodal approaches.

Keywords:
Computer science Audio visual Depression (economics) Speech recognition Computer vision Human–computer interaction Artificial intelligence Multimedia

Metrics

3
Cited By
1.25
FWCI (Field Weighted Citation Impact)
23
Refs
0.76
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Subtitles and Audiovisual Media
Social Sciences →  Arts and Humanities →  Language and Linguistics
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.