Multimodal Depression Detection Using Audio Visual Cues

Gyanendra Tiwary; Shivani Chauhan; Krishan Kumar Goyal

doi:10.1109/cset58993.2023.10346770

ScienceGate Book Chapters

JOURNAL ARTICLE

Multimodal Depression Detection Using Audio Visual Cues

Gyanendra Tiwary Shivani Chauhan Krishan Kumar Goyal

Year: 2023 Pages: 1-5

DOI: 10.1109/cset58993.2023.10346770

Get Full-Text PDF Get Analytical Report

Abstract

Millions of people all around the world suffer from depression, which is a common mental health issue. Effective intervention and therapy depend on early and precise identification of depression. Multimodal techniques that incorporate audio and video data have recently demonstrated promising outcomes in the detection of depression. In this article, authors suggest a convolutional neural network (CNN) model for multimodal depression identification based on audio and video. The proposed model makes use of audio and visual elements to gather comprehensive and complementary data about depression. Pitch, intensity, and spectral information, among others, are derived from voice recordings for the audio modality. The video modality uses face landmarks, optical flow, and pose estimation techniques to combine facial expressions, head movements, and body gestures. The parallel branches of the CNN architecture process the audio and visual inputs individually. To learn discriminative audio representations, the audio branch applies three convolutional layers, followed by pooling and dense layers. To extract video-specific information, the video branch uses five convolutional layers with different filter widths and depths, followed by fully connected and pooling layers. Further a late fusion technique has been adopted for multimodal fusion, concatenating learned features from the two modalities and passing them through additional thick layers for depression prediction. Authors also used regularization strategies, including dropout and batch normalization, during training to alleviate the overfitting problem. Authors evaluated the proposed multimodal CNN model on a DAIC-WOZ dataset consisting of audio and video recordings of individuals with and without depression. The proposed model achieved 77% accuracy and hence demonstrate that the proposed multimodal CNN model achieves superior performance compared to unimodal approaches.

Keywords:

Computer science Audio visual Depression (economics) Speech recognition Computer vision Human–computer interaction Artificial intelligence Multimedia

Metrics

Cited By

1.25

FWCI (Field Weighted Citation Impact)

Refs

0.76

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Emotion and Mood Recognition

Social Sciences → Psychology → Experimental and Cognitive Psychology

Subtitles and Audiovisual Media

Social Sciences → Arts and Humanities → Language and Linguistics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Multimodal Depression Detection Using Audio Visual Cues

Abstract

Metrics

Citation History

Topics

Related Documents

Multimodal Depression Detection Using Audio, Visual and Textual Cues: A Survey

Multimodal depression recognition with dynamic visual and audio cues

Emotionfusion: a Multimodal Approach to Emotion Detection Using Audio-Visual Cues

EEG Guided Multimodal Lie Detection with Audio-Visual Cues

Multimodal Depression Recognition Using Audio and Visual