Deep Cross-Modal Audio-Visual Generation

Lele Chen; Sudhanshu Srivastava; Zhiyao Duan; Chenliang Xu

doi:10.1145/3126686.3126723

ScienceGate Book Chapters

JOURNAL ARTICLE

Deep Cross-Modal Audio-Visual Generation

Lele Chen Sudhanshu Srivastava Zhiyao Duan Chenliang Xu

Year: 2017 Pages: 349-357

DOI: 10.1145/3126686.3126723

Get Full-Text PDF Get Analytical Report

Abstract

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.

Keywords:

Computer science Modal Generative grammar Modality (human–computer interaction) Perception Artificial intelligence Audio visual Speech recognition Machine learning Human–computer interaction Multimedia

Metrics

198

Cited By

17.60

FWCI (Field Weighted Citation Impact)

Refs

0.99

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music Technology and Sound Studies

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Deep Cross-Modal Audio-Visual Generation

Abstract

Metrics

Citation History

Topics

Related Documents

Audio-to-Visual Cross-Modal Generation of Birds

Cross-modal Deep Learning Applications: Audio-Visual Retrieval

Learning Audio-Visual Correlations From Variational Cross-Modal Generation

Audio-Visual Cross-Modal Generation with Multimodal Variational Generative Model

Deep Learning Cross-Modal Learning for Audio-Visual Speech Recognition