Abstract

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.

Keywords:
Computer science Modal Generative grammar Modality (human–computer interaction) Perception Artificial intelligence Audio visual Speech recognition Machine learning Human–computer interaction Multimedia

Metrics

198
Cited By
17.60
FWCI (Field Weighted Citation Impact)
34
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music Technology and Sound Studies
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.