Abstract

In this work, we present an end-to-end binaural speech synthesis system that combines a low-bitrate audio codec with a powerful binaural decoder that is capable of accurate speech binauralization while faithfully reconstructing environmental factors like ambient noise or reverb.The network is a modified vectorquantized variational autoencoder, trained with several carefully designed objectives, including an adversarial loss.We evaluate the proposed system on an internal binaural dataset with objective metrics and a perceptual study.Results show that the proposed approach matches the ground truth data more closely than previous methods.In particular, we demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.

Keywords:
Computer science Binaural recording Speech synthesis Speech recognition End-to-end principle Artificial intelligence

Metrics

10
Cited By
1.18
FWCI (Field Weighted Citation Impact)
30
Refs
0.78
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and dialogue systems
Physical Sciences →  Computer Science →  Artificial Intelligence
Phonetics and Phonology Research
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
© 2026 ScienceGate Book Chapters — All rights reserved.