Communicating one's inner state - their emotions and feelings - forms one of the core principles of social communication and behavior in humans. Emotion is an important component of speech, and its inclusion in synthetic speech will allow for breakthroughs in applications like human-machine interfacing, e-book reading, and voice acting. However, modelling emotions in speech in an end-to-end manner has so far remained an under-explored topic of research. To address this, we experiment with novel methods in global emotional modelling in unsupervised, semi-supervised and adverserial contexts using an end-to-end text-to-speech (TTS) architecture. We condition the latent space, duration prediction and audio generation on novel hybrid labels based on ground truth data – 14 emotion labels, 64 sentiment analysis labels, and speaker labels - which may be inferred from input text during inference. Experiments on conditional discriminators were also performed. The final proposed model produces high quality expressive results comparable to the state of the art.
Katsuki InoueSunao HaraMasanobu AbeTomoki HayashiRyuichi YamamotoShinji Watanabe
Nafis SadeqNafis Tahmid ChowdhuryFarhan Tanvir UtshawShafayat AhmedMuhammad Abdullah Adnan