Despite end-to-end text-to-speech (TTS) synthesizers producing human-like speech, they might still need more intuitive user control over prosody. Modeling interrogative sentence prosody is challenging due to the significant variation in question types. Synthesized intonation frequently requires more accuracy, richness, and detail when only a small amount of adaptation data from particular sentence types are available. This paper uses speaker adaptation to enhance the modeling of interrogative sentence prosody in speech synthesis, tested on an English dataset. The adaptation data were selected based on the occurrence of interrogative sentences. The first dataset consisted of sentences with frequent interrogative sentences, whereas the second dataset contained declarative sentences. Two target speakers (male and female) were adapted. Objective and subjective evaluations show that the proposed model achieves remarkable performance in intonation. The MUSHRA subjective listening test has shown better intonation patterns using the interrogative dataset than the declarative one. The potential application of this model is for the vision impaired and chatbots/voice bots.
Ali Raheem MandeelMohammed Salah Al-RadhiTamás Gábor Csapó
Katsuki InoueSunao HaraMasanobu AbeTomoki HayashiRyuichi YamamotoShinji Watanabe
Tsubasa OchiaiShinji WatanabeShigeru KatagiriTakaaki HoriJohn R. Hershey
Ting LiangAskar HamdullaHao YinYunlin Chen