Modern approaches to text to speech require the entire input character\nsequence to be processed before any audio is synthesised. This latency limits\nthe suitability of such models for time-sensitive tasks like simultaneous\ninterpretation. Interleaving the action of reading a character with that of\nsynthesising audio reduces this latency. However, the order of this sequence of\ninterleaved actions varies across sentences, which raises the question of how\nthe actions should be chosen. We propose a reinforcement learning based\nframework to train an agent to make this decision. We compare our performance\nagainst that of deterministic, rule-based systems. Our results demonstrate that\nour agent successfully balances the trade-off between the latency of audio\ngeneration and the quality of synthesised audio. More broadly, we show that\nneural sequence-to-sequence models can be adapted to run in an incremental\nmanner.\n
Hoon ChungHyung‐Bae JeonJeon Gue Park
Neto, Álvaro Conde LemosCastro, Cristiano L.
Tian ShiYaser KeneshlooNaren RamakrishnanChandan K. Reddy
Sashi NovitasariAndros TjandraSakriani SaktiSatoshi Nakamura