JOURNAL ARTICLE

Incremental Text to Speech for Neural Sequence-to-Sequence Models Using Reinforcement Learning

Abstract

Modern approaches to text to speech require the entire input character\nsequence to be processed before any audio is synthesised. This latency limits\nthe suitability of such models for time-sensitive tasks like simultaneous\ninterpretation. Interleaving the action of reading a character with that of\nsynthesising audio reduces this latency. However, the order of this sequence of\ninterleaved actions varies across sentences, which raises the question of how\nthe actions should be chosen. We propose a reinforcement learning based\nframework to train an agent to make this decision. We compare our performance\nagainst that of deterministic, rule-based systems. Our results demonstrate that\nour agent successfully balances the trade-off between the latency of audio\ngeneration and the quality of synthesised audio. More broadly, we show that\nneural sequence-to-sequence models can be adapted to run in an incremental\nmanner.\n

Keywords:
Computer science Interleaving Reinforcement learning Latency (audio) Sequence (biology) Speech recognition Artificial intelligence Sequence learning Low latency (capital markets) Character (mathematics) Natural language processing

Metrics

11
Cited By
1.47
FWCI (Field Weighted Citation Impact)
47
Refs
0.85
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.