Listeners can perceive and use a wide array of fine-grained phonetic details, including the detailed coarticulatory influences of adjacent sounds, when perceiving speech. Details like anticipatory nasalization in can, for example, potentially provide the listener with a rich network of informative cues and are a key to understanding listeners’ ability to disambiguate speech sounds from seemingly ambiguous input. Unfortunately, these coarticulatory cues are generally missing or contradictory in the output of speech synthesis systems. These systems work by concatenating variable-length sound units chosen from a large database of recorded speech. Units are chosen to minimize two functions: the cost of aligning a particular unit with the desired speech output (target cost) and the cost of adjoining the next sound to the most recently selected unit (join cost). Generally, these costs are calculated using features which can be automatically extracted from the acoustic speech signal. A unit selection database is created, automatically segmented and automatically labeled with nasal and oral airflow feature vectors. These aerodynamic features are used as a proxy for articulatory information in the calculation of join and cost functions. Listeners’ mean opinion scores are obtained on output from this system and a baseline acoustic system for comparison.
Abhinav SethyShrikanth Narayanan