Modern text-to-speech synthesis systems should deliver speech which is not just intelligible, but whose style corresponds to the domain in which synthesized speech is used. In this paper three approaches based on deep neural networks aimed at synthesis of expressive speech are presented: style code, model re-training and an architecture using shared hidden layers. Their usability is tested on a speech corpus with a limited amount of expressive speech data. A new architecture for transplanting speech styles is also presented and compared with a referent approach from literature.
Xuehao ZhouMingyang ZhangYi ZhouZhizheng WuHaizhou Li
Kanellos, IoannisSuciu, IoanaMoudenc, Thierry
Kanellos, IoannisSuciu, IoanaMoudenc, Thierry
Alexandre TrillaFrancesc Álías
Yuanyuan ZhuJiaxu HeRui JingYaodong SongJie LianXiao-Lei ZhangJie Li