Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

Qing Zhou; Xiaona Xu; Yue Zhao

doi:10.3390/app14156834

ScienceGate Book Chapters

JOURNAL ARTICLE

Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

Qing Zhou Xiaona Xu Yue Zhao

Year: 2024 Journal: Applied Sciences Vol: 14 (15)Pages: 6834-6834 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/app14156834

Get Full-Text PDF Get Analytical Report

Abstract

Most current research in Tibetan speech synthesis relies primarily on autoregressive models in deep learning. However, these models face challenges such as slow inference, skipped readings, and repetitions. To overcome these issues, we propose an enhanced non-autoregressive acoustic model combined with a vocoder for Tibetan speech synthesis. Specifically, we introduce the mixture alignment FastSpeech2 method to correct errors caused by hard alignment in the original FastSpeech2 method. This new method employs soft alignment at the level of Latin letters and hard alignment at the level of Tibetan characters, thereby improving alignment accuracy between text and speech and enhancing the naturalness and intelligibility of the synthesized speech. Additionally, we integrate pitch and energy information into the model, further enhancing overall synthesis quality. Furthermore, Tibetan has relatively smaller text-to-audio datasets compared to widely studied languages. To address these limited resources, we employ a transfer learning approach to pre-train the model with data from resource-rich languages. Subsequently, this pre-trained mixture alignment FastSpeech2 model is fine-tuned for Tibetan speech synthesis. Experimental results demonstrate that the mixture alignment FastSpeech2 model produces higher-quality speech compared to the original FastSpeech2 model, particularly when pre-trained on an English dataset, resulting in further improvements in clarity and naturalness.

Keywords:

Naturalness Computer science Speech synthesis Intelligibility (philosophy) Speech recognition Artificial intelligence Inference Autoregressive model Natural language processing Mathematics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.11

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

Abstract

Metrics

Topics

Related Documents

Research on Tibetan Speech Synthesis Based on Fastspeech2

FastSpeech2 Based Japanese Emotional Speech Synthesis

Mongolian Emotional Speech Synthesis Based on CGAN and Improved FastSpeech2

Research on Speech Synthesis Based on Mixture Alignment Mechanism

A Fast and Lightweight Speech Synthesis Model based on FastSpeech2