Decoupling Pronunciation and Language for End-to-End Code-Switching Automatic Speech Recognition

Shuai Zhang; Jiangyan Yi; Zhengkun Tian; Ye Bai; Jianhua Tao; Zhengqi Wen

doi:10.1109/icassp39728.2021.9414428

ScienceGate Book Chapters

JOURNAL ARTICLE

Decoupling Pronunciation and Language for End-to-End Code-Switching Automatic Speech Recognition

Shuai Zhang Jiangyan Yi Zhengkun Tian Ye Bai Jianhua Tao Zhengqi Wen

Year: 2021 Pages: 6249-6253

DOI: 10.1109/icassp39728.2021.9414428

Get Full-Text PDF Get Analytical Report

Abstract

Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use mono-lingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data. Meanwhile, it generates multiple phoneme sequence candidates for single audio data in real time during the training process. Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model to a certain extent. Finally, the two networks are optimized jointly through attention fusion. We evaluate the proposed method on the public Mandarin-English code-switching dataset. Compared with our transformer baseline, the proposed method achieves 18.14% relative mix error rate reduction.

Keywords:

End-to-end principle Computer science Pronunciation Decoupling (probability) Speech recognition Code-switching Artificial intelligence Linguistics Engineering

Metrics

Cited By

0.71

FWCI (Field Weighted Citation Impact)

Refs

0.75

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Phonetics and Phonology Research

Social Sciences → Psychology → Experimental and Cognitive Psychology

Decoupling Pronunciation and Language for End-to-End Code-Switching Automatic Speech Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

reducing multilingual context confusion for end-to-end code-switching automatic speech recognition

Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching

Improving Transformer Based End-to-End Code-Switching Speech Recognition Using Language Identification

Text-Derived Language Identity Incorporation for End-to-End Code-Switching Speech Recognition

Code-Switching without Switching: Language Agnostic End-to-End Speech Translation