Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations

Wenqin Wu; Joonwhoan Lee

doi:10.3390/app14188532

ScienceGate Book Chapters

JOURNAL ARTICLE

Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations

Wenqin Wu Joonwhoan Lee

Year: 2024 Journal: Applied Sciences Vol: 14 (18)Pages: 8532-8532 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/app14188532

Get Full-Text PDF Get Analytical Report

Abstract

In general, it is difficult to obtain a huge, labeled dataset for deep learning-based phoneme recognition in singing voices. Studying singing voices also offers inherent challenges, compared to speech, because of the distinct variations in pitch, duration, and intensity. This paper proposes a detouring method to overcome this insufficient dataset, and applies it to the recognition of Korean phonemes in singing voices. The method started with pre-training the HuBERT, a self-supervised speech representation model, on a large-scale English corpus. The model was then adapted to the Korean speech domain with a relatively small-scale Korean corpus, in which the Korean phonemes were interpreted as similar English ones. Finally, the speech-adapted model was again trained with a tiny-scale Korean singing voice corpus for speech–singing adaptation. In the final adaptation, melodic supervision was chosen, which utilizes pitch information to improve the performance. For evaluation, the performance on multi-level error rates based on Word Error Rate (WER) was taken. Using the HuBERT-based transfer learning for adaptation improved the phoneme-level error rate of Korean speech by as much as 31.19%. Again, on singing voices by melodic supervision, it improved the rate by 0.55%. The significant improvement in speech recognition underscores the considerable potential of a model equipped with general human voice representations captured from the English corpus that can improve phoneme recognition on less target speech data. Moreover, the musical variation in singing voices is beneficial for phoneme recognition in singing voices. The proposed method could be applied to the phoneme recognition of other languages that have less speech and singing voice corpora.

Keywords:

Singing Speech recognition Linguistics Psychology Computer science Communication Acoustics

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.15

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations

Abstract

Metrics

Topics

Related Documents

Phoneme Segmentation Using Self-Supervised Speech Models

Phoneme segmentation using self-supervised speech models

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Measuring Phoneme-Level Pronunciation Deviations in Japanese Learners of English Using Self-Supervised Speech Representations

Transcription-Guided and Self-Supervised Speech Representations for Singing Voice Conversion