JOURNAL ARTICLE

Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations

Wenqin WuJoonwhoan Lee

Year: 2024 Journal:   Applied Sciences Vol: 14 (18)Pages: 8532-8532   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

In general, it is difficult to obtain a huge, labeled dataset for deep learning-based phoneme recognition in singing voices. Studying singing voices also offers inherent challenges, compared to speech, because of the distinct variations in pitch, duration, and intensity. This paper proposes a detouring method to overcome this insufficient dataset, and applies it to the recognition of Korean phonemes in singing voices. The method started with pre-training the HuBERT, a self-supervised speech representation model, on a large-scale English corpus. The model was then adapted to the Korean speech domain with a relatively small-scale Korean corpus, in which the Korean phonemes were interpreted as similar English ones. Finally, the speech-adapted model was again trained with a tiny-scale Korean singing voice corpus for speech–singing adaptation. In the final adaptation, melodic supervision was chosen, which utilizes pitch information to improve the performance. For evaluation, the performance on multi-level error rates based on Word Error Rate (WER) was taken. Using the HuBERT-based transfer learning for adaptation improved the phoneme-level error rate of Korean speech by as much as 31.19%. Again, on singing voices by melodic supervision, it improved the rate by 0.55%. The significant improvement in speech recognition underscores the considerable potential of a model equipped with general human voice representations captured from the English corpus that can improve phoneme recognition on less target speech data. Moreover, the musical variation in singing voices is beneficial for phoneme recognition in singing voices. The proposed method could be applied to the phoneme recognition of other languages that have less speech and singing voice corpora.

Keywords:
Singing Speech recognition Linguistics Psychology Computer science Communication Acoustics

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
33
Refs
0.15
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing

Related Documents

JOURNAL ARTICLE

Phoneme Segmentation Using Self-Supervised Speech Models

Luke StrgarDavid Harwath

Journal:   2022 IEEE Spoken Language Technology Workshop (SLT) Year: 2023 Pages: 1067-1073
DISSERTATION

Phoneme segmentation using self-supervised speech models

Strgar, Luke Vincent

University:   Texas Digital Library (University of Texas) Year: 2023
JOURNAL ARTICLE

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Bagus Tris AtmajaAkira Sasou

Journal:   IEEE Access Year: 2022 Vol: 10 Pages: 124396-124407
JOURNAL ARTICLE

Transcription-Guided and Self-Supervised Speech Representations for Singing Voice Conversion

Lorenzo, Betty Cortiñas

Journal:   Zenodo (CERN European Organization for Nuclear Research) Year: 2022
© 2026 ScienceGate Book Chapters — All rights reserved.