Lakshani NissankaBanuka AthuraliyaSahan Priyanayana
Accurate pronunciation is essential for successful communication in a second language (L2) as it significantly influences communicative effectiveness and perceived fluency. Mispronunciations frequently arise due to the influence of the learner’s first language (L1), posing barriers to effective spoken communication. Therefore, pronunciation error detection (PED) has emerged as a critical research area within the domains of Computer-Assisted Language Learning (CALL) and Computer-Assisted Pronunciation Training (CAPT). Although numerous PED systems have been developed over recent decades, existing survey papers have mainly emphasized comparisons of modeling methodologies or learning paradigms, often neglecting the critical role of feature representation. To address this research gap, this survey introduces a novel, feature-based taxonomy for categorizing PED methodologies into four primary groups: Acoustic-based, Acoustic-Phonetic, Linguistic-based, and Hybrid approaches. Each category is systematically reviewed, summarizing over two decades of research work with respect to feature extraction techniques, modeling approaches, evaluation metrics, and the nature and quality of instructional feedback provided to learners. A detailed comparative analysis highlights significant trade-offs among these categories in terms of detection accuracy, interpretability, resource demands, and applicability in real-time or low-resource contexts. Furthermore, this survey discusses recent and emerging trends in PED research, including self-supervised learning frameworks, multimodal feature fusion, and integrating phonological knowledge with modern deep learning architectures. By synthesizing existing knowledge and identifying gaps in current methodologies, this paper aims to provide clear insights and directions for future advancements in PED systems.
Priyanka ChhabraShailja ChhillarRiya TanwarMuskan VermaGaurav Indra