The field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings in human-written texts adversely affect language understanding for LLMs for various NLP tasks as well as misspelling applications such as auto-proofreading and chatbots. Therefore, this study focuses on the task of spelling correction in the agglutinative language Turkish, where its nature makes spell correction significantly more challenging. To address this, the research introduces a novel dataset, referred to as NoisyWikiTr, to explore encoder-only models based on bidirectional encoder representations from transformers (BERT) and existing auto-correction tools. For the first time in this study, as far as is known, encoder-only models based on BERT are presented as subword prediction models, and encoder-decoder models based on text-cleaning (Text-to-Text Transfer Transformer) architecture are fine-tuned for this task in Turkish. A comprehensive comparison of these models highlights the advantages of context-based approaches over traditional, context-free auto-correction tools. The findings also reveal that among LLMs, a language-specific sequence-to-sequence model outperforms both cross-lingual sequence-to-sequence models and encoder-only models in handling realistic misspellings.
Doga OytacAhmet Emre ErgünTuğba ÇeliktenAytuğ Onan
Jung H. YaeNolan C. SkellyNeil RanlyPhillip M. LaCasse
Saida LaaroussiSi Lhoussain AouraghAbdellah YousfiMohammed NejjaHicham GeddahSaïd Ouatik El Alaoui
Moath R. KhaleelGheith A. Abandah