Leveraging large language models for spelling correction in Turkish

Ceren Güzel Turhan

doi:10.7717/peerj-cs.2889

ScienceGate Book Chapters

JOURNAL ARTICLE

Leveraging large language models for spelling correction in Turkish

Ceren Güzel Turhan

Year: 2025 Journal: PeerJ Computer Science Vol: 11 Pages: e2889-e2889 Publisher: PeerJ, Inc.

DOI: 10.7717/peerj-cs.2889

Get Full-Text PDF Get Analytical Report

Abstract

The field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings in human-written texts adversely affect language understanding for LLMs for various NLP tasks as well as misspelling applications such as auto-proofreading and chatbots. Therefore, this study focuses on the task of spelling correction in the agglutinative language Turkish, where its nature makes spell correction significantly more challenging. To address this, the research introduces a novel dataset, referred to as NoisyWikiTr, to explore encoder-only models based on bidirectional encoder representations from transformers (BERT) and existing auto-correction tools. For the first time in this study, as far as is known, encoder-only models based on BERT are presented as subword prediction models, and encoder-decoder models based on text-cleaning (Text-to-Text Transfer Transformer) architecture are fine-tuned for this task in Turkish. A comprehensive comparison of these models highlights the advantages of context-based approaches over traditional, context-free auto-correction tools. The findings also reveal that among LLMs, a language-specific sequence-to-sequence model outperforms both cross-lingual sequence-to-sequence models and encoder-only models in handling realistic misspellings.

Keywords:

Computer science Agglutinative language Natural language processing Language model Turkish Spelling Transformer Artificial intelligence Encoder Speech recognition Linguistics Parsing

Metrics

Cited By

4.82

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Text Readability and Simplification

Physical Sciences → Computer Science → Artificial Intelligence

Leveraging large language models for spelling correction in Turkish

Abstract

Metrics

Citation History

Topics

Related Documents

Contextual Spelling Correction with Large Language Models

EmbedTurk: Leveraging Large Language Models as Text Encoders for Turkish Language

Correction: Leveraging large language models for word sense disambiguation

New Language Models for Spelling Correction

Efficient Stochastic Error Injection for Optimizing Large Language Models in Arabic Spelling Correction