JOURNAL ARTICLE

Hate Speech Detection for Transliterated English and Sinhala Code-Mixed Data

Abstract

The rise of online hate speech has highlighted the need for sophisticated detection methods, particularly in settings with linguistic diversity. This study focuses on "Transliterated English and Sinhala code-mixed data" from Facebook, carefully annotated to maintain dataset accuracy. The data underwent preprocessing and feature extraction, with techniques like Term Frequency (TF) - Inverse Document Frequency (IDF) and fastText Word Embeddings applied to grasp the complexities of the mixed-code language. Various models were evaluated, ranging from conventional ones like Logistic Regression and Support Vector Machine (SVM) to advanced architectures like Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM). The study also incorporated gradient-boosting frameworks and transformer-based models like BERT and GPT2. The effectiveness of these models was rigorously assessed using metrics like accuracy, precision, recall, F1-score, confusion matrices, and Receiver Operating Characteristic area under curve value (ROC AUC) values. BERT stood out, achieving 82% accuracy and a 90% ROC AUC value, proving highly effective for this detection task. It paves the way for future research in diverse linguistic settings and online environments.

Keywords:
Computer science Natural language processing Code (set theory) Speech recognition Artificial intelligence Information retrieval Programming language

Metrics

3
Cited By
1.92
FWCI (Field Weighted Citation Impact)
24
Refs
0.81
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Hate Speech and Cyberbullying Detection
Physical Sciences →  Computer Science →  Artificial Intelligence
Swearing, Euphemism, Multilingualism
Social Sciences →  Social Sciences →  Communication
© 2026 ScienceGate Book Chapters — All rights reserved.