Hate Speech Detection for Transliterated English and Sinhala Code-Mixed Data

Meuru Muthuthanthri; Roy Smith

doi:10.1109/icarc61713.2024.10499768

ScienceGate Book Chapters

JOURNAL ARTICLE

Hate Speech Detection for Transliterated English and Sinhala Code-Mixed Data

Meuru Muthuthanthri Roy Smith

Year: 2024 Pages: 155-160

DOI: 10.1109/icarc61713.2024.10499768

Get Full-Text PDF Get Analytical Report

Abstract

The rise of online hate speech has highlighted the need for sophisticated detection methods, particularly in settings with linguistic diversity. This study focuses on "Transliterated English and Sinhala code-mixed data" from Facebook, carefully annotated to maintain dataset accuracy. The data underwent preprocessing and feature extraction, with techniques like Term Frequency (TF) - Inverse Document Frequency (IDF) and fastText Word Embeddings applied to grasp the complexities of the mixed-code language. Various models were evaluated, ranging from conventional ones like Logistic Regression and Support Vector Machine (SVM) to advanced architectures like Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM). The study also incorporated gradient-boosting frameworks and transformer-based models like BERT and GPT2. The effectiveness of these models was rigorously assessed using metrics like accuracy, precision, recall, F1-score, confusion matrices, and Receiver Operating Characteristic area under curve value (ROC AUC) values. BERT stood out, achieving 82% accuracy and a 90% ROC AUC value, proving highly effective for this detection task. It paves the way for future research in diverse linguistic settings and online environments.

Keywords:

Computer science Natural language processing Code (set theory) Speech recognition Artificial intelligence Information retrieval Programming language

Metrics

Cited By

1.92

FWCI (Field Weighted Citation Impact)

Refs

0.81

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Hate Speech and Cyberbullying Detection

Physical Sciences → Computer Science → Artificial Intelligence

Swearing, Euphemism, Multilingualism

Social Sciences → Social Sciences → Communication

Hate Speech Detection for Transliterated English and Sinhala Code-Mixed Data

Abstract

Metrics

Citation History

Topics

Related Documents

Hate Speech Detection in Sinhala-English Code-Mixed Language

CODE-MIXED TELUGU-ENGLISH HATE SPEECH DETECTION

Language Detection in Sinhala-English Code-mixed Data

Detection of Hate Speech Text in Hindi-English Code-mixed Data

Improving Sinhala Hate Speech Detection Using Deep Learning