The rise of online hate speech has highlighted the need for sophisticated detection methods, particularly in settings with linguistic diversity. This study focuses on "Transliterated English and Sinhala code-mixed data" from Facebook, carefully annotated to maintain dataset accuracy. The data underwent preprocessing and feature extraction, with techniques like Term Frequency (TF) - Inverse Document Frequency (IDF) and fastText Word Embeddings applied to grasp the complexities of the mixed-code language. Various models were evaluated, ranging from conventional ones like Logistic Regression and Support Vector Machine (SVM) to advanced architectures like Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM). The study also incorporated gradient-boosting frameworks and transformer-based models like BERT and GPT2. The effectiveness of these models was rigorously assessed using metrics like accuracy, precision, recall, F1-score, confusion matrices, and Receiver Operating Characteristic area under curve value (ROC AUC) values. BERT stood out, achieving 82% accuracy and a 90% ROC AUC value, proving highly effective for this detection task. It paves the way for future research in diverse linguistic settings and online environments.
Oshadhi LiyanageKrishnakripa Jayakumar
Ian SmithUthayasanker Thayasivam
K SreelakshmiB. PremjithK. P. Soman
Kavishka GamageViraj WelgamaRuvan Weerasinghe