Large and accurately labeled textual corpora are vital to developing efficient hate speech classifiers. This paper introduces an ensemble-based semi-supervised learning approach to leverage the availability of abundant social media content. Starting with a reliable hate speech dataset, we train and test diverse classifiers that are then used to label a corpus of one million tweets. Next, we investigate several strategies to select the most confident labels from the obtained pseudo labels. We assess these strategies by re-training all the classifiers with the seed dataset augmented with the trusted pseudo-labeled data. Finally, we demonstrate that our approach improves classification performance over supervised hate speech classification methods.
Cendra Devayana PutraHei‐Chia Wang
Rahul KumarVasu GuptaVibhu SehraYashaswi Raj Vardhan
Weichao LiuPengyu WangYoupeng You
Ahmed Cherif MazariNesrine BoudoukhaniAbdelhamid Djeffal