A text classification model in which one of the class variables is biased to the majority class typically classifies most documents into the majority class to enhance the overall classification accuracy. It is called a class imbalance problem. This study proposes a feature selection method based on simplified chi-square statistics to select features in each class for developing a robust model to the problem. Proposed method and typical feature selection methods are compared by Reuter21578 data. Experiment shows that the proposed method is superior to typical feature selection methods in terms of naïve Bayes and support vector machine which are robust to the class imbalance problem.
Jieming YangZhaoyang QuZhiying Liu
Surani MatharaarachchiMichael DomaratzkiSaman Muthukumarana
Małgorzata BachAleksandra Werner
Mohd Shamrie SaininRayner AlfredFaudziah Ahmad