ZHANG Huiyi,XIE Yeming,YUAN Zhixiang,SUN Guohua
Traditional CHI-square feature selection method does not take into account the category number of words in imbalanced data sets,the frequency of words,the intra-class and inter-class distribution of words,so that it fails to choose valid feature words for different categories.To solve this problem,a CHI-square feature selection method based on probability is proposed.It is used to measure the frequency of words and documents by probability of words and documents,and calculates the frequency factor of categories,the concentration factors of words between classes,equilibrium degree factors of words in the same classes and the concentration factors of documents between classes.The initial value of CHI-square is adjusted by these factors.The difference degree factor of different classes for the same word is used to make the improved CHI-square select more efficient words.Text classification experiment results show that,compared with the CHI-square feature selection method without improvement,the proposed method improves macroscopic F1 significantly,and has better classification performance on imbalanced datasets.
Yujia ZhaiWei SongLiu Xian-junLizhen LiuXinlei Zhao
Emad Mohamed MashhourEnas M. F. El HoubyKhaled WassifAkram Ibrahim Salah
Jun GeZhenxing ZhangLumin ZhouWei ZhengYilei Wang