Feature selection algorithm has a great influence on the accuracy of text categorization. The traditional information gain (IG) feature selection algorithm usually selects the features that rarely appear in the specified categories, but frequently appear in other categories. To overcome this drawback, on the basis of in-depth analysis of the related algorithms, an improved IG feature selection method is proposed. At first, the features are selected by the categories of data set, and the features from different categories are merged by an optimized method. Then, the weight of IG is calculated by using the probability of the appearance of these characteristics. At last, between-class concentration distribution factor and within-class word frequency dispersion distribution factor are adopted. SVM classifier is used to verify the algorithm. It is proved that our improved method has better performance than the original IG and other two improved methods.
Hong ZhangYonggong RenXue Yang
Li ZhongYang JingLijing YaoBinbin Gan