The widespread use of social media and the internet are emerging trends that offer an additional interaction channel for companies to better understand customer sentiments about their brands and products. Sentiment analysis uses text data from social media such as customer comments and reviews, which has the nature of high dimensionality. Without selection, typically there are at least thousands of features (words or phrases) that can be extracted from a text corpus, among which there are many redundant or irrelevant features for sentiment classification task. Thus, it is critical to select a compact yet effective set of features to avoid the complex classifier design and slow running time of classification process. However, very few of existing metrics is able to improve efficacy of feature selection by addressing the issue of sparsity of feature matrix for text data, i.e., many features may appear only in a few documents. In this paper, an improved feature selection metric known as sparsity adjusted information gain (SAIG) is proposed, which modifies the conventional information gain metric and aims to adjust the feature ranking scores according to the sparsity of the feature vector. It is able to use less features to obtain a targeted performance level. The experiment results show that SAIG is able to improve the performance of sentiment classification.
R. MadhumathiA. Meena Kowshalya
R. MadhumathiA. Meena KowshalyaR. Shruthi
Asriyanti Indah PratiwiAdiwijaya Adiwijaya
Jin Tao ShiHui Liang LiuYuan XuJun YanJian Xu