Sparsity adjusted information gain for feature selection in sentiment analysis

Betty Ong; so-jin Goh; Chi Xu

doi:10.1109/bigdata.2015.7363995

ScienceGate Book Chapters

JOURNAL ARTICLE

Sparsity adjusted information gain for feature selection in sentiment analysis

Betty Ong so-jin Goh Chi Xu

Year: 2015 Vol: 3 Pages: 2122-2128

DOI: 10.1109/bigdata.2015.7363995

Get Full-Text PDF Get Analytical Report

Abstract

The widespread use of social media and the internet are emerging trends that offer an additional interaction channel for companies to better understand customer sentiments about their brands and products. Sentiment analysis uses text data from social media such as customer comments and reviews, which has the nature of high dimensionality. Without selection, typically there are at least thousands of features (words or phrases) that can be extracted from a text corpus, among which there are many redundant or irrelevant features for sentiment classification task. Thus, it is critical to select a compact yet effective set of features to avoid the complex classifier design and slow running time of classification process. However, very few of existing metrics is able to improve efficacy of feature selection by addressing the issue of sparsity of feature matrix for text data, i.e., many features may appear only in a few documents. In this paper, an improved feature selection metric known as sparsity adjusted information gain (SAIG) is proposed, which modifies the conventional information gain metric and aims to adjust the feature ranking scores according to the sparsity of the feature vector. It is able to use less features to obtain a targeted performance level. The experiment results show that SAIG is able to improve the performance of sentiment classification.

Keywords:

Computer science Feature selection Sentiment analysis Classifier (UML) Dimensionality reduction Artificial intelligence Feature (linguistics) Social media Curse of dimensionality Machine learning Ranking (information retrieval) Data mining The Internet World Wide Web

Metrics

Cited By

1.57

FWCI (Field Weighted Citation Impact)

Refs

0.92

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Sentiment Analysis and Opinion Mining

Physical Sciences → Computer Science → Artificial Intelligence

Text and Document Classification Technologies

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Text Analysis Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Sparsity adjusted information gain for feature selection in sentiment analysis

Abstract

Metrics

Citation History

Topics

Related Documents

Information Gain Based Feature Selection for Improved Textual Sentiment Analysis

Assessment of Sentiment Analysis Using Information Gain Based Feature Selection Approach

Sentiment Analysis using Naive Bayes Classifier and Information Gain Feature Selection over Twitter

On the Feature Selection and Classification Based on Information Gain for Document Sentiment Analysis

Chinese Sentiment Classifier Machine Learning Based on Optimized Information Gain Feature Selection