Ali MansourJuman MohammadYury Kravchenko
With the incredible increase in the amount of text data, the need to develop efficient methods for processing and analyzing it increases. In this context, feature extraction from text is an urgent task to solve many texts mining and information retrieval problems. Traditional text feature extraction methods such as TF-IDF and bag-of-words are effective and characterized by intuitive interpretability, but suffer from the «curse of dimensionality», and they are unable to capture the meanings of words. On the other hand, modern distributed methods effectively capture the hidden semantics, but they are computationally intensive, and uninterpretable. This paper proposes a new concept-mining-based text vectorization method called Bag of weighted Concepts BoWC that aims to generate representations with low dimensions and high representational ability. BoWC vectorizes a document according to the concepts' information it contains, where it creates concepts by clustering word vectors, then uses the frequencies of these concept clusters to represent document vectors. To enrich the resulted document representation, new weighting functions are proposed for concept weighting based on statistics extracted from word embedding information. This work is a development of previous research in which the proposed method was developed and tested on a text classification task. In this work, empirical tests were extended to include tuning the parameters of the proposed method and analyzing the effect of each on the efficiency of the method. The proposed method has been tested in two data mining tasks; document clustering and classification, with five various benchmark data sets, and it was compared with several baselines, including Bag-of-words, TF-IDF, Averaged word embeddings, Bag-of-Concepts, and VLAC. The proposed method outperforms most baselines in terms of the minimum number of features and maximum classification and clustering accuracy.
Vitaly I. YuferevNikolai A. Razin
Shady ShehataFakhri KarrayMohamed S. Kamel