JOURNAL ARTICLE

Text Vectorization Method Based on Concept Mining Using Clustering Techniques

Ali MansourJuman MohammadYury Kravchenko

Year: 2022 Journal:   2022 VI International Conference on Information Technologies in Engineering Education (Inforino) Pages: 1-10

Abstract

With the incredible increase in the amount of text data, the need to develop efficient methods for processing and analyzing it increases. In this context, feature extraction from text is an urgent task to solve many texts mining and information retrieval problems. Traditional text feature extraction methods such as TF-IDF and bag-of-words are effective and characterized by intuitive interpretability, but suffer from the «curse of dimensionality», and they are unable to capture the meanings of words. On the other hand, modern distributed methods effectively capture the hidden semantics, but they are computationally intensive, and uninterpretable. This paper proposes a new concept-mining-based text vectorization method called Bag of weighted Concepts BoWC that aims to generate representations with low dimensions and high representational ability. BoWC vectorizes a document according to the concepts' information it contains, where it creates concepts by clustering word vectors, then uses the frequencies of these concept clusters to represent document vectors. To enrich the resulted document representation, new weighting functions are proposed for concept weighting based on statistics extracted from word embedding information. This work is a development of previous research in which the proposed method was developed and tested on a text classification task. In this work, empirical tests were extended to include tuning the parameters of the proposed method and analyzing the effect of each on the efficiency of the method. The proposed method has been tested in two data mining tasks; document clustering and classification, with five various benchmark data sets, and it was compared with several baselines, including Bag-of-words, TF-IDF, Averaged word embeddings, Bag-of-Concepts, and VLAC. The proposed method outperforms most baselines in terms of the minimum number of features and maximum classification and clustering accuracy.

Keywords:
Vectorization (mathematics) Computer science Cluster analysis Data mining Information retrieval Artificial intelligence Natural language processing Parallel computing

Metrics

6
Cited By
0.71
FWCI (Field Weighted Citation Impact)
27
Refs
0.67
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Text and Document Classification Technologies
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Text Analysis Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Web Data Mining and Analysis
Physical Sciences →  Computer Science →  Information Systems

Related Documents

JOURNAL ARTICLE

Word-embedding Based Text Vectorization Using Clustering

Vitaly I. YuferevNikolai A. Razin

Journal:   Modeling and Analysis of Information Systems Year: 2021 Vol: 28 (3)Pages: 292-311
JOURNAL ARTICLE

Enhancing Text Clustering Using Concept-based Mining Model

Shady ShehataFakhri KarrayMohamed S. Kamel

Journal:   Proceedings Year: 2006 Pages: 1043-1048
JOURNAL ARTICLE

TEXT CLUSTERING IN CONCEPT BASED MINING

Pradnya RandiveNitin Pise

Journal:   International Journal of Computer and Communication Technology Year: 2016 Pages: 32-34
JOURNAL ARTICLE

Concept Based Mining in Text Clustering

Pradnya RandiveNitin Pise

Journal:   International Journal Of Recent Advances in Engineering & Technology Year: 2020 Vol: 08 (03)Pages: 1-4
BOOK-CHAPTER

Clustering Text: A Comparison Between Available Text Vectorization Techniques

Lovedeep Singh

Advances in intelligent systems and computing Year: 2021 Pages: 21-27
© 2026 ScienceGate Book Chapters — All rights reserved.