Word-embedding Based Text Vectorization Using Clustering

Vitaly I. Yuferev; Nikolai A. Razin

doi:10.18255/1818-1015-2021-3-292-311

ScienceGate Book Chapters

JOURNAL ARTICLE

Word-embedding Based Text Vectorization Using Clustering

Vitaly I. Yuferev Nikolai A. Razin

Year: 2021 Journal: Modeling and Analysis of Information Systems Vol: 28 (3)Pages: 292-311 Publisher: Yaroslavl State University

DOI: 10.18255/1818-1015-2021-3-292-311

Get Full-Text PDF Get Analytical Report

Abstract

It is known that in the tasks of natural language processing, the representation of texts by vectors of fixed length using word-embedding models makes sense in cases where the vectorized texts are short. The longer the texts being compared, the worse the approach works. This situation is due to the fact that when using word-embedding models, information is lost when converting the vector representations of the words that make up the text into a vector representation of the entire text, which usually has the same dimension as the vector of a single word. This paper proposes an alternative way for using pre-trained word-embedding models for text vectorization. The essence of the proposed method consists in combining semantically similar elements of the dictionary of the existing text corpus by clustering their (dictionary elements) embeddings, as a result of which a new dictionary is formed with a size smaller than the original one, each element of which corresponds to one cluster. The original corpus of texts is reformulated in terms of this new dictionary, after which vectorization is performed on the reformulated texts using one of the dictionary approaches (TF-IDF was used in the work). The resulting vector representation of the text can be additionally enriched using the vectors of words of the original dictionary obtained by decreasing the dimension of their embeddings for each cluster. A series of experiments to determine the optimal parameters of the method is described in the paper, the proposed approach is compared with other methods of text vectorization for the text ranking problem – averaging word embeddings with TF-IDF weighting and without weighting, as well as vectorization based on TF-IDF coefficients.

Keywords:

Vectorization (mathematics) Word (group theory) Computer science Word embedding Cluster analysis Representation (politics) Dimension (graph theory) Embedding Artificial intelligence Natural language processing Mathematics Combinatorics

Metrics

Cited By

0.24

FWCI (Field Weighted Citation Impact)

Refs

0.59

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Information Systems and Technology Applications

Social Sciences → Business, Management and Accounting → Management Information Systems

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Advanced Text Analysis Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Word-embedding Based Text Vectorization Using Clustering

Abstract

Metrics

Citation History

Topics

Related Documents

Short Text Clustering Based on Word Semantic Graph with Word Embedding Model

Text Vectorization Method Based on Concept Mining Using Clustering Techniques

Word Embedding Interpretation using Co-Clustering

An Approach for Textual Based Clustering Using Word Embedding

Short Text Embedding for Clustering Based on Word and Topic Semantic Information