JOURNAL ARTICLE

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

Abstract

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

Keywords:
Latent Dirichlet allocation Topic model Support vector machine Document classification Training set Unsupervised learning Data modeling Data integration

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.36
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Text and Document Classification Technologies
Physical Sciences →  Computer Science →  Artificial Intelligence
Imbalanced Data Classification Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Spam and Phishing Detection
Physical Sciences →  Computer Science →  Information Systems

Related Documents

BOOK-CHAPTER

Unsupervised Document Classification and Topic Detection

Jaromír NovotnýPavel Ircing

Lecture notes in computer science Year: 2017 Pages: 748-756
JOURNAL ARTICLE

One-class svms for document classification

Larry M. ManevitzMalik Yousef

Journal:   Journal of Machine Learning Research Year: 2002 Vol: 2 (2)Pages: 139-154
JOURNAL ARTICLE

One-class document classification via Neural Networks

Larry M. ManevitzMalik Yousef

Journal:   Neurocomputing Year: 2006 Vol: 70 (7-9)Pages: 1466-1481
© 2026 ScienceGate Book Chapters — All rights reserved.