JOURNAL ARTICLE

Learning from Multi-Class Imbalanced Big Data with Apache Spark

Sleeman, William

Year: 2021 Journal:   VCU Scholars Compass (Virginia Commonwealth University)   Publisher: Virginia Commonwealth University

Abstract

With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered. This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed.

Keywords:
SPARK (programming language) Big data Oversampling Class (philosophy) Implementation Classifier (UML) Binary classification

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.30
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Imbalanced Data Classification Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Data Stream Mining Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Spam and Phishing Detection
Physical Sciences →  Computer Science →  Information Systems

Related Documents

BOOK-CHAPTER

Impact of Imbalanced Data on Apache Spark

Thet Hsu AungAye Myat Myat Paing

Lecture notes in electrical engineering Year: 2025 Pages: 475-485
JOURNAL ARTICLE

Multi-class imbalanced big data classification on Spark

William C. SleemanBartosz Krawczyk

Journal:   Knowledge-Based Systems Year: 2020 Vol: 212 Pages: 106598-106598
BOOK-CHAPTER

SCUT-DS: Learning from Multi-class Imbalanced Canadian Weather Data

Olubukola M. OlaitanHerna L. Viktor

Lecture notes in computer science Year: 2018 Pages: 291-301
© 2026 ScienceGate Book Chapters — All rights reserved.