JOURNAL ARTICLE

Bagging Using Instance-Level Difficulty for Multi-Class Imbalanced Big Data Classification on Spark

Abstract

Most machine learning methods work under the assumption that classes have a roughly balanced number of instances. However, in many real-life problems we may have some types of instances appearing predominantly more frequently than the others which causes a bias towards the majority class during classifier training. This becomes even more challenging when dealing with multiple classes, where relationships between them are not easily defined. Learning from multi-class imbalanced data has not been widely considered in the context of big data mining, despite the fact that this is a learning difficulty frequently appearing in this domain. In this paper, we address this challenge by proposing a comprehensive ensemble-based framework. We propose to analyze each class to extract instance-level characteristics describing their difficulty levels. We embed this information into the existing UnderBagging framework. Our ensemble samples instances with probabilities proportional to their difficulty levels. This allows us to focus the learning process on the most difficult instances, better capturing the properties of multi-class imbalanced problems. We implemented our framework on Apache Spark to allow for high-performance computing over big data sets. This experimental study shows that taking into account the instance-level difficulty leads to training of significantly more accurate ensembles.

Keywords:
Computer science Machine learning Artificial intelligence Big data Classifier (UML) Ensemble learning SPARK (programming language) Class (philosophy) Focus (optics) Process (computing) Context (archaeology) Training set Data mining

Metrics

17
Cited By
1.23
FWCI (Field Weighted Citation Impact)
37
Refs
0.84
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Imbalanced Data Classification Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Machine Learning and Data Classification
Physical Sciences →  Computer Science →  Artificial Intelligence
Anomaly Detection Techniques and Applications
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Multi-class imbalanced big data classification on Spark

William C. SleemanBartosz Krawczyk

Journal:   Knowledge-Based Systems Year: 2020 Vol: 212 Pages: 106598-106598
BOOK-CHAPTER

Evaluating Difficulty of Multi-class Imbalanced Data

Mateusz LangoKrystyna NapierałaJerzy Stefanowski

Lecture notes in computer science Year: 2017 Pages: 312-322
JOURNAL ARTICLE

Improved multi-class classification approach for imbalanced big data on spark

Tinku SinghRiya KhannaSatakshiManish Kumar

Journal:   The Journal of Supercomputing Year: 2022 Vol: 79 (6)Pages: 6583-6611
JOURNAL ARTICLE

Instance-Level Cost-Sensitive Online Classification Algorithms for Class-Imbalanced Data Streams

Xian ShanJinyu YouY. G. XieWei XuZongrui Li

Journal:   Advances in computer and materials science research. Year: 2025 Vol: 1 (1)Pages: 351-351
JOURNAL ARTICLE

A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty

Sihao YuJiafeng GuoRuqing ZhangYixing FanZizhen WangXueqi Cheng

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 70-79
© 2026 ScienceGate Book Chapters — All rights reserved.