Bagging Using Instance-Level Difficulty for Multi-Class Imbalanced Big Data Classification on Spark

William C. Sleeman; Bartosz Krawczyk

doi:10.1109/bigdata47090.2019.9006058

ScienceGate Book Chapters

JOURNAL ARTICLE

Bagging Using Instance-Level Difficulty for Multi-Class Imbalanced Big Data Classification on Spark

William C. Sleeman Bartosz Krawczyk

Year: 2019 Pages: 2484-2493

DOI: 10.1109/bigdata47090.2019.9006058

Get Full-Text PDF Get Analytical Report

Abstract

Most machine learning methods work under the assumption that classes have a roughly balanced number of instances. However, in many real-life problems we may have some types of instances appearing predominantly more frequently than the others which causes a bias towards the majority class during classifier training. This becomes even more challenging when dealing with multiple classes, where relationships between them are not easily defined. Learning from multi-class imbalanced data has not been widely considered in the context of big data mining, despite the fact that this is a learning difficulty frequently appearing in this domain. In this paper, we address this challenge by proposing a comprehensive ensemble-based framework. We propose to analyze each class to extract instance-level characteristics describing their difficulty levels. We embed this information into the existing UnderBagging framework. Our ensemble samples instances with probabilities proportional to their difficulty levels. This allows us to focus the learning process on the most difficult instances, better capturing the properties of multi-class imbalanced problems. We implemented our framework on Apache Spark to allow for high-performance computing over big data sets. This experimental study shows that taking into account the instance-level difficulty leads to training of significantly more accurate ensembles.

Keywords:

Computer science Machine learning Artificial intelligence Big data Classifier (UML) Ensemble learning SPARK (programming language) Class (philosophy) Focus (optics) Process (computing) Context (archaeology) Training set Data mining

Metrics

Cited By

1.23

FWCI (Field Weighted Citation Impact)

Refs

0.84

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Imbalanced Data Classification Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Machine Learning and Data Classification

Physical Sciences → Computer Science → Artificial Intelligence

Anomaly Detection Techniques and Applications

Physical Sciences → Computer Science → Artificial Intelligence

Bagging Using Instance-Level Difficulty for Multi-Class Imbalanced Big Data Classification on Spark

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-class imbalanced big data classification on Spark

Evaluating Difficulty of Multi-class Imbalanced Data

Improved multi-class classification approach for imbalanced big data on spark

Instance-Level Cost-Sensitive Online Classification Algorithms for Class-Imbalanced Data Streams

A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty