Sequence-based undersampling: an algorithm for managing imbalanced datasets

Bastián Andres Troncoso; José A. Mateo-Cortés; M. Julia Flores; Francisco J. Tapiador

doi:10.7717/peerj-cs.3078

ScienceGate Book Chapters

JOURNAL ARTICLE

Sequence-based undersampling: an algorithm for managing imbalanced datasets

Bastián Andres Troncoso José A. Mateo-Cortés M. Julia Flores Francisco J. Tapiador

Year: 2025 Journal: PeerJ Computer Science Vol: 11 Pages: e3078-e3078 Publisher: PeerJ, Inc.

DOI: 10.7717/peerj-cs.3078

Get Full-Text PDF Get Analytical Report

Abstract

Imbalanced datasets pose significant challenges in machine learning (ML), often leading to catastrophic failures in models that are insensitive to their statistical particularities. In regression using neural networks (NN), imbalance is particularly problematic due to the potential undersampling of extreme but important values. Traditional classification methods are ineffective with imbalanced datasets and fail to capture the structure of the problem. This article presents a novel approach to handling imbalanced data in general regression. The new preprocessing technique, called “Sequence-Based Undersampling”, leverages the spatial structure of the data to selectively remove overrepresented instances. The method is tested using quantitative precipitation estimates (QPE), a well-known case of imbalance distribution in earth physics. The technique demonstrates consistent improvements in model performance compared to existing methods. The results suggest that sequence-aware undersampling improves regression models and ML algorithms, providing a practical solution to a prevalent issue in data-driven research. This method can enhance current satellite precipitation algorithms, as satellite retrievals often exhibit a leptokurtic distribution with few cases of high rainfall rates, many low rates, and numerous no-rain occurrences—a paradigmatic case of an imbalanced dataset built sequentially through radar and radiometer measurements.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.19

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Imbalanced Data Classification Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Anomaly Detection Techniques and Applications

Physical Sciences → Computer Science → Artificial Intelligence

Electricity Theft Detection Techniques

Physical Sciences → Engineering → Electrical and Electronic Engineering

Sequence-based undersampling: an algorithm for managing imbalanced datasets

Abstract

Metrics

Topics

Related Documents

GABoost: A Clustering Based Undersampling Algorithm for Highly Imbalanced Datasets Using Genetic Algorithm

Exploiting Prototypical Explanations for Undersampling Imbalanced Datasets

An Optimal Transport-Based Undersampling Technique for Handling Imbalanced Datasets

Overlap-Based Undersampling Method for Classification of Imbalanced Medical Datasets

Representative-Based Cluster Undersampling Technique for Imbalanced Credit Scoring Datasets