Bastián Andres TroncosoJosé A. Mateo-CortésM. Julia FloresFrancisco J. Tapiador
Imbalanced datasets pose significant challenges in machine learning (ML), often leading to catastrophic failures in models that are insensitive to their statistical particularities. In regression using neural networks (NN), imbalance is particularly problematic due to the potential undersampling of extreme but important values. Traditional classification methods are ineffective with imbalanced datasets and fail to capture the structure of the problem. This article presents a novel approach to handling imbalanced data in general regression. The new preprocessing technique, called “Sequence-Based Undersampling”, leverages the spatial structure of the data to selectively remove overrepresented instances. The method is tested using quantitative precipitation estimates (QPE), a well-known case of imbalance distribution in earth physics. The technique demonstrates consistent improvements in model performance compared to existing methods. The results suggest that sequence-aware undersampling improves regression models and ML algorithms, providing a practical solution to a prevalent issue in data-driven research. This method can enhance current satellite precipitation algorithms, as satellite retrievals often exhibit a leptokurtic distribution with few cases of high rainfall rates, many low rates, and numerous no-rain occurrences—a paradigmatic case of an imbalanced dataset built sequentially through radar and radiometer measurements.
O. A. AjilisaV P JagathyrajM. K. Sabu
Yusuf ArslanKevin AllixClément LefebvreAndrey BoytsovTegawendé F. BissyandéJacques Klein
Sungjun SeoMohammad AfraziKooktae Lee
Pattaramon VuttipittayamongkolEyad Elyan
Sudhansu Ranjan LenkaSukant Kishoro BisoyRojalina PriyadarshiniB. Nayak