Sungjun SeoMohammad AfraziKooktae Lee
Abstract This paper investigates a novel undersampling technique based on optimal transport (OT) for managing imbalanced datasets in classification tasks. Undersampling is crucial for reducing dataset size while preserving essential statistical properties, improving both classification performance and computational efficiency. Existing methods, such as random undersampling, NearMiss, Tomek Links, and Edited Nearest Neighbor, often fail to adequately preserve the underlying data distribution. To address this limitation, we propose a Wasserstein distance-based undersampling method that formulates an optimization problem aimed at minimizing distributional distortion. By leveraging the Wasserstein distance to quantify differences between probability distributions, the proposed approach ensures that the reduced dataset retains key geometric and statistical characteristics of the original majority class. Furthermore, we provide a computational complexity analysis and establish a stability property that bounds the Wasserstein deviation introduced by support reduction. Simulation results on synthetically generated imbalanced datasets demonstrate that the proposed method preserves the structural characteristics of the original data more effectively than existing resampling techniques, while achieving balanced classification performance across both majority and minority classes. These results highlight the potential of the proposed approach as an effective and scalable solution for addressing class imbalance in practical classification problems.
Sudhansu Ranjan LenkaSukant Kishoro BisoyRojalina PriyadarshiniB. Nayak
Yap Bee WahKhatijahhusna Abd RaniHezlin Aryani Abd RahmanSimon FongZuraida KhairudinNik Nik Abdullah
Bastián Andres TroncosoJosé A. Mateo-CortésM. Julia FloresFrancisco J. Tapiador