JOURNAL ARTICLE

A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

Dina ElreedyAmir F. AtiyaFiruz Kamalov

Year: 2023 Journal:   Machine Learning Vol: 113 (7)Pages: 4903-4923   Publisher: Springer Science+Business Media

Abstract

Abstract Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns’ probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically.

Keywords:
Oversampling Class (philosophy) Interpolation (computer graphics) Mathematics Distribution (mathematics) Artificial intelligence Algorithm Computer science Conditional probability distribution Pattern recognition (psychology) Machine learning Statistics

Metrics

265
Cited By
67.69
FWCI (Field Weighted Citation Impact)
64
Refs
1.00
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Imbalanced Data Classification Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Electricity Theft Detection Techniques
Physical Sciences →  Engineering →  Electrical and Electronic Engineering
Medical Coding and Health Information
Health Sciences →  Health Professions →  Health Information Management
© 2026 ScienceGate Book Chapters — All rights reserved.