William C. SleemanBartosz Krawczyk
Most machine learning methods work under the assumption that classes have a roughly balanced number of instances. However, in many real-life problems we may have some types of instances appearing predominantly more frequently than the others which causes a bias towards the majority class during classifier training. This becomes even more challenging when dealing with multiple classes, where relationships between them are not easily defined. Learning from multi-class imbalanced data has not been widely considered in the context of big data mining, despite the fact that this is a learning difficulty frequently appearing in this domain. In this paper, we address this challenge by proposing a comprehensive ensemble-based framework. We propose to analyze each class to extract instance-level characteristics describing their difficulty levels. We embed this information into the existing UnderBagging framework. Our ensemble samples instances with probabilities proportional to their difficulty levels. This allows us to focus the learning process on the most difficult instances, better capturing the properties of multi-class imbalanced problems. We implemented our framework on Apache Spark to allow for high-performance computing over big data sets. This experimental study shows that taking into account the instance-level difficulty leads to training of significantly more accurate ensembles.
William C. SleemanBartosz Krawczyk
Mateusz LangoKrystyna NapierałaJerzy Stefanowski
Tinku SinghRiya KhannaSatakshiManish Kumar
Xian ShanJinyu YouY. G. XieWei XuZongrui Li
Sihao YuJiafeng GuoRuqing ZhangYixing FanZizhen WangXueqi Cheng