HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

Al Majzoub H., Elgedawy I., Akaydin O., Ulukok M. K.

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, cilt.45, sa.4, ss.3205-3222, 2020 (SCI İndekslerine Giren Dergi) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 45 Konu: 4
  • Basım Tarihi: 2020
  • Doi Numarası: 10.1007/s13369-019-04336-1
  • Sayfa Sayıları: ss.3205-3222


Binary datasets are considered imbalanced when one of their two classes has less than 40% of the total number of the data instances (i.e., minority class). Existing classification algorithms are biased when applied on imbalanced binary datasets, as they misclassify instances of minority class. Many techniques are proposed to minimize the bias and to increase the classification accuracy. Synthetic Minority Oversampling Technique (SMOTE) is a well-known approach proposed to address this problem. It generates new synthetic data instances to balance the dataset. Unfortunately, it generates these instances randomly, leading to the generation of useless new instances, which is time and memory consuming. Different SMOTE derivatives were proposed to overcome this problem (such as Borderline SMOTE), yet the number of generated instances slightly changed. To overcome such problem, this paper proposes a novel approach for generating synthesized data instances known as Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE). It managed to minimize the number of generated instances while increasing the classification accuracy. It combines undersampling for removing majority noise instances and oversampling approaches to enhance the density of the borderline. It uses k-means clustering on the borderline area and identify which clusters to oversample to achieve better results. Experimental results show that HCAB-SMOTE outperformed SMOTE, Borderline SMOTE, AB-SMOTE and CAB-SMOTE approaches which were developed before reaching HCAB-SMOTE, as it provided the highest classification accuracy with the least number of generated instances.