Handling Imbalance Data in Classification Model with Nominal Predictors

Kartika Fithriasari, Iswari Hariastuti, Kinanthi Sukma Wening

Abstract


Decision tree, one of classification method, can be done to find out the factors that predict something with interpretable result. However, a small and unbalanced percentage will make the classification only lead to the majority class. Therefore, handling imbalance class needs to be done. One method that often used in nominal predictor data is SMOTE-N. For accuracy improving, a hybrid SMOTE-N and ADASYN-N was developed. SMOTE-N-ENN and ADASYN-N were developed for accuracy improvement. In this study, SMOTE-N, SMOTE-N-ENN and ADASYN-N will be compared in handling imbalance class in the classification of premarital sex among adolescent using base class CART. The conclusion obtained regarding the best method for handling class imbalance is ADASYN-N because it provides the highest AUC compared to SMOTE-N and SMOTE-N-ENN. The best decision tree provides information that factors that can predict adolescents having premarital sexual relations are dating style, knowledge of the fertile period, knowledge of the risk of young marriage, gender, recent education, and area of residence.

Keywords


ADASYN-N; CART; hybrid SMOTE-N; imbalanced data; premarital sex

Full Text:

PDF

References


N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.

S. Vluymans, N. Verbiest, C. Cornelis, and Y. Saeys, “Instance selection for imbalanced data,” in WorkshopRough Sets: Theory and Applications(RST&A); held at the 2014 Joint Rough Set symposium (JRS 2014), 2014.

H. Haibo, B. Yang, G. A. Edwardo, and L. Shutao, “Adaptive synthetic sampling approach for imbalanced learning,” in IEEE International Joint Conference on Neural Networks, IJCNN, vol. 8, no. 3, 2016, pp. 1322–1328.

S. Rahayu, T. Adji, and N. Setiawan, “Analisis perbandingan metode oversampling adaptive synthetic-nominal (adasyn-n) dan adaptive synthetic-knn (adsyn-knn) untuk data dengan fitur nominal-multi categories,” 2017.

M. Adiansyah, “Perbandingan metode cart dan analisis regresi logistik serta penerapannya untuk klasifikasi ketertinggalan kabupaten dan kota di Indonesia,” Ph.D. dissertation, Institut Pertanian Bogor, 2017.

D. Jeyarani, G. Anushya, R. Rajeswari, and A. Pethalakshmi, “A comparative study of decision tree and naive bayesian classifiers on medical datasets,” International Journal of Computer Applications, vol. 975, p. 8887, 2013.

L. Breiman, J. Friedman, C. Stone, and R. Olshen, Classification and regression trees. CRC press, 1984.

K. Fithriasari, S. Pangastuti, N. Iriawan, and W. Suryaningtyas, “Classification boosting in imbalanced data,” MJS, vol. 38, no. Sp2, pp. 36–45, 2019.

S. Cost and S. Salzberg, “A weighted nearest neighbor algorithm for learning with symbolic features,” Machine learning, vol. 10, no. 1, pp. 57–78, 1993.




DOI: http://dx.doi.org/10.12962/j24775401.v6i1.6643

Refbacks

  • There are currently no refbacks.



View My Stats


Creative Commons License
International Journal of Computing Science and Applied Mathematics by Pusat Publikasi Ilmiah LPPM, Institut Teknologi Sepuluh Nopember is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://iptek.its.ac.id/index.php/ijcsam.