CART and Random Forest Analysis on Graduation Status of Halu Oleo University Students

Gusti Arviana Rahman, Khairil Anwar Notodiputro, Bagus Sartono, La Surimi

Abstract


Classification and Regression Tree (CART) is a popular classification method and it is used in various fields. The method is capable to be applied on various data conditions. An alternative method of CART is random forest. These two methods of classification were studied in this paper using graduation data of Halu Oleo University. This data was interesting due to the imbalance problem existed in the data. We compared several scenarios, namely the CART and Random Forest methods, Random Forest with oversampling, and Random Forest with undersampling. There were three explanatory variables considered in the model including Study Program, GPA, and TOEFL score. The results showed that the best method to classify the student’s graduation status at Halu Oleo University is Random Forest without handling imbalanced data, as it provided the highest sensitivity. This suggests that Random Forest, even without specific adjustments for data imbalance, can effectively capture the patterns in the data and provide accurate classifications, making it a robust choice for this dataset.

Keywords


classification tree; imbalance data; oversampling; undersampling; statistical learning

Full Text:

PDF

References


L. Breiman, Classification and Regression Trees. Monterey, CA: Wadsworth International Group, 1984.

L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009.

I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed., Morgan Kaufmann, 2011.

Han, W. Wang, and B. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” in Advances in Intelligent Computing. Springer, 2005, pp. 878–887.

N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.

Liaw and M. Wiener, “Classification and regression by randomForest,” R News, vol. 2, no. 3, pp. 18–22, 2002.

P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine Learning, vol. 63, no. 1, pp. 3–42, 2006.

Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011.

G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2013.

X. Wang, S. Smith, and T. Hyndman, “Improving CART with ensemble methods for imbalanced datasets,” Journal of Statistical Computation and Simulation, vol. 89, no. 3, pp. 423–439, 2019.

Indonesian Ministry of Education and Culture, “Standards for Higher Education Accreditation,” BAN-PT, Jakarta, 2020.

S. Safitri, B. Setiawan, and M. Kurniawan, “Applying CART for timely graduation prediction at Tanjungpura University,” Indonesian Journal of Educational Research, vol. 8, no. 2, pp. 122–130, 2020.

P. Dewi and E. R. Putri, “A study on student graduation prediction using CART at Pakuan University,” Education and Data Mining Journal, vol. 5, no. 1, pp. 77–89, 2021.

T. Abdullah, “Predicting student success using CART and Random Forest at Universitas Terbuka,” Open Education Review, vol. 6, no. 4, pp. 215–230, 2021.

Huang and Y. Li, “Using MissForest and XGBoost for timely graduation prediction,” Educational Data Science Journal, vol. 4, no. 1, pp. 54–65, 2020.

S. S. Rao and N. Gupta, “A comparison of classification methods for academic performance prediction,” International Journal of Data Science and Analytics, vol. 7, no. 3, pp. 189–203, 2018.

T. Raj and P. G. Nair, “Comparative analysis of classifiers for predicting academic performance,” Journal of Machine Learning Applications, vol. 10, no. 2, pp. 95–104, 2019.

Singh and V. Sharma, “Random oversampling techniques for improving Random Forest accuracy,” International Journal of Computer Applications, vol. 183, no. 36, pp. 21–28, 2022.

B. Fernandez, A. G. Ruz, and D. B. Huertas, “Random forest and oversampling techniques to improve classification performance,” Knowledge-Based Systems, vol. 110, pp. 106–113, 2016.

M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer, 2013.

Zhang and I. Y. Song, “Categorical-Boosted Trees for highly imbalanced datasets,” in Proceedings of the 2018 ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), London, 2018, pp. 1077–1086.

Rokach and O. Maimon, Data Mining with Decision Trees: Theory and Applications, World Scientific Publishing, 2008.

Patel, R. Bhatt, and P. Shah, “Machine learning approaches for academic performance prediction,” Applied Artificial Intelligence Journal, vol. 29, no. 8, pp. 763–777, 2020.

R. Johnson and J. K. Miller, “Using XGBoost to predict graduation time,” Data Science in Education Review, vol. 5, no. 2, pp. 202–212, 2021.

S. K. Verma and M. K. Sharma, “Improving Random Forest performance on imbalanced educational datasets,” Machine Learning in Education Journal, vol. 12, no. 3, pp. 155–170, 2021.

K. Kumar and S. Ahuja, “Using ensemble methods to predict academic success,” International Journal of Educational Data Mining, vol. 8, no. 4, pp. 95–110, 2022.




DOI: http://dx.doi.org/10.12962%2Fj27213862.v8i3.23336

Refbacks

  • There are currently no refbacks.




Creative Commons License
Inferensi by Department of Statistics ITS is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://iptek.its.ac.id/index.php/inferensi.

ISSN:  0216-308X

e-ISSN: 2721-3862

Web
Analytics Made Easy - StatCounter View My Stats