Comparison of GMERF and GLMM Tree Models on Poverty Household Data with Imbalanced Categories

Ari Shobri Bukhari, Khairil Anwar Notodiputro, Indahwati Indahwati, Anwar Fitrianto

Abstract


Decision tree and forest methods have become popular approaches in data science and continue to evolve. One of these developments is the combination of decision trees with Generalized Linear Mixed Models (GLMM), resulting in the GLMM Tree, which is applicable to multilevel and longitudinal data. Another model, Generalized Mixed Effect Random Forest (GMERF), extends the concept of decision forests with GLMM, effectively handling complex data structures with non-linear interactions. This study compares the performance of GLMM Tree and GMERF models in classifying poor households in South Sulawesi Province, characterized by imbalanced categories. GLMM Tree provides a simple, interpretable classification through tree diagrams, while GMERF highlights variable importance. Initial tests show all three models (GLMM, GLMM Tree, and GMERF) achieve high accuracy and specificity but exhibit low sensitivity. By applying oversampling, sensitivity and AUC are significantly improved, though this is accompanied by a decline in accuracy and specificity, revealing a trade-off. The study concludes that while GLMM, GLMM Tree and GMERF have their strengths, using them together offers a more comprehensive understanding of poverty classification. Handling imbalanced data with oversampling is effective in increasing sensitivity, but careful consideration is needed due to its impact on overall accuracy.

Keywords


GLMM Tree, GLMM, GMERF, poverty household classification, oversampling

Full Text:

PDF

References


M. Fokkema, J. Edbrooke-Childs, and M. Wolpert, “Generalized linear mixed-model (GLMM) trees: A flexible decision-tree method for multilevel and longitudinal data,” Psychother. Res., vol. 31, no. 3, pp. 1–13, 2020, doi: 10.1080/10503307.2020.1785037.

M. R. Segal, “Machine Learning Benchmarks and Random Forest Regression,” Biostatistics, no. May, pp. 1–14, 2004, [Online]. Available: http://escholarship.org/uc/item/35x3v9t4.pdf

G. Verbeke, G., & Molenberghs, “Linear Mixed Models for Longitudinal Data,” Linear Mixed Models for Longitudinal Data. 2000. doi: 10.1007/b98969.

J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, no. 1, pp. 81–106, 1986, doi: 10.1007/bf00116251.

K. Topuz, A. Bajaj, and I. Abdulrashid, “Interpretable Machine Learning,” Proc. Annu. Hawaii Int. Conf. Syst. Sci., vol. 2023-Janua, pp. 1236–1237, 2023, doi: 10.1201/9780367816377-16.

L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.

M. Pellagatti, C. Masci, F. Ieva, and A. M. Paganoni, “Generalized mixed-effects random forest: A flexible approach to predict university student dropout,” Stat. Anal. Data Min., vol. 14, no. 3, pp. 241–257, 2021, doi: 10.1002/sam.11505.

D. R. Cutler et al., “Random forests for classification in ecology,” Ecology, vol. 88, no. 11, pp. 2783–2792, 2007, doi: 10.1890/07-0539.1.

G. Biau and E. Scornet, “A random forest guided tour,” TEST, vol. 25, no. 2, pp. 197–227, 2016, doi: 10.1007/s11749-016-0481-7.

T. Hothorn, K. Hornik, and A. Zeileis, “Unbiased recursive partitioning: A conditional inference framework,” J. Comput. Graph. Stat., vol. 15, no. 3, pp. 651–674, 2006, doi: 10.1198/106186006X133933.

Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced data: A review,” Int. J. Pattern Recognit. Artif. Intell., vol. 23, no. 4, pp. 687–719, 2009, doi: 10.1142/S0218001409007326.

P. Kumar, R. Bhatnagar, K. Gaur, and A. Bhatnagar, “Classification of Imbalanced Data:Review of Methods and Applications,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1099, no. 1, p. 012077, 2021, doi: 10.1088/1757-899x/1099/1/012077.

H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009, doi: 10.1109/TKDE.2008.239.

G. S. Datta and M. Ghosh, “Bayesian Prediction in Linear Models: Applications to Small Area Estimation,” Ann. Stat., vol. 19, pp. 1748–1770, 1991, doi: 10.1214/aos/1176348369.

McCulloch, Generalized, Linear, and Mixed Models. Canada: John Wiley & Sons, Inc., 2001.

M. Fokkema, N. Smits, A. Zeileis, T. Hothorn, and H. Kelderman, “Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees,” Behav. Res. Methods, vol. 50, no. 5, pp. 2016–2034, 2018, doi: 10.3758/s13428-017-0971-x.

A. Zeileis, T. Hothorn, K. Hornik, A. Zeileis, T. Hothorn, and K. Hornik, “Interface Foundation of America Model-Based Recursive Partitioning Linked references are available on JSTOR for this article : Model-Based Recursive Partitioning,” vol. 17, no. 2, pp. 492–514, 2016, doi: 10.1198/106186008X319331.

A. Hajjem, F. Bellavance, and D. Larocque, “Mixed-effects random forest for clustered data,” J. Stat. Comput. Simul., vol. 84, no. 6, pp. 1313–1328, 2012, doi: 10.1080/00949655.2012.741599.

P. and N. G. Mittal, “A comparative analysis of classification techniques on medical datasets,” IJRET Int. J. Res. Eng. Technol., vol. 3, no. 6, pp. 454–460, 2014, doi: 10.15623/ijret.2014.0306085.

T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., vol. 27, no. 8, pp. 861–874, 2006, doi: 10.1016/j.patrec.2005.10.010.

J. Han, J., Kamber, M., & Pei, Data Mining: Concepts and Techniques Third Edition. San Francisco: Elsevier / Morgan Kaufmann, 2011. doi: 10.1016/C2009-0-61819-5.

N. Lunardon, G. Menardi, and N. Torelli, “ROSE: A package for binary imbalanced learning,” R J., vol. 6, no. 1, pp. 79–89, 2014, doi: 10.32614/rj-2014-008.

N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE : Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002, doi: 10.1613/jair.953.

H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning,” in IEEE International Joint Conference on Neural Networks (IJCNN), Hong Kong, China: IEEE, 2008, pp. 1322–1328. doi: 10.1109/IJCNN.2008.4633969.

S. M. A. Elrahman and A. Abraham, “A Review of Class Imbalance Problem,” J. Netw. Innov. Comput., vol. 1, pp. 332–340, 2013, [Online]. Available: https://cspub-jnic.org/index.php/jnic/article/view/42/33

BPS, “Garis Kemiskinan dan Indikator Sosial Ekonomi Indonesia.” [Online]. Available: https://www.bps.go.id

Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, UK: Cambridge University Press, 2007. doi: 10.1017/CBO9780511790942.

D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression: Third Edition, Third. Hoboken, New Jersey: John Wiley & Sons, 2013. doi: 10.1002/9781118548387.

J. Fox and G. Monette, “Generalized Collinearity Diagnostics,” J. Am. Stat. Assoc., vol. 87, no. 417, pp. 178–183, 2014, doi: 10.1080/01621459.1992.10475190.

D. Bates, M. Mächler, B. M. Bolker, and S. C. Walker, “Fitting Linear Mixed-Effects Models Using lme4,” J. Stat. Softw., vol. 67, no. 1, 2015, doi: 10.18637/jss.v067.i01.

A. Liaw and M. Wiener, “Classification and Regression by randomForest,” R News, vol. 2, no. December, pp. 18–22, 2002, [Online]. Available: https://journal.r-project.org/articles/RN-2002-022/

World Bank, “World Development Report 2018: Learning to Realize Education’s Promise.” The World Bank, Washington, DC, 2018. doi: 10.1596/978-1-4648-1096-1.

B. Krawczyk, “Learning from imbalanced data: open challenges and future directions,” Prog. Artif. Intell., vol. 5, no. 4, pp. 221–232, 2016, doi: 10.1007/s13748-016-0094-0.

M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” Neural Networks, vol. 106, pp. 249–259, 2018, doi: 10.1016/j.neunet.2018.07.011.

A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” ICML 2005 - Proc. 22nd Int. Conf. Mach. Learn., no. 1999, pp. 625–632, 2005, doi: 10.1145/1102351.1102430.




DOI: http://dx.doi.org/10.12962%2Fj27213862.v8i2.21901

Refbacks

  • There are currently no refbacks.




Creative Commons License
Inferensi by Department of Statistics ITS is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://iptek.its.ac.id/index.php/inferensi.

ISSN:  0216-308X

e-ISSN: 2721-3862

Web
Analytics Made Easy - StatCounter View My Stats