A Comparative Study of Generalized Linear Mixed Model and Mixed Effects Random Forest for Analyzing Data with Outliers

Reza Arianti; Khairil Anwar  Notodiputro; Yenni  Angraini

doi:10.52436/1.jutif.2026.7.2.5407

Authors

Reza Arianti School of Data Science, Mathematics and Informatics, IPB University, Indonesia
Khairil Anwar Notodiputro School of Data Science, Mathematics and Informatics, IPB University, Indonesia
Yenni Angraini School of Data Science, Mathematics and Informatics, IPB University, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2026.7.2.5407

Keywords:

GLMM, Hierarchical Modeling, Data, MERF, Outliers, Tobacco Intensity, Winsorization

Abstract

This study compares MERF and GLMM-NB in analyzing hierarchical data and focusing on the role of residual outliers and the application of winsorization. A two-stage analytical pipeline was implemented: (1) winsorization to reduce extreme residual values, and (2) model training using MERF and GLMM-NB. The dataset comes from the 2021 National Socio-Economic Survey (Susenas) in West Java Province, measuring tobacco consumption intensity. Two statistical approaches are compared, MERF and GLMM with a Negative Binomial distribution (GLMM-NB). Models were trained under two conditions: without winsorization (WIN0) and with two-sided 5% winsorization (WIN5). Winsorization was applied to the training data, and the test data were adjusted using thresholds from the training set. Model performance was assessed using Root Mean Squared Error (RMSE) and the train-test ratio. Under WIN0, GLMM recorded an RMSE of 49.65 for training and 42.27 for testing, while MERF achieved 35.96 and 39.94, respectively. After WIN5, GLMM showed a larger error reduction, with RMSE values of 34.90 (train) and 30.20 (test), while MERF dropped to 26.63 (train) and 28.64 (test). These results indicate that MERF provides higher predictive accuracy, whereas GLMM benefits more from winsorization. Household expenditure, employment status, age, and gender consistently emerged as key variables linked to tobacco consumption intensity. This study is the first to compare MERF and GLMM-NB with winsorization using Indonesia’s hierarchical data. The analytical framework helps inform public health policies aligned with SDG 3: Good Health and Well-being, particularly in reducing tobacco-related health risks.

Downloads

Download data is not yet available.

References

Y. Angraini, K. A. Notodiputro, A. Saefuddin, dan T. Toharudin, “Latent factor linear mixed model (LFLMM) for modelling flanders data,” Communications in Mathematical Biology and Neuroscience, vol. 2020, hlm. 1–14, Mei 2020, doi: 10.28919/cmbn/4610.

B. Suseno, K. A. Notodiputro, dan B. Sartono, “GLMMTree for Modelling Poverty in Indonesia.”

H. M. Heiling, N. U. Rashid, Q. Li, X. L. Peng, J. J. Yeh, dan J. G. Ibrahim, “Efficient Computation of High-Dimensional Penalized Generalized Linear Mixed Models by Latent Factor Modeling of the Random Effects,” Apr 2024, [Daring]. Tersedia pada: http://arxiv.org/abs/2305.08201

W. W. Stroup, Generalized Linear Mixed Models Modern Concepts, Methods and Applications, 1st edition. New York: CRC Press Taylor & Francis Group, 2016. doi: https://doi.org/10.1201/b13151.

J. Salinas, R. Osval, A. Montesinos López, G. Hernández, R. Jose, dan C. Hiriart, “Generalized Linear Mixed Models with Applications in Agriculture and Biology.”

K. H. Lee, C. Pedroza, E. B. C. Avritscher, R. A. Mosquera, dan J. E. Tyson, “Evaluation of negative binomial and zero-inflated negative binomial models for the analysis of zero-inflated count data: application to the telemedicine for children with medical complexity trial,” Trials, vol. 24, no. 1, Des 2023, doi: 10.1186/s13063-023-07648-8.

D. A. N. Sirodj, K. Sadik, dan A. Kurnia, “Modeling The Incidence of Malnutrition in Bogor Regency using Zero-Inflated Negative Binomial Mixed Effect Model,” BAREKENG: Jurnal Ilmu Matematika dan Terapan, vol. 18, no. 2, hlm. 0961–0972, Mei 2024, doi: 10.30598/barekengvol18iss2pp0961-0972.

P. R. Sihombing, K. A. Notodiputro, dan B. Sartono, “Comparison of GEE and GLMM Methods for Longitudinal Data (Case Study: Determinants of the Percentage of Poor People in Indonesia, 2015-2019),” dalam AIP Conference Proceedings, American Institute of Physics Inc., Okt 2022. doi: 10.1063/5.0103254.

A. Hajjem, F. Bellavance, dan D. Larocque, “Mixed-effects random forest for clustered data,” J Stat Comput Simul, vol. 84, no. 6, hlm. 1313–1328, 2014, doi: 10.1080/00949655.2012.741599.

P. Krennmair dan T. Schmid, “Flexible domain prediction using mixed effects random forests,” J R Stat Soc Ser C Appl Stat, vol. 71, no. 5, hlm. 1865–1894, 2022, doi: 10.1111/rssc.12600.

R. A. Lewis, A. Ghandeharioun, S. Fedor, P. Pedrelli, R. Picard, dan D. Mischoulon, “Mixed Effects Random Forests for Personalised Predictions of Clinical Depression Severity,” 2023, [Daring]. Tersedia pada: http://arxiv.org/abs/2301.09815

A. Fakhrurrozi, “On The use of Mixed Effects Machine Learning Regression Models to Capture Spatial Patterns: A Case Study on Crime,” Master Thesis, University of Twente, Enschede, 2019.

R. Mayapada, B. Susetyo, dan B. Sartono, “A Comparison between Random Forest and Mixed Effects Random Forest to Predict Students ’ Math Performance in Indonesia,” International Journal of Sciences: Basic and Applied Research (IJSBAR), vol. 57, hlm. 1–8, Mar 2021.

R. Anisa, A. Kurnia, dan I. Indahwati, “Cluster Information of Non-Sampled Area In Small Area Estimation,” IOSR Journal of Mathematics, vol. 10, no. 1, hlm. 15–19, 2014, doi: 10.9790/5728-10121519.

F. Zubedi, B. Sartono, dan K. Anwar, “Jurnal Natural,” vol. 22, no. 2, hlm. 108–116, 2022, doi: 10.24815/jn.v22i2.25499.

R. Ananda, K. A. Notodiputro, dan M. N. Aidi, “Modified Mixed Effects Random Forest in Small Area Estimation Using PCA and Rotation Forest with Correlated Auxiliary Variables,” Scientific Journal of Informatics, vol. 11, no. 3, hlm. 705–720, Agu 2024, doi: 10.15294/sji.v11i3.10633.

J. Shi, “Investigating Mixed Effects Random Forest Models in Predicting Investigating Mixed Effects Random Forest Models in Predicting Satisfaction with Online Learning in Higher Education Satisfaction with Online Learning in Higher Education.” [Daring]. Tersedia pada: https://digitalcommons.du.edu/etd

P. J. Huber dan J. Wiley, “Robust statistics,” Data Handling in Science and Technology, vol. 20, no. PART A, hlm. 339–377, 1981, doi: 10.1016/S0922-3487(97)80042-1.

R. R. Wilcox, Introduction to robust estimation and hypothesis testing, 3rd ed. Amsterdam: Academic Press, 2012.

A. Abuzaid dan I. Alkrunz, “A COMPARATIVE STUDY ON UNIVARIATE OUTLIER WINSORIZATION METHODS IN DATA SCIENCE CONTEXT,” Statistica Applicata, vol. 36, no. 1, hlm. 85–99, Jul 2024, doi: 10.26398/IJAS.0036-004.

M. Atif, M. Farooq, M. Shafiq, T. Alballa, S. Abdualziz Alhabeeb, dan H. Abd El-Wahed Khalifa, “Uncovering the impact of outliers on clusters’ evolution in temporal data-sets: an empirical analysis,” Sci Rep, vol. 14, no. 1, Des 2024, doi: 10.1038/s41598-024-75928-7.

C. Lartey, J. Liu, R. K. Asamoah, C. Greet, M. Zanin, dan W. Skinner, “Effective Outlier Detection for Ensuring Data Quality in Flotation Data Modelling Using Machine Learning (ML) Algorithms,” Minerals, vol. 14, no. 9, Sep 2024, doi: 10.3390/min14090925.

A. A. Mangino, J. H. Bolin, dan W. H. Finch, “Fixed Effects or Mixed Effects Classifiers? Evidence From Simulated and Archival Data,” Educ Psychol Meas, vol. 83, no. 4, hlm. 710–739, Agu 2023, doi: 10.1177/00131644221108180.

O. R. Olaniran, S. F. Olaniran, J. Allohibi, A. A. Alharbi, dan N. M. S. Alharbi, “Mixed effect gradient boosting for high-dimensional longitudinal data,” Sci Rep, vol. 15, no. 1, Des 2025, doi: 10.1038/s41598-025-16526-z.

WHO global report on trends in prevalence of tobacco use 2000-2025 Fourth edition WHO global report on trends in prevalence of tobacco use 2000-2025, fourth edition ISBN 978-92-4-003932-2 (electronic version). 2021. [Daring]. Tersedia pada: http://apps.who.int/bookorders.

A. E. Yunianto, N. A. Q. A’yunin, E. Emy Yuliantini, M. Haya, dan A. Faridi, “Knowledge and Healthy Behavior of the West Java People Related to COVID-19 Pandemic,” Annals of Tropical Medicine and Public Health ·, vol. Volume 7, no. Issue 6, 2021, doi: 10.36295/AOTMPH.2021.7602.

C. E. McCulloch, S. R. Searle, dan J. M. Neuhaus, Generalized, Linear, and Mixed Models, 2 ed. New Jersey: John Wiley & Sons, 2008.

J. M. . Hilbe, Modeling count data. Hilbe (Arizona State University and Jet Propulsion Laboratory, California Institute of Technology). Cambridge University Press, 2014.

M. E. Brooks dkk., “glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling,” R Journal, vol. 9, no. 2, hlm. 378–400, Des 2017, doi: 10.32614/rj-2017-066.

A. H. Welsh, Approaches to the robust estimation of mixed models, vol. 13. Amsterdam: Elsevier Science, 1997.

V. Barnett dan Lewis T., Outliers in Statistical Data, 3rd ed., vol. XVII. Boston: J. Wiley & Sons, 1994.

M. Hubert dan S. Van Der Veeken, “Outlier detection for skewed data,” dalam Journal of Chemometrics, John Wiley and Sons Ltd, 2008, hlm. 235–246. doi: 10.1002/cem.1123.

A. F. Baktiar dan T. S. Utiayarsih, “Identification of Factors Affecting Smoking Prevalence in West Java using Spatial Modeling,” Indonesian Journal of Statistics and Its Applications, vol. 6, no. 1, hlm. 114–131, 2022, doi: 10.29244/ijsa.v6i1p114-131.

M. A. Fahmi, “Correlation Between Smoke-Free Areas and Smoking Behavior in Indonesia,” Jurnal Berkala Epidemiologi, vol. 8, no. 2, hlm. 117, 2020, doi: 10.20473/jbe.v8i22020.117-124.

Dian Diniyati dan Budiman Achmad, “Tobacco use and its impact on poverty among forest households: The cases of Indonesia,” World Journal of Biology Pharmacy and Health Sciences, vol. 11, no. 3, hlm. 060–066, 2022, doi: 10.30574/wjbphs.2022.11.3.0139.

R. Fauzi, I. Arumsari, M. A. Maruf, dan A. Ahsan, “Association of Tobacco Advertising, Promotion, and Sponsorship (TAPS) exposure on smoking intention and current smoking behavior among youth in Indonesia,” J Subst Use, vol. 29, no. 1, hlm. 54–60, Sep 2022.

W. Septiono, M. A. G. Kuipers, N. Ng, dan A. E. Kunst, “The impact of local smoke-free policies on smoking behaviour among adults in Indonesia: a quasi-experimental national study,” Addiction, vol. 115, no. 12, hlm. 2382–2392, 2020, doi: 10.1111/add.15110.

Badan Pusat Statistik (BPS), “Data Mikro Survei Sosial Ekonomi Nasional (SUSENAS) Jawa Barat 2021,” Jawa Barat , 2021.

Badan Pusat Statistik (BPS), “Data Potensi Desa (PODES) 2021 Provinsi Jawa Barat,” Jawa Barat, 2021.

T. Chai dan R. R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)?,” 28 Februari 2014. doi: 10.5194/gmdd-7-1525-2014.

T. O. Hodson, “Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not,” 19 Juli 2022, Copernicus GmbH. doi: 10.5194/gmd-15-5481-2022.

L. Breiman, “Random Forests,” Mach Learn, vol. 45, hlm. 5–32, 2001, doi: 10.1023/A:1010933404324.

Q. Zou, B. Chen, Y. Zhang, X. Wu, Y. Wan, dan C. Chen, “Mixed-effects neural network modelling to predict longitudinal trends in fasting plasma glucose,” BMC Med Res Methodol, vol. 24, no. 1, Des 2024, doi: 10.1186/s12874-024-02442-9.