Mixed-Data K-Means Clustering with Hyperparameter-Tuned Random Forest for OSS-Based MSME Investment Profiling and Policy Targeting

Laura Sari; Ratih Hafsarah  Maharrani; Hety Dwi  Hastuti; Adrian Putra  Ramadhan; Wahyuni Windasari

doi:10.52436/1.jutif.2026.7.2.5545

Authors

Laura Sari Teknik Informatika, Politeknik Negeri Cilacap, Indonesia
Ratih Hafsarah Maharrani Rekayasa Keamanan Siber, Politeknik Negeri Cilacap, Indonesia
Hety Dwi Hastuti Akuntansi Lembaga Keuangan Syariah, Politeknik Negeri Cilacap, Indonesia
Adrian Putra Ramadhan Teknik Informatika, Politeknik Negeri Cilacap, Indonesia
Wahyuni Windasari Sains Data, Universitas Putra Bangsa, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2026.7.2.5545

Keywords:

Clustering Analysis, Gower Distance, MSME Investment Profiling, Predictive Analytics, Random Forest

Abstract

Administrative data of Micro, Small, and Medium Enterprises collected through the Online Single Submission system are highly heterogeneous, combining numerical and categorical attributes that hinder conventional investment segmentation and early-stage policy mapping. This study aims to develop a predictive clustering framework for enterprise investment profiling using mixed-type administrative data. The proposed methodology applies robust preprocessing, including RobustScaler for numerical variables and one-hot encoding with singular value decomposition for categorical features. Mixed-type similarity is computed using Gower distance, followed by a hybrid Gower–K-Means clustering approach, where the optimal number of clusters (k = 3) is determined using Silhouette, Calinski–Harabasz, and Davies–Bouldin indices. A comparative evaluation of clustering algorithms is conducted, with K-Prototypes performing best in the initial assessment and K-Means achieving superior performance after optimization. Cluster membership is subsequently predicted using a Random Forest classifier with hyperparameters optimized through randomized search. Experiments on 20,857 enterprise records identify three distinct clusters representing low-capital micro enterprises, transitional firms, and asset-intensive corporate entities. The optimized K-Means model achieves a Silhouette score of 0.97 and a Davies–Bouldin Index of 0.54. Compared with the untuned baseline, the tuned Random Forest model improves recall from 0.25 to 0.75 (200% increase) and increases the F1-score from 0.40 to 0.86 (114% improvement), while achieving 99.89% accuracy. These gains correspond to an estimated 20–30% improvement in MSME investment mapping effectiveness compared with traditional profiling approaches, providing a scalable AI-based blueprint for targeted regional economic governance.

Downloads

Download data is not yet available.

References

G. Gramigna, “Evaluating SME Policies and Programmes—Micro-level Datasets, Analytical Toolkits and Institutional Factors,” J. Entrep. Innov. Emerg. Econ., vol. 3, no. 2, pp. 134–142, Jul. 2017, doi: 10.1177/2393957517721845.

G. Žigienė, E. Rybakovas, and R. Alzbutas, “Artificial Intelligence Based Commercial Risk Management Framework for SMEs,” Sustainability, vol. 11, no. 16, p. 4501, Aug. 2019, doi: 10.3390/su11164501.

M. of I. I. (BKPM), “OSS-RBA: Risk-Based Business Licensing System,” 2025.

A. Diop, N. El-Malki, M. Chevalier, A. Péninou, G. Roman-Jimenez, and O. Teste, “Simrec: a similarity measure recommendation system for mixed data clustering algorithms,” J. Big Data, vol. 12, no. 1, p. 43, Feb. 2025, doi: 10.1186/s40537-024-01052-y.

P. Liu, H. Yuan, Y. Ning, B. Chakraborty, N. Liu, and M. A. Peres, “A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses,” BMC Med. Res. Methodol., vol. 24, no. 1, p. 305, Dec. 2024, doi: 10.1186/s12874-024-02427-8.

N. Sumbherwal, B. K. Hooda, and P. K. Vinit, “Performance Analysis of Distance Measures for Mixed-Variables Data,” Dec. 28, 2023. doi: 10.21203/rs.3.rs-3749138/v1.

Y. Zhang, M. Zhao, Y. Chen, Y. Lu, and Y. Cheung, “Learning unified distance metric for heterogeneous attribute data clustering,” Expert Syst. Appl., vol. 273, p. 126738, May 2025, doi: 10.1016/j.eswa.2025.126738.

H. D. Hastuti and L. Sari, “Penerapan Analisis SWOT Terhadap Penentuan Strategi Peningkatan Daya Saing Saleh Pisang Nazen Rawalo,” J. Adm. Bisnis, vol. 2, no. 1, p. 15, Jan. 2023, doi: 10.26858/jab.v2i1.43157.

R. Mitra, A. Dongre, P. Dangare, A. Goswami, and M. K. Tiwari, “Knowledge graph driven credit risk assessment for micro, small and medium-sized enterprises,” Int. J. Prod. Res., vol. 62, no. 12, pp. 4273–4289, Jun. 2024, doi: 10.1080/00207543.2023.2257807.

T. Terttiaavini, “Predicting the Sustainability of Small and Medium Enterprises (SMEs) Using Machine Learning Algorithms,” JSAI (Journal Sci. Appl. Informatics), vol. 8, no. 1, pp. 29–37, Jan. 2025, doi: 10.36085/jsai.v8i1.7454.

B. K. Khotimah, D. R. Anamisa, Y. Kustiyahningsih, A. N. Fauziah, and E. Setiawan, “Enhancing Small and Medium Enterprises: A Hybrid Clustering and AHP-TOPSIS Decision Support Framework,” Ingénierie des systèmes d Inf., vol. 29, no. 1, pp. 313–321, Feb. 2024, doi: 10.18280/isi.290131.

A. Paula Barbosa de Morais, M. Santos Dias, B. Samways dos Santos, R. Henrique Palma Lima, and P. Rochavetz de Lara Andrade, “Clustering techniques and innovation-based comparison in Londrina and Region companies,” Semin. Ciências Exatas e Tecnológicas, vol. 45, p. e49522, May 2024, doi: 10.5433/1679-0375.2024.v45.49522.

O. P. Atemoagbo, , Aisha Abdullahi, and P. Siyan, “Cluster Analysis of MSMES In Suleja, Nigeria: Insights From Fuzzy C-Means Clustering And T-SNE Visualizations,” Manag. Econ. J., pp. 1–9, Apr. 2024, doi: 10.18535/mej/v2023.03.

N. Hafizah, A. Lia Hananto, F. Nurapriani, and E. Novalia, “Segmentasi Nasabah UMKM Berdasarkan Kinerja dan Keuntungan Menggunakan K-MEANS Clustering,” JATI (Jurnal Mhs. Tek. Inform., vol. 9, no. 5, pp. 8661–8665, Jul. 2025, doi: 10.36040/jati.v9i5.15056.

A. Ahmad and L. Dey, “A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets,” Pattern Recognit. Lett., vol. 32, no. 7, pp. 1062–1069, May 2011, doi: 10.1016/j.patrec.2011.02.017.

D. Chrisinta, I. M. Sumertajaya, and I. Indahwati, “Evaluasi Kinerja Metode Cluster Ensemble Dan Latent Class Clustering Pada Peubah Campuran,” Indones. J. Stat. Its Appl., vol. 4, no. 3, pp. 448–461, Nov. 2020, doi: 10.29244/ijsa.v4i3.630.

A. Markos, O. Moschidis, and T. Chadjipantelis, “Sequential dimension reduction and clustering of mixed-type data,” Int. J. Data Anal. Tech. Strateg., vol. 12, no. 3, p. 228, 2020, doi: 10.1504/IJDATS.2020.108043.

L. Kaufman; P. J. Rousseeuw, Finding Groups in Data. Wiley, 2020.

S. Harikumar and S. PV, “K-Medoid Clustering for Heterogeneous DataSets,” Procedia Comput. Sci., vol. 70, pp. 226–237, 2015, doi: 10.1016/j.procs.2015.10.077.

D. L. Nkweteyim, “Clustering by partitioning around medoids using distance-based similarity measures on interval-scaled variables,” Niger. J. Technol. Dev., vol. 15, no. 1, p. 1, Mar. 2018, doi: 10.4314/njtd.v15i1.1.

M. Klein, C. Leiber, and C. Böhm, “k-SubMix: Common Subspace Clustering on Mixed-Type Data,” 2023, pp. 662–677. doi: 10.1007/978-3-031-43412-9_39.

Y. Villuendas-Rey, C. C. Tusell-rey, O. Camacho-Nieto, and V. Salinas-García, “Bioinspired Hybrid and Incomplete Data Clustering,” Int. J. Comb. Optim. Probl. Informatics, vol. 15, no. 4, pp. 85–100, Nov. 2024, doi: 10.61467/2007.1558.2024.v15i4.501.

D. Marcelina, “Hybrid clustering and supervised learning model for digital MSME segmentation,” vol. 14, no. 1, pp. 86–96, 2025, [Online]. Available: www.ejournal.isha.or.id/index.php/Mandiri

F. B. Wijaya, W. Budiaji, and A. S. Wicaksono, “Applied Machine Learning DBSCAN for Identifying Clusters of Micro and Small Industries,” RIGGS J. Artif. Intell. Digit. Bus., vol. 4, no. 2, pp. 380–386, May 2025, doi: 10.31004/riggs.v4i2.515.

T. T. Tran, N. Q. Phan, and H. X. Huynh, “Random Forest Model Parameters Optimization,” 2025, pp. 237–247. doi: 10.1007/978-981-97-9616-8_19.

L. Sari, A. Romadloni, R. Lityaningrum, and H. D. Hastuti, “Implementation of LightGBM and Random Forest in Potential Customer Classification,” TIERS Inf. Technol. J., vol. 4, no. 1, pp. 43–55, Jun. 2023, doi: 10.38043/tiers.v4i1.4355.

E. Widiastuti, J. Kusanti, and A. Agustiwi, “Location Aware Machine Learning Models for Predicting Online Sales of MSMEs: A Case Study from Indonesia,” Tahun, vol. 4, no. 2, pp. 539–552, 2025, doi: 10.59066/jmae.v4i2.1556.

E. Purnamasari and D. Asa Verano, “Model Data-Driven untuk Prediksi Digitalisasi UMKM Menggunakan GMM dan XGBoost,” 2025, doi: 10.55382/jurnalpustakaai.v5i.984.

R. Mitra, A. Dongre, P. Dangare, A. Goswami, and M. K. Tiwari, “Knowledge graph driven credit risk assessment for micro, small and medium-sized enterprises,” Int. J. Prod. Res., vol. 62, no. 12, pp. 4273–4289, Jun. 2024, doi: 10.1080/00207543.2023.2257807.

T. T. T. N. Q. P. H. X. Huynh, “Random Forest Model Parameters Optimization,” Intell. Syst. Data Sci., pp. 237–247, 2024, doi: 10.1007/978-981-97-9616-8_19.

J. Z. C.-D. L. G. C. J. Zhang, “Research on the Prediction Application of Multiple Classification Datasets Based on Random Forest Model,” IEEE, pp. 156–161, 2024, doi: https://doi.org/10.1109/ICPICS62053.2024.10795875.

L. Sari, A. Romadloni, R. Lityaningrum, and H. D. Hastuti, “Implementation of LightGBM and Random Forest in Potential Customer Classification,” TIERS Inf. Technol. J., vol. 4, no. 1, pp. 43–55, Jun. 2023, doi: 10.38043/tiers.v4i1.4355.