Overcoming Data Imbalance in Risk Management: A Comparative  Study of Sampling Methods

Arya Wijna Astungkara; Achmad Pratama Rifai

doi:10.24014/jti.v11i1.37368

Overcoming Data Imbalance in Risk Management: A Comparative Study of Sampling Methods

Arya Wijna Astungkara, Achmad Pratama Rifai

Abstract

Data imbalance is a significant challenge in risk management, especially in classification tasks where critical events—such as loan defaults, employee attrition, or company bankruptcy—occur less frequently than normal cases. This paper presents a comparative study of eight sampling methods—Random Undersampling (RUS), Random Oversampling (ROS), Edited Nearest Neighbor (ENN), One-Sided Selection (OSS), SMOTE, ADASYN, SMOTEENN, and SMOTETomek—across three imbalanced datasets: Taiwanese Bankruptcy Prediction, IBM HR Analytics Employee Attrition, and Loan Prediction. Using eight machine learning classifiers, the study evaluates performance using F1 Score and Negative Predictive Value (NPV), two metrics suited for imbalanced data. The results reveal that ENN achieves the highest F1 scores in high-dimensional and severely imbalanced datasets, while SMOTE-based methods perform best in large-scale datasets with moderate imbalance. Notably, RUS consistently delivers the highest NPV, highlighting its effectiveness in minimizing false negatives and supporting conservative decision-making. The findings underscore the importance of aligning sampling strategies with dataset characteristics and specific risk management objectives.

Full Text:

PDF

References

G. Niehaus, “Enterprise Risk Management and the Risk Management Process,” The Palgrave Handbook of Unconventional Risk Transfer, pp. 109–142, Aug. 2017, doi: 10.1007/978-3-319-59297-8_5.

V. K. Shrivastava, J. Balasubramanian, A. Katyal, A. Yadav, and S. Yogananthan, “Understanding the significance of risk management in enterprise management dynamics,” Multidisciplinary Reviews, vol. 6, 2024, doi: 10.31893/MULTIREV.2023SS093.

R. Gorvett, “Behavioral Economics and Its Implications for Enterprise Risk Management,” 2012.

A. D. Fernandes, D. Ramalho, M. Mira, and D. Silva, “Enterprise Risk Management and Information Systems: a Enterprise Risk Management and Information Systems: a Systematic Literature Review Systematic Literature Review,” 2022.

R. Banham, “Enterprising Views of Risk Management: Businesses Can Use ERM to Manage a Wide Variety of Risks,” Journal of accountancy, 2004.

L. Zeng, W. Y. Lau, and E. N. Abdul Bahri, “Value‐at‐Risk with quantile regression neural network: New evidence from internet finance firms,” Appl Stoch Models Bus Ind, vol. 39, no. 6, pp. 884–905, Nov. 2023, doi: 10.1002/ASMB.2808.

A. Karasan and E. Gaygısız, “Volatility Prediction and Risk Management: An SVR-GARCH Approach,” The Journal of Financial Data Science, vol. 2, no. 4, pp. 85–104, Sep. 2020, doi: 10.3905/JFDS.2020.1.046.

X. Zhang, “Integrating Machine Learning and Traditional Models for Financial Risk Quantification,” Financial Economics Insights, vol. 1, no. 1, pp. 50–61, Nov. 2024, doi: 10.70088/6BB8RK74.

F. Ahmed, K. Nizam, Z. Sajid, S. Qamar, and Ahsan, “Striking a Balance: Evaluating Credit Risk with Traditional and Machine Learning Models,” Bulletin of business and economics, vol. 13, no. 2, pp. 999–1004, Aug. 2024, doi: 10.61506/01.00425.

H. Xu, K. Niu, T. Lu, and S. Li, “Leveraging artificial intelligence for enhanced risk management in financial services: Current applications and future prospects,” Engineering Science & Technology Journal, vol. 5, no. 8, pp. 2402–2426, Aug. 2024, doi: 10.51594/ESTJ.V5I8.1363.

W. C. Aaron, O. Irekponor, N. T. Aleke, L. Yeboah, and J. E. Joseph, “Machine learning techniques for enhancing security in financial technology systems,” International Journal of Science and Research Archive, vol. 13, no. 1, pp. 2805–2822, Oct. 2024, doi: 10.30574/IJSRA.2024.13.1.1965.

Y. Zhao, “Integrating Advanced Technologies in Financial Risk Management: A Comprehensive Analysis,” Advances in Economics, Management and Political Sciences, vol. 89, no. 1, pp. 49–54, Jun. 2024, doi: 10.54254/2754-1169/89/20241908.

Q. Zhang, “A Survey on Imbalanced Data Learning Method,” Computer Science, 2005.

S. Birla, K. Kohli, and A. Dutta, “Machine Learning on imbalanced data in Credit Risk,” IEEE Annual Information Technology, Electronics and Mobile Communication Conference, Nov. 2016, doi: 10.1109/IEMCON.2016.7746326.

Y. Zhang, “Stroke Prediction Based on Machine Learning,” ITM Web of Conferences, vol. 70, p. 04029, Jan. 2025, doi: 10.1051/ITMCONF/20257004029.

B. Ozturk, T. Lawton, S. Smith, and I. Habli, “Balancing Acts: Tackling Data Imbalance in Machine Learning for Predicting Myocardial Infarction in Type 2 Diabetes,” Medical Informatics Europe, vol. 316, pp. 626–630, Aug. 2024, doi: 10.3233/SHTI240491.

K. Okada et al., “Abstract 15027: Predicting Recurrence of Myocardial Infarction in Post-PCI Patients Using Machine Learning,” Circulation, vol. 148, no. Suppl_1, Nov. 2023, doi: 10.1161/CIRC.148.SUPPL_1.15027.

I. M. Bermudez Vera, J. Mosquera Restrepo, and D. F. Manotas-Duque, “Data Mining for the Adjustment of Credit Scoring Models in Solidarity Economy Entities: A Methodology for Addressing Class Imbalances,” Risks, vol. 13, no. 2, Feb. 2025, doi: 10.3390/RISKS13020020.

Y. Cai, “Overcoming Data Limitations in Credit Risk Assessment with FinGPT-Generated Synthetic Data,” 2024 International Conference on Electronics and Devices, Computational Science (ICEDCS), pp. 137–141, 2024, doi: 10.1109/ICEDCS64328.2024.00029.

Ajmal M S, TANMAY DESHPANDE, and IBM Data Scientists, “IBM HR Analytics Employee Attrition & Performance,” 2023, IEEE Dataport. doi: 10.21227/2m1g-6v47.

S. Surana, “Loan Prediction Based on Customer Behavior.” Accessed: Jun. 02, 2025. [Online]. Available: https://www.kaggle.com/datasets/subhamjain/loan-prediction-based-on-customer-behavior

“Taiwanese Bankruptcy Prediction,” 2020, UCI Machine Learning Repository. doi: 10.24432/C5004D.

S. Ray, “A Quick Review of Machine Learning Algorithms,” in 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), IEEE, Feb. 2019, pp. 35–39. doi: 10.1109/COMITCon.2019.8862451.

F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, no. null, pp. 2825–2830, Nov. 2011.

J. H. Friedman, “Greedy function approximation: A gradient boosting machine.,” The Annals of Statistics, vol. 29, no. 5, Oct. 2001, doi: 10.1214/aos/1013203451.

T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Trans Inf Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967, doi: 10.1109/TIT.1967.1053964.

D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression. Wiley, 2013. doi: 10.1002/9781118548387.

X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gordon, D. Dunson, and M. Dudík, Eds., in Proceedings of Machine Learning Research, vol. 15. Fort Lauderdale, FL, USA: PMLR, May 2011, pp. 315–323. [Online]. Available: https://proceedings.mlr.press/v15/glorot11a.html

L. Breiman, “Random Forests,” Mach Learn, vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.

I. H. Sarker, A. S. M. Kayes, and P. Watters, “Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage,” J Big Data, vol. 6, no. 1, p. 57, Dec. 2019, doi: 10.1186/s40537-019-0219-y.

C. Cortes and V. Vapnik, “Support-vector networks,” Mach Learn, vol. 20, no. 3, pp. 273–297, Sep. 1995, doi: 10.1007/BF00994018.

J. H. Friedman, “Greedy function approximation: A gradient boosting machine.,” The Annals of Statistics, vol. 29, no. 5, Oct. 2001, doi: 10.1214/aos/1013203451.

T. Chen and C. Guestrin, “XGBoost,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM, Aug. 2016, pp. 785–794. doi: 10.1145/2939672.2939785.

Haibo He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans Knowl Data Eng, vol. 21, no. 9, pp. 1263–1284, Sep. 2009, doi: 10.1109/TKDE.2008.239.

D. L. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Trans Syst Man Cybern, vol. SMC-2, no. 3, pp. 408–421, Jul. 1972, doi: 10.1109/TSMC.1972.4309137.

D. Guan, W. Yuan, Y.-K. Lee, and S. Lee, “Nearest neighbor editing aided by unlabeled data,” Inf Sci (N Y), vol. 179, no. 13, pp. 2273–2282, Jun. 2009, doi: 10.1016/j.ins.2009.02.011.

M. Kubat and S. Matwin, “Addressing The Curse Of Imbalanced Training Sets: One-sided Selection,” International Conference on Machine Learning, vol. 97, p. 179, 1997.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.

Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, Jun. 2008, pp. 1322–1328. doi: 10.1109/IJCNN.2008.4633969.

G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 20–29, Jun. 2004, doi: 10.1145/1007730.1007735.

DOI: http://dx.doi.org/10.24014/jti.v11i1.37368