Enhancing Single Nucleotide Polymorphisms Detection from Imbalanced Data: A Study of Resampling Techniques in Machine Learning Algorithms

Rossy Nurhasanah; Dedy Arisandi; Fanindia Purnamasari; Hayatunnufus Hayatunnufus; Daisy Sere Damara Simangunsong; Aflah Mutsanni Pulungan

doi:10.24014/ijaidm.v8i1.32942

Enhancing Single Nucleotide Polymorphisms Detection from Imbalanced Data: A Study of Resampling Techniques in Machine Learning Algorithms

Rossy Nurhasanah, Dedy Arisandi, Fanindia Purnamasari, Hayatunnufus Hayatunnufus, Daisy Sere Damara Simangunsong, Aflah Mutsanni Pulungan

Abstract

Identifying the actual Single Nucleotide Polymorphisms (SNPs) by sourcing Next Generation Sequencing (NGS) data emerges an imbalanced problem due to the inherent high error rate of NGS technology. The imbalance problem has been found to have a negative impact on machine learning algorithms because it produces biased models and poor performance, particularly in detecting actual SNP that belong to the underrepresented class in question. This study evaluates the effectiveness of several resampling techniques, including Borderline-SMOTE, Random Undersampling, and Tomek-Link, in enhancing the performance of machine learning algorithms, specifically Random Forest (RF) and Artificial Neural Networks (ANN). Furthermore, we compare these techniques to determine the most effective approach. Our results indicate that Borderline-SMOTE improves the F-Measure of RF from 69.72 to 91.52 (a 31.2% increase) and ANN from 79.75 to 91.32 (a 14.5% increase) and outperforms other resampling methods. These findings highlight the crucial role of resampling techniques and the careful selection of algorithms in improving classification accuracy for imbalanced datasets.

Keywords

Artificial Neural Network; Borderline-SMOTE; Imbalanced Classification; Random Forest; SNP Identification

Full Text:

PDF

References

L. Picoult-Newberg et al., “Mining SNPs from EST databases,” Genome Res, vol. 9, no. 2, pp. 167--174, 1999, doi: 10.1101/gr.9.2.167.

P. Nowotny, J. M. Kwon, and A. M. Goate, “SNP analysis to dissect human traits,” Curr Opin Neurobiol, vol. 11, no. 5, pp. 637–641, 2001, doi: https://doi.org/10.1016/S0959-4388(00)00261-0.

I. Joshi et al., “15 - Artificial intelligence, big data and machine learning approaches in genome-wide SNP-based prediction for precision medicine and drug discovery,” in Big Data Analytics in Chemoinformatics and Bioinformatics, S. C. Basak and M. Vračko, Eds., Elsevier, 2023, pp. 333–357. doi: https://doi.org/10.1016/B978-0-323-85713-0.00021-9.

J. Candotti et al., “Haplotype mining panel for genetic dissection and breeding in Eucalyptus.,” Plant J, vol. 113, no. 1, pp. 174—185, 2022. doi: 10.1111/tpj.16026

O. A. Gutiérrez et al., “SNP markers associated with resistance to frosty pod and black pod rot diseases in an F1 population of Theobroma cacao L.,” Tree Genet Genomes, vol. 17, no. 3, 2021, doi: 10.1007/s11295-021-01507-w.

J. Xue et al., “An overview of SNP-SNP microhaplotypes in the 26 populations of the 1000 Genomes Project,” Int J Legal Med, vol. 136, no. 5, pp. 1211–1226, 2022, doi: 10.1007/s00414-022-02820-2.

L. S. Hasibuan, N. Hudachair, and M. A. Istiadi, “Bootstrap aggregating of classification and regression trees in identification of single nucleotide polymorphisms,” in 2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017, 2018. doi: 10.1109/ICACSIS.2017.8355068.

M. Wasikowski and X. Chen, “Combating the Small Sample Class Imbalance Problem Using Feature Selection,” IEEE Trans Knowl Data Eng, vol. 22, no. 10, pp. 1388–1400, 2010, doi: 10.1109/TKDE.2009.187.

M. Koziarski, “Potential Anchoring for imbalanced data classification,” Pattern Recognit, vol. 120, 2021, doi: 10.1016/j.patcog.2021.108114.

A. S. More and D. P. Rana, “Review of random forest classification techniques to resolve data imbalance,” in 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), 2017, pp. 72–78. doi: 10.1109/ICISIM.2017.8122151.

A. C. Neocleous, K. H. Nicolaides, and C. N. Schizas, “Intelligent Noninvasive Diagnosis of Aneuploidy: Raw Values and Highly Imbalanced Dataset,” IEEE J Biomed Health Inform, vol. 21, no. 5, 2017, doi: 10.1109/JBHI.2016.2608859.

S.-H. Oh, “A Statistical Perspective of Neural Networks for Imbalanced Data Problems,” International Journal of Contents, vol. 7, no. 3, 2011, doi: 10.5392/ijoc.2011.7.3.001.

L. Breiman, “Random forests,” Mach Learn, vol. 45, pp. 5–32, 2001.

M. Khalilia, S. Chakraborty, and M. Popescu, “Predicting disease risks from highly imbalanced data using random forest,” BMC Med Inform Decis Mak, vol. 11, no. 1, 2011, doi: 10.1186/1472-6947-11-51.

S. Sakr et al., “Comparison of machine learning techniques to predict all-cause mortality using fitness data: The Henry Ford exercIse testing (FIT) project,” BMC Med Inform Decis Mak, vol. 17, no. 1, 2017, doi: 10.1186/s12911-017-0566-6.

J. H. Ma, Z. Feng, J. Y. Wu, Y. Zhang, and W. Di, “Learning from imbalanced fetal outcomes of systemic lupus erythematosus in artificial neural networks,” BMC Med Inform Decis Mak, vol. 21, no. 1, Dec. 2021, doi: 10.1186/s12911-021-01486-x.

S. Bagui and K. Li, “Resampling imbalanced data for network intrusion detection datasets,” J Big Data, vol. 8, no. 1, 2021, doi: 10.1186/s40537-020-00390-x.

R. Taghizadeh-Mehrjardi et al., “Synthetic resampling strategies and machine learning for digital soil mapping in Iran,” Eur J Soil Sci, vol. 71, no. 3, 2020, doi: 10.1111/ejss.12893.

C. Zhang, P. Soda, J. Bi, G. Fan, G. Almpanidis, and S. Garcia, “An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification,” Appl Intell, vol. 53, no. 5, pp. 5449—5461, 2023, doi: https://doi.org/10.1007/s10489-022-03772-1.

W. A. Kusuma, A. S. Rahmi, and R. Heryanto, “Implementation of hybrid sampling technique for predicting active compound and protein interaction in unbalanced dataset,” in IOP Conference Series: Earth and Environmental Science, vol. 335, no. 1, pp. 012005, 2019. doi: 10.1088/1755-1315/335/1/012005.

I. Sadgali, N. Sael, and F. Benabbou, “Bidirectional gated recurrent unit for improving classification in credit card fraud detection,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 21, no. 3, 2021, doi: 10.11591/ijeecs.v21.i3.pp1704-1712.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002, doi: 10.1613/jair.953.

D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance,” Inf Sci (N Y), vol. 505, 2019, doi: 10.1016/j.ins.2019.07.070.

G. Kovács, “Smote-variants: A python implementation of 85 minority oversampling techniques,” Neurocomputing, vol. 366, 2019, doi: 10.1016/j.neucom.2019.06.100.

X. Zheng, “SMOTE Variants for Imbalanced Binary Classification: Heart Disease Prediction,” J Chem Inf Model, vol. 21, no. 1, 2020.

R. Zuech, J. Hancock, and T. M. Khoshgoftaar, “Investigating rarity in web attacks with ensemble learners,” J Big Data, vol. 8, no. 1, 2021, doi: 10.1186/s40537-021-00462-6.

E. AT, A. M, A.-M. F, and S. M, “Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method,” Global Journal of Technology and Optimization, vol. 01, no. S1, 2016, doi: 10.4172/2229-8711.s1111.

M. A. Istiadi, W. A. Kusuma, and I. M. Tasma, “Application of decision tree classifier for single nucleotide polymorphism discovery from next-generation sequencing data,” in Proceedings - ICACSIS 2014: 2014 International Conference on Advanced Computer Science and Information Systems, 2014. doi: 10.1109/ICACSIS.2014.7065832.

L. Sahrina Hasibuan, S. Nabila, N. Hudachair, and M. Abrar Istiadi, “Evaluation of F-Measure and Feature Analysis of C5.0 Implementation on Single Nucleotide Polymorphism Calling,” Indonesian Journal of Artificial Intelligence and Data Mining (IJAIDM), vol. 1, no. 1, pp. 1–5, 2018.

R. Nurhasanah, A. Buono, and W. A. Kusuma, “COMBINING SIGNAL TO NOISE RATIO AND UNDERSAMPLING IN SINGLE NUCLEOTIDE POLYMORPHISMS IDENTIFICATION,” Indian Journal of Computer Science and Engineering, vol. 14, no. 3, pp. 490–499, Jun. 2023, doi: 10.21817/indjcse/2023/v14i3/231403029.

G. Lemaître, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” Journal of Machine Learning Research, vol. 18, 2017.

A. Thumpati and Y. Zhang, “Towards Optimizing Performance of Machine Learning Algorithms on Unbalanced Dataset,” 2023. doi: 10.5121/csit.2023.131914.

R. Zuech, J. Hancock, and T. M. Khoshgoftaar, “Detecting web attacks using random undersampling and ensemble learners,” J Big Data, vol. 8, no. 1, 2021, doi: 10.1186/s40537-021-00460-8.

M. U. Khan, S. U. J. Lee, S. Abbas, A. Abbas, and A. K. Bashir, “Detecting Wake Lock Leaks in Android Apps Using Machine Learning,” IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2021.3110244.

G. Kawamura, S. Seno, Y. Takenaka, and H. Matsuda, “A Combination Method of the Tanimoto Coefficient and Proximity Measure of Random Forest for Compound Activity Prediction,” IPSJ Digital Courier, vol. 4, 2008, doi: 10.2197/ipsjdc.4.238.

A. S. More and D. P. Rana, “Performance enrichment through parameter tuning of random forest classification for imbalanced data applications,” Mater Today Proc, 2022, doi: 10.1016/j.matpr.2021.12.020.

L. Roberts, L. Razoumov, L. Su, and Y. Wang, “Gini-regularized Optimal Transport with an Application to Spatio-Temporal Forecasting,” Dec. 2017, [Online]. Available: http://arxiv.org/abs/1712.02512

R. P. Pratama and W. Maharani, “Predicting Big Five Personality Traits Based on Twitter User U sing Random Forest Method*,” in 2021 International Conference on Data Science and Its Applications, ICoDSA 2021, 2021. doi: 10.1109/ICoDSA53588.2021.9617501.

J. W. Huang, C. W. Chiang, and J. W. Chang, “Email security level classification of imbalanced data using artificial neural network: The real case in a world-leading enterprise,” Eng Appl Artif Intell, vol. 75, 2018, doi: 10.1016/j.engappai.2018.07.010.

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in 7th International Conference on Learning Representations, ICLR 2019, International Conference on Learning Representations, ICLR, 2019.

Ü. Yllmaz, C. Gezer, Z. Aydln, and V. C. Güngör, “Data Mining Techniques in Direct Marketing on Imbalanced Data using Tomek Link Combined with Random Under-sampling,” in ACM International Conference Proceeding Series, 2021. doi: 10.1145/3471287.3471299.

S. G. Zadeh and M. Schmid, “Bias in Cross-Entropy-Based Training of Deep Survival Networks,” IEEE Trans Pattern Anal Mach Intell, vol. 43, no. 9, pp. 3126–3137, Sep. 2021, doi: 10.1109/TPAMI.2020.2979450.

Z. ao Huang, Y. Sang, Y. Sun, and J. Lv, “A neural network learning algorithm for highly imbalanced data classification,” Inf Sci (N Y), vol. 612, 2022, doi: 10.1016/j.ins.2022.08.074.

L. K. Matukumalli, J. J. Grefenstette, D. L. Hyten, I. Y. Choi, P. B. Cregan, and C. P. Van Tassell, “Application of machine learning in SNP discovery,” BMC Bioinformatics, 2006, doi: 10.1186/1471-2105-7-4.

L. S. Hasibuan, W. A. Kusuma, and W. B. Suwamo, “Identification of single nucleotide polymorphism using support vector machine on imbalanced data,” Proceedings - ICACSIS 2014: 2014 International Conference on Advanced Computer Science and Information Systems, no. June, pp. 375–379, 2014, doi: 10.1109/ICACSIS.2014.7065854.

DOI: http://dx.doi.org/10.24014/ijaidm.v8i1.32942

Refbacks

There are currently no refbacks.

Office and Secretariat:

Big Data Research Centre
Puzzle Research Data Technology (Predatech)
Laboratory Building 1st Floor of Faculty of Science and Technology
UIN Sultan Syarif Kasim Riau

Jl. HR. Soebrantas KM. 18.5 No. 155 Pekanbaru Riau – 28293
Website: http://predatech.uin-suska.ac.id/ijaidm
Email: ijaidm@uin-suska.ac.id
e-Journal: http://ejournal.uin-suska.ac.id/index.php/ijaidm
Phone: 085275359942

Journal Indexing:

IJAIDM Stats