Evaluation of F-Measure and Feature Analysis of C5.0 Implementation on Single Nucleotide Polymorphism Calling

Lailan Sahrina Hasibuan, Sita Nabila, Nurul Hudachair, Muhammad Abrar Istiadi

Abstract


Data growing in molecular biology has increased rapidly since Next-Generation Sequencing (NGS) technology introduced in 2000, the latest technology used to sequence DNA with high throughput. Single Nucleotide Polymorphism (SNP) is a marker based on DNA which can be used to identify organism specifically. SNPs are usually exploited for optimizing parents selection in producing high-quality seed for plant breeding. This paper discusses SNP calling underlying NGS data of cultivated soybean (Glycine max [L]. Merr) using C5.0, an improved rule-based algorithm of C4.5. The evaluation illustrated that C5.0 is better than the other rule-based algorithm CART based on f-measure. The value of f-measure using C5.0 and CART are 0.63 and 0.58. Besides of that, C5.0 is robust for imbalanced training dataset up to 1:17 but it is suffer in large training dataset. C5.0’s performance may be increased by applying bagging or the other ensemble technique as improvement of CART by applying bagging in final decision. The other important thing is using appropriate features in representing SNP candidates. Based on information gain of C5.0, this paper recommends error probability, homopolymer left, mismatch alt and mean nearby qual as features for SNP calling.

Full Text:

PDF

References


M. Barba, H. Czosnek e A. Hadidi, “Historical Perspective, Development and Applications of Next-Generation Sequencing in Plant Virology,” viruses, vol. 6, nº 2014, pp. 106-136, 2014.

M. Gu`vic, “The History of DNA Sequencing,” J Med Biochem, vol. 32, nº 2013, pp. 301-312, 2013.

W. Kong e K. W. Choo, “Predicting Single Nucleotide Polymorphisms (SNP) from DNA sequence by Support Vector Machine,” Frontiers in Bioscience, vol. 12, nº 2007, pp. 1610-1614, 2007.

J. Mammadov, R. Aggarwal, R. Buyyarapu e S. Kumpatla, “SNP Markers and Their Impact on Plant Breeding,” International Journal of Plant Genomics, vol. 2012, 2012.

N. M. Boopathi, “Marker-Assisted Selection,” em Genetic Mapping and Marker Assisted Selection, Springer India, 2013, pp. 173-186.

L. K. Matukumalli, J. J. Grefenstette, D. L. Hyten, I. Y. Choi, P. B. Cregan e C. P. V. Tassell, “Application of machine learning in SNP discovery,” BMC Bioinformatics, vol. 7, nº 4, 2006.

B. D. O’Fallon, W. W. Donahue e D. K. Crockett, “A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data,” Bioinformatics, vol. 29, nº 11, p. 1361–1366, 2013.

L. S. Hasibuan, W. A. Kusuma e W. B. Suwarno, “Identification of single nucleotide polymorphism using support vector machine on imbalanced data,” em International Conference on Advanced Computer Science and Information Systems (ICACSIS), Jakarta, 2014.

L. S. Hasbuan, N. Hudachair e M. A. Istiadi, “Bootstrap Aggregating of Classification and Regression Trees in Identification of Single Nucleotide Polymorphisms,” em International Conference on Advanced Computer Science and Information Systems (ICACSIS), Jakarta, 2017.

W. Y. Loh, “Classification and regression trees,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, pp. 14-23, 2011.

R. Pandya e J. Pandya, “C5.0 Algorithm to Improved Decision Tree with Feature Selection and Reduced Error Pruning,” International Journal of Computer Applications, vol. 117, nº 16, pp. 18-21, 2015.

H.-M. Lam, X. Xu, X. Liu, W. Chen, G. Yang, F.-L. Wong, M.-W. Li, W. He, N. Qin, B. Wang e J. Li, “Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection,” Nature Genetics, vol. 42, nº 12, p. 1053–1059, 2010.

M. A. Istiadi, W. A. Kusuma e I. M. Tasma, “Application of Decision Tree Classifier for Single Nucleotide Polymorphism Discovery from Next-Generation Sequencing Data,” em International Conference on Advanced Computer Science and Information Systems (ICACSIS), Jakarta, 2014.

J. Han, M. Kamber e J. Pei, Data mining: concepts and techniques, Elsevier, 2011.

H. He e E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on knowledge and data engineering, vol. 21, nº 9, pp. 1263-1284, 2009.

X.-Y. Liu, J. Wu e Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, nº 2, pp. 539-550, 2009.

S.-J. Yen e Y.-S. Lee, “Cluster-based under-sampling approaches for imbalanced data distributions,” Expert Systems with Applications, vol. 36, nº 3, pp. 5718-5727, 2009.

Cieslak, A. David e V. Nitesh, “Learning decision trees for unbalanced data,” em Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2008.

U. Ojha, M. Jain, G. Jain e R. K. Tiwari, “Significance of Important Attributes for Decision Making Using C5.0,” em International Conference on Computing, Communication and Networking Technologies (ICCCNT), New Delhi, 2017.


Refbacks

  • There are currently no refbacks.


Office and Secretariat

Big Data Research Centre
Puzzle Research Data Technology (Predatech)
Laboratory Building 1st Floor of Faculty of Science and Technology
UIN Sultan Syarif Kasim Riau

Jl. HR. Soebrantas KM. 18.5 No. 155 Pekanbaru Riau – 28293
Website: http://predatech.uin-suska.ac.id/ijaidm
Email: ijaidm@uin-suska.ac.id
e-Journal: http://ejournal.uin-suska.ac.id/index.php/ijaidm
Phone./ Hp.: +62 852-7535-9942/ +62 852-6370-8907