Development of algorithms and software for classification of nucleotide sequences
Abstract
Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.
About the Authors
V. R. ZakiravaBelarus
Veranika R. Zakirava, Master Student, Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies
D. A. Syrakvash
Belarus
Dzmitry A. Syrakvash, Master, Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies
S. V. Hileuski
Belarus
Stanislau V. Hileuski, Associate Professor, Cand. Sci. (Eng.), Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies
P. V. Nazarov
Luxembourg
PhD, Scientist, Proteome and Genome Research Unit
Department of Oncology (1A-B, rue Thomas Edison, L-1445 Strassen, Luxembourg)
M. M. Yatskou
Belarus
Mikalai M. Yatskou, Associate Professor, Cand. Sci. (Phys.-Math.), Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies
References
1. Edwards D. J., Holt K. E. Beginner's guide to comparative bacterial genome analysis using nextgeneration sequence data. Microbial Informatics and Experimentation, 2013, vol. 3:2, pp. 1–9.
2. Bao J., Yuan R., Bao Z. An improved alignment-free model for DNA sequence similarity metric. BMC Bioinformatics, 2014, vol. 15:312, pp. 1–15.
3. Li C., Wang J. Relative entropy of DNA and its application. Physica A, 2005, vol. 347, pр. 465–471.
4. Dai Q., Liu X., Yao Y., Zhao F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. Journal of Theoretical Biology, 2011, vol. 276, pр. 174–180.
5. Liu L., Ho Y. K., Yau S. Clustering DNA sequences by feature vectors. Mol Phylogenet Evol, 2006, vol. 41, pр. 64–69.
6. Wang J., Zheng X. Wse, a new sequence distance measure based on word frequencies. Mathematical Biosciences, 2008, vol. 215, pр. 78–83.
7. Zhao B., He R. L., Yau S. Т. A new distribution vector and its application in genome clustering. Mol Phylogenet Evol, 2011, vol. 59, pр. 438–443.
8. Bermingham M. L., Pong-Wong R., Spiliopoulou A., Hayward C., Rudan I., …, Haley C. S. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports, 2015, vol. 5:10312, pp. 1–12.
9. GFF/GTF File Format – Definition and Supported Options, 2014. Available at: www.ensembl.org/ info/website/upload/gff.html (accessed 16.10.2014).
10. Mao R., Kumar P. K. R., Guo C., Zhang Y., Liang C. Comparative analyses between retained introns and constitutively spliced introns in arabidopsos thaliana using random forest and support vector machine. PLoS One, 2014, vol. 9, no. 8, pр. 1–12.
11. Syrakvash D. А., Jackov N. N., Nazarov P. V., Skakun V. V. Razrabotka algoritmov i avtomatizirovannyh programmnyh sredstv dlya klassifikacii kodirujushchih i nekodiruyushchih nukleotidnyh posledovatel’nostey [Development of algorithms and automated software for the classification of coding and non-coding nucleotide sequences]. Mejdunarodnyi congress po informatike: informacionnye sistemy i tehnologii [International Congress on Informatics: Information Systems and Technologies]. Minsk, Belorusskij gosudarstvennyj universitet, 2016, pp. 189–193 (in Russian).
12. Fernández-Delgado M., Cernadas E., Barro S., Amorim D. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 2014, vol. 15, pр. 3133–3181.
13. Liaw A., Wiener M. Breiman and Custler’s Random Forests for Classification and Regression, 2016. Available at: http://www.stat.berkley.edu/~breiman/RandomForest/cc_home.htm#workings (accessed 11.02.2016).
14. Breiman L. Random forest. Machine Learning, 2001, vol. 45(1), pр. 5–32.
15. Vapnik V. N. Vosstanovlenie zavisimostey po empiricheskim dannym. Recovering Dependencies from Empirical Data. Moscow, Nauka, 1979, 448 p. (in Russian).
16. V’ugin V. V. Matematicheskie osnovy mashinnogo obucheniya i prognozirovaniya. Mathematical Foundations of Machine Learning and Prediction. Moscow, Moskovskij centr nepreryvnogo matematicheskogo obrazovanija, 2014, 304 p. (in Russian).
17. Mastickiy C. E., Shitikov V. K. Statisticheskiy analiz i vizualizaciya dannyh s pomoshchju R. Statistical Analysis and Data Visualization with R, 2014. Available at: http://r-analytics.blogspot.com (accessed 13.03.2015) (in Russian).
18. Zhao Z., Sharma S., Morstatter F., Alelyani S. Advancing Feature Selection Research – ASU Feature Selection Repository, 2010. Available at: https://www.researchgate.net/publication/305083748_Advancing_ feature_selection_research (accessed 10.04.2019).
19. Kuhn M. The Caret Package, 2017. Available at: https://topepo.github.io/caret (accessed 11.04.2017)
Review
For citations:
Zakirava V.R., Syrakvash D.A., Hileuski S.V., Nazarov P.V., Yatskou M.M. Development of algorithms and software for classification of nucleotide sequences. Informatics. 2019;16(2):109-118. (In Russ.)