Preview

Informatics

Advanced search

Development of algorithms and software for classification of nucleotide sequences

Abstract

Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.

For citations:


Zakirava V.R., Syrakvash D.A., Hileuski S.V., Nazarov P.V., Yatskou M.M. Development of algorithms and software for classification of nucleotide sequences. Informatics. 2019;16(2):109-118. (In Russ.)

Views: 830


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1816-0301 (Print)
ISSN 2617-6963 (Online)