<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">inform</journal-id><journal-title-group><journal-title xml:lang="ru">Информатика</journal-title><trans-title-group xml:lang="en"><trans-title>Informatics</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1816-0301</issn><issn pub-type="epub">2617-6963</issn><publisher><publisher-name>UIIP NASB</publisher-name></publisher></journal-meta><article-meta><article-id custom-type="elpub" pub-id-type="custom">inform-471</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>БИОИНФОРМАТИКА</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="en"><subject>BIOINFORMATICS</subject></subj-group></article-categories><title-group><article-title>Разработка алгоритмов и программных средств классификации кодирующих и некодирующих нуклеотидных последовательностей</article-title><trans-title-group xml:lang="en"><trans-title>Development of algorithms and software for classification of nucleotide sequences</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Закирова</surname><given-names>В. Р.</given-names></name><name name-style="western" xml:lang="en"><surname>Zakirava</surname><given-names>V. R.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Закирова Вероника Рашидовна, магистрант, кафедра системного анализа и компьютерного моделирования, факультет радиофизики и компьютерных технологий</p></bio><bio xml:lang="en"><p>Veranika R. Zakirava, Master Student, Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies</p></bio><email xlink:type="simple">veranika.zakirava@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Сырокваш</surname><given-names>Д. А.</given-names></name><name name-style="western" xml:lang="en"><surname>Syrakvash</surname><given-names>D. A.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Сырокваш Дмитрий Алексеевич, магистр, кафедра системного анализа и компьютерного моделирования, факультет радиофизики и компьютерных технологий</p></bio><bio xml:lang="en"><p>Dzmitry A. Syrakvash, Master, Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies</p></bio><email xlink:type="simple">dzmitry.syrakvash@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Гилевский</surname><given-names>С. В.</given-names></name><name name-style="western" xml:lang="en"><surname>Hileuski</surname><given-names>S. V.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Гилевский Станислав Викентьевич, доцент, кандидат технических наук, кафедра системного анализа и компьютерного моделирования, факультет радиофизики и компьютерных технологий</p></bio><bio xml:lang="en"><p>Stanislau V. Hileuski, Associate Professor, Cand. Sci. (Eng.), Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies</p></bio><email xlink:type="simple">Hileuski@bsu.by</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Назаров</surname><given-names>П. В.</given-names></name><name name-style="western" xml:lang="en"><surname>Nazarov</surname><given-names>P. V.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Назаров Петр Владимирович, кандидат физикоматематических наук, отдел исследования протеома и генома</p><p>отделение онкологии</p></bio><bio xml:lang="en"><p>PhD, Scientist, Proteome and Genome Research Unit</p><p>Department of Oncology (<ext-link xlink:href="https://maps.google.com/?q=1A-B,+rue+Thomas+Edison&amp;entry=gmail&amp;source=g" ext-link-type="uri">1A-B, rue Thomas Edison</ext-link>, L-1445 Strassen, Luxembourg)</p></bio><email xlink:type="simple">petr.nazarov@lih.lu</email><xref ref-type="aff" rid="aff-2"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Яцков</surname><given-names>Н. Н.</given-names></name><name name-style="western" xml:lang="en"><surname>Yatskou</surname><given-names>M. M.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Яцков Николай Николаевич, доцент, кандидат физикоматематических наук, кафедра системного анализа и компьютерного моделирования, факультет радиофизики и компьютерных технологий</p></bio><bio xml:lang="en"><p>Mikalai M. Yatskou, Associate Professor, Cand. Sci. (Phys.-Math.), Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies</p></bio><email xlink:type="simple">yatskou@bsu.by</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Белорусский государственный университет</institution></aff><aff xml:lang="en"><institution>Belarusian State University</institution></aff></aff-alternatives><aff-alternatives id="aff-2"><aff xml:lang="ru"><institution>Люксембургский институт здоровья</institution></aff><aff xml:lang="en"><institution>Luxembourg Institute of Health</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2019</year></pub-date><pub-date pub-type="epub"><day>14</day><month>02</month><year>2019</year></pub-date><volume>16</volume><issue>2</issue><fpage>109</fpage><lpage>118</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Закирова В.Р., Сырокваш Д.А., Гилевский С.В., Назаров П.В., Яцков Н.Н., 2019</copyright-statement><copyright-year>2019</copyright-year><copyright-holder xml:lang="ru">Закирова В.Р., Сырокваш Д.А., Гилевский С.В., Назаров П.В., Яцков Н.Н.</copyright-holder><copyright-holder xml:lang="en">Zakirava V.R., Syrakvash D.A., Hileuski S.V., Nazarov P.V., Yatskou M.M.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://inf.grid.by/jour/article/view/471">https://inf.grid.by/jour/article/view/471</self-uri><abstract><p>Проведено исследование кодирующих и некодирующих нуклеотидных последовательностей референсного генома человека. Разработаны семь моделей векторизации нуклеотидных последовательностей на основе частот моно-, би- и триграммов нуклеотидов, параметров модели частот и позиций сочетаний нуклеотидов (category-position-frequency model), длин последовательностей, корреляционных факторов нуклеотидов, статистических признаков кодирующих и некодирующих участков молекул ДНК. Определены наиболее информативные признаки моделей векторизации c использованием алгоритмов автоматического выбора признаков и классификации на основе методов случайного леса и опорных векторов. Установлено различие кодирующих и некодирующих фрагментов нуклеотидных последовательностей. Ошибка классификации последовательностей с использованием метода случайного леса на наборе из 23 наиболее информативных признаков составила 2,93 %.</p></abstract><trans-abstract xml:lang="en"><p>Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>ДНК</kwd><kwd>экзон</kwd><kwd>интрон</kwd><kwd>классификация</kwd><kwd>метод случайного леса</kwd><kwd>метод опорных векторов</kwd><kwd>алгоритмы автоматического отбора информативных признаков</kwd><kwd>программирование на языке R</kwd></kwd-group><kwd-group xml:lang="en"><kwd>DNA</kwd><kwd>exon</kwd><kwd>intron</kwd><kwd>classification</kwd><kwd>Random Forests</kwd><kwd>Support Vector Machine</kwd><kwd>feature selection</kwd><kwd>R programming</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Edwards, D. J. Beginner's guide to comparative bacterial genome analysis using next-generation sequence data / D. J. Edwards, K. E. Holt // Microbial Informatics and Experimentation. – 2013 – Vol. 3:2. – Р. 1–9.</mixed-citation><mixed-citation xml:lang="en">Edwards D. J., Holt K. E. Beginner's guide to comparative bacterial genome analysis using nextgeneration sequence data. Microbial Informatics and Experimentation, 2013, vol. 3:2, pp. 1–9.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Bao, J. An improved alignment-free model for DNA sequence similarity metric / J. Bao, R. Yuan, Z. Bao // BMC Bioinformatics. – 2014. – Vol. 15:321. – Р. 1–15.</mixed-citation><mixed-citation xml:lang="en">Bao J., Yuan R., Bao Z. An improved alignment-free model for DNA sequence similarity metric. BMC Bioinformatics, 2014, vol. 15:312, pp. 1–15.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Li, C. Relative entropy of DNA and its application / C. Li, J. Wang // Physica A. – 2005. – Vol. 347. – P. 465–471.</mixed-citation><mixed-citation xml:lang="en">Li C., Wang J. Relative entropy of DNA and its application. Physica A, 2005, vol. 347, pр. 465–471.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison / Q. Dai [et al.] // J. of Theoretical Biology. – 2011. – Vol. 276. – P. 174–180.</mixed-citation><mixed-citation xml:lang="en">Dai Q., Liu X., Yao Y., Zhao F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. Journal of Theoretical Biology, 2011, vol. 276, pр. 174–180.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Liu, L. Clustering DNA sequences by feature vectors / L. Liu, Y. K. Ho, S. Yau // Mol Phylogenet Evol. – 2006. – Vol. 41. – P. 64–69.</mixed-citation><mixed-citation xml:lang="en">Liu L., Ho Y. K., Yau S. Clustering DNA sequences by feature vectors. Mol Phylogenet Evol, 2006, vol. 41, pр. 64–69.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Wang, J. Wse, a new sequence distance measure based on word frequencies / J. Wang, X. Zheng // Mathematical Biosciences. – 2008. – Vol. 215. – P. 78–83.</mixed-citation><mixed-citation xml:lang="en">Wang J., Zheng X. Wse, a new sequence distance measure based on word frequencies. Mathematical Biosciences, 2008, vol. 215, pр. 78–83.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Zhao, B. A new distribution vector and its application in genome clustering / B. Zhao, R. L. He, S. Т. Yau // Mol Phylogenet Evol. – 2011. – Vol. 59. – P. 438–443.</mixed-citation><mixed-citation xml:lang="en">Zhao B., He R. L., Yau S. Т. A new distribution vector and its application in genome clustering. Mol Phylogenet Evol, 2011, vol. 59, pр. 438–443.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Application of high-dimensional feature selection: evaluation for genomic prediction in man / M. L. Bermingham [et al.] // Scientific Reports. – 2015. – Vol. 5:10312. – P. 1–12.</mixed-citation><mixed-citation xml:lang="en">Bermingham M. L., Pong-Wong R., Spiliopoulou A., Hayward C., Rudan I., …, Haley C. S. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports, 2015, vol. 5:10312, pp. 1–12.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">GFF/GTF File Format – Definition and Supported Options [Electronic resource]. – 2014. – Modе of access: www.ensembl.org/info/website/upload/gff.html. – Date of access: 16.10.2014.</mixed-citation><mixed-citation xml:lang="en">GFF/GTF File Format – Definition and Supported Options, 2014. Available at: www.ensembl.org/ info/website/upload/gff.html (accessed 16.10.2014).</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Comparative analyses between retained introns and constitutively spliced introns in arabidopsos thaliana using random forest and support vector machine / R. Mao [et al.] // PLoS One. – 2014. – Vol. 9, no. 8. – P. 1–12.</mixed-citation><mixed-citation xml:lang="en">Mao R., Kumar P. K. R., Guo C., Zhang Y., Liang C. Comparative analyses between retained introns and constitutively spliced introns in arabidopsos thaliana using random forest and support vector machine. PLoS One, 2014, vol. 9, no. 8, pр. 1–12.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Разработка алгоритмов и автоматизированных программных средств для классификации кодирующих и некодирующих нуклеотидных последовательностей / Д. А. Сырокваш [и др.] // Междунар. конгресс по информатике: информационные системы и технологии : материалы конгресса, Минск, 24–27 окт. 2016 г. ; редкол.: С. В. Абламейко [и др.]. – Минск : БГУ, 2016. – С. 189–193.</mixed-citation><mixed-citation xml:lang="en">Syrakvash D. А., Jackov N. N., Nazarov P. V., Skakun V. V. Razrabotka algoritmov i avtomatizirovannyh programmnyh sredstv dlya klassifikacii kodirujushchih i nekodiruyushchih nukleotidnyh posledovatel’nostey [Development of algorithms and automated software for the classification of coding and non-coding nucleotide sequences]. Mejdunarodnyi congress po informatike: informacionnye sistemy i tehnologii [International Congress on Informatics: Information Systems and Technologies]. Minsk, Belorusskij gosudarstvennyj universitet, 2016, pp. 189–193 (in Russian).</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Do we need hundreds of classifiers to solve real world classification problems? / M. Fernández-Delgado [et al.] // J. of Machine Learning Research. – 2014. – Vol. 15. – P. 3133–3181.</mixed-citation><mixed-citation xml:lang="en">Fernández-Delgado M., Cernadas E., Barro S., Amorim D. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 2014, vol. 15, pр. 3133–3181.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Liaw, A. Breiman and Custler’s Random Forests for Classification and Regression [Electronic resource] / A. Liaw, M. Wiener. – 2016. – Mode of access: http://www.stat.berkley.edu/~breiman/RandomForest/ cc_home.htm#workings. – Date of access: 11.02.2016.</mixed-citation><mixed-citation xml:lang="en">Liaw A., Wiener M. Breiman and Custler’s Random Forests for Classification and Regression, 2016. Available at: http://www.stat.berkley.edu/~breiman/RandomForest/cc_home.htm#workings (accessed 11.02.2016).</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Breiman, L. Random forest / L. Breiman // Machine Learning. – 2001. – Vol. 45(1). – P. 5–32.</mixed-citation><mixed-citation xml:lang="en">Breiman L. Random forest. Machine Learning, 2001, vol. 45(1), pр. 5–32.</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Вапник, В. Н. Восстановление зависимостей по эмпирическим данным / В. Н. Вапник. – М. : Наука, 1979. – 448 с.</mixed-citation><mixed-citation xml:lang="en">Vapnik V. N. Vosstanovlenie zavisimostey po empiricheskim dannym. Recovering Dependencies from Empirical Data. Moscow, Nauka, 1979, 448 p. (in Russian).</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Вьюгин, В. В. Математические основы машинного обучения и прогнозирования / В. В. Вьюгин. – М. : МЦНМО, 2014. – 304 с.</mixed-citation><mixed-citation xml:lang="en">V’ugin V. V. Matematicheskie osnovy mashinnogo obucheniya i prognozirovaniya. Mathematical Foundations of Machine Learning and Prediction. Moscow, Moskovskij centr nepreryvnogo matematicheskogo obrazovanija, 2014, 304 p. (in Russian).</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">Мастицкий, С. Э. Статистический анализ и визуализация данных с помощью R [Электронный ресурс] / С. Э. Мастицкий, В. К. Шитиков. – 2014. – Режим доступа: http://r-analytics.blogspot/.com. – Дата доступа: 13.03.2015.</mixed-citation><mixed-citation xml:lang="en">Mastickiy C. E., Shitikov V. K. Statisticheskiy analiz i vizualizaciya dannyh s pomoshchju R. Statistical Analysis and Data Visualization with R, 2014. Available at: http://r-analytics.blogspot.com (accessed 13.03.2015) (in Russian).</mixed-citation></citation-alternatives></ref><ref id="cit18"><label>18</label><citation-alternatives><mixed-citation xml:lang="ru">Advancing Feature Selection Research – ASU Feature Selection Repository [Electronic resource] / Z. Zhao [et al.]. – 2010. – Mode of access: https://www.researchgate.net/publication/305083748_Advancing_ feature_selection_research. – Date of access: 10.04.2019.</mixed-citation><mixed-citation xml:lang="en">Zhao Z., Sharma S., Morstatter F., Alelyani S. Advancing Feature Selection Research – ASU Feature Selection Repository, 2010. Available at: https://www.researchgate.net/publication/305083748_Advancing_ feature_selection_research (accessed 10.04.2019).</mixed-citation></citation-alternatives></ref><ref id="cit19"><label>19</label><citation-alternatives><mixed-citation xml:lang="ru">Kuhn, M. The Caret Package [Electronic resource] / M. Kuhn. – 2017. – Mode of access: https://topepo.github.io/caret. – Date of access: 11.04.2017.</mixed-citation><mixed-citation xml:lang="en">Kuhn M. The Caret Package, 2017. Available at: https://topepo.github.io/caret (accessed 11.04.2017)</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
