Software complex for simulation modelling of single nucleotide genetic polymorphism sites
https://doi.org/10.37661/1816-0301-2025-22-2-81-94
Abstract
Objectives. High-throughput sequencing methods have recently become widely used in the fundamental and applied research of various human diseases. Sequencing of functionally significant regions of the human genome enables the simultaneous identification of multiple genetic polymorphism sites that have diagnostic and/or prognostic significance for human genetic diseases. One of the key goals in this area is to develop efficient software tools for processing genomic data and identifying single nucleotide polymorphism sites using computer modelling and big data analysis methods.
Methods. A software complex has been developed for simulation modelling and identification of single nucleotide polymorphism sites using machine learning methods. The methods for the approach to simulation modelling and analysis of single nucleotide polymorphism sites in DNA molecules are implemented based on the beta or normal distributions, the parameters of which are determined from the available experimental data, and machine learning models trained on simulated data and used to accurately identify single nucleotide polymorphism sites. The software complex includes an R package, a web application, and auxiliary computational tools for processing experimental genomic sequencing data.
Results. The performance of the developed software complex was tested on sets of simulated and experimental data from human cell genomic sequencing. A comparative analysis of the most effective algorithms for identifying single nucleotide polymorphism sites was performed. The best results were obtained for machine learning models.
Conclusion. The use of the software complex increases the accuracy of identifying genetic polymorphism sites during the analysis of big genomic sequencing data. The software can be used for modelling synthetic data, based on experimental data or independently, for the purpose of comprehensive testing and selection of the best algorithms for identifying single nucleotide polymorphisms, as well as for generative data modelling used in training identification algorithms based on machine learning methods
About the Authors
M. M. YatskouBelarus
Mikalai M. Yatskou, Ph. D. (Phys.-Math.), Assoc. Prof., Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies
av. Nezavisimosti, 4, Minsk, 220030
D. D. Sarnatski
Belarus
Dzianis D. Sarnatski, Student, Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies
av. Nezavisimosti, 4, Minsk, 220030
V. V. Skakun
Belarus
Victor V. Skakun, Ph. D. (Phys.-Math.), Assoc. Prof., Head of Department of Systems Analysis and Computer Modelling, Faculty of Radiophysics and Computer Technologies
av. Nezavisimosti, 4, Minsk, 220030
V. V. Grinev
Belarus
Vasily V. Grinev, Ph. D. (Biol.), Assoc. Prof., Department of Genetics, Faculty of Biology
av. Nezavisimosti, 4, Minsk, 220030
References
1. Sung W. K. Algorithms for Next Generation Sequencing, 1st ed. New York, Chapman & Hall / CRC, 2017, 364 p.
2. Kappelmann-Fenzl M. (ed.). Next Generation Sequencing and Data Analysis, 1st ed. Cham, Springer, 2021, 218 p.
3. Wu X. L., Xu J., Feng G., Wiggans G. R., Taylor J. F., …, Bauck S. Optimal design of low-density SNP arrays for genomic prediction: algorithm and applications. PLoS ONE, September 2016, vol. 11, no 9, p. e0161719. DOI: 10.1371/journal.pone.0161719.
4. Korani W., Clevenger J. P., Chu Y., Ozias-Akins P. Machine learning as an effective method for identifying true single nucleotide polymorphisms in polyploid plants. Plant Genome, March 2019, vol. 12, iss. 1, p. 180023. DOI: 10.3835/plantgenome2018.05.0023.
5. Masoudi-Nejad A., Narimani Z., Hosseinkhan N. Next Generation Sequencing and Sequence Assembly. Methodologies and Algorithms, 1st ed. New York, Springer, 2013, 86 p.
6. Su Z., Marchini J., Donnelly P. HAPGEN2: simulations of multiple disease SNPs. Bioinformatics, 2011, vol. 27, iss. 16, p. 2304–2305.
7. Oh J. H., Deasy J. O. SITDEM: a simulation tool for disease/endpoint models of association studies based on single nucleotide polymorphism genotypes. Computers in Biology and Medicine, 2014, vol. 45, pр. 136–142.
8. Hendricks A. E., Dupuis J., Gupta M., Logue M. W., Lunetta K. L. A comparison of gene region simulation methods. PLoS ONE, 2012, vol. 7, no 7, p. e40925. DOI: 10.1371/journal.pone.0040925.
9. Peng B., Chen H. S., Mechanic L. E., Racine B., Clarke J., …, Feuer E. J. Genetic Simulation Resources: a website for the registration and discovery of genetic data simulators. Bioinformatics, 2013, vol. 29, iss. 8, pр. 1101–1102.
10. Peng B., Chen H. S., Mechanic L. E., Racine B., Clarke J., …, Feuer E. J. Genetic data simulators and their applications: an overview. Genetic Epidemiology, 2015, vol. 39, iss. 1, pр. 2–10.
11. Tahmasbi R., Keller M. C. GeneEvolve: a fast and memory efficient forward-time simulator of realistic whole-genome sequence and SNP data. Bioinformatics, 2017, vol. 33, iss. 2, pр. 294–296.
12. Posada D., Wiuf C. Simulating haplotype blocks in the human genome. Bioinformatics, 2003, vol. 19, iss. 2, pр. 289–290.
13. Jacquin L., Cao T. V., Grenier C., Ahmadi N. DHOEM: a statistical simulation software for simulating new markers in real SNP marker data. BMC Bioinformatics, December 2015, vol. 16, p. 404. DOI: 10.1186/s12859-015-0830-7.
14. Meyer H. V., Birney E. PhenotypeSimulator: A comprehensive framework for simulating multi-trait, multi-locus genotype to phenotype relationships. Bioinformatics, 2018, vol. 34, iss. 17, pр. 2951–2956.
15. Dimitromanolakis A., Xu J., Krol A., Briollais L. sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs. BMC Bioinformatics, January 2019, vol. 20, no. 1, p. 26. DOI: 10.1186/s12859-019-2611-1.
16. Yatskou M. M., Apanasovich V. V., Yatskou U. M. Generative simulation modelling of complex biophysical systems. Kompyuternye tehnologii i analiz dannyh (CTDA’2024): materialy IV Mezhdunarodnoj nauchno-prakticheskoj konferencii, Minsk, 25–26 aprelya 2024 g. [Computer Technologies and Data Analysis (CTDA’2024): Proceedings of the IV International Scientific Conference, Minsk, 25–26 April 2024]. Minsk, Belorusskij gosudarstvennyj universitet, 2024, pp. 211–214 (In Russ.).
17. Yatskou M. M., Smolyakova E. V., Skakun V. V., Grinev V. V. Simulation modelling for machine learning identification of single nucleotide polymorphisms in human Genomes. Pattern Recognition and Information Processing (PRIP’2023) : Proceedings of the 16th International Conference, Minsk, 17–19 October 2023. Minsk, Belarusian State University, 2023, рр. 49–53.
18. Yatskou M. M., Skakun V. V., Grinev V. V. A computational approach and software package RNAexploreR for grouping RNA molecules of human genes by exon features. Informatika [Informatics], 2019, vol. 16, no. 4, pp. 7–24 (In Russ.).
19. Gentleman R., Carey V. J., Bates D. M. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology, 2004, vol. 5, no. 10, art. R80. Available at: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2004-5-10-r80 (accessed 10.04.2025). DOI: 10.1186/GB-2004-5-10-R80.
20. Yatskou M. M., Smolyakova E. V., Skakun V. V., Grinev V. V. R-package SNPSimulatoR for modelling single nucleotide genetic polymorphism sites. Kvantovaya elektronika: materialy XIV Mezhdunarodnoj nauchno-tehnicheskoj konferencii, Minsk, 21–23 noyabrya 2023 g. [Quantum Electronics: Proceedings of the XIV International Scientific and Practical Conference, Minsk, 21–23 November 2023]. Minsk, Belorusskij gosudarstvennyj universitet, 2023, pp. 510–515 (In Russ.).
21. Zook J. M., McDaniel J., Olson N. D., Wagner J., Parikh H., …, Salit M. An open resource for accurately benchmarking small variant and reference calls. Nature Biotechnology, 2019, vol. 37, no 5, pр. 561–566.
22. Yatskou M. M., Smolyakova E. V., Grudovik K. I., Skakun V. V., Grinev V. V. Identification of single nucleotide genetic polymorphisms using machine learning methods. Kvantovaya elektronika: materialy XIV Mezhdunarodnoj nauchno-tehnicheskoj konferencii, Minsk, 21–23 noyabrya 2023 g. [Quantum Electronics: Proceedings of the XIV International Scientific and Practical Conference, Minsk, 21–23 November 2023]. Minsk, Belorusskij gosudarstvennyj universitet, 2023, pp. 504–509 (In Russ.).
23. Yatskou M. M., Smolyakova E. V., Skakun V. V., Grinev V. V. Identification of Single nucleotide genetic polymorphism sites using machine learning methods. Advances in Transdisciplinary Engineering, 2023, vol. 42, pр. 1031–1037.
24. Yatskou M. M., Grinev V. V., Apanasovich V. V. Simulation modelling of single nucleotide genetic polymorphisms. Journal of the Belarusian State University. Mathematics and Informatics, 2024, no. 2, рр. 104–112.
25. Yatskou M. M., Apanasovich V. V. Computational platform FluorSimStudio for processing kinetic curves of fluorescence decay using simulation modeling and data mining algorithms. Journal of Applied Spectroscopy, 2021, vol. 88, no. 3, pp. 452–461.
26. Sarnatski D. D., Yatskou M. M., Grinev V. V. Simulation model of generation of single nucleotide polymorphism sites in human DNA molecules. Kompyuternye tehnologii i analiz dannyh (CTDA’2024): materialy IV Mezhdunarodnoj nauchno-prakticheskoj konferencii, Minsk, 25–26 aprelya 2024 g. [Computer Technologies and Data Analysis (CTDA’2024): Proceedings of the IV International Scientific Conference, Minsk, 25–26 April 2024]. Minsk, Belorusskij gosudarstvennyj universitet, 2024, pp. 265–268 (In Russ.).
27. Sarnatski D. D., Yatskou M. M., Grinev V. V. Study of the informativeness of nucleotide site features in determining genetic polymorphisms using machine learning methods. Informacionnye tehnologii i sistemy (ITS 2024): materialy Mezhdunarodnoj nauchnoj konferencii, Minsk, 20 noyabrya 2024 g. [Information Technologies and Systems 2024 (ITS 2024): Proceedings of the International Scientific Conference, Minsk, 20 November 2024]. Minsk, Belorusskij gosudarstvennyj universitet informatiki i radioelektroniki, 2024, pp. 69–70 (In Russ.).
28. Yatskou M. M., Apanasovich V. V. Neural network simulation modelling when analyzing experimental fluorescence spectroscopy data. Kompyuternye tehnologii i analiz dannyh (CTDA’2024): materialy IV Mezhdunarodnoj nauchno-prakticheskoj konferencii, Minsk, 25–26 aprelya 2024 g. [Computer Technologies and Data Analysis (CTDA’2024): Proceedings of the IV International Scientific Conference, Minsk, 25–26 April 2024]. Minsk, Belorusskij gosudarstvennyj universitet, 2024, pp. 215–218 (In Russ.).
Supplementary files
Review
For citations:
Yatskou M.M., Sarnatski D.D., Skakun V.V., Grinev V.V. Software complex for simulation modelling of single nucleotide genetic polymorphism sites. Informatics. 2025;22(2):81-94. (In Russ.) https://doi.org/10.37661/1816-0301-2025-22-2-81-94


















