Цели

inform

Информатика

Informatics

1816-03012617-6963

UIIP NASB

10.37661/1816-0301-2026-23-1-26-38

inform-1395

Research Article

ИНТЕЛЛЕКТУАЛЬНЫЕ СИСТЕМЫ

INTELLIGENT SYSTEMS

BelLitGPT – технологии языковых моделей для белорусского языка

BelLitGPT – language model technologies for the Belarusian language

Бондоловский

А. М.

Lyakhov

Dmitry A.

Андрей Михайлович Бондоловский, кандидат экономических наук, заведующий лабораторией распознавания и синтеза речи

Сурганова, 6, Минск, 220012

Dmitry A. Lyakhov, Cand. Sci. (Phys.-Math.), Senior Researcher

st. Surganova, 6, Minsk, 220012

a.bandalouski@newman.bas-net.by

Ляхов

Д. А.

Bandalouski

Andrei M.

Ляхов Дмитрий Александрович, кандидат физико-математических наук, старший научный сотрудник

Сурганова, 6, Минск, 220012

Andrei M. Bandalouski, Cand. Sci. (Econ.), Head of Laboratory of Speech Synthesis and Recognition

st. Surganova, 6, Minsk, 220012

dlyakhov@newman.bas-net.by

Кругликов

С. В.

Kruglikov

Sergey V.

Сергей Владимирович Кругликов, доктор военных наук, кандидат технических наук, доцент, главный научный сотрудник

Сурганова, 6, Минск, 220012

Sergey V. Kruglikov, Dr. Sci. (Milit.), Cand. Sci. (Eng.), Assoc. Prof., Principal Researcher

st. Surganova, 6, Minsk, 220012

kruglikov_s@newman.bas-net.by

Шульган

К. К.

Shulgan

Konstantin K.

Константин Константинович Шульган, заместитель генерального директора по цифровому развитию

Сурганова, 6, Минск, 220012

Konstantin K. Shulgan, Deputy General Director for Digital Development

st. Surganova, 6, Minsk, 220012

skk@newman.bas-net.by

Объединенный институт проблем информатики Национальной академии наук БеларусиThe United Institute of Informatics Problems of the National Academy of Sciences of Belarus

2026

27032026

2312638

2026

Бондоловский А.М., Ляхов Д.А., Кругликов С.В., Шульган К.К.

Lyakhov D.A., Bandalouski A.M., Kruglikov S.V., Shulgan K.K.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://inf.grid.by/jour/article/view/1395

Цели

Цели. Работа выполнена в области исследования специализированных генеративных нейронных сетей для белорусского языка. Поставлена цель сделать первый шаг для построения национальной генеративной языковой модели.

Методы

Методы. Описывается процесс разработки модели BelLitGPT (700 млн параметров), который основан на стратегии трансферного обучения русскоязычной модели ruGPT-3 и состоит из трех этапов: подготовки корпуса, адаптации токенизатора и обучения модели. Обучающий корпус составлен из золотого фонда классической белорусской прозы и подготовленных статей из Википедии. Подробно описываются методика адаптации токенизатора для расширения словарного запаса специфическими белорусскими лексемами, процесс обучения и тестирования модели.

Результаты

Результаты. Результаты исследования подтверждают способность модели BelLitGPT генерировать связные, грамматически и стилистически корректные тексты. Особое внимание уделено созданию гибридного нейросимвольного подхода для генерации четверостиший с соблюдением ритма и рифмы.

Заключение

Заключение. Эксперимент по масштабированию архитектуры показал сложности в обучении крупной модели (13 млрд параметров) в условиях дефицита данных.

Objectives

Objectives. The research is conducted in the field of specialized generative neural networks for the Belarusian language. The authors aim to take the first step towards building a national generative language model.

Methods

Methods. The paper describes the development process of the BelLitGPT model (700 million parameters). It is based on a transfer learning strategy using the Russian-language model ruGPT-3 and consists of three stages: corpus preparation, tokenizer adaptation methodology and model training. The training corpus is compiled from the golden fund of classic Belarusian prose and prepared Wikipedia articles. The paper details the tokenizer adaptation method for expanding the vocabulary with specific Belarusian lexemes, as well as the model training and testing process.

Results

Results. The research results confirm that BelLitGPT can generate coherent, grammatically and stylistically correct texts. Special attention is given to the creation of a hybrid neuro-symbolic approach for generating quatrains that adhere to rhythm and rhyme.

Conclusion

Conclusion. The experiment on scaling the architecture revealed difficulties in training a large model (13 billion parameters) under conditions of data scarcity.

большие языковые моделитрансферное обучениенейросимвольный подходгенерация стиховмодель BelLitGPTбелорусский язык

large language models (LLM)transfer learningneuro-symbolic approachpoetry generationmodel BelLitGPTBelarusian language

References1

Brown T., Mann B., Ryder N., Subbiah M., Kaplan J. D., …, Amodei D. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020, vol. 33, рр. 1877–1901.

Radford A., Wu J., Child R., Luan D., Amodei D., Sutskever I. Language models are unsupervised multitask learners. OpenAI, 2019. Available at: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed 03.11.2025).

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., …, Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems, 2017, vol. 30, рр. 5998–6008.

Artetxe M., Ruder S., Yogatama D. On the cross-lingual transferability of monolingual representations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020, рр. 4623–4637.

Jakubíček M., Kilgarriff A., Kovář V., Rychlỳ P., Suchomel V. The tenten corpus family. Proceedings of the 7th International Corpus Linguistics Conference (CL2013), Lancaster University, United Kingdom, 22–26 July 2013, рр. 125–127.

Sennrich R., Haddow B., Birch A. Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016, vol. 1, рр. 1715–1725.

Imamura K., Sumita E. Vocabulary adaptation for domain adaptation in neural machine translation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October – 4 November 2018, рр. 4623–4637.

Ghazvininejad M., Shi X., Choi Y., Knight K. Hafez: an interactive poetry generation system. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 30 July – 4 August 2017, рр. 43–48.

Mesnard T., Hardin C., Dadashi R., Bhupatiraju S., Pathak S., …, Eck D. Gemma: Open models based on Gemini research and technology, 2024. Available at: https://arxiv.org/pdf/2403.08295 (accessed 03.11.2025).

Lau J. H., Cohn T., Baldwin T., Brooke J., Hammond A. Deep-speare: A joint neural model of poetic language, meter and rhyme. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018, vol. 1, рр. 1948–1958.

Zugarini A., Melacci S., Maggini M. Neural poetry: Learning to generate poems using syllables. Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series: 28th International Conference on Artificial Neural Networks, Munich, Germany, 17–19 September 2019, рр. 313–325.

The authors declare that there are no conflicts of interest present.