Preview

Informatics

Advanced search

LINGUISTIC ANALYSIS FOR THE BELARUSIAN CORPUS WITH THE APPLICATION OF NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING TECHNIQUES

Abstract

The article focuses on the problems existing in text-to-speech synthesis. Different morphological, lexical and syntactical elements were localized with the help of the Belarusian unit of NooJ program. Those types of errors, which occur in Belarusian texts, were analyzed and corrected. Language model and part of speech tagging model were built. The natural language processing of Belarusian corpus with the help of developed algorithm using machine learning was carried out. The precision of developed models of machine learning has been 80–90 %. The dictionary was enriched with new words for the further using it in the systems of Belarusian speech synthesis.

About the Authors

Yu. S. Hetsevich
United Institute of Informatics Problems, National Academy of Sciences of Belarus
Belarus


I. V. Reentovich
United Institute of Informatics Problems, National Academy of Sciences of Belarus
Belarus


References

1. Kennedy, G. An Introduction to Corpus Linguistics / G. Kennedy. – London : Longman, 1998. – 315 p.

2. Belarusian N-corpus [Electronic resource]. – 2015. – Mode of access : http://bnkorpus.info/. – Date of access : 22.06.2017.

3. Барковіч, А.А. Беларускі корпус тэкстаў : інтэрнэт-дыскурс / А.А. Барковіч // Веснік Беларус. дзярж. ун-та. Сер.

4. Філалогія. Журналістыка. Педагогіка. – 2013. – № 2. – С. 26–29. 4. The First One-Million Corpus for the Belarusian NooJ Module / I. Reentovich [et al.] // Automatic Processing of Natural-Language Electronic Texts with NooJ : 9th Intern. Conf. «NooJ 2015». – Springer International Publishing, 2016. – P. 3–15.

5. Холоденко, А.Б. Использование лексических и синтаксических анализаторов в задачах распознавания для естественных языков / А.Б. Холоденко // Интеллектуальные системы. – 1999. – № 1–2. – С. 185–193.

6. Автоматическая обработка текстов на естественном языке и анализ данных / Е.И. Большакова [и др.]. – М. : Изд-во НИУ ВШЭ, 2017. – 269 с.

7. Silberztein, M. NooJ Manual / M. Silberztein [Electronic resource]. – 2003. – Mode of access : www.nooj4nlp.net. – Date of access : 22.06.2017.

8. Hetsevich, Yu. Overview of Belarusian and Russian Dictionaries and Their Adaptation for NooJ / Yu. Hetsevich, S. Hetsevich // Automatic Processing of Various Levels of Linguistic Phenomena: Selected Papers from the NooJ 2011 Intern. Conf. – Newcastle : Cambridge Scholars Publishing, 2012. – P. 29–40.

9. Kriesel, D. A Brief Introduction to Neural Networks / D. Kriesel [Electronic resource]. – 2005. – Mode of access : http://www.dkriesel.com. – Date of access : 22.06.2017.

10. Quinlan, J.R. Simplifying Decision Trees / J.R. Quinlan // Intern. J. of Man-Machine Studies. – 1987. – Vol. 27, no. 3. – Р. 221–234.

11. Cha, S.-H. A Genetic Algorithm for Constructing Compact Binary Decision Trees / S.-H. Cha, C.C. Tappert // J. of Pattern Recognition Research. – 2009. – Vol. 4, no. 1. – Р. 1–13.

12. Генератар парадыгмы слова // Лабараторыя распазнавання і сінтэзу маўлення [Электронны рэсурс]. – 2017. – Рэжым доступу : http://ssrlab.by/5047. – Дата доступу : 13.05.2017.

13. Oliveira, H.G. Towards the Automatic Enrichment of a Thesaurus with Information in Dictionaries / H.G. Oliveira, P. Gomes // Expert Systems. – 2013. – Vol. 30, no. 4. – P. 320–332.

14. The Enrichment of Lexical Resources Through Incremental Parsebanking / V. Rosén [et al.] // Language Resources and Evaluation. – 2016. – Vol. 50, no. 2. – Р. 291–319.

15. Computer Treatment of Slavic and East European Languages / ed. R. Garabik // Third Intern. Seminar, Bratislava, Slovakia, 10–12 Nov. 2005. – Bratislave : VEDA, 2005. – 246 p.


Review

For citations:


Hetsevich Yu.S., Reentovich I.V. LINGUISTIC ANALYSIS FOR THE BELARUSIAN CORPUS WITH THE APPLICATION OF NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING TECHNIQUES. Informatics. 2017;(4(56)):70-77. (In Russ.)

Views: 782


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1816-0301 (Print)
ISSN 2617-6963 (Online)