АВТОМАТИЧЕСКОЕ ОПРЕДЕЛЕНИЕ ЯЗЫКА ТЕКСТОВОГО ДОКУМЕНТА ДЛЯ ОСНОВНЫХ ЕВРОПЕЙСКИХ ЯЗЫКОВ
Abstract
Проводится анализ основных методов решения задачи автоматического определения языка текстового документа и предлагается алгоритм, основанный на комбинировании алфавитного метода, метода грамматических слов и алфавитно-триграммного метода, сочетающий в себе возможности минимального статистического и лингвистического анализа языковых данных и обеспечивающий эффективное решение указанной задачи.
References
1. Крапивин, Ю.Б. К задаче автоматического распознавания воспроизведенных фрагментов текстовых документов / Ю.Б. Крапивин // Вестник БрГТУ : Физика, математика, информатика. – 2009. – № 5 (59). – С. 120–123.
2. Grefenstette, G. Comparing two language identification schemes / G. Grefenstette // The
3. Third Intern. Conf. on Statistical Analysis of Textual Data. – Rome, 1995.
4. Sibun, P. Language Determination: Natural Language Processing from Scanned Document
5. Images / P. Sibun, A.L. Spitz // Proc. of the 4th ACL Conf. on Applied Natural Language Proceeding (ANLP). – Stuttgart, Germany, 1994.
6. Cowie, J. Language recognition for mono- and multilingual documents / J. Cowie,
7. Y. Ludovic, R. Zacharski // Proc. of the Vextal Conference. – Venice, 1999.
8. Natural Language Identification using Corpus-based Models / C. Souter [et al.] // Hermes
9. Journal of Linguistics. – 1994. – № 13. – P. 183–203.
10. Cavnar, W.B. N-Gram-Based Text Categorization / W.B. Cavnar, J.M. Trenkle // Proc. of the
11. rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR). – Las Vegas, 1994. – P. 161–175.
12. Prager, J.M. Linguini: Language identification for multilingual documents / J.M. Prager //
13. Proc. of the 32nd Hawaii Intern. Conf. on System Sciences. – Maui, Hawaii, USA, 1999.
14. Dunning, T. Statistical Identification of Language / T. Dunning // Computing Research Laboratory.
15. Technical report MCCS. – New Mexico State University, 1994. – P. 94–273.
16. Sibun, P. Language identification: Examing the issues / P. Sibun, J.C. Reynar // Proc. of the
17. th Symposium on Document Analysis and Information Retrieval (SDAIR). – Las Vegas, 1996. – P. 125–135.
18. Poutsma, A. Applying MonteCarlo Techniques to Language Identification / A. Poutsma //
19. Proc. of Computational Linguistics in the Netherlands. – Amsterdam, Netherlands, 2001.
20. Biemann, C. Disentangling from Babylonian Confusion – Unsupervised Language Identification / C. Biemann, S. Teresniak // Proc. of the CICLing-2005. – Mexico City, 2005.
21. Kruengkrai, C. Language Identification Based on String Kernels / C. Kruengkrai // Proc. of
22. the 5th Intern. Symposium on Communications and Information Technologies (ISCIT-2005). – Beijing, China, 2005.
23. Giguet, E. Categorization according to Language: A step toward combining Linguistic
24. Knowledge and Statistic Learning / E. Giguet // 4th Intern. Workshop of Parsing Technologies. – Prague, Karlovy Vary, Czech Republic, 1995.
25. Newman, P. Foreign language identification: First step in the translation process / P. Newman // Proc. of the 28th Annual Conf. of the American Translators Accociation. – Albuquerque NM, USA, 1987. – P. 509–516.
26. Kullback – Leibler_divergence [Electronic resource] // Wikipedia. – Mode of access :
27. http://en.wikipedia.org/wiki/ Kullback-Leibler_divergence. – Date of access : 15.12.2010.
28. Ukkonen, E. On-line construction of suffix trees / E. Ukkonen // Algorithmica. – 1995. –
29. № 14 (3). – P. 249–260.
30. Function word [Electronic resource] // Wikipedia. – Mode of access : http://en.wikipedia.org/wiki/Function_word. – Date of access : 14.12.2010.
31. Quasthoff, U. Corpus Portal for Search in Monolingual Corpora / U. Quasthoff, M. Richter,
32. C. Biemann // Proc. of the Fifth Intern. Conf. on Language Resources and Evaluation, LREC 2006. – Genoa, 2006. – P. 1799–1802.
33. Lancaster-Oslo-Bergen Corpus [Electronic resource] // Wikipedia. – Mode of access :
34. http://en.wikipedia.org/wiki/LOB_Corpus. – Date of access : 15.12.2010.
Review
For citations:
. Informatics. 2011;(3(31)):112-117. (In Russ.)