References

inform

Информатика

Informatics

1816-03012617-6963

UIIP NASB

10.37661/1816-0301-2020-17-4-61-72

inform-1080

Research Article

ОБРАБОТКА СИГНАЛОВ, ИЗОБРАЖЕНИЙ, РЕЧИ, ТЕКСТА И РАСПОЗНАВАНИЕ ОБРАЗОВ

SIGNAL, IMAGE, SPEECH, TEXT PROCESSING AND PATTERN RECOGNITION

Моделирование языка и двунаправленные представления кодировщиков: обзор ключевых технологий

Language modeling and bidirectional coders representations: an overview of key technologies

Качков

Д. И.

Kachkou

D. I.

Качков Дмитрий Ильич, аспирант кафедры многопроцессорных систем и сетей факультета прикладной математики и информатики

Минск

Dzmitry I. Kachkou, Postgraduate Student of Department of Multiprocessor Systems and Networks of the Faculty of Applied Mathematics and Informatics

Minsk

dmitriydikanskiy@gmail.com

Белорусский государственный университетBelarusian State University

2020

02112020

1746172

2021

Качков Д.И.

Kachkou D.I.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://inf.grid.by/jour/article/view/1080

Представлен очерк развития технологий обработки естественного языка, которые легли в основу BERT (Bidirectional Encoder Representations from Transformers) − языковой модели от компании Google, демонстрирующей высокие результаты на целом классе задач, связанных с пониманием естественного языка. Две ключевые идеи, реализованные в BERT, – это перенос знаний и механизм внимания. Модель предобучена решению нескольких задач на обширном корпусе неразмеченных данных и может применять обнаруженные языковые закономерности для эффективного дообучения под конкретную проблему обработки текста. Использованная архитектура Transformer основана на внимании, т. е. предполагает оценку взаимосвязей между токенами входных данных. В статье отмечены сильные и слабые стороны BERT и направления дальнейшего усовершенствования модели.

The article is an essay on the development of technologies for natural language processing, which formed the basis of BERT (Bidirectional Encoder Representations from Transformers), a language model from Google, showing high results on the whole class of problems associated with the understanding of natural language. Two key ideas implemented in BERT are knowledge transfer and attention mechanism. The model is designed to solve two problems on a large unlabeled data set and can reuse the identified language patterns for effective learning for a specific text processing problem. Architecture Transformer is based on the attention mechanism, i.e. it involves evaluation of relationships between input data tokens. In addition, the article notes strengths and weaknesses of BERT and the directions for further model improvement.

информатикаинформационные технологииязыковые моделиобработка естественного языкамеханизм вниманияархитектура Transformerмодель BERT

informaticsinformation technologylanguage modelsnatural language processingattention mechanismtransformer architecturemodel BERT

References1

Cho K., Merriënboer B. van, Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning phrase representations using RNN encoder-decoderfor statistical machine translation. Proceedings of the 2014. Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014, pp. 1724–1734. https://doi.org/10.3115/v1/D14-1179

Cho K., van Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning Phrase Representations using RNN Encoder-Decoderfor Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724-1734. https://doi.org/10.3115/v1/D14-1179

Sutskever I. Sequence to Sequence Learning with Neural Networks / I. Sutskever, O. Vinyals, Q. V. Le // Advances in Neural Information Processing Systems. — 2014. — P. 3104–3112. ArXiv preprint: https://arxiv.org/abs/1409.3215

Sutskever I., Vinyals O, Le Q. V. Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems, 2014, pp. 3104–3112. ArXiv preprint: https://arxiv.org/abs/1409.3215

Serban I. V., Lowe R., Charlin L., Pineau J. Generative deep neural networks for dialogue: A short review. Neural Information Processing Systems, Workshop on Learning Methods for Dialogue, 2016. Available at: https://arxiv.org/abs/1611.06216 (accessed 07.07.2020).

Serban I. V., Lowe R., Charlin L., Pineau J. Generative Deep Neural Networks for Dialogue: A Short Review. Advances in Neural Information Processing Systems, Workshop on Learning Methods for Dialogue, 2016. ArXiv preprint: https://arxiv.org/abs/1611.06216

Vinyals O. Show and tell: A neural image caption generator / O. Vinyals, A. Toshev, S. Bengio, D. Erhan // Proceedings of the IEEE conference on computer vision and pattern recognition. — 2015. — P. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

Vinyals O., Toshev A., Bengio S., Erhan D. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

Loyola P. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes / P. Loyola., E. Marrese-Taylor, Y. Matsuo // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. — 2017. — Vol. 2. — P. 287-292. https://doi.org/10.18653/v1/P17-2045

Loyola P., Marrese-Taylor E., Matsuo Y. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, Vol. 2, pp. 287-292. https://doi.org/10.18653/v1/P17-2045

Lebret R. Neural Text Generation from Structured Data with Application to the Biography Domain / R. Lebret., D. Grangier, M. Auli // Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. — 2016. — P. 1203–1213. https://doi.org/10.18653/v1/D16-1128

Lebret R., Grangier D., Auli M. Neural Text Generation from Structured Data with Application to the Biography Domain. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1203–1213. https://doi.org/10.18653/v1/D16-1128

Николенко С., Кандурин А., Архангельская Е. Глубокое обучение. — Санкт-Петербург: Питер, 2020. — 480 с.

Nikolenko S., Kandurin A., Arhangelskaja E. Glubokoe obuchenie [Deep Learning], Saint Petersburg.: Piter, 2020, 480 p.

Bahdanau D. Neural Machine Translation by Jointly Learning to Align and Translate / D. Bahdanau, K. Cho, Y. Bengio // International Conference on Learning Representations. — 2015. ArXiv preprint: https://arxiv.org/abs/1409.0473

Bahdanau D., Cho K., Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations, 2015. ArXiv preprint: https://arxiv.org/abs/1409.0473

Schuster M. Bidirectional recurrent neural networks / M. Schuster, K. K. Paliwal // Signal Processing, IEEE Transactions on 45.11. — 1997. — P. 2673-2681. https://doi.org/10.1109/78.650093

Schuster M., Paliwal K. K. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions, 1997, Vol. 45 (11), pp. 2673-2681. https://doi.org/10.1109/78.650093

Luong T. Effective Approaches to Attention-based Neural Machine Translation / T. Luong, H. Pham, C. D. Manning // Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 1412-1421. https://doi.org/10.18653/v1/D15-1166

Luong T., Pham H., Manning C. D. Effective Approaches to Attention-based Neural Machine Translation. Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412-1421. https://doi.org/10.18653/v1/D15-1166

Chung J. A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation / J. Chung, K. Cho, Y. Bengio // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. — 2016. — Vol. 1. — P. 1693–1703. https://doi.org/10.18653/v1/P16-1160

Chung J., Cho K., Bengio Y. A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, Vol. 1, pp. 1693–1703. https://doi.org/10.18653/v1/P16-1160

Rush A. A Neural Attention Model for Abstractive Sentence Summarization / A. Rush, S. Chorpa, J. Weston // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 379–389. https://doi.org/10.18653/v1/D15-1044

Rush A., Chorpa S., Weston J. A Neural Attention Model for Abstractive Sentence Summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 379–389. https://doi.org/10.18653/v1/D15-1044

Attention-Based Models for Speech Recognition / J. Chorowski [et al.] // Proceedings of the 28th International Conference on Neural Information Processing Systems. — 2015. — Vol. 1. — P. 577–585. ArXiv preprint: https://arxiv.org/abs/1506.07503

Chorowski J., Bahdanau D., Serdyuk D., Cho K., Bengio Y. Attention-Based Models for Speech Recognition. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, Vol. 1, pp. 577–585. ArXiv preprint: https://arxiv.org/abs/1506.07503

Chan W. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition / W. Chan, N. Jaitly, Q. V. Le, O. Vinyals // 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). — 2016. — P. 4960-4964. https://doi.org/10.1109/ICASSP.2016.7472621

Chan W., Jaitly N., Le Q. V., Vinyals O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960-4964. https://doi.org/10.1109/ICASSP.2016.7472621

Teaching Machines to Read and Comprehend / K. M. Hermann [et al.] // Advances in Neural Information Processing Systems 28: 29th Annual Conference on Neural Information Processing Systems 2015. — 2015. — P. 1693-1701. ArXiv preprint: https://arxiv.org/abs/1506.03340

Hermann K. M., Kočiský T., Grefenstette E., Espeholt L., Kay W., Suleyman M., Blunsom P. Teaching Machines to Read and Comprehend. Advances in Neural Information Processing Systems 28: 29th Annual Conference on Neural Information Processing Systems, 2015, pp. 1693-1701. ArXiv preprint: https://arxiv.org/abs/1506.03340

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation / Y. Wu [et al.] // ArXiv preprint. — 2016. https://arxiv.org/abs/1609.08144

Wu Y., Schuster M., Chen Z., Le Q. V., Norouzi M., Macherey W., Krikun M., Cao Y., Gao Q., Macherey K., Klingner J., Shah A., Johnson M., Liu X., Kaiser Ł., Gouws S., Kato Y., Kudo T., Kazawa H., Stevens K., Kurian G., Patil N., Wang W., Young C., Smith J., Riesa J., Rudnick A., Vinyals O., Corrado G., Hughes M., Dean J. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv preprint, 2016. https://arxiv.org/abs/1609.08144.

Hochreiter S. Long short-term memory / S. Hochreiter, J. Schmidhuber // Neural Computation. — 1997. — Vol. 9 (8). — P. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Hochreiter S., Schmidhuber, J. Long short-term memory. Neural Computation, 1997, Vol. 9 (8), pp. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Cho K. On the properties of neural machinetranslation: Encoder-decoder approaches / K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio // Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation — 2014. — P. 103–111. https://doi.org/10.3115/v1/W14-4012

Cho K., van Merrienboer B., Bahdanau D., Bengio Y. On the properties of neural machinetranslation: Encoder-decoder approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014, pp. 103–111. https://doi.org/10.3115/v1/W14-4012

Martin E., Cundy C. Parallelizing Linear Recurrent Neural Nets Over Sequence Length // International Conference on Learning Representations. — 2018. ArXiv preprint:https://arxiv.org/abs/1709.04057

Martin E., Cundy C. Parallelizing Linear Recurrent Neural Nets Over Sequence Length. International Conference on Learning Representations, 2018. ArXiv preprint: https://arxiv.org/abs/1709.04057

Neural machine translation in linear time / N. Kalchbrenner [et al.] // ArXiv preprint. — 2016. https://arxiv.org/abs/1610.10099.

Kalchbrenner N., Espeholt L., Simonyan K., van den Oord A., Graves A., Kavukcuoglu K. Neural machine translation in linear time. ArXiv preprint, 2016. https://arxiv.org/abs/1610.10099

Convolutional sequence to sequence learning / J. Gehring [et al.] // Proceedings of the 34th International Conference on Machine Learning — 2017. — Vol. 70. — P. 1243–1252. ArXiv preprint: https://arxiv.org/abs/1705.03122

Gehring J., Auli M., Grangier D., Yarats D., Dauphin Y. N. Convolutional sequence to sequence learning. Proceedings of the 34th International Conference on Machine Learning, 2017, Vol. 70, pp. 1243–1252. ArXiv preprint: https://arxiv.org/abs/1705.03122

LeCun Y. Gradient-based learning applied to document recognition / Y. LeCun, L. Bottou, Y. Bengio, P. Haffner // Proceedings of the IEEE. — 1998. — Vol. 86 (11). — P. 2278–2324. https://doi.org/10.1109/5.726791

LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, Vol. 86 (11), pp. 2278–2324. https://doi.org/10.1109/5.726791

Parikh A. P. A Decomposable Attention Model for Natural Language Inference / A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit // Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. — 2016. — P. 2249–2255. https://doi.org/10.18653/v1/D16-1244

Parikh A. P., Täckström O., Das D., Uszkoreit J. A Decomposable Attention Model for Natural Language Inference. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2249–2255. https://doi.org/10.18653/v1/D16-1244

Attention Is All You Need / A. Vaswani [et al.] // Publication: NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems. — 2017. — P. 6000–6010. ArXiv preprint: https://arxiv.org/abs/1706.03762

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I. Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000–6010. ArXiv preprint: https://arxiv.org/abs/1706.03762

Mitkov R. Anaphora Resolution: The State of the Art. / R. Mitkov // Paper based on the COLING'98/ACL'98 tutorial on anaphora resolution. — University of Wolverhampton. — 1999.

Mitkov R. Anaphora Resolution: The State of the Art. Paper based on the COLING'98/ACL'98 tutorial on anaphora resolution, University of Wolverhampton, 1999.

Ba J. L. Layer normalization / J. L. Ba, J. R. Kiros, G. E. Hinton // ArXiv preprint. — 2016. https://arxiv.org/abs/1607.06450.

Ba J. L., Kiros J. R., Hinton G. E. Layer normalization. ArXiv preprint, 2016. https://arxiv.org/abs/1607.06450.

Neural Speech Synthesis with Transformer Network / N. Li [et al.] // The AAAI Conference on Artificial Intelligence (AAAI). — 2019. ArXiv preprint: https://arxiv.org/abs/1809.08895

Li N., Liu S., Liu Y., Zhao S., Liu M., Zhou M. Neural Speech Synthesis with Transformer Network. The AAAI Conference on Artificial Intelligence, 2019. ArXiv preprint: https://arxiv.org/abs/1809.08895

Khandelwal U. Sample Efficient Text Summarization Using a Single Pre-Trained Transformer / U. Khandelwal, K. Clark, D. Jurafsky, Ł. Kaiser. // ArXiv preprint. — 2019. https://arxiv.org/abs/1905.08836

Khandelwal U., Clark K., Jurafsky D., Kaiser Ł. Sample Efficient Text Summarization Using a Single Pre-Trained Transformer. ArXiv preprint, 2019. https://arxiv.org/abs/1905.08836

Vlasov V. Dialogue Transformers / V. Vlasov, J. E. M. Mosig, A. Nicho // ArXiv preprint. — 2019. https://arxiv.org/abs/1910.00486

Vlasov V., Mosig J. E. M., Nicho A. Dialogue Transformers. ArXiv preprint, 2019. https://arxiv.org/abs/1910.00486

Griffith K. Solving Arithmetic Word Problems Automatically Using Transformer and Unambiguous Representations / K. Griffith and J. Kalita // 2019 International Conference on Computational Science and Computational Intelligence (CSCI). — 2019. — P. 526-532. https://doi.org/10.1109/CSCI49370.2019.00101

Griffith K., Kalita J. Solving Arithmetic Word Problems Automatically Using Transformer and Unambiguous Representations. International Conference on Computational Science and Computational Intelligence, 2019, pp. 526-532. https://doi.org/10.1109/CSCI49370.2019.00101

Kang W. Self-Attentive Sequential Recommendation / W. Kang, J. McAuley // 2018 IEEE International Conference on Data Mining (ICDM). — 2018. — P. 197-206.https://doi.org/10.1109/ICDM.2018.00035.

Kang W.-C., McAuley J. Self-Attentive Sequential Recommendation. IEEE International Conference on Data Mining, 2018, pp. 197-206. https://doi.org/10.1109/ICDM.2018.00035

Music Transformer / C.-Z. A. Huang [et al.] // ArXiv preprint. — 2018. https://arxiv.org/abs/1809.04281.

Huang C.-Z. A, Vaswani A., Uszkoreit J., Shazeer N., Simon I., Hawthorne C., Dai A. M., Hoffman M. D., Dinculescu M., Eck D. Music Transformer. ArXiv preprint, 2018. https://arxiv.org/abs/1809.04281

Universal Transformers / M. Dehghani [et al.] // 7th International Conference on Learning Representations. — 2019. ArXiv preprint: https://arxiv.org/abs/1807.03819.

Dehghani M., Gouws S., Vinyals O., Uszkoreit J., Kaiser Ł. Universal Transformers. 7th International Conference on Learning Representations, 2019. ArXiv preprint: https://arxiv.org/abs/1807.03819

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context / Z. Dai [et al.] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 2978–2988. https://doi.org/10.18653/v1/P19-1285

Dai Z., Yang Z., Yang Y., Carbonell J., Le Q. V., Salakhutdinov R. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 2978–2988. https://doi.org/10.18653/v1/P19-1285

So D. R. The Evolved Transformer / D. R. So, C. Liang, Q. V. Le // Proceedings of the 36th International Conference on Machine Learning. — 2019. — P. 5877-5886. ArXiv preprint: https://arxiv.org/abs/1901.11117

So D. R., Liang C., Le Q. V. The Evolved Transformer. Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 5877-5886. ArXiv preprint: https://arxiv.org/abs/1901.11117

Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention / C. Zhao [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=r1eIiCNYwS (Accessed 10 July 2020)

Zhao C., Xiong C., Rosset C., Song X., Bennett P., Tiwary S. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=r1eIiCNYwS (Accessed 10 July 2020)

Mikolov T. Distributed Representations of Words and Phrases and their Compositionality / T. Mikolov, K. Chen, G. Corrado, J. Dean // Proceedings of the 26th International Conference on Neural Information Processing Systems. — 2013. — Vol. 2. — P. 3111–3119. ArXiv preprint: https://arxiv.org/abs/1310.4546

Mikolov T., Chen K., Corrado G., Dean J. Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013, Vol. 2, pp. 3111–3119. ArXiv preprint: https://arxiv.org/abs/1310.4546

Pennington J. Glove: Global Vectors for Word Representation / J. Pennington, R. Socher, C. D. Manning // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. — 2014. — P. 1532–1543. https://doi.org/10.3115/v1/D14-1162

Pennington J., Socher R., Manning C. D. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162

Sahlgren M. The Distributional Hypothesis. From context to meaning / M. Sahlgren // Distributional models of the lexicon in linguistics and cognitive science (Special issue of the Italian Journal of Linguistics), Rivista di Linguistica. — Vol. 20 (1). — 2008. — P. 33—53.

Sahlgren M. The Distributional Hypothesis. From context to meaning. Distributional models of the lexicon in linguistics and cognitive science (Special issue of the Italian Journal of Linguistics), Rivista di Linguistica, Vol. 20 (1), 2008, pp. 33–53.

B. McCann. Learned in Translation: Contextualized Word Vectors / B. McCann, J. Bradbury, C. Xiong, R. Socher // 31st Conference on Neural Information Processing Systems, Long Beach. — 2017. — P. 6297–6308. ArXiv preprint: https://arxiv.org/abs/1708.00107

McCann, B., Bradbury J., Xiong C., Socher R. Learned in Translation: Contextualized Word Vectors. 31st Conference on Neural Information Processing Systems, 2017, pp. 6297–6308. ArXiv preprint: https://arxiv.org/abs/1708.00107

Hedderich M. A. Using Multi-Sense Vector Embeddings for Reverse Dictionaries / M. A. Hedderich, A. Yates, D. Klakow, G. de Melo // Proceedings of the 13th International Conference on Computational Semantics - Long Papers. — 2019. — P. 247–258. https://doi.org/10.18653/v1/W19-0421

Hedderich M. A., Yates A., Klakow D., de Melo G. Using Multi-Sense Vector Embeddings for Reverse Dictionaries. Proceedings of the 13th International Conference on Computational Semantics - Long Papers, 2019, pp. 247–258. https://doi.org/10.18653/v1/W19-0421

Ruder S. Neural Transfer Learning for Natural Language Processing / S. Ruder // Ph.D. thesis, National University of Ireland, Galway. — 2019.

Ruder S. Neural Transfer Learning for Natural Language Processing. Ph.D. thesis, National University of Ireland, Galway, 2019.

ImageNet: A large-scale hierarchical image database / J. Deng [et al.] // IEEE Conference on Computer Vision and Pattern Recognition. — 2009. — P. 248–255. https://doi.org/10.1109/CVPR.2009.5206848

Deng J., Dong W., Socher R.; Li L.-J., Li K., Fei-Fei L. ImageNet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. https://doi.org/10.1109/CVPR.2009.5206848

Towards Accurate Multi-person Pose Estimation in the Wild / G. Papandreou [et al.] // IEEE Conference on Computer Vision and Pattern Recognition. — 2017. — P. 3711-3719. https://doi.org/10.1109/CVPR.2017.395

Papandreou G., Zhu T., Kanazawa N., Toshev A., Tompson J., Bregler C., Murphy K. Towards Accurate Multi-person Pose Estimation in the Wild. IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3711-3719. https://doi.org/10.1109/CVPR.2017.395

He K. Mask R-CNN / K. He, G. Gkioxari, P. Dollár, R. Girshick // IEEE International Conference on Computer Vision. — 2017. — P. 2980-2988. https://doi.org/10.1109/ICCV.2017.322

He K., Gkioxari G., Dollár P., Girshick R. Mask R-CNN. IEEE International Conference on Computer Vision, 2017, pp. 2980-2988. https://doi.org/10.1109/ICCV.2017.322

Exploring the Limits of Weakly Supervised Pretraining / D. Mahajan [et al.] // European Conference on Computer Vision. — 2018. — P. 181–196 https://doi.org/10.1007/978-3-030-01216-8_12

Mahajan D., Girshick R., Ramanathan V., He K., Paluri M., Li Y., Bharambe A., van der Maaten L. Exploring the Limits of Weakly Supervised Pretraining. European Conference on Computer Vision, 2018, pp. 181–196. https://doi.org/10.1007/978-3-030-01216-8_12

Dai A. M. Semi-supervised Sequence Learning / A. M. Dai, Q. V. Le // Proceedings of the 28th International Conference on Neural Information Processing Systems. — 2015. — Vol. 2. — P. 3079–3087. https://doi.org/10.18653/v1/P17-1161

Dai A. M., Le Q. V. Semi-supervised Sequence Learning. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, Vol. 2, pp. 3079–3087. https://doi.org/10.18653/v1/P17-1161

Peters M. E. Semi-supervised sequence tagging with bidirectional language models / M. E. Peters, W. Ammar, C. Bhagavatula, R. Power // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. — 2017. — Vol. 1. — P. 1756-1765. ArXiv preprint: https://arxiv.org/abs/1705.00108

Peters M. E, Ammar W., Bhagavatula C., Power R. Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, Vol. 1, pp. 1756-1765. ArXiv preprint: https://arxiv.org/abs/1705.00108

Howard J. Universal Language Model Fine-tuning for Text Classification / J. Howard, S. Ruder // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. — 2018. — Vol. 1. — P. 328–339. https://doi.org/10.18653/v1/P18-1031

Howard J., Ruder S. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, Vol. 1, pp. 328–339. https://doi.org/10.18653/v1/P18-1031

Deep contextualized word representations / M. E. Peters [et al.] // Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2018. — Vol. 1. — P. 2227–2237. https://doi.org/10.18653/v1/N18-1202

Peters M. E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., Zettlemoyer L. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, Vol. 1, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202

Merity S. Pointer Sentinel Mixture Models / S. Merity, C. Xiong, J. Bradbury, R. Socher // 5th International Conference on Learning Representations. — 2017. ArXiv preprint: https://arxiv.org/abs/1609.07843

Merity S., Xiong C., Bradbury J., Socher R. Pointer Sentinel Mixture Models. 5th International Conference on Learning Representations, 2017. ArXiv preprint: https://arxiv.org/abs/1609.07843.

Radford A. Improving language understanding with unsupervised learning / A. Radford, K. Narasimhan, T. Salimans, I. Sutskever // Technical report, OpenAI. — 2018. Available at: https://openai.com/blog/language-unsupervised/ (Accessed 10 July 2020)

Radford A., Narasimhan K., Salimans T., Sutskever I. Improving language understanding with unsupervised learning. Technical report, OpenAI, 2018. Available at: https://openai.com/blog/language-unsupervised/ (Accessed 10 July 2020)

Generating Wikipedia by Summarizing Long Sequences / P. J. Liu [et al.] // 6th International Conference on Learning Representations. — 2018. ArXiv preprint: https://arxiv.org/abs/1801.10198

Liu P. J., Saleh M., Pot E., Goodrich B., Sepassi R., Kaiser L., Shazeer N. Generating Wikipedia by Summarizing Long Sequences. 6th International Conference on Learning Representations, 2018. ArXiv preprint: https://arxiv.org/abs/1801.10198

Devlin J. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding / J. Devlin, M.-W. Chang, K. Lee, K. Toutanova // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2019. — Vol. 1. — P. 4171–4186. https://doi.org/10.18653/v1/N19-1423

Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, Vol. 1, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423

Taylor W. L. Cloze procedure: A new tool for measuring readability / W. L. Taylor // Journalism Bulletin. — 1953. — Vol. 30(4) — P. 415–433.

Taylor W. L. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 1953, Vol. 30 (4) — P. 415–433.

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books / Y. Zhu [et al.] // Proceedings of the IEEE international conference on computer vision. — 2015. — P. 19–27. https://doi.org/10.1109/ICCV.2015.11

Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., Fidler S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE international conference on computer vision, 2015, pp. 19–27. https://doi.org/10.1109/ICCV.2015.11

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding / A. Wang [et al.] // Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. — 2018. — P. 353–355. https://doi.org/10.18653/v1/W18-5446

Wang A., Singh A., Michael J., Hill F., Levy O., Bowman S. R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355. https://doi.org/10.18653/v1/W18-5446

RoBERTa: A Robustly Optimized BERT Pretraining Approach / Y. Liu [et al.] // ArXiv preprint. — 2019. https://arxiv.org/abs/1907.11692

Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv preprint, 2019. https://arxiv.org/abs/1907.11692

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations / Z. Lan [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=H1eA7AEtvS (Accessed 10 July 2020)

Lan Z., Chen M., Goodman S., Gimpel K., Sharma P., Soricut R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=H1eA7AEtvS (Accessed 10 July 2020)

Sanh V. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter / V. Sanh, L. Debut, J. Chaumond, T. Wolf // Conference on Neural Information Processing Systems. — 2019. ArXiv preprint: https://arxiv.org/abs/1910.01108.

Sanh V., Debut L., Chaumond J., Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Conference on Neural Information Processing Systems, 2019. ArXiv preprint: https://arxiv.org/abs/1910.01108

Hinton G. Distilling the Knowledge in a Neural Network / G. Hinton, O. Vinyals, J. Dean // Neural Information Processing Systems. Deep Learning and Representation Learning Workshop. — 2015. ArXiv preprint: https://arxiv.org/abs/1503.02531

Hinton G., Vinyals O., Dean J. Distilling the Knowledge in a Neural Network. Neural Information Processing Systems. Deep Learning and Representation Learning Workshop, 2015. ArXiv preprint: https://arxiv.org/abs/1503.02531

TinyBERT: Distilling BERT for Natural Language Understanding / X. Jiao [et al.] // ArXiv preprint. — 2019. https://arxiv.org/abs/1909.10351

Jiao X., Yin Y., Shang L., Jiang X., Chen X., Li L., Wang F., Liu Q. TinyBERT: Distilling BERT for Natural Language Understanding. ArXiv preprint, 2019. https://arxiv.org/abs/1909.10351

Liu. X. Multi-Task Deep Neural Networks for Natural Language Understanding / X. Liu, P. He, W. Chen, J. Gao // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4487–4496. https://doi.org/10.18653/v1/P19-1441

Liu X., He P., Chen W., Gao J. Multi-Task Deep Neural Networks for Natural Language Understanding. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4487–4496. https://doi.org/10.18653/v1/P19-1441

Representation learning using multi-task deep neural networks for semantic classification and information retrieval / X. Liu [et al.] // Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2015. — P. 912–921. https://doi.org/10.3115/v1/N15-1092

Liu X., Gao J., He X., Deng L., Duh K., Wang Y.-Y. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 912–921. https://doi.org/10.3115/v1/N15-1092

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding / W. Wang [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=BJgQ4lSFPH (Accessed 10 July 2020)

Wang W., Bi B., Yan M., Wu C., Xia J., Bao Z., Peng L., Si L. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=BJgQ4lSFPH (Accessed 10 July 2020)

Elman J. L. Finding structure in time / Elman J. L. // Cognitive science. — 1990. — Vol. 14 (2). — P. 179–211.

Elman J. L. Finding structure in time. Cognitive science, 1990, Vol. 14 (2), pp. 179–211.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining / J. Lee [et al.] // Bioinformatics. — 2020. — Volume 36 (4). — P. 1234–1240. https://doi.org/10.1093/bioinformatics/btz682

Lee J., Yoon W., Kim S., Kim D., Kim S., So C. H., Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, Volume 36 (4), pp. 1234–1240. https://doi.org/10.1093/bioinformatics/btz682

Lu J. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks / J. Lu, D. Batra, D. Parikh, S. Lee // ArXiv preprint. — 2019. https://arxiv.org/abs/1908.02265

Lu J., Batra D., Parikh D., Lee S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. ArXiv preprint, 2019. https://arxiv.org/abs/1908.02265

Niven T. Probing Neural Network Comprehension of Natural Language Arguments / T. Niven, H.-Y. Kao // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4658–4664. https://doi.org/10.18653/v1/P19-1459

Niven T., Kao H.-Y. Probing Neural Network Comprehension of Natural Language Arguments. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4658–4664. https://doi.org/10.18653/v1/P19-1459

HellaSwag: Can a Machine Really Finish Your Sentence? / R. Zellers [et al.] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4791–4800. https://doi.org/10.18653/v1/P19-1472

Zellers R., Holtzman A., Bisk Y., Farhadi A., Choi Y. HellaSwag: Can a Machine Really Finish Your Sentence? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4791–4800. https://doi.org/10.18653/v1/P19-1472

McCoy T. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference / T. McCoy, E. Pavlick, T. Linzen // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 3428–3448. https://doi.org/10.18653/v1/P19-1334

McCoy T., Pavlick E., Linzen T. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3428–3448. https://doi.org/10.18653/v1/P19-1334

The authors declare that there are no conflicts of interest present.