<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">inform</journal-id><journal-title-group><journal-title xml:lang="ru">Информатика</journal-title><trans-title-group xml:lang="en"><trans-title>Informatics</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1816-0301</issn><issn pub-type="epub">2617-6963</issn><publisher><publisher-name>UIIP NASB</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.37661/1816-0301-2020-17-4-61-72</article-id><article-id custom-type="elpub" pub-id-type="custom">inform-1080</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>ОБРАБОТКА СИГНАЛОВ, ИЗОБРАЖЕНИЙ, РЕЧИ, ТЕКСТА И РАСПОЗНАВАНИЕ ОБРАЗОВ</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="en"><subject>SIGNAL, IMAGE, SPEECH, TEXT PROCESSING AND PATTERN RECOGNITION</subject></subj-group></article-categories><title-group><article-title>Моделирование языка и двунаправленные представления кодировщиков: обзор ключевых технологий</article-title><trans-title-group xml:lang="en"><trans-title>Language modeling and bidirectional coders representations: an overview of key technologies</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Качков</surname><given-names>Д. И.</given-names></name><name name-style="western" xml:lang="en"><surname>Kachkou</surname><given-names>D. I.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Качков Дмитрий Ильич, аспирант кафедры многопроцессорных систем и сетей факультета прикладной математики и информатики</p><p>Минск</p></bio><bio xml:lang="en"><p>Dzmitry I. Kachkou, Postgraduate Student of Department of Multiprocessor Systems and Networks of the Faculty of Applied Mathematics and Informatics</p><p>Minsk</p></bio><email xlink:type="simple">dmitriydikanskiy@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Белорусский государственный университет</institution></aff><aff xml:lang="en"><institution>Belarusian State University</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2020</year></pub-date><pub-date pub-type="epub"><day>02</day><month>11</month><year>2020</year></pub-date><volume>17</volume><issue>4</issue><fpage>61</fpage><lpage>72</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Качков Д.И., 2021</copyright-statement><copyright-year>2021</copyright-year><copyright-holder xml:lang="ru">Качков Д.И.</copyright-holder><copyright-holder xml:lang="en">Kachkou D.I.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://inf.grid.by/jour/article/view/1080">https://inf.grid.by/jour/article/view/1080</self-uri><abstract><p>Представлен очерк развития технологий обработки естественного языка, которые легли в основу BERT (Bidirectional Encoder Representations from Transformers) − языковой модели от компании Google, демонстрирующей высокие результаты на целом классе задач, связанных с пониманием естественного языка. Две ключевые идеи, реализованные в BERT, – это перенос знаний и механизм внимания. Модель предобучена решению нескольких задач на обширном корпусе неразмеченных данных и может применять обнаруженные языковые закономерности для эффективного дообучения под конкретную проблему обработки текста. Использованная  архитектура Transformer основана на внимании, т. е. предполагает оценку взаимосвязей между токенами входных данных. В статье отмечены сильные и слабые стороны BERT и направления дальнейшего усовершенствования модели.</p><p> </p></abstract><trans-abstract xml:lang="en"><p>The article is an essay on the development of technologies for natural language processing, which formed the basis of BERT (Bidirectional Encoder Representations from Transformers), a language model from Google, showing high results on the whole class of problems associated with the understanding of natural language. Two key ideas implemented in BERT are knowledge transfer and attention mechanism. The model is designed to solve two problems on a large unlabeled data set and can reuse the identified language patterns for effective learning for a specific text processing problem. Architecture Transformer is based on the attention mechanism, i.e. it involves evaluation of relationships between input data tokens. In addition, the article notes strengths and weaknesses of BERT and the directions for further model improvement.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>информатика</kwd><kwd>информационные технологии</kwd><kwd>языковые модели</kwd><kwd>обработка естественного языка</kwd><kwd>механизм внимания</kwd><kwd>архитектура Transformer</kwd><kwd>модель BERT</kwd></kwd-group><kwd-group xml:lang="en"><kwd>informatics</kwd><kwd>information technology</kwd><kwd>language models</kwd><kwd>natural language processing</kwd><kwd>attention mechanism</kwd><kwd>transformer architecture</kwd><kwd>model BERT</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Cho K., Merriënboer B. van, Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning phrase representations using RNN encoder-decoderfor statistical machine translation. Proceedings of the 2014. Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014, pp. 1724–1734. https://doi.org/10.3115/v1/D14-1179</mixed-citation><mixed-citation xml:lang="en">Cho K., van Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning Phrase Representations using RNN Encoder-Decoderfor Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724-1734. https://doi.org/10.3115/v1/D14-1179</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Sutskever I. Sequence to Sequence Learning with Neural Networks / I. Sutskever, O. Vinyals, Q. V. Le // Advances in Neural Information Processing Systems. — 2014. — P. 3104–3112. ArXiv preprint: https://arxiv.org/abs/1409.3215</mixed-citation><mixed-citation xml:lang="en">Sutskever I., Vinyals O, Le Q. V. Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems, 2014, pp. 3104–3112. ArXiv preprint: https://arxiv.org/abs/1409.3215</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Serban I. V., Lowe R., Charlin L., Pineau J. Generative deep neural networks for dialogue: A short review. Neural Information Processing Systems, Workshop on Learning Methods for Dialogue, 2016. Available at: https://arxiv.org/abs/1611.06216 (accessed 07.07.2020).</mixed-citation><mixed-citation xml:lang="en">Serban I. V., Lowe R., Charlin L., Pineau J. Generative Deep Neural Networks for Dialogue: A Short Review. Advances in Neural Information Processing Systems, Workshop on Learning Methods for Dialogue, 2016. ArXiv preprint: https://arxiv.org/abs/1611.06216</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Vinyals O. Show and tell: A neural image caption generator / O. Vinyals, A. Toshev, S. Bengio, D. Erhan // Proceedings of the IEEE conference on computer vision and pattern recognition. — 2015. — P. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935</mixed-citation><mixed-citation xml:lang="en">Vinyals O., Toshev A., Bengio S., Erhan D. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Loyola P. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes / P. Loyola., E. Marrese-Taylor, Y. Matsuo // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. — 2017. — Vol. 2. — P. 287-292. https://doi.org/10.18653/v1/P17-2045</mixed-citation><mixed-citation xml:lang="en">Loyola P., Marrese-Taylor E., Matsuo Y. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, Vol. 2, pp. 287-292. https://doi.org/10.18653/v1/P17-2045</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Lebret R. Neural Text Generation from Structured Data with Application to the Biography Domain / R. Lebret., D. Grangier, M. Auli // Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. — 2016. — P. 1203–1213. https://doi.org/10.18653/v1/D16-1128</mixed-citation><mixed-citation xml:lang="en">Lebret R., Grangier D., Auli M. Neural Text Generation from Structured Data with Application to the Biography Domain. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1203–1213. https://doi.org/10.18653/v1/D16-1128</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Николенко С., Кандурин А., Архангельская Е. Глубокое обучение. — Санкт-Петербург: Питер, 2020. — 480 с.</mixed-citation><mixed-citation xml:lang="en">Nikolenko S., Kandurin A., Arhangelskaja E. Glubokoe obuchenie [Deep Learning], Saint Petersburg.: Piter, 2020, 480 p.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Bahdanau D. Neural Machine Translation by Jointly Learning to Align and Translate / D. Bahdanau, K. Cho, Y. Bengio // International Conference on Learning Representations. — 2015. ArXiv preprint: https://arxiv.org/abs/1409.0473</mixed-citation><mixed-citation xml:lang="en">Bahdanau D., Cho K., Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations, 2015. ArXiv preprint:  https://arxiv.org/abs/1409.0473</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Schuster M. Bidirectional recurrent neural networks / M. Schuster, K. K. Paliwal // Signal Processing, IEEE Transactions on 45.11. — 1997. — P. 2673-2681. https://doi.org/10.1109/78.650093</mixed-citation><mixed-citation xml:lang="en">Schuster M., Paliwal K. K. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions, 1997, Vol. 45 (11), pp. 2673-2681. https://doi.org/10.1109/78.650093</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Luong T. Effective Approaches to Attention-based Neural Machine Translation / T. Luong, H. Pham, C. D. Manning // Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 1412-1421. https://doi.org/10.18653/v1/D15-1166</mixed-citation><mixed-citation xml:lang="en">Luong T., Pham H., Manning C. D. Effective Approaches to Attention-based Neural Machine Translation. Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412-1421. https://doi.org/10.18653/v1/D15-1166</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Chung J. A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation / J. Chung, K. Cho, Y. Bengio // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. — 2016. — Vol. 1. — P. 1693–1703. https://doi.org/10.18653/v1/P16-1160</mixed-citation><mixed-citation xml:lang="en">Chung J., Cho K., Bengio Y. A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, Vol. 1, pp. 1693–1703. https://doi.org/10.18653/v1/P16-1160</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Rush A. A Neural Attention Model for Abstractive Sentence Summarization / A. Rush, S. Chorpa, J. Weston // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 379–389. https://doi.org/10.18653/v1/D15-1044</mixed-citation><mixed-citation xml:lang="en">Rush A., Chorpa S., Weston J. A Neural Attention Model for Abstractive Sentence Summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 379–389. https://doi.org/10.18653/v1/D15-1044</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Attention-Based Models for Speech Recognition / J. Chorowski [et al.] // Proceedings of the 28th International Conference on Neural Information Processing Systems. — 2015. — Vol. 1. — P. 577–585. ArXiv preprint: https://arxiv.org/abs/1506.07503</mixed-citation><mixed-citation xml:lang="en">Chorowski J., Bahdanau D., Serdyuk D., Cho K., Bengio Y. Attention-Based Models for Speech Recognition. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, Vol. 1, pp. 577–585. ArXiv preprint: https://arxiv.org/abs/1506.07503</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Chan W. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition / W. Chan, N. Jaitly, Q. V. Le, O. Vinyals // 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). — 2016. — P. 4960-4964. https://doi.org/10.1109/ICASSP.2016.7472621</mixed-citation><mixed-citation xml:lang="en">Chan W., Jaitly N., Le Q. V., Vinyals O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960-4964. https://doi.org/10.1109/ICASSP.2016.7472621</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Teaching Machines to Read and Comprehend / K. M. Hermann [et al.] // Advances in Neural Information Processing Systems 28: 29th Annual Conference on Neural Information Processing Systems 2015. — 2015. — P. 1693-1701. ArXiv preprint: https://arxiv.org/abs/1506.03340</mixed-citation><mixed-citation xml:lang="en">Hermann K. M., Kočiský T., Grefenstette E., Espeholt L., Kay W., Suleyman M.,  Blunsom P. Teaching Machines to Read and Comprehend. Advances in Neural Information Processing Systems 28: 29th Annual Conference on Neural Information Processing Systems, 2015, pp. 1693-1701. ArXiv preprint: https://arxiv.org/abs/1506.03340</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation / Y. Wu [et al.] // ArXiv preprint. — 2016. https://arxiv.org/abs/1609.08144</mixed-citation><mixed-citation xml:lang="en">Wu Y., Schuster M., Chen Z., Le Q. V., Norouzi M., Macherey W., Krikun M., Cao Y., Gao Q., Macherey K., Klingner J., Shah A., Johnson M., Liu X., Kaiser Ł., Gouws S., Kato Y., Kudo T., Kazawa H., Stevens K., Kurian G., Patil N., Wang W., Young C., Smith J., Riesa J., Rudnick A., Vinyals O., Corrado G., Hughes M., Dean J. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv preprint, 2016. https://arxiv.org/abs/1609.08144.</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">Hochreiter S. Long short-term memory / S. Hochreiter, J. Schmidhuber // Neural Computation. — 1997. — Vol. 9 (8). — P. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735</mixed-citation><mixed-citation xml:lang="en">Hochreiter S., Schmidhuber, J. Long short-term memory. Neural Computation, 1997, Vol. 9 (8), pp. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735</mixed-citation></citation-alternatives></ref><ref id="cit18"><label>18</label><citation-alternatives><mixed-citation xml:lang="ru">Cho K. On the properties of neural machinetranslation: Encoder-decoder approaches / K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio // Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation — 2014. — P. 103–111. https://doi.org/10.3115/v1/W14-4012</mixed-citation><mixed-citation xml:lang="en">Cho K., van Merrienboer B., Bahdanau D., Bengio Y. On the properties of neural machinetranslation: Encoder-decoder approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014, pp. 103–111. https://doi.org/10.3115/v1/W14-4012</mixed-citation></citation-alternatives></ref><ref id="cit19"><label>19</label><citation-alternatives><mixed-citation xml:lang="ru">Martin E., Cundy C. Parallelizing Linear Recurrent Neural Nets Over Sequence Length // International Conference on Learning Representations. — 2018. ArXiv preprint:https://arxiv.org/abs/1709.04057</mixed-citation><mixed-citation xml:lang="en">Martin E., Cundy C. Parallelizing Linear Recurrent Neural Nets Over Sequence Length. International Conference on Learning Representations, 2018. ArXiv preprint: https://arxiv.org/abs/1709.04057</mixed-citation></citation-alternatives></ref><ref id="cit20"><label>20</label><citation-alternatives><mixed-citation xml:lang="ru">Neural machine translation in linear time / N. Kalchbrenner [et al.] // ArXiv preprint. — 2016. https://arxiv.org/abs/1610.10099.</mixed-citation><mixed-citation xml:lang="en">Kalchbrenner N., Espeholt L., Simonyan K., van den Oord A., Graves A., Kavukcuoglu K. Neural machine translation in linear time. ArXiv preprint, 2016. https://arxiv.org/abs/1610.10099</mixed-citation></citation-alternatives></ref><ref id="cit21"><label>21</label><citation-alternatives><mixed-citation xml:lang="ru">Convolutional sequence to sequence learning / J. Gehring [et al.] // Proceedings of the 34th International Conference on Machine Learning — 2017. — Vol. 70. — P. 1243–1252. ArXiv preprint: https://arxiv.org/abs/1705.03122</mixed-citation><mixed-citation xml:lang="en">Gehring J., Auli M., Grangier D., Yarats D., Dauphin Y. N. Convolutional sequence to sequence learning. Proceedings of the 34th International Conference on Machine Learning, 2017, Vol. 70, pp. 1243–1252. ArXiv preprint: https://arxiv.org/abs/1705.03122</mixed-citation></citation-alternatives></ref><ref id="cit22"><label>22</label><citation-alternatives><mixed-citation xml:lang="ru">LeCun Y. Gradient-based learning applied to document recognition / Y. LeCun, L. Bottou, Y. Bengio, P. Haffner // Proceedings of the IEEE. — 1998. — Vol. 86 (11). — P. 2278–2324. https://doi.org/10.1109/5.726791</mixed-citation><mixed-citation xml:lang="en">LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, Vol. 86 (11), pp. 2278–2324. https://doi.org/10.1109/5.726791</mixed-citation></citation-alternatives></ref><ref id="cit23"><label>23</label><citation-alternatives><mixed-citation xml:lang="ru">Parikh A. P. A Decomposable Attention Model for Natural Language Inference / A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit // Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. — 2016. — P. 2249–2255. https://doi.org/10.18653/v1/D16-1244</mixed-citation><mixed-citation xml:lang="en">Parikh A. P., Täckström O., Das D., Uszkoreit J. A Decomposable Attention Model for Natural Language Inference. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2249–2255. https://doi.org/10.18653/v1/D16-1244</mixed-citation></citation-alternatives></ref><ref id="cit24"><label>24</label><citation-alternatives><mixed-citation xml:lang="ru">Attention Is All You Need / A. Vaswani [et al.] // Publication: NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems. — 2017. — P. 6000–6010. ArXiv preprint: https://arxiv.org/abs/1706.03762</mixed-citation><mixed-citation xml:lang="en">Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L.,  Polosukhin I. Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp.  6000–6010. ArXiv preprint: https://arxiv.org/abs/1706.03762</mixed-citation></citation-alternatives></ref><ref id="cit25"><label>25</label><citation-alternatives><mixed-citation xml:lang="ru">Mitkov R. Anaphora Resolution: The State of the Art. / R. Mitkov // Paper based on the COLING'98/ACL'98 tutorial on anaphora resolution. — University of Wolverhampton. — 1999.</mixed-citation><mixed-citation xml:lang="en">Mitkov R. Anaphora Resolution: The State of the Art. Paper based on the COLING'98/ACL'98 tutorial on anaphora resolution, University of Wolverhampton, 1999.</mixed-citation></citation-alternatives></ref><ref id="cit26"><label>26</label><citation-alternatives><mixed-citation xml:lang="ru">Ba J. L. Layer normalization / J. L. Ba, J. R. Kiros, G. E. Hinton // ArXiv preprint. — 2016. https://arxiv.org/abs/1607.06450.</mixed-citation><mixed-citation xml:lang="en">Ba J. L., Kiros J. R., Hinton G. E. Layer normalization. ArXiv preprint, 2016. https://arxiv.org/abs/1607.06450.</mixed-citation></citation-alternatives></ref><ref id="cit27"><label>27</label><citation-alternatives><mixed-citation xml:lang="ru">Neural Speech Synthesis with Transformer Network / N. Li [et al.] // The AAAI Conference on Artificial Intelligence (AAAI). — 2019. ArXiv preprint: https://arxiv.org/abs/1809.08895</mixed-citation><mixed-citation xml:lang="en">Li N., Liu S., Liu Y., Zhao S., Liu M., Zhou M. Neural Speech Synthesis with Transformer Network. The AAAI Conference on Artificial Intelligence, 2019. ArXiv preprint: https://arxiv.org/abs/1809.08895</mixed-citation></citation-alternatives></ref><ref id="cit28"><label>28</label><citation-alternatives><mixed-citation xml:lang="ru">Khandelwal U. Sample Efficient Text Summarization Using a Single Pre-Trained Transformer / U. Khandelwal, K. Clark, D. Jurafsky, Ł. Kaiser. // ArXiv preprint. — 2019. https://arxiv.org/abs/1905.08836</mixed-citation><mixed-citation xml:lang="en">Khandelwal U., Clark K., Jurafsky D., Kaiser Ł.  Sample Efficient Text Summarization Using a Single Pre-Trained Transformer. ArXiv preprint, 2019. https://arxiv.org/abs/1905.08836</mixed-citation></citation-alternatives></ref><ref id="cit29"><label>29</label><citation-alternatives><mixed-citation xml:lang="ru">Vlasov V. Dialogue Transformers / V. Vlasov, J. E. M. Mosig, A. Nicho // ArXiv preprint. — 2019. https://arxiv.org/abs/1910.00486</mixed-citation><mixed-citation xml:lang="en">Vlasov V., Mosig J. E. M., Nicho A. Dialogue Transformers. ArXiv preprint, 2019. https://arxiv.org/abs/1910.00486</mixed-citation></citation-alternatives></ref><ref id="cit30"><label>30</label><citation-alternatives><mixed-citation xml:lang="ru">Griffith K. Solving Arithmetic Word Problems Automatically Using Transformer and Unambiguous Representations / K. Griffith and J. Kalita // 2019 International Conference on Computational Science and Computational Intelligence (CSCI). — 2019. — P. 526-532. https://doi.org/10.1109/CSCI49370.2019.00101</mixed-citation><mixed-citation xml:lang="en">Griffith K., Kalita J. Solving Arithmetic Word Problems Automatically Using Transformer and Unambiguous Representations. International Conference on Computational Science and Computational Intelligence, 2019, pp. 526-532. https://doi.org/10.1109/CSCI49370.2019.00101</mixed-citation></citation-alternatives></ref><ref id="cit31"><label>31</label><citation-alternatives><mixed-citation xml:lang="ru">Kang W. Self-Attentive Sequential Recommendation / W. Kang, J. McAuley // 2018 IEEE International Conference on Data Mining (ICDM). — 2018. — P. 197-206.https://doi.org/10.1109/ICDM.2018.00035.</mixed-citation><mixed-citation xml:lang="en">Kang W.-C., McAuley J. Self-Attentive Sequential Recommendation. IEEE International Conference on Data Mining, 2018, pp. 197-206. https://doi.org/10.1109/ICDM.2018.00035</mixed-citation></citation-alternatives></ref><ref id="cit32"><label>32</label><citation-alternatives><mixed-citation xml:lang="ru">Music Transformer / C.-Z. A. Huang [et al.] // ArXiv preprint. — 2018. https://arxiv.org/abs/1809.04281.</mixed-citation><mixed-citation xml:lang="en">Huang C.-Z. A, Vaswani A., Uszkoreit J., Shazeer N., Simon I., Hawthorne C., Dai A. M., Hoffman M. D., Dinculescu M., Eck D. Music Transformer. ArXiv preprint, 2018. https://arxiv.org/abs/1809.04281</mixed-citation></citation-alternatives></ref><ref id="cit33"><label>33</label><citation-alternatives><mixed-citation xml:lang="ru">Universal Transformers / M. Dehghani [et al.] // 7th International Conference on Learning Representations. — 2019. ArXiv preprint: https://arxiv.org/abs/1807.03819.</mixed-citation><mixed-citation xml:lang="en">Dehghani M., Gouws S., Vinyals O., Uszkoreit J., Kaiser Ł. Universal Transformers. 7th International Conference on Learning Representations, 2019. ArXiv preprint: https://arxiv.org/abs/1807.03819</mixed-citation></citation-alternatives></ref><ref id="cit34"><label>34</label><citation-alternatives><mixed-citation xml:lang="ru">Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context / Z. Dai [et al.] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 2978–2988. https://doi.org/10.18653/v1/P19-1285</mixed-citation><mixed-citation xml:lang="en">Dai Z., Yang Z., Yang Y., Carbonell J., Le Q. V., Salakhutdinov R. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 2978–2988. https://doi.org/10.18653/v1/P19-1285</mixed-citation></citation-alternatives></ref><ref id="cit35"><label>35</label><citation-alternatives><mixed-citation xml:lang="ru">So D. R. The Evolved Transformer / D. R. So, C. Liang, Q. V. Le // Proceedings of the 36th International Conference on Machine Learning. — 2019. — P. 5877-5886. ArXiv preprint: https://arxiv.org/abs/1901.11117</mixed-citation><mixed-citation xml:lang="en">So D. R., Liang C., Le Q. V. The Evolved Transformer. Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 5877-5886. ArXiv preprint: https://arxiv.org/abs/1901.11117</mixed-citation></citation-alternatives></ref><ref id="cit36"><label>36</label><citation-alternatives><mixed-citation xml:lang="ru">Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention / C. Zhao [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=r1eIiCNYwS (Accessed 10 July 2020)</mixed-citation><mixed-citation xml:lang="en">Zhao C., Xiong C., Rosset C., Song X., Bennett P., Tiwary S. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=r1eIiCNYwS (Accessed 10 July 2020)</mixed-citation></citation-alternatives></ref><ref id="cit37"><label>37</label><citation-alternatives><mixed-citation xml:lang="ru">Mikolov T. Distributed Representations of Words and Phrases and their Compositionality / T. Mikolov, K. Chen, G. Corrado, J. Dean // Proceedings of the 26th International Conference on Neural Information Processing Systems. — 2013. — Vol. 2. — P. 3111–3119. ArXiv preprint: https://arxiv.org/abs/1310.4546</mixed-citation><mixed-citation xml:lang="en">Mikolov T., Chen K., Corrado G., Dean J. Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013, Vol. 2, pp. 3111–3119. ArXiv preprint: https://arxiv.org/abs/1310.4546</mixed-citation></citation-alternatives></ref><ref id="cit38"><label>38</label><citation-alternatives><mixed-citation xml:lang="ru">Pennington J. Glove: Global Vectors for Word Representation / J. Pennington, R. Socher, C. D. Manning // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. — 2014. — P. 1532–1543. https://doi.org/10.3115/v1/D14-1162</mixed-citation><mixed-citation xml:lang="en">Pennington J., Socher R., Manning C. D. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162</mixed-citation></citation-alternatives></ref><ref id="cit39"><label>39</label><citation-alternatives><mixed-citation xml:lang="ru">Sahlgren M. The Distributional Hypothesis. From context to meaning / M. Sahlgren // Distributional models of the lexicon in linguistics and cognitive science (Special issue of the Italian Journal of Linguistics), Rivista di Linguistica. — Vol. 20 (1). — 2008. — P. 33—53.</mixed-citation><mixed-citation xml:lang="en">Sahlgren M. The Distributional Hypothesis. From context to meaning. Distributional models of the lexicon in linguistics and cognitive science (Special issue of the Italian Journal of Linguistics), Rivista di Linguistica, Vol. 20 (1), 2008, pp. 33–53.</mixed-citation></citation-alternatives></ref><ref id="cit40"><label>40</label><citation-alternatives><mixed-citation xml:lang="ru">B. McCann. Learned in Translation: Contextualized Word Vectors / B. McCann, J. Bradbury, C. Xiong, R. Socher // 31st Conference on Neural Information Processing Systems, Long Beach. — 2017. — P. 6297–6308. ArXiv preprint: https://arxiv.org/abs/1708.00107</mixed-citation><mixed-citation xml:lang="en">McCann, B., Bradbury J., Xiong C., Socher R. Learned in Translation: Contextualized Word Vectors. 31st Conference on Neural Information Processing Systems, 2017, pp. 6297–6308. ArXiv preprint: https://arxiv.org/abs/1708.00107</mixed-citation></citation-alternatives></ref><ref id="cit41"><label>41</label><citation-alternatives><mixed-citation xml:lang="ru">Hedderich M. A. Using Multi-Sense Vector Embeddings for Reverse Dictionaries / M. A. Hedderich, A. Yates, D. Klakow, G. de Melo // Proceedings of the 13th International Conference on Computational Semantics - Long Papers. — 2019. — P. 247–258. https://doi.org/10.18653/v1/W19-0421</mixed-citation><mixed-citation xml:lang="en">Hedderich M. A., Yates A., Klakow D., de Melo G. Using Multi-Sense Vector Embeddings for Reverse Dictionaries. Proceedings of the 13th International Conference on Computational Semantics - Long Papers, 2019, pp. 247–258. https://doi.org/10.18653/v1/W19-0421</mixed-citation></citation-alternatives></ref><ref id="cit42"><label>42</label><citation-alternatives><mixed-citation xml:lang="ru">Ruder S. Neural Transfer Learning for Natural Language Processing / S. Ruder // Ph.D. thesis, National University of Ireland, Galway. — 2019.</mixed-citation><mixed-citation xml:lang="en">Ruder S. Neural Transfer Learning for Natural Language Processing. Ph.D. thesis, National University of Ireland, Galway, 2019.</mixed-citation></citation-alternatives></ref><ref id="cit43"><label>43</label><citation-alternatives><mixed-citation xml:lang="ru">ImageNet: A large-scale hierarchical image database / J. Deng [et al.] // IEEE Conference on Computer Vision and Pattern Recognition. — 2009. — P. 248–255. https://doi.org/10.1109/CVPR.2009.5206848</mixed-citation><mixed-citation xml:lang="en">Deng J., Dong W., Socher R.; Li L.-J., Li K., Fei-Fei L. ImageNet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. https://doi.org/10.1109/CVPR.2009.5206848</mixed-citation></citation-alternatives></ref><ref id="cit44"><label>44</label><citation-alternatives><mixed-citation xml:lang="ru">Towards Accurate Multi-person Pose Estimation in the Wild / G. Papandreou [et al.] // IEEE Conference on Computer Vision and Pattern Recognition. — 2017. — P. 3711-3719. https://doi.org/10.1109/CVPR.2017.395</mixed-citation><mixed-citation xml:lang="en">Papandreou G., Zhu T., Kanazawa N., Toshev A., Tompson J., Bregler C., Murphy K. Towards Accurate Multi-person Pose Estimation in the Wild. IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3711-3719.  https://doi.org/10.1109/CVPR.2017.395</mixed-citation></citation-alternatives></ref><ref id="cit45"><label>45</label><citation-alternatives><mixed-citation xml:lang="ru">He K. Mask R-CNN / K. He, G. Gkioxari, P. Dollár, R. Girshick // IEEE International Conference on Computer Vision. — 2017. — P. 2980-2988. https://doi.org/10.1109/ICCV.2017.322</mixed-citation><mixed-citation xml:lang="en">He K., Gkioxari G., Dollár P., Girshick R. Mask R-CNN. IEEE International Conference on Computer Vision, 2017, pp. 2980-2988. https://doi.org/10.1109/ICCV.2017.322</mixed-citation></citation-alternatives></ref><ref id="cit46"><label>46</label><citation-alternatives><mixed-citation xml:lang="ru">Exploring the Limits of Weakly Supervised Pretraining / D. Mahajan [et al.] // European Conference on Computer Vision. — 2018. — P. 181–196 https://doi.org/10.1007/978-3-030-01216-8_12</mixed-citation><mixed-citation xml:lang="en">Mahajan D., Girshick R., Ramanathan V., He K., Paluri M., Li Y., Bharambe A., van der Maaten L. Exploring the Limits of Weakly Supervised Pretraining. European Conference on Computer Vision, 2018, pp. 181–196. https://doi.org/10.1007/978-3-030-01216-8_12</mixed-citation></citation-alternatives></ref><ref id="cit47"><label>47</label><citation-alternatives><mixed-citation xml:lang="ru">Dai A. M. Semi-supervised Sequence Learning / A. M. Dai, Q. V. Le // Proceedings of the 28th International Conference on Neural Information Processing Systems. — 2015. — Vol. 2. — P. 3079–3087. https://doi.org/10.18653/v1/P17-1161</mixed-citation><mixed-citation xml:lang="en">Dai A. M., Le Q. V. Semi-supervised Sequence Learning. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, Vol. 2, pp. 3079–3087. https://doi.org/10.18653/v1/P17-1161</mixed-citation></citation-alternatives></ref><ref id="cit48"><label>48</label><citation-alternatives><mixed-citation xml:lang="ru">Peters M. E. Semi-supervised sequence tagging with bidirectional language models / M. E. Peters, W. Ammar, C. Bhagavatula, R. Power // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. — 2017. — Vol. 1. — P. 1756-1765. ArXiv preprint: https://arxiv.org/abs/1705.00108</mixed-citation><mixed-citation xml:lang="en">Peters M. E, Ammar W., Bhagavatula C., Power R. Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, Vol. 1, pp. 1756-1765. ArXiv preprint: https://arxiv.org/abs/1705.00108</mixed-citation></citation-alternatives></ref><ref id="cit49"><label>49</label><citation-alternatives><mixed-citation xml:lang="ru">Howard J. Universal Language Model Fine-tuning for Text Classification / J. Howard, S. Ruder // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. — 2018. — Vol. 1. — P. 328–339. https://doi.org/10.18653/v1/P18-1031</mixed-citation><mixed-citation xml:lang="en">Howard J., Ruder S. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, Vol. 1, pp. 328–339. https://doi.org/10.18653/v1/P18-1031</mixed-citation></citation-alternatives></ref><ref id="cit50"><label>50</label><citation-alternatives><mixed-citation xml:lang="ru">Deep contextualized word representations / M. E. Peters [et al.] // Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2018. — Vol. 1. — P. 2227–2237. https://doi.org/10.18653/v1/N18-1202</mixed-citation><mixed-citation xml:lang="en">Peters M. E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., Zettlemoyer L. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, Vol. 1, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202</mixed-citation></citation-alternatives></ref><ref id="cit51"><label>51</label><citation-alternatives><mixed-citation xml:lang="ru">Merity S. Pointer Sentinel Mixture Models / S. Merity, C. Xiong, J. Bradbury, R. Socher // 5th International Conference on Learning Representations. — 2017. ArXiv preprint: https://arxiv.org/abs/1609.07843</mixed-citation><mixed-citation xml:lang="en">Merity S., Xiong C., Bradbury J., Socher R. Pointer Sentinel Mixture Models. 5th International Conference on Learning Representations, 2017. ArXiv preprint: https://arxiv.org/abs/1609.07843.</mixed-citation></citation-alternatives></ref><ref id="cit52"><label>52</label><citation-alternatives><mixed-citation xml:lang="ru">Radford A. Improving language understanding with unsupervised learning / A. Radford, K. Narasimhan, T. Salimans, I. Sutskever // Technical report, OpenAI. — 2018. Available at: https://openai.com/blog/language-unsupervised/ (Accessed 10 July 2020)</mixed-citation><mixed-citation xml:lang="en">Radford A., Narasimhan K., Salimans T., Sutskever I. Improving language understanding with unsupervised learning. Technical report, OpenAI, 2018. Available at: https://openai.com/blog/language-unsupervised/ (Accessed 10 July 2020)</mixed-citation></citation-alternatives></ref><ref id="cit53"><label>53</label><citation-alternatives><mixed-citation xml:lang="ru">Generating Wikipedia by Summarizing Long Sequences / P. J. Liu [et al.] // 6th International Conference on Learning Representations. — 2018. ArXiv preprint: https://arxiv.org/abs/1801.10198</mixed-citation><mixed-citation xml:lang="en">Liu P. J., Saleh M., Pot E., Goodrich B., Sepassi R., Kaiser L., Shazeer N. Generating Wikipedia by Summarizing Long Sequences. 6th International Conference on Learning Representations, 2018. ArXiv preprint: https://arxiv.org/abs/1801.10198</mixed-citation></citation-alternatives></ref><ref id="cit54"><label>54</label><citation-alternatives><mixed-citation xml:lang="ru">Devlin J. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding / J. Devlin, M.-W. Chang, K. Lee, K. Toutanova // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2019. — Vol. 1. — P. 4171–4186. https://doi.org/10.18653/v1/N19-1423</mixed-citation><mixed-citation xml:lang="en">Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, Vol. 1, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423</mixed-citation></citation-alternatives></ref><ref id="cit55"><label>55</label><citation-alternatives><mixed-citation xml:lang="ru">Taylor W. L. Cloze procedure: A new tool for measuring readability / W. L. Taylor // Journalism Bulletin. — 1953. — Vol. 30(4) — P. 415–433.</mixed-citation><mixed-citation xml:lang="en">Taylor W. L. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 1953, Vol. 30 (4) — P. 415–433.</mixed-citation></citation-alternatives></ref><ref id="cit56"><label>56</label><citation-alternatives><mixed-citation xml:lang="ru">Aligning books and movies: Towards story-like visual explanations by watching movies and reading books / Y. Zhu [et al.] // Proceedings of the IEEE international conference on computer vision. — 2015. — P. 19–27. https://doi.org/10.1109/ICCV.2015.11</mixed-citation><mixed-citation xml:lang="en">Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., Fidler S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE international conference on computer vision, 2015, pp. 19–27. https://doi.org/10.1109/ICCV.2015.11</mixed-citation></citation-alternatives></ref><ref id="cit57"><label>57</label><citation-alternatives><mixed-citation xml:lang="ru">GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding / A. Wang [et al.] // Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. — 2018. — P. 353–355. https://doi.org/10.18653/v1/W18-5446</mixed-citation><mixed-citation xml:lang="en">Wang A., Singh A., Michael J., Hill F., Levy O., Bowman S. R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355. https://doi.org/10.18653/v1/W18-5446</mixed-citation></citation-alternatives></ref><ref id="cit58"><label>58</label><citation-alternatives><mixed-citation xml:lang="ru">RoBERTa: A Robustly Optimized BERT Pretraining Approach / Y. Liu [et al.] // ArXiv preprint. — 2019. https://arxiv.org/abs/1907.11692</mixed-citation><mixed-citation xml:lang="en">Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L.,  Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv preprint, 2019. https://arxiv.org/abs/1907.11692</mixed-citation></citation-alternatives></ref><ref id="cit59"><label>59</label><citation-alternatives><mixed-citation xml:lang="ru">ALBERT: A Lite BERT for Self-supervised Learning of Language Representations / Z. Lan [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=H1eA7AEtvS (Accessed 10 July 2020)</mixed-citation><mixed-citation xml:lang="en">Lan Z., Chen M., Goodman S., Gimpel K., Sharma P., Soricut R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=H1eA7AEtvS (Accessed 10 July 2020)</mixed-citation></citation-alternatives></ref><ref id="cit60"><label>60</label><citation-alternatives><mixed-citation xml:lang="ru">Sanh V. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter / V. Sanh, L. Debut, J. Chaumond, T. Wolf // Conference on Neural Information Processing Systems. — 2019. ArXiv preprint: https://arxiv.org/abs/1910.01108.</mixed-citation><mixed-citation xml:lang="en">Sanh V., Debut L., Chaumond J., Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Conference on Neural Information Processing Systems, 2019. ArXiv preprint: https://arxiv.org/abs/1910.01108</mixed-citation></citation-alternatives></ref><ref id="cit61"><label>61</label><citation-alternatives><mixed-citation xml:lang="ru">Hinton G. Distilling the Knowledge in a Neural Network / G. Hinton, O. Vinyals, J. Dean // Neural Information Processing Systems. Deep Learning and Representation Learning Workshop. — 2015. ArXiv preprint: https://arxiv.org/abs/1503.02531</mixed-citation><mixed-citation xml:lang="en">Hinton G., Vinyals O., Dean J. Distilling the Knowledge in a Neural Network. Neural Information Processing Systems. Deep Learning and Representation Learning Workshop, 2015. ArXiv preprint: https://arxiv.org/abs/1503.02531</mixed-citation></citation-alternatives></ref><ref id="cit62"><label>62</label><citation-alternatives><mixed-citation xml:lang="ru">TinyBERT: Distilling BERT for Natural Language Understanding / X. Jiao [et al.] // ArXiv preprint. — 2019. https://arxiv.org/abs/1909.10351</mixed-citation><mixed-citation xml:lang="en">Jiao X., Yin Y., Shang L., Jiang X., Chen X., Li L., Wang F., Liu Q. TinyBERT: Distilling BERT for Natural Language Understanding. ArXiv preprint, 2019. https://arxiv.org/abs/1909.10351</mixed-citation></citation-alternatives></ref><ref id="cit63"><label>63</label><citation-alternatives><mixed-citation xml:lang="ru">Liu. X. Multi-Task Deep Neural Networks for Natural Language Understanding / X. Liu, P. He, W. Chen, J. Gao // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4487–4496. https://doi.org/10.18653/v1/P19-1441</mixed-citation><mixed-citation xml:lang="en">Liu X., He P., Chen W., Gao J. Multi-Task Deep Neural Networks for Natural Language Understanding. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4487–4496. https://doi.org/10.18653/v1/P19-1441</mixed-citation></citation-alternatives></ref><ref id="cit64"><label>64</label><citation-alternatives><mixed-citation xml:lang="ru">Representation learning using multi-task deep neural networks for semantic classification and information retrieval / X. Liu [et al.] // Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2015. — P. 912–921. https://doi.org/10.3115/v1/N15-1092</mixed-citation><mixed-citation xml:lang="en">Liu X., Gao J., He X., Deng L., Duh K., Wang Y.-Y. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 912–921. https://doi.org/10.3115/v1/N15-1092</mixed-citation></citation-alternatives></ref><ref id="cit65"><label>65</label><citation-alternatives><mixed-citation xml:lang="ru">StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding / W. Wang [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=BJgQ4lSFPH (Accessed 10 July 2020)</mixed-citation><mixed-citation xml:lang="en">Wang W., Bi B., Yan M., Wu C., Xia J., Bao Z., Peng L., Si L. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=BJgQ4lSFPH (Accessed 10 July 2020)</mixed-citation></citation-alternatives></ref><ref id="cit66"><label>66</label><citation-alternatives><mixed-citation xml:lang="ru">Elman J. L. Finding structure in time / Elman J. L. // Cognitive science. — 1990. — Vol. 14 (2). — P. 179–211.</mixed-citation><mixed-citation xml:lang="en">Elman J. L. Finding structure in time. Cognitive science, 1990, Vol. 14 (2), pp. 179–211.</mixed-citation></citation-alternatives></ref><ref id="cit67"><label>67</label><citation-alternatives><mixed-citation xml:lang="ru">BioBERT: a pre-trained biomedical language representation model for biomedical text mining / J. Lee [et al.] // Bioinformatics. — 2020. — Volume 36 (4). — P. 1234–1240. https://doi.org/10.1093/bioinformatics/btz682</mixed-citation><mixed-citation xml:lang="en">Lee J., Yoon W., Kim S., Kim D., Kim S., So C. H., Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, Volume 36 (4), pp. 1234–1240. https://doi.org/10.1093/bioinformatics/btz682</mixed-citation></citation-alternatives></ref><ref id="cit68"><label>68</label><citation-alternatives><mixed-citation xml:lang="ru">Lu J. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks / J. Lu, D. Batra, D. Parikh, S. Lee // ArXiv preprint. — 2019. https://arxiv.org/abs/1908.02265</mixed-citation><mixed-citation xml:lang="en">Lu J., Batra D., Parikh D., Lee S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. ArXiv preprint, 2019.  https://arxiv.org/abs/1908.02265</mixed-citation></citation-alternatives></ref><ref id="cit69"><label>69</label><citation-alternatives><mixed-citation xml:lang="ru">Niven T. Probing Neural Network Comprehension of Natural Language Arguments / T. Niven, H.-Y. Kao // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4658–4664. https://doi.org/10.18653/v1/P19-1459</mixed-citation><mixed-citation xml:lang="en">Niven T., Kao H.-Y. Probing Neural Network Comprehension of Natural Language Arguments. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4658–4664. https://doi.org/10.18653/v1/P19-1459</mixed-citation></citation-alternatives></ref><ref id="cit70"><label>70</label><citation-alternatives><mixed-citation xml:lang="ru">HellaSwag: Can a Machine Really Finish Your Sentence? / R. Zellers [et al.] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4791–4800. https://doi.org/10.18653/v1/P19-1472</mixed-citation><mixed-citation xml:lang="en">Zellers R., Holtzman A., Bisk Y., Farhadi A., Choi Y. HellaSwag: Can a Machine Really Finish Your Sentence? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4791–4800. https://doi.org/10.18653/v1/P19-1472</mixed-citation></citation-alternatives></ref><ref id="cit71"><label>71</label><citation-alternatives><mixed-citation xml:lang="ru">McCoy T. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference / T. McCoy, E. Pavlick, T. Linzen // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 3428–3448. https://doi.org/10.18653/v1/P19-1334</mixed-citation><mixed-citation xml:lang="en">McCoy T., Pavlick E., Linzen T. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3428–3448. https://doi.org/10.18653/v1/P19-1334</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
