Моделирование языка и двунаправленные представления кодировщиков: обзор ключевых технологий
https://doi.org/10.37661/1816-0301-2020-17-4-61-72
Аннотация
Представлен очерк развития технологий обработки естественного языка, которые легли в основу BERT (Bidirectional Encoder Representations from Transformers) − языковой модели от компании Google, демонстрирующей высокие результаты на целом классе задач, связанных с пониманием естественного языка. Две ключевые идеи, реализованные в BERT, – это перенос знаний и механизм внимания. Модель предобучена решению нескольких задач на обширном корпусе неразмеченных данных и может применять обнаруженные языковые закономерности для эффективного дообучения под конкретную проблему обработки текста. Использованная архитектура Transformer основана на внимании, т. е. предполагает оценку взаимосвязей между токенами входных данных. В статье отмечены сильные и слабые стороны BERT и направления дальнейшего усовершенствования модели.
Об авторе
Д. И. КачковБеларусь
Качков Дмитрий Ильич, аспирант кафедры многопроцессорных систем и сетей факультета прикладной математики и информатики
Минск
Список литературы
1. Cho K., Merriënboer B. van, Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning phrase representations using RNN encoder-decoderfor statistical machine translation. Proceedings of the 2014. Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014, pp. 1724–1734. https://doi.org/10.3115/v1/D14-1179
2. Sutskever I. Sequence to Sequence Learning with Neural Networks / I. Sutskever, O. Vinyals, Q. V. Le // Advances in Neural Information Processing Systems. — 2014. — P. 3104–3112. ArXiv preprint: https://arxiv.org/abs/1409.3215
3. Serban I. V., Lowe R., Charlin L., Pineau J. Generative deep neural networks for dialogue: A short review. Neural Information Processing Systems, Workshop on Learning Methods for Dialogue, 2016. Available at: https://arxiv.org/abs/1611.06216 (accessed 07.07.2020).
4. Vinyals O. Show and tell: A neural image caption generator / O. Vinyals, A. Toshev, S. Bengio, D. Erhan // Proceedings of the IEEE conference on computer vision and pattern recognition. — 2015. — P. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
5. Loyola P. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes / P. Loyola., E. Marrese-Taylor, Y. Matsuo // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. — 2017. — Vol. 2. — P. 287-292. https://doi.org/10.18653/v1/P17-2045
6. Lebret R. Neural Text Generation from Structured Data with Application to the Biography Domain / R. Lebret., D. Grangier, M. Auli // Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. — 2016. — P. 1203–1213. https://doi.org/10.18653/v1/D16-1128
7. Николенко С., Кандурин А., Архангельская Е. Глубокое обучение. — Санкт-Петербург: Питер, 2020. — 480 с.
8. Bahdanau D. Neural Machine Translation by Jointly Learning to Align and Translate / D. Bahdanau, K. Cho, Y. Bengio // International Conference on Learning Representations. — 2015. ArXiv preprint: https://arxiv.org/abs/1409.0473
9. Schuster M. Bidirectional recurrent neural networks / M. Schuster, K. K. Paliwal // Signal Processing, IEEE Transactions on 45.11. — 1997. — P. 2673-2681. https://doi.org/10.1109/78.650093
10. Luong T. Effective Approaches to Attention-based Neural Machine Translation / T. Luong, H. Pham, C. D. Manning // Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 1412-1421. https://doi.org/10.18653/v1/D15-1166
11. Chung J. A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation / J. Chung, K. Cho, Y. Bengio // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. — 2016. — Vol. 1. — P. 1693–1703. https://doi.org/10.18653/v1/P16-1160
12. Rush A. A Neural Attention Model for Abstractive Sentence Summarization / A. Rush, S. Chorpa, J. Weston // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 379–389. https://doi.org/10.18653/v1/D15-1044
13. Attention-Based Models for Speech Recognition / J. Chorowski [et al.] // Proceedings of the 28th International Conference on Neural Information Processing Systems. — 2015. — Vol. 1. — P. 577–585. ArXiv preprint: https://arxiv.org/abs/1506.07503
14. Chan W. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition / W. Chan, N. Jaitly, Q. V. Le, O. Vinyals // 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). — 2016. — P. 4960-4964. https://doi.org/10.1109/ICASSP.2016.7472621
15. Teaching Machines to Read and Comprehend / K. M. Hermann [et al.] // Advances in Neural Information Processing Systems 28: 29th Annual Conference on Neural Information Processing Systems 2015. — 2015. — P. 1693-1701. ArXiv preprint: https://arxiv.org/abs/1506.03340
16. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation / Y. Wu [et al.] // ArXiv preprint. — 2016. https://arxiv.org/abs/1609.08144
17. Hochreiter S. Long short-term memory / S. Hochreiter, J. Schmidhuber // Neural Computation. — 1997. — Vol. 9 (8). — P. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
18. Cho K. On the properties of neural machinetranslation: Encoder-decoder approaches / K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio // Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation — 2014. — P. 103–111. https://doi.org/10.3115/v1/W14-4012
19. Martin E., Cundy C. Parallelizing Linear Recurrent Neural Nets Over Sequence Length // International Conference on Learning Representations. — 2018. ArXiv preprint:https://arxiv.org/abs/1709.04057
20. Neural machine translation in linear time / N. Kalchbrenner [et al.] // ArXiv preprint. — 2016. https://arxiv.org/abs/1610.10099.
21. Convolutional sequence to sequence learning / J. Gehring [et al.] // Proceedings of the 34th International Conference on Machine Learning — 2017. — Vol. 70. — P. 1243–1252. ArXiv preprint: https://arxiv.org/abs/1705.03122
22. LeCun Y. Gradient-based learning applied to document recognition / Y. LeCun, L. Bottou, Y. Bengio, P. Haffner // Proceedings of the IEEE. — 1998. — Vol. 86 (11). — P. 2278–2324. https://doi.org/10.1109/5.726791
23. Parikh A. P. A Decomposable Attention Model for Natural Language Inference / A. P. Parikh, O. Täckström, D. Das, J. Uszkoreit // Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. — 2016. — P. 2249–2255. https://doi.org/10.18653/v1/D16-1244
24. Attention Is All You Need / A. Vaswani [et al.] // Publication: NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems. — 2017. — P. 6000–6010. ArXiv preprint: https://arxiv.org/abs/1706.03762
25. Mitkov R. Anaphora Resolution: The State of the Art. / R. Mitkov // Paper based on the COLING'98/ACL'98 tutorial on anaphora resolution. — University of Wolverhampton. — 1999.
26. Ba J. L. Layer normalization / J. L. Ba, J. R. Kiros, G. E. Hinton // ArXiv preprint. — 2016. https://arxiv.org/abs/1607.06450.
27. Neural Speech Synthesis with Transformer Network / N. Li [et al.] // The AAAI Conference on Artificial Intelligence (AAAI). — 2019. ArXiv preprint: https://arxiv.org/abs/1809.08895
28. Khandelwal U. Sample Efficient Text Summarization Using a Single Pre-Trained Transformer / U. Khandelwal, K. Clark, D. Jurafsky, Ł. Kaiser. // ArXiv preprint. — 2019. https://arxiv.org/abs/1905.08836
29. Vlasov V. Dialogue Transformers / V. Vlasov, J. E. M. Mosig, A. Nicho // ArXiv preprint. — 2019. https://arxiv.org/abs/1910.00486
30. Griffith K. Solving Arithmetic Word Problems Automatically Using Transformer and Unambiguous Representations / K. Griffith and J. Kalita // 2019 International Conference on Computational Science and Computational Intelligence (CSCI). — 2019. — P. 526-532. https://doi.org/10.1109/CSCI49370.2019.00101
31. Kang W. Self-Attentive Sequential Recommendation / W. Kang, J. McAuley // 2018 IEEE International Conference on Data Mining (ICDM). — 2018. — P. 197-206.https://doi.org/10.1109/ICDM.2018.00035.
32. Music Transformer / C.-Z. A. Huang [et al.] // ArXiv preprint. — 2018. https://arxiv.org/abs/1809.04281.
33. Universal Transformers / M. Dehghani [et al.] // 7th International Conference on Learning Representations. — 2019. ArXiv preprint: https://arxiv.org/abs/1807.03819.
34. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context / Z. Dai [et al.] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 2978–2988. https://doi.org/10.18653/v1/P19-1285
35. So D. R. The Evolved Transformer / D. R. So, C. Liang, Q. V. Le // Proceedings of the 36th International Conference on Machine Learning. — 2019. — P. 5877-5886. ArXiv preprint: https://arxiv.org/abs/1901.11117
36. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention / C. Zhao [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=r1eIiCNYwS (Accessed 10 July 2020)
37. Mikolov T. Distributed Representations of Words and Phrases and their Compositionality / T. Mikolov, K. Chen, G. Corrado, J. Dean // Proceedings of the 26th International Conference on Neural Information Processing Systems. — 2013. — Vol. 2. — P. 3111–3119. ArXiv preprint: https://arxiv.org/abs/1310.4546
38. Pennington J. Glove: Global Vectors for Word Representation / J. Pennington, R. Socher, C. D. Manning // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. — 2014. — P. 1532–1543. https://doi.org/10.3115/v1/D14-1162
39. Sahlgren M. The Distributional Hypothesis. From context to meaning / M. Sahlgren // Distributional models of the lexicon in linguistics and cognitive science (Special issue of the Italian Journal of Linguistics), Rivista di Linguistica. — Vol. 20 (1). — 2008. — P. 33—53.
40. B. McCann. Learned in Translation: Contextualized Word Vectors / B. McCann, J. Bradbury, C. Xiong, R. Socher // 31st Conference on Neural Information Processing Systems, Long Beach. — 2017. — P. 6297–6308. ArXiv preprint: https://arxiv.org/abs/1708.00107
41. Hedderich M. A. Using Multi-Sense Vector Embeddings for Reverse Dictionaries / M. A. Hedderich, A. Yates, D. Klakow, G. de Melo // Proceedings of the 13th International Conference on Computational Semantics - Long Papers. — 2019. — P. 247–258. https://doi.org/10.18653/v1/W19-0421
42. Ruder S. Neural Transfer Learning for Natural Language Processing / S. Ruder // Ph.D. thesis, National University of Ireland, Galway. — 2019.
43. ImageNet: A large-scale hierarchical image database / J. Deng [et al.] // IEEE Conference on Computer Vision and Pattern Recognition. — 2009. — P. 248–255. https://doi.org/10.1109/CVPR.2009.5206848
44. Towards Accurate Multi-person Pose Estimation in the Wild / G. Papandreou [et al.] // IEEE Conference on Computer Vision and Pattern Recognition. — 2017. — P. 3711-3719. https://doi.org/10.1109/CVPR.2017.395
45. He K. Mask R-CNN / K. He, G. Gkioxari, P. Dollár, R. Girshick // IEEE International Conference on Computer Vision. — 2017. — P. 2980-2988. https://doi.org/10.1109/ICCV.2017.322
46. Exploring the Limits of Weakly Supervised Pretraining / D. Mahajan [et al.] // European Conference on Computer Vision. — 2018. — P. 181–196 https://doi.org/10.1007/978-3-030-01216-8_12
47. Dai A. M. Semi-supervised Sequence Learning / A. M. Dai, Q. V. Le // Proceedings of the 28th International Conference on Neural Information Processing Systems. — 2015. — Vol. 2. — P. 3079–3087. https://doi.org/10.18653/v1/P17-1161
48. Peters M. E. Semi-supervised sequence tagging with bidirectional language models / M. E. Peters, W. Ammar, C. Bhagavatula, R. Power // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. — 2017. — Vol. 1. — P. 1756-1765. ArXiv preprint: https://arxiv.org/abs/1705.00108
49. Howard J. Universal Language Model Fine-tuning for Text Classification / J. Howard, S. Ruder // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. — 2018. — Vol. 1. — P. 328–339. https://doi.org/10.18653/v1/P18-1031
50. Deep contextualized word representations / M. E. Peters [et al.] // Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2018. — Vol. 1. — P. 2227–2237. https://doi.org/10.18653/v1/N18-1202
51. Merity S. Pointer Sentinel Mixture Models / S. Merity, C. Xiong, J. Bradbury, R. Socher // 5th International Conference on Learning Representations. — 2017. ArXiv preprint: https://arxiv.org/abs/1609.07843
52. Radford A. Improving language understanding with unsupervised learning / A. Radford, K. Narasimhan, T. Salimans, I. Sutskever // Technical report, OpenAI. — 2018. Available at: https://openai.com/blog/language-unsupervised/ (Accessed 10 July 2020)
53. Generating Wikipedia by Summarizing Long Sequences / P. J. Liu [et al.] // 6th International Conference on Learning Representations. — 2018. ArXiv preprint: https://arxiv.org/abs/1801.10198
54. Devlin J. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding / J. Devlin, M.-W. Chang, K. Lee, K. Toutanova // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2019. — Vol. 1. — P. 4171–4186. https://doi.org/10.18653/v1/N19-1423
55. Taylor W. L. Cloze procedure: A new tool for measuring readability / W. L. Taylor // Journalism Bulletin. — 1953. — Vol. 30(4) — P. 415–433.
56. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books / Y. Zhu [et al.] // Proceedings of the IEEE international conference on computer vision. — 2015. — P. 19–27. https://doi.org/10.1109/ICCV.2015.11
57. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding / A. Wang [et al.] // Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. — 2018. — P. 353–355. https://doi.org/10.18653/v1/W18-5446
58. RoBERTa: A Robustly Optimized BERT Pretraining Approach / Y. Liu [et al.] // ArXiv preprint. — 2019. https://arxiv.org/abs/1907.11692
59. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations / Z. Lan [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=H1eA7AEtvS (Accessed 10 July 2020)
60. Sanh V. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter / V. Sanh, L. Debut, J. Chaumond, T. Wolf // Conference on Neural Information Processing Systems. — 2019. ArXiv preprint: https://arxiv.org/abs/1910.01108.
61. Hinton G. Distilling the Knowledge in a Neural Network / G. Hinton, O. Vinyals, J. Dean // Neural Information Processing Systems. Deep Learning and Representation Learning Workshop. — 2015. ArXiv preprint: https://arxiv.org/abs/1503.02531
62. TinyBERT: Distilling BERT for Natural Language Understanding / X. Jiao [et al.] // ArXiv preprint. — 2019. https://arxiv.org/abs/1909.10351
63. Liu. X. Multi-Task Deep Neural Networks for Natural Language Understanding / X. Liu, P. He, W. Chen, J. Gao // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4487–4496. https://doi.org/10.18653/v1/P19-1441
64. Representation learning using multi-task deep neural networks for semantic classification and information retrieval / X. Liu [et al.] // Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2015. — P. 912–921. https://doi.org/10.3115/v1/N15-1092
65. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding / W. Wang [et al.] // 8th International Conference on Learning Representations. — 2020. Available at: https://openreview.net/forum?id=BJgQ4lSFPH (Accessed 10 July 2020)
66. Elman J. L. Finding structure in time / Elman J. L. // Cognitive science. — 1990. — Vol. 14 (2). — P. 179–211.
67. BioBERT: a pre-trained biomedical language representation model for biomedical text mining / J. Lee [et al.] // Bioinformatics. — 2020. — Volume 36 (4). — P. 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
68. Lu J. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks / J. Lu, D. Batra, D. Parikh, S. Lee // ArXiv preprint. — 2019. https://arxiv.org/abs/1908.02265
69. Niven T. Probing Neural Network Comprehension of Natural Language Arguments / T. Niven, H.-Y. Kao // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4658–4664. https://doi.org/10.18653/v1/P19-1459
70. HellaSwag: Can a Machine Really Finish Your Sentence? / R. Zellers [et al.] // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 4791–4800. https://doi.org/10.18653/v1/P19-1472
71. McCoy T. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference / T. McCoy, E. Pavlick, T. Linzen // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. — 2019. — P. 3428–3448. https://doi.org/10.18653/v1/P19-1334
Дополнительные файлы
Рецензия
Для цитирования:
Качков Д.И. Моделирование языка и двунаправленные представления кодировщиков: обзор ключевых технологий. Информатика. 2020;17(4):61-72. https://doi.org/10.37661/1816-0301-2020-17-4-61-72
For citation:
Kachkou D.I. Language modeling and bidirectional coders representations: an overview of key technologies. Informatics. 2020;17(4):61-72. (In Russ.) https://doi.org/10.37661/1816-0301-2020-17-4-61-72