Preview

Informatics

Advanced search

Language modeling and bidirectional coders representations: an overview of key technologies

https://doi.org/10.37661/1816-0301-2020-17-4-61-72

Abstract

The article is an essay on the development of technologies for natural language processing, which formed the basis of BERT (Bidirectional Encoder Representations from Transformers), a language model from Google, showing high results on the whole class of problems associated with the understanding of natural language. Two key ideas implemented in BERT are knowledge transfer and attention mechanism. The model is designed to solve two problems on a large unlabeled data set and can reuse the identified language patterns for effective learning for a specific text processing problem. Architecture Transformer is based on the attention mechanism, i.e. it involves evaluation of relationships between input data tokens. In addition, the article notes strengths and weaknesses of BERT and the directions for further model improvement.

About the Author

D. I. Kachkou
Belarusian State University
Belarus

Dzmitry I. Kachkou, Postgraduate Student of Department of Multiprocessor Systems and Networks of the Faculty of Applied Mathematics and Informatics

Minsk



References

1. Cho K., van Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning Phrase Representations using RNN Encoder-Decoderfor Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724-1734. https://doi.org/10.3115/v1/D14-1179

2. Sutskever I., Vinyals O, Le Q. V. Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems, 2014, pp. 3104–3112. ArXiv preprint: https://arxiv.org/abs/1409.3215

3. Serban I. V., Lowe R., Charlin L., Pineau J. Generative Deep Neural Networks for Dialogue: A Short Review. Advances in Neural Information Processing Systems, Workshop on Learning Methods for Dialogue, 2016. ArXiv preprint: https://arxiv.org/abs/1611.06216

4. Vinyals O., Toshev A., Bengio S., Erhan D. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

5. Loyola P., Marrese-Taylor E., Matsuo Y. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, Vol. 2, pp. 287-292. https://doi.org/10.18653/v1/P17-2045

6. Lebret R., Grangier D., Auli M. Neural Text Generation from Structured Data with Application to the Biography Domain. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1203–1213. https://doi.org/10.18653/v1/D16-1128

7. Nikolenko S., Kandurin A., Arhangelskaja E. Glubokoe obuchenie [Deep Learning], Saint Petersburg.: Piter, 2020, 480 p.

8. Bahdanau D., Cho K., Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations, 2015. ArXiv preprint: https://arxiv.org/abs/1409.0473

9. Schuster M., Paliwal K. K. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions, 1997, Vol. 45 (11), pp. 2673-2681. https://doi.org/10.1109/78.650093

10. Luong T., Pham H., Manning C. D. Effective Approaches to Attention-based Neural Machine Translation. Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412-1421. https://doi.org/10.18653/v1/D15-1166

11. Chung J., Cho K., Bengio Y. A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, Vol. 1, pp. 1693–1703. https://doi.org/10.18653/v1/P16-1160

12. Rush A., Chorpa S., Weston J. A Neural Attention Model for Abstractive Sentence Summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 379–389. https://doi.org/10.18653/v1/D15-1044

13. Chorowski J., Bahdanau D., Serdyuk D., Cho K., Bengio Y. Attention-Based Models for Speech Recognition. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, Vol. 1, pp. 577–585. ArXiv preprint: https://arxiv.org/abs/1506.07503

14. Chan W., Jaitly N., Le Q. V., Vinyals O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960-4964. https://doi.org/10.1109/ICASSP.2016.7472621

15. Hermann K. M., Kočiský T., Grefenstette E., Espeholt L., Kay W., Suleyman M., Blunsom P. Teaching Machines to Read and Comprehend. Advances in Neural Information Processing Systems 28: 29th Annual Conference on Neural Information Processing Systems, 2015, pp. 1693-1701. ArXiv preprint: https://arxiv.org/abs/1506.03340

16. Wu Y., Schuster M., Chen Z., Le Q. V., Norouzi M., Macherey W., Krikun M., Cao Y., Gao Q., Macherey K., Klingner J., Shah A., Johnson M., Liu X., Kaiser Ł., Gouws S., Kato Y., Kudo T., Kazawa H., Stevens K., Kurian G., Patil N., Wang W., Young C., Smith J., Riesa J., Rudnick A., Vinyals O., Corrado G., Hughes M., Dean J. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv preprint, 2016. https://arxiv.org/abs/1609.08144.

17. Hochreiter S., Schmidhuber, J. Long short-term memory. Neural Computation, 1997, Vol. 9 (8), pp. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

18. Cho K., van Merrienboer B., Bahdanau D., Bengio Y. On the properties of neural machinetranslation: Encoder-decoder approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014, pp. 103–111. https://doi.org/10.3115/v1/W14-4012

19. Martin E., Cundy C. Parallelizing Linear Recurrent Neural Nets Over Sequence Length. International Conference on Learning Representations, 2018. ArXiv preprint: https://arxiv.org/abs/1709.04057

20. Kalchbrenner N., Espeholt L., Simonyan K., van den Oord A., Graves A., Kavukcuoglu K. Neural machine translation in linear time. ArXiv preprint, 2016. https://arxiv.org/abs/1610.10099

21. Gehring J., Auli M., Grangier D., Yarats D., Dauphin Y. N. Convolutional sequence to sequence learning. Proceedings of the 34th International Conference on Machine Learning, 2017, Vol. 70, pp. 1243–1252. ArXiv preprint: https://arxiv.org/abs/1705.03122

22. LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, Vol. 86 (11), pp. 2278–2324. https://doi.org/10.1109/5.726791

23. Parikh A. P., Täckström O., Das D., Uszkoreit J. A Decomposable Attention Model for Natural Language Inference. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2249–2255. https://doi.org/10.18653/v1/D16-1244

24. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I. Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000–6010. ArXiv preprint: https://arxiv.org/abs/1706.03762

25. Mitkov R. Anaphora Resolution: The State of the Art. Paper based on the COLING'98/ACL'98 tutorial on anaphora resolution, University of Wolverhampton, 1999.

26. Ba J. L., Kiros J. R., Hinton G. E. Layer normalization. ArXiv preprint, 2016. https://arxiv.org/abs/1607.06450.

27. Li N., Liu S., Liu Y., Zhao S., Liu M., Zhou M. Neural Speech Synthesis with Transformer Network. The AAAI Conference on Artificial Intelligence, 2019. ArXiv preprint: https://arxiv.org/abs/1809.08895

28. Khandelwal U., Clark K., Jurafsky D., Kaiser Ł. Sample Efficient Text Summarization Using a Single Pre-Trained Transformer. ArXiv preprint, 2019. https://arxiv.org/abs/1905.08836

29. Vlasov V., Mosig J. E. M., Nicho A. Dialogue Transformers. ArXiv preprint, 2019. https://arxiv.org/abs/1910.00486

30. Griffith K., Kalita J. Solving Arithmetic Word Problems Automatically Using Transformer and Unambiguous Representations. International Conference on Computational Science and Computational Intelligence, 2019, pp. 526-532. https://doi.org/10.1109/CSCI49370.2019.00101

31. Kang W.-C., McAuley J. Self-Attentive Sequential Recommendation. IEEE International Conference on Data Mining, 2018, pp. 197-206. https://doi.org/10.1109/ICDM.2018.00035

32. Huang C.-Z. A, Vaswani A., Uszkoreit J., Shazeer N., Simon I., Hawthorne C., Dai A. M., Hoffman M. D., Dinculescu M., Eck D. Music Transformer. ArXiv preprint, 2018. https://arxiv.org/abs/1809.04281

33. Dehghani M., Gouws S., Vinyals O., Uszkoreit J., Kaiser Ł. Universal Transformers. 7th International Conference on Learning Representations, 2019. ArXiv preprint: https://arxiv.org/abs/1807.03819

34. Dai Z., Yang Z., Yang Y., Carbonell J., Le Q. V., Salakhutdinov R. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 2978–2988. https://doi.org/10.18653/v1/P19-1285

35. So D. R., Liang C., Le Q. V. The Evolved Transformer. Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 5877-5886. ArXiv preprint: https://arxiv.org/abs/1901.11117

36. Zhao C., Xiong C., Rosset C., Song X., Bennett P., Tiwary S. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=r1eIiCNYwS (Accessed 10 July 2020)

37. Mikolov T., Chen K., Corrado G., Dean J. Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013, Vol. 2, pp. 3111–3119. ArXiv preprint: https://arxiv.org/abs/1310.4546

38. Pennington J., Socher R., Manning C. D. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162

39. Sahlgren M. The Distributional Hypothesis. From context to meaning. Distributional models of the lexicon in linguistics and cognitive science (Special issue of the Italian Journal of Linguistics), Rivista di Linguistica, Vol. 20 (1), 2008, pp. 33–53.

40. McCann, B., Bradbury J., Xiong C., Socher R. Learned in Translation: Contextualized Word Vectors. 31st Conference on Neural Information Processing Systems, 2017, pp. 6297–6308. ArXiv preprint: https://arxiv.org/abs/1708.00107

41. Hedderich M. A., Yates A., Klakow D., de Melo G. Using Multi-Sense Vector Embeddings for Reverse Dictionaries. Proceedings of the 13th International Conference on Computational Semantics - Long Papers, 2019, pp. 247–258. https://doi.org/10.18653/v1/W19-0421

42. Ruder S. Neural Transfer Learning for Natural Language Processing. Ph.D. thesis, National University of Ireland, Galway, 2019.

43. Deng J., Dong W., Socher R.; Li L.-J., Li K., Fei-Fei L. ImageNet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. https://doi.org/10.1109/CVPR.2009.5206848

44. Papandreou G., Zhu T., Kanazawa N., Toshev A., Tompson J., Bregler C., Murphy K. Towards Accurate Multi-person Pose Estimation in the Wild. IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3711-3719. https://doi.org/10.1109/CVPR.2017.395

45. He K., Gkioxari G., Dollár P., Girshick R. Mask R-CNN. IEEE International Conference on Computer Vision, 2017, pp. 2980-2988. https://doi.org/10.1109/ICCV.2017.322

46. Mahajan D., Girshick R., Ramanathan V., He K., Paluri M., Li Y., Bharambe A., van der Maaten L. Exploring the Limits of Weakly Supervised Pretraining. European Conference on Computer Vision, 2018, pp. 181–196. https://doi.org/10.1007/978-3-030-01216-8_12

47. Dai A. M., Le Q. V. Semi-supervised Sequence Learning. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, Vol. 2, pp. 3079–3087. https://doi.org/10.18653/v1/P17-1161

48. Peters M. E, Ammar W., Bhagavatula C., Power R. Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, Vol. 1, pp. 1756-1765. ArXiv preprint: https://arxiv.org/abs/1705.00108

49. Howard J., Ruder S. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, Vol. 1, pp. 328–339. https://doi.org/10.18653/v1/P18-1031

50. Peters M. E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., Zettlemoyer L. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, Vol. 1, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202

51. Merity S., Xiong C., Bradbury J., Socher R. Pointer Sentinel Mixture Models. 5th International Conference on Learning Representations, 2017. ArXiv preprint: https://arxiv.org/abs/1609.07843.

52. Radford A., Narasimhan K., Salimans T., Sutskever I. Improving language understanding with unsupervised learning. Technical report, OpenAI, 2018. Available at: https://openai.com/blog/language-unsupervised/ (Accessed 10 July 2020)

53. Liu P. J., Saleh M., Pot E., Goodrich B., Sepassi R., Kaiser L., Shazeer N. Generating Wikipedia by Summarizing Long Sequences. 6th International Conference on Learning Representations, 2018. ArXiv preprint: https://arxiv.org/abs/1801.10198

54. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, Vol. 1, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423

55. Taylor W. L. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 1953, Vol. 30 (4) — P. 415–433.

56. Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., Fidler S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE international conference on computer vision, 2015, pp. 19–27. https://doi.org/10.1109/ICCV.2015.11

57. Wang A., Singh A., Michael J., Hill F., Levy O., Bowman S. R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355. https://doi.org/10.18653/v1/W18-5446

58. Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv preprint, 2019. https://arxiv.org/abs/1907.11692

59. Lan Z., Chen M., Goodman S., Gimpel K., Sharma P., Soricut R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=H1eA7AEtvS (Accessed 10 July 2020)

60. Sanh V., Debut L., Chaumond J., Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Conference on Neural Information Processing Systems, 2019. ArXiv preprint: https://arxiv.org/abs/1910.01108

61. Hinton G., Vinyals O., Dean J. Distilling the Knowledge in a Neural Network. Neural Information Processing Systems. Deep Learning and Representation Learning Workshop, 2015. ArXiv preprint: https://arxiv.org/abs/1503.02531

62. Jiao X., Yin Y., Shang L., Jiang X., Chen X., Li L., Wang F., Liu Q. TinyBERT: Distilling BERT for Natural Language Understanding. ArXiv preprint, 2019. https://arxiv.org/abs/1909.10351

63. Liu X., He P., Chen W., Gao J. Multi-Task Deep Neural Networks for Natural Language Understanding. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4487–4496. https://doi.org/10.18653/v1/P19-1441

64. Liu X., Gao J., He X., Deng L., Duh K., Wang Y.-Y. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 912–921. https://doi.org/10.3115/v1/N15-1092

65. Wang W., Bi B., Yan M., Wu C., Xia J., Bao Z., Peng L., Si L. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=BJgQ4lSFPH (Accessed 10 July 2020)

66. Elman J. L. Finding structure in time. Cognitive science, 1990, Vol. 14 (2), pp. 179–211.

67. Lee J., Yoon W., Kim S., Kim D., Kim S., So C. H., Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, Volume 36 (4), pp. 1234–1240. https://doi.org/10.1093/bioinformatics/btz682

68. Lu J., Batra D., Parikh D., Lee S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. ArXiv preprint, 2019. https://arxiv.org/abs/1908.02265

69. Niven T., Kao H.-Y. Probing Neural Network Comprehension of Natural Language Arguments. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4658–4664. https://doi.org/10.18653/v1/P19-1459

70. Zellers R., Holtzman A., Bisk Y., Farhadi A., Choi Y. HellaSwag: Can a Machine Really Finish Your Sentence? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4791–4800. https://doi.org/10.18653/v1/P19-1472

71. McCoy T., Pavlick E., Linzen T. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3428–3448. https://doi.org/10.18653/v1/P19-1334


Supplementary files

Review

For citations:


Kachkou D.I. Language modeling and bidirectional coders representations: an overview of key technologies. Informatics. 2020;17(4):61-72. (In Russ.) https://doi.org/10.37661/1816-0301-2020-17-4-61-72

Views: 953


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1816-0301 (Print)
ISSN 2617-6963 (Online)