Language modeling and bidirectional coders representations: an overview of key technologies
https://doi.org/10.37661/1816-0301-2020-17-4-61-72
Abstract
About the Author
D. I. KachkouBelarus
Dzmitry I. Kachkou, Postgraduate Student of Department of Multiprocessor Systems and Networks of the Faculty of Applied Mathematics and Informatics
Minsk
References
1. Cho K., van Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning Phrase Representations using RNN Encoder-Decoderfor Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724-1734. https://doi.org/10.3115/v1/D14-1179
2. Sutskever I., Vinyals O, Le Q. V. Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems, 2014, pp. 3104–3112. ArXiv preprint: https://arxiv.org/abs/1409.3215
3. Serban I. V., Lowe R., Charlin L., Pineau J. Generative Deep Neural Networks for Dialogue: A Short Review. Advances in Neural Information Processing Systems, Workshop on Learning Methods for Dialogue, 2016. ArXiv preprint: https://arxiv.org/abs/1611.06216
4. Vinyals O., Toshev A., Bengio S., Erhan D. Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
5. Loyola P., Marrese-Taylor E., Matsuo Y. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, Vol. 2, pp. 287-292. https://doi.org/10.18653/v1/P17-2045
6. Lebret R., Grangier D., Auli M. Neural Text Generation from Structured Data with Application to the Biography Domain. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1203–1213. https://doi.org/10.18653/v1/D16-1128
7. Nikolenko S., Kandurin A., Arhangelskaja E. Glubokoe obuchenie [Deep Learning], Saint Petersburg.: Piter, 2020, 480 p.
8. Bahdanau D., Cho K., Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations, 2015. ArXiv preprint: https://arxiv.org/abs/1409.0473
9. Schuster M., Paliwal K. K. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions, 1997, Vol. 45 (11), pp. 2673-2681. https://doi.org/10.1109/78.650093
10. Luong T., Pham H., Manning C. D. Effective Approaches to Attention-based Neural Machine Translation. Proceedings of EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412-1421. https://doi.org/10.18653/v1/D15-1166
11. Chung J., Cho K., Bengio Y. A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, Vol. 1, pp. 1693–1703. https://doi.org/10.18653/v1/P16-1160
12. Rush A., Chorpa S., Weston J. A Neural Attention Model for Abstractive Sentence Summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 379–389. https://doi.org/10.18653/v1/D15-1044
13. Chorowski J., Bahdanau D., Serdyuk D., Cho K., Bengio Y. Attention-Based Models for Speech Recognition. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, Vol. 1, pp. 577–585. ArXiv preprint: https://arxiv.org/abs/1506.07503
14. Chan W., Jaitly N., Le Q. V., Vinyals O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960-4964. https://doi.org/10.1109/ICASSP.2016.7472621
15. Hermann K. M., Kočiský T., Grefenstette E., Espeholt L., Kay W., Suleyman M., Blunsom P. Teaching Machines to Read and Comprehend. Advances in Neural Information Processing Systems 28: 29th Annual Conference on Neural Information Processing Systems, 2015, pp. 1693-1701. ArXiv preprint: https://arxiv.org/abs/1506.03340
16. Wu Y., Schuster M., Chen Z., Le Q. V., Norouzi M., Macherey W., Krikun M., Cao Y., Gao Q., Macherey K., Klingner J., Shah A., Johnson M., Liu X., Kaiser Ł., Gouws S., Kato Y., Kudo T., Kazawa H., Stevens K., Kurian G., Patil N., Wang W., Young C., Smith J., Riesa J., Rudnick A., Vinyals O., Corrado G., Hughes M., Dean J. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv preprint, 2016. https://arxiv.org/abs/1609.08144.
17. Hochreiter S., Schmidhuber, J. Long short-term memory. Neural Computation, 1997, Vol. 9 (8), pp. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
18. Cho K., van Merrienboer B., Bahdanau D., Bengio Y. On the properties of neural machinetranslation: Encoder-decoder approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014, pp. 103–111. https://doi.org/10.3115/v1/W14-4012
19. Martin E., Cundy C. Parallelizing Linear Recurrent Neural Nets Over Sequence Length. International Conference on Learning Representations, 2018. ArXiv preprint: https://arxiv.org/abs/1709.04057
20. Kalchbrenner N., Espeholt L., Simonyan K., van den Oord A., Graves A., Kavukcuoglu K. Neural machine translation in linear time. ArXiv preprint, 2016. https://arxiv.org/abs/1610.10099
21. Gehring J., Auli M., Grangier D., Yarats D., Dauphin Y. N. Convolutional sequence to sequence learning. Proceedings of the 34th International Conference on Machine Learning, 2017, Vol. 70, pp. 1243–1252. ArXiv preprint: https://arxiv.org/abs/1705.03122
22. LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, Vol. 86 (11), pp. 2278–2324. https://doi.org/10.1109/5.726791
23. Parikh A. P., Täckström O., Das D., Uszkoreit J. A Decomposable Attention Model for Natural Language Inference. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2249–2255. https://doi.org/10.18653/v1/D16-1244
24. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I. Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000–6010. ArXiv preprint: https://arxiv.org/abs/1706.03762
25. Mitkov R. Anaphora Resolution: The State of the Art. Paper based on the COLING'98/ACL'98 tutorial on anaphora resolution, University of Wolverhampton, 1999.
26. Ba J. L., Kiros J. R., Hinton G. E. Layer normalization. ArXiv preprint, 2016. https://arxiv.org/abs/1607.06450.
27. Li N., Liu S., Liu Y., Zhao S., Liu M., Zhou M. Neural Speech Synthesis with Transformer Network. The AAAI Conference on Artificial Intelligence, 2019. ArXiv preprint: https://arxiv.org/abs/1809.08895
28. Khandelwal U., Clark K., Jurafsky D., Kaiser Ł. Sample Efficient Text Summarization Using a Single Pre-Trained Transformer. ArXiv preprint, 2019. https://arxiv.org/abs/1905.08836
29. Vlasov V., Mosig J. E. M., Nicho A. Dialogue Transformers. ArXiv preprint, 2019. https://arxiv.org/abs/1910.00486
30. Griffith K., Kalita J. Solving Arithmetic Word Problems Automatically Using Transformer and Unambiguous Representations. International Conference on Computational Science and Computational Intelligence, 2019, pp. 526-532. https://doi.org/10.1109/CSCI49370.2019.00101
31. Kang W.-C., McAuley J. Self-Attentive Sequential Recommendation. IEEE International Conference on Data Mining, 2018, pp. 197-206. https://doi.org/10.1109/ICDM.2018.00035
32. Huang C.-Z. A, Vaswani A., Uszkoreit J., Shazeer N., Simon I., Hawthorne C., Dai A. M., Hoffman M. D., Dinculescu M., Eck D. Music Transformer. ArXiv preprint, 2018. https://arxiv.org/abs/1809.04281
33. Dehghani M., Gouws S., Vinyals O., Uszkoreit J., Kaiser Ł. Universal Transformers. 7th International Conference on Learning Representations, 2019. ArXiv preprint: https://arxiv.org/abs/1807.03819
34. Dai Z., Yang Z., Yang Y., Carbonell J., Le Q. V., Salakhutdinov R. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 2978–2988. https://doi.org/10.18653/v1/P19-1285
35. So D. R., Liang C., Le Q. V. The Evolved Transformer. Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 5877-5886. ArXiv preprint: https://arxiv.org/abs/1901.11117
36. Zhao C., Xiong C., Rosset C., Song X., Bennett P., Tiwary S. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=r1eIiCNYwS (Accessed 10 July 2020)
37. Mikolov T., Chen K., Corrado G., Dean J. Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013, Vol. 2, pp. 3111–3119. ArXiv preprint: https://arxiv.org/abs/1310.4546
38. Pennington J., Socher R., Manning C. D. Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162
39. Sahlgren M. The Distributional Hypothesis. From context to meaning. Distributional models of the lexicon in linguistics and cognitive science (Special issue of the Italian Journal of Linguistics), Rivista di Linguistica, Vol. 20 (1), 2008, pp. 33–53.
40. McCann, B., Bradbury J., Xiong C., Socher R. Learned in Translation: Contextualized Word Vectors. 31st Conference on Neural Information Processing Systems, 2017, pp. 6297–6308. ArXiv preprint: https://arxiv.org/abs/1708.00107
41. Hedderich M. A., Yates A., Klakow D., de Melo G. Using Multi-Sense Vector Embeddings for Reverse Dictionaries. Proceedings of the 13th International Conference on Computational Semantics - Long Papers, 2019, pp. 247–258. https://doi.org/10.18653/v1/W19-0421
42. Ruder S. Neural Transfer Learning for Natural Language Processing. Ph.D. thesis, National University of Ireland, Galway, 2019.
43. Deng J., Dong W., Socher R.; Li L.-J., Li K., Fei-Fei L. ImageNet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. https://doi.org/10.1109/CVPR.2009.5206848
44. Papandreou G., Zhu T., Kanazawa N., Toshev A., Tompson J., Bregler C., Murphy K. Towards Accurate Multi-person Pose Estimation in the Wild. IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3711-3719. https://doi.org/10.1109/CVPR.2017.395
45. He K., Gkioxari G., Dollár P., Girshick R. Mask R-CNN. IEEE International Conference on Computer Vision, 2017, pp. 2980-2988. https://doi.org/10.1109/ICCV.2017.322
46. Mahajan D., Girshick R., Ramanathan V., He K., Paluri M., Li Y., Bharambe A., van der Maaten L. Exploring the Limits of Weakly Supervised Pretraining. European Conference on Computer Vision, 2018, pp. 181–196. https://doi.org/10.1007/978-3-030-01216-8_12
47. Dai A. M., Le Q. V. Semi-supervised Sequence Learning. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, Vol. 2, pp. 3079–3087. https://doi.org/10.18653/v1/P17-1161
48. Peters M. E, Ammar W., Bhagavatula C., Power R. Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, Vol. 1, pp. 1756-1765. ArXiv preprint: https://arxiv.org/abs/1705.00108
49. Howard J., Ruder S. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, Vol. 1, pp. 328–339. https://doi.org/10.18653/v1/P18-1031
50. Peters M. E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., Zettlemoyer L. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, Vol. 1, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202
51. Merity S., Xiong C., Bradbury J., Socher R. Pointer Sentinel Mixture Models. 5th International Conference on Learning Representations, 2017. ArXiv preprint: https://arxiv.org/abs/1609.07843.
52. Radford A., Narasimhan K., Salimans T., Sutskever I. Improving language understanding with unsupervised learning. Technical report, OpenAI, 2018. Available at: https://openai.com/blog/language-unsupervised/ (Accessed 10 July 2020)
53. Liu P. J., Saleh M., Pot E., Goodrich B., Sepassi R., Kaiser L., Shazeer N. Generating Wikipedia by Summarizing Long Sequences. 6th International Conference on Learning Representations, 2018. ArXiv preprint: https://arxiv.org/abs/1801.10198
54. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, Vol. 1, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
55. Taylor W. L. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 1953, Vol. 30 (4) — P. 415–433.
56. Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., Fidler S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE international conference on computer vision, 2015, pp. 19–27. https://doi.org/10.1109/ICCV.2015.11
57. Wang A., Singh A., Michael J., Hill F., Levy O., Bowman S. R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355. https://doi.org/10.18653/v1/W18-5446
58. Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv preprint, 2019. https://arxiv.org/abs/1907.11692
59. Lan Z., Chen M., Goodman S., Gimpel K., Sharma P., Soricut R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=H1eA7AEtvS (Accessed 10 July 2020)
60. Sanh V., Debut L., Chaumond J., Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Conference on Neural Information Processing Systems, 2019. ArXiv preprint: https://arxiv.org/abs/1910.01108
61. Hinton G., Vinyals O., Dean J. Distilling the Knowledge in a Neural Network. Neural Information Processing Systems. Deep Learning and Representation Learning Workshop, 2015. ArXiv preprint: https://arxiv.org/abs/1503.02531
62. Jiao X., Yin Y., Shang L., Jiang X., Chen X., Li L., Wang F., Liu Q. TinyBERT: Distilling BERT for Natural Language Understanding. ArXiv preprint, 2019. https://arxiv.org/abs/1909.10351
63. Liu X., He P., Chen W., Gao J. Multi-Task Deep Neural Networks for Natural Language Understanding. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4487–4496. https://doi.org/10.18653/v1/P19-1441
64. Liu X., Gao J., He X., Deng L., Duh K., Wang Y.-Y. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 912–921. https://doi.org/10.3115/v1/N15-1092
65. Wang W., Bi B., Yan M., Wu C., Xia J., Bao Z., Peng L., Si L. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. 8th International Conference on Learning Representations, 2020. Available at: https://openreview.net/forum?id=BJgQ4lSFPH (Accessed 10 July 2020)
66. Elman J. L. Finding structure in time. Cognitive science, 1990, Vol. 14 (2), pp. 179–211.
67. Lee J., Yoon W., Kim S., Kim D., Kim S., So C. H., Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, Volume 36 (4), pp. 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
68. Lu J., Batra D., Parikh D., Lee S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. ArXiv preprint, 2019. https://arxiv.org/abs/1908.02265
69. Niven T., Kao H.-Y. Probing Neural Network Comprehension of Natural Language Arguments. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4658–4664. https://doi.org/10.18653/v1/P19-1459
70. Zellers R., Holtzman A., Bisk Y., Farhadi A., Choi Y. HellaSwag: Can a Machine Really Finish Your Sentence? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4791–4800. https://doi.org/10.18653/v1/P19-1472
71. McCoy T., Pavlick E., Linzen T. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3428–3448. https://doi.org/10.18653/v1/P19-1334
Supplementary files
Review
For citations:
Kachkou D.I. Language modeling and bidirectional coders representations: an overview of key technologies. Informatics. 2020;17(4):61-72. (In Russ.) https://doi.org/10.37661/1816-0301-2020-17-4-61-72