<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">inform</journal-id><journal-title-group><journal-title xml:lang="ru">Информатика</journal-title><trans-title-group xml:lang="en"><trans-title>Informatics</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1816-0301</issn><issn pub-type="epub">2617-6963</issn><publisher><publisher-name>UIIP NASB</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.37661/1816-0301-2026-23-1-69-87</article-id><article-id custom-type="elpub" pub-id-type="custom">inform-1391</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>ОБРАБОТКА СИГНАЛОВ, ИЗОБРАЖЕНИЙ, РЕЧИ, ТЕКСТА И РАСПОЗНАВАНИЕ ОБРАЗОВ</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="en"><subject>SIGNAL, IMAGE, SPEECH, TEXT PROCESSING AND PATTERN RECOGNITION</subject></subj-group></article-categories><title-group><article-title>Распознавание эмоций по речи на основе LSTM-сетей с мультивекторным механизмом внимания</article-title><trans-title-group xml:lang="en"><trans-title>Speech emotion recognition based on LSTM networks with multi-vector attention</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Краснопрошин</surname><given-names>Д. В.</given-names></name><name name-style="western" xml:lang="en"><surname>Krasnoproshin</surname><given-names>Daniil V.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Краснопрошин Даниил Вадимович, магистр технических наук, аспирант кафедры электронных вычислительных средств</p><p>ул. П. Бровки, 6, Минск, 220013</p></bio><bio xml:lang="en"><p>Daniil V. Krasnoproshin, М. Sci. (Eng.), Postgraduate Student of Computer Engineering Department</p><p>st. Brovki, 6, Minsk, 220013</p></bio><email xlink:type="simple">daniil.krasnoproshin@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Вашкевич</surname><given-names>М. И.</given-names></name><name name-style="western" xml:lang="en"><surname>Vashkevich</surname><given-names>Maxim I.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Вашкевич Максим Иосифович, доктор технических наук, доцент, профессор кафедры электронных вычислительных средств</p><p>ул. П. Бровки, 6, Минск, 220013</p></bio><bio xml:lang="en"><p>Maxim I. Vashkevich, Dr. Sci. (Eng.), Assoc. Prof., Prof. of Computer Engineering Department</p><p>st. Brovki, 6, Minsk, 220013</p></bio><email xlink:type="simple">vashkevich@bsuir.by</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Белорусский государственный университет информатики и радиоэлектроники</institution></aff><aff xml:lang="en"><institution>Belarusian State University of Informatics and Radioelectronics</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2026</year></pub-date><pub-date pub-type="epub"><day>27</day><month>03</month><year>2026</year></pub-date><volume>23</volume><issue>1</issue><fpage>69</fpage><lpage>87</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Краснопрошин Д.В., Вашкевич М.И., 2026</copyright-statement><copyright-year>2026</copyright-year><copyright-holder xml:lang="ru">Краснопрошин Д.В., Вашкевич М.И.</copyright-holder><copyright-holder xml:lang="en">Krasnoproshin D.V., Vashkevich M.I.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://inf.grid.by/jour/article/view/1391">https://inf.grid.by/jour/article/view/1391</self-uri><abstract><sec><title>Цели</title><p>Цели. Целью исследования является повышение точности распознавания эмоций по речевому сигналу с помощью моделей на основе рекуррентных нейронных сетей (РНС) с долгой краткосрочной памятью.</p></sec><sec><title>Методы</title><p>Методы. В работе предложен мультивекторный механизм внимания для РНС на основе ячеек LSTM. Данный механизм представляет собой обобщение классического мягкого внимания и позволяет модели одновременно анализировать различные аспекты временны́ х зависимостей. Предложенные архитектуры РНС применены к задаче распознавания эмоций по речевому сигналу. В качестве входных данных использовались последовательности мел-частотных кепстральных коэффициентов, отражающих частотно-временную структуру речевого сигнала. Эксперименты проводились на общедоступном наборе данных RAVDESS. Для автоматизированного подбора оптимальных гиперпараметров моделей использовался метод байесовской оптимизации.</p></sec><sec><title>Результаты</title><p>Результаты. Результаты экспериментов с LSTM-сетями, имеющими различную размерность скрытого состояния (64, 96, 128), показывают, что применение мультивекторного механизма внимания приводит к статистически значимому улучшению среднего значения точности на величину от 0,88 до 1,56 %.</p></sec><sec><title>Заключение</title><p>Заключение. Полученные результаты подтверждают целесообразность использования предложенного механизма мультивекторного внимания в архитектурах LSTM-сетей для задачи классификации эмоций в речи.</p></sec></abstract><trans-abstract xml:lang="en"><sec><title>Objectives</title><p>Objectives. Improvement of speech emotion recognition accuracy using Long Short-Term Memory (LSTM) recurrent neural network (RNN) models.</p></sec><sec><title>Methods</title><p>Methods. The paper proposes a multi-vector attention mechanism for LSTM-based RNNs. This mechanism generalizes the classical soft attention and allows the model to simultaneously analyze different aspects of temporal dependencies. The proposed RNN architectures were applied to the task of speech emotion recognition. Input data consisted of sequences of mel-frequency cepstral coefficients (MFCCs), which reflect the time-frequency structure of the speech signal. Experiments were conducted on the publicly available RAVDESS dataset. Bayesian optimization was employed for automated hyperparameter tuning of the models.</p></sec><sec><title>Results</title><p>Results. The experimental results with LSTM networks having different hidden state dimensions (64, 96, 128) demonstrate that the application of the multi-vector attention mechanism leads to a statistically significant improvement in the average accuracy metric (UAR) by 0.88 to 1.56 %.</p></sec><sec><title>Conclusion</title><p>Conclusion. The obtained results confirm the effectiveness of using the proposed multi-vector attention mechanism in LSTM-based architectures for speech emotion classification.</p></sec></trans-abstract><kwd-group xml:lang="ru"><kwd>обработка речи</kwd><kwd>распознавание эмоций</kwd><kwd>глубокое обучение</kwd><kwd>рекуррентные нейронные сети</kwd><kwd>механизм внимания</kwd></kwd-group><kwd-group xml:lang="en"><kwd>speech processing</kwd><kwd>emotion recognition</kwd><kwd>deep learning</kwd><kwd>recurrent neural networks</kwd><kwd>attention mechanism</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">A review of affective computing: From unimodal analysis to multimodal fusion / S. Poria, E. Cambria, R. Bajpai, A. Hussain // Information Fusion. – 2017. – Vol. 37. – Р. 98–125.</mixed-citation><mixed-citation xml:lang="en">Poria S., Cambria E., Bajpai R., Hussain A. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 2017, vol. 37, рр. 98–125.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Multimodal emotion recognition on RAVDESS dataset using transfer learning / C. Luna-Jiménez, D. Griol, Z. Callejas [et al.] // Sensors. – 2021. – Vol. 21. – P. 1–29.</mixed-citation><mixed-citation xml:lang="en">Luna-Jiménez C., Griol D., Callejas Z., Kleinlein R., Montero J. M., Fernández-Martínez F. Multimodal emotion recognition on RAVDESS dataset using transfer learning. Sensors, 2021, vol. 21, рр. 1–29.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Mirsamadi, S. Automatic speech emotion recognition using recurrent neural networks with local attention / S. Mirsamadi, E. Barsoum, C. Zhang // Proc. of IEEE Intern. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 05–09 Mar. 2017. – New Orleans, 2017. – P. 2227–2231.</mixed-citation><mixed-citation xml:lang="en">Mirsamadi S., Barsoum E., Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 05–09 March 2017. New Orleans, 2017, рр. 2227–2231.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Краснопрошин, Д. В. Отбор признаков на основе техники переноса обучения для классификации эмоций в речи с помощью полносвязной нейронной сети прямого распространения / Д. В. Краснопрошин, М. И. Вашкевич // Системный анализ и прикладная информатика. – 2025. – № 1. – С. 38–43.</mixed-citation><mixed-citation xml:lang="en">Krasnoproshin D. V., Vashkevich M. I. Transfer learning based feature selection for feedforward neural network for speech emotion classifier. Sistemnyj analiz i prikladnaja informatika [System Analysis and Applied Information Science], 2025, no. 1, рр. 38–43 (In Russ.).</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Краснопрошин, Д. В. Анализ подходов к построению систем распознавания эмоций по речи с использованием методов глубокого обучения / Д. В. Краснопрошин, М. И. Вашкевич // Big Data and Advanced Analytics : сб. науч. ст. XI Междунар. науч.-практ. конф., Минск, 23–24 апр. 2025 г. – Мн., 2025. – С. 343–353.</mixed-citation><mixed-citation xml:lang="en">Krasnoproshin D. V., Vashkevich M. I. Analysis of approaches to building speech emotion recognition systems using deep learning methods. Big Data and Advanced Analytics: sbornik nauchnyh statej XI Mezhdunarodnoj nauchno-prakticheskoj konferencii, Minsk, 23–24 aprelja 2025 g. [Big Data and Advanced Analytics: Collection of Scientific Articles of the XI International Scientific and Practical Conference, Minsk, 23–24 April 2025]. Minsk, 2025, рр. 343–353 (In Russ.).</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Dal Rí, F. A. Speech emotion recognition and deep learning: an extensive validation using convolutional neural networks / F. A. Dal Rí, F. C. Ciardi, N. Conci // IEEE Access. – 2023. – Vol. 11. – Р. 116638–116649.</mixed-citation><mixed-citation xml:lang="en">Dal Rí F. A., Ciardi F. C., Conci N. Speech emotion recognition and deep learning: an extensive validation using convolutional neural networks. IEEE Access, 2023, vol. 11, рр. 116638–116649.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Waleed, G. T. Speech emotion recognition on MELD and RAVDESS datasets using CNN / G. T. Waleed, S. H. Shaker // Information. – 2025. – Vol. 16, no. 7. – Р. 518.</mixed-citation><mixed-citation xml:lang="en">Waleed G. T., Shaker S. H. Speech emotion recognition on MELD and RAVDESS datasets using CNN. Information, 2025, vol. 16, no. 7, р. 518.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">PANNs: Large-scale pretrained audio neural networks for audio pattern recognition / Q. Kong, Y. Cao, T. Iqbal [et al.] // IEEE/ACM Transactions on Audio, Speech, and Language Processing. – 2020. – Vol. 28. – Р. 2880–2894.</mixed-citation><mixed-citation xml:lang="en">Kong Q., Cao Y., Iqbal T., Wang Y., Wang W., Plumbley M. D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, vol. 28, рр. 2880–2894.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Николенко, С. Глубокое обучение / С. Николенко, А. Кадурин, Е. Архангельская. – СПб. : Питер, 2019. – 480 с.</mixed-citation><mixed-citation xml:lang="en">Nikolenko S., Kadurin А., Archangelskaya Е. Glubokoe obuchenie. Deep Learning. Saint Petersburg, Piter, 2019, 480 p. (In Russ.).</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Hochreiter, S. Long short-term memory / S. Hochreiter, J. Schmidhuber // Neural Computation. – 1997. – Vol. 9, no. 8. – P. 1735–1780.</mixed-citation><mixed-citation xml:lang="en">Hochreiter S., Schmidhuber J. Long short-term memory. Neural Computation, 1997, vol. 9, no. 8, pp. 1735–1780.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Context-aware attention mechanism for speech emotion recognition / G. Ramet, P. N. Garner, M. Baeriswyl, A. Lazaridis // 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 Dec. 2018. – Athens, 2018. – Р. 126–131.</mixed-citation><mixed-citation xml:lang="en">Ramet G., Garner P. N., Baeriswyl M., Lazaridis A. Context-aware attention mechanism for speech emotion recognition. 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018. Athens, 2018, рр. 126–131.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Краснопрошин, Д. В. Метод распознавания эмоций в речевом сигнале с использованием машины опорных векторов и надсегментных акустических признаков / Д. В. Краснопрошин, М. И. Вашкевич // Доклады БГУИР. – 2024. – Т. 22, № 3. – С. 93–100.</mixed-citation><mixed-citation xml:lang="en">Krasnoproshin D. V., Vashkevich M. I. Speech emotion recognition method based on support vector machine and suprasegmental acoustic features. Doklady BGUIR [BGUIR Proceedings], 2024, vol. 22, no. 3, pp. 93–100 (In Russ.).</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Bahdanau, D. Neural machine translation by jointly learning to align and translate / D. Bahdanau, K. Cho, Y. Bengio // 3rd Intern. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. – San Diego, 2015. – URL: https://arxiv.org/abs/1409.0473 (date of access: 13.11.2025).</mixed-citation><mixed-citation xml:lang="en">Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. San Diego, 2015. Available at: https://arxiv.org/abs/1409.0473 (accessed 13.11.2025).</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Optuna: A next-generation hyperparameter optimization framework / T. Akiba, S. Sano, T. Yanase [et al.] // Proc. of the 25th ACM SIGKDD Intern. Conf. on Knowledge Discovery &amp; Data Mining (KDD'19), Anchorage, AK, USA, 4–8 Aug. 2019. – Anchorage, 2019. – Р. 2623–2631.</mixed-citation><mixed-citation xml:lang="en">Akiba T., Sano S., Yanase T., Ohta T., Koyama M. Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining (KDD'19), Anchorage, AK, USA, 4–8 August 2019. Anchorage, 2019, рр. 2623–2631.</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Algorithms for hyper-parameter optimization / J. S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl // NIPS'11: Proc. of the 25th Intern. Conf. on Neural Information Processing Systems, Granada, Spain, 12–15 Dec. 2011. – Granada, 2011. – Р. 2546–2554.</mixed-citation><mixed-citation xml:lang="en">Bergstra J. S., Bardenet R., Bengio Y., Kégl B. Algorithms for hyper-parameter optimization. NIPS'11: Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011. Granada, 2011, рр. 2546–2554.</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Spectrogram based multi-task audio classification / Y. Zeng, H. Mao, D. Peng, Z. Yi // Multimedia Tools and Applications. – 2019. – Vol. 78, no. 3. – Р. 3705–3722.</mixed-citation><mixed-citation xml:lang="en">Zeng Y., Mao H., Peng D., Yi Z. Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 2019, vol. 78, no. 3, рр. 3705–3722.</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset / C. Luna-Jiménez, R. Kleinlein, D. Griol [et al.] // Applied Sciences. – 2022. – Vol. 12, no. 1. – P. 1–23.</mixed-citation><mixed-citation xml:lang="en">Luna-Jiménez C., Kleinlein R., Griol D., Callejas Z., Montero J. M., Fernández-Martínez F. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 2022, vol. 12, no. 1, рр. 1–23.</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
