Цели

inform

Информатика

Informatics

1816-03012617-6963

UIIP NASB

10.37661/1816-0301-2026-23-1-69-87

inform-1391

Research Article

ОБРАБОТКА СИГНАЛОВ, ИЗОБРАЖЕНИЙ, РЕЧИ, ТЕКСТА И РАСПОЗНАВАНИЕ ОБРАЗОВ

SIGNAL, IMAGE, SPEECH, TEXT PROCESSING AND PATTERN RECOGNITION

Распознавание эмоций по речи на основе LSTM-сетей с мультивекторным механизмом внимания

Speech emotion recognition based on LSTM networks with multi-vector attention

Краснопрошин

Д. В.

Krasnoproshin

Daniil V.

Краснопрошин Даниил Вадимович, магистр технических наук, аспирант кафедры электронных вычислительных средств

ул. П. Бровки, 6, Минск, 220013

Daniil V. Krasnoproshin, М. Sci. (Eng.), Postgraduate Student of Computer Engineering Department

st. Brovki, 6, Minsk, 220013

daniil.krasnoproshin@gmail.com

Вашкевич

М. И.

Vashkevich

Maxim I.

Вашкевич Максим Иосифович, доктор технических наук, доцент, профессор кафедры электронных вычислительных средств

ул. П. Бровки, 6, Минск, 220013

Maxim I. Vashkevich, Dr. Sci. (Eng.), Assoc. Prof., Prof. of Computer Engineering Department

st. Brovki, 6, Minsk, 220013

vashkevich@bsuir.by

Белорусский государственный университет информатики и радиоэлектроникиBelarusian State University of Informatics and Radioelectronics

2026

27032026

2316987

2026

Краснопрошин Д.В., Вашкевич М.И.

Krasnoproshin D.V., Vashkevich M.I.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://inf.grid.by/jour/article/view/1391

Цели

Цели. Целью исследования является повышение точности распознавания эмоций по речевому сигналу с помощью моделей на основе рекуррентных нейронных сетей (РНС) с долгой краткосрочной памятью.

Методы

Методы. В работе предложен мультивекторный механизм внимания для РНС на основе ячеек LSTM. Данный механизм представляет собой обобщение классического мягкого внимания и позволяет модели одновременно анализировать различные аспекты временны́ х зависимостей. Предложенные архитектуры РНС применены к задаче распознавания эмоций по речевому сигналу. В качестве входных данных использовались последовательности мел-частотных кепстральных коэффициентов, отражающих частотно-временную структуру речевого сигнала. Эксперименты проводились на общедоступном наборе данных RAVDESS. Для автоматизированного подбора оптимальных гиперпараметров моделей использовался метод байесовской оптимизации.

Результаты

Результаты. Результаты экспериментов с LSTM-сетями, имеющими различную размерность скрытого состояния (64, 96, 128), показывают, что применение мультивекторного механизма внимания приводит к статистически значимому улучшению среднего значения точности на величину от 0,88 до 1,56 %.

Заключение

Заключение. Полученные результаты подтверждают целесообразность использования предложенного механизма мультивекторного внимания в архитектурах LSTM-сетей для задачи классификации эмоций в речи.

Objectives

Objectives. Improvement of speech emotion recognition accuracy using Long Short-Term Memory (LSTM) recurrent neural network (RNN) models.

Methods

Methods. The paper proposes a multi-vector attention mechanism for LSTM-based RNNs. This mechanism generalizes the classical soft attention and allows the model to simultaneously analyze different aspects of temporal dependencies. The proposed RNN architectures were applied to the task of speech emotion recognition. Input data consisted of sequences of mel-frequency cepstral coefficients (MFCCs), which reflect the time-frequency structure of the speech signal. Experiments were conducted on the publicly available RAVDESS dataset. Bayesian optimization was employed for automated hyperparameter tuning of the models.

Results

Results. The experimental results with LSTM networks having different hidden state dimensions (64, 96, 128) demonstrate that the application of the multi-vector attention mechanism leads to a statistically significant improvement in the average accuracy metric (UAR) by 0.88 to 1.56 %.

Conclusion

Conclusion. The obtained results confirm the effectiveness of using the proposed multi-vector attention mechanism in LSTM-based architectures for speech emotion classification.

обработка речираспознавание эмоцийглубокое обучениерекуррентные нейронные сетимеханизм внимания

speech processingemotion recognitiondeep learningrecurrent neural networksattention mechanism

References1

A review of affective computing: From unimodal analysis to multimodal fusion / S. Poria, E. Cambria, R. Bajpai, A. Hussain // Information Fusion. – 2017. – Vol. 37. – Р. 98–125.

Poria S., Cambria E., Bajpai R., Hussain A. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 2017, vol. 37, рр. 98–125.

Multimodal emotion recognition on RAVDESS dataset using transfer learning / C. Luna-Jiménez, D. Griol, Z. Callejas [et al.] // Sensors. – 2021. – Vol. 21. – P. 1–29.

Luna-Jiménez C., Griol D., Callejas Z., Kleinlein R., Montero J. M., Fernández-Martínez F. Multimodal emotion recognition on RAVDESS dataset using transfer learning. Sensors, 2021, vol. 21, рр. 1–29.

Mirsamadi, S. Automatic speech emotion recognition using recurrent neural networks with local attention / S. Mirsamadi, E. Barsoum, C. Zhang // Proc. of IEEE Intern. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 05–09 Mar. 2017. – New Orleans, 2017. – P. 2227–2231.

Mirsamadi S., Barsoum E., Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 05–09 March 2017. New Orleans, 2017, рр. 2227–2231.

Краснопрошин, Д. В. Отбор признаков на основе техники переноса обучения для классификации эмоций в речи с помощью полносвязной нейронной сети прямого распространения / Д. В. Краснопрошин, М. И. Вашкевич // Системный анализ и прикладная информатика. – 2025. – № 1. – С. 38–43.

Krasnoproshin D. V., Vashkevich M. I. Transfer learning based feature selection for feedforward neural network for speech emotion classifier. Sistemnyj analiz i prikladnaja informatika [System Analysis and Applied Information Science], 2025, no. 1, рр. 38–43 (In Russ.).

Краснопрошин, Д. В. Анализ подходов к построению систем распознавания эмоций по речи с использованием методов глубокого обучения / Д. В. Краснопрошин, М. И. Вашкевич // Big Data and Advanced Analytics : сб. науч. ст. XI Междунар. науч.-практ. конф., Минск, 23–24 апр. 2025 г. – Мн., 2025. – С. 343–353.

Krasnoproshin D. V., Vashkevich M. I. Analysis of approaches to building speech emotion recognition systems using deep learning methods. Big Data and Advanced Analytics: sbornik nauchnyh statej XI Mezhdunarodnoj nauchno-prakticheskoj konferencii, Minsk, 23–24 aprelja 2025 g. [Big Data and Advanced Analytics: Collection of Scientific Articles of the XI International Scientific and Practical Conference, Minsk, 23–24 April 2025]. Minsk, 2025, рр. 343–353 (In Russ.).

Dal Rí, F. A. Speech emotion recognition and deep learning: an extensive validation using convolutional neural networks / F. A. Dal Rí, F. C. Ciardi, N. Conci // IEEE Access. – 2023. – Vol. 11. – Р. 116638–116649.

Dal Rí F. A., Ciardi F. C., Conci N. Speech emotion recognition and deep learning: an extensive validation using convolutional neural networks. IEEE Access, 2023, vol. 11, рр. 116638–116649.

Waleed, G. T. Speech emotion recognition on MELD and RAVDESS datasets using CNN / G. T. Waleed, S. H. Shaker // Information. – 2025. – Vol. 16, no. 7. – Р. 518.

Waleed G. T., Shaker S. H. Speech emotion recognition on MELD and RAVDESS datasets using CNN. Information, 2025, vol. 16, no. 7, р. 518.

PANNs: Large-scale pretrained audio neural networks for audio pattern recognition / Q. Kong, Y. Cao, T. Iqbal [et al.] // IEEE/ACM Transactions on Audio, Speech, and Language Processing. – 2020. – Vol. 28. – Р. 2880–2894.

Kong Q., Cao Y., Iqbal T., Wang Y., Wang W., Plumbley M. D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, vol. 28, рр. 2880–2894.

Николенко, С. Глубокое обучение / С. Николенко, А. Кадурин, Е. Архангельская. – СПб. : Питер, 2019. – 480 с.

Nikolenko S., Kadurin А., Archangelskaya Е. Glubokoe obuchenie. Deep Learning. Saint Petersburg, Piter, 2019, 480 p. (In Russ.).

Hochreiter, S. Long short-term memory / S. Hochreiter, J. Schmidhuber // Neural Computation. – 1997. – Vol. 9, no. 8. – P. 1735–1780.

Hochreiter S., Schmidhuber J. Long short-term memory. Neural Computation, 1997, vol. 9, no. 8, pp. 1735–1780.

Context-aware attention mechanism for speech emotion recognition / G. Ramet, P. N. Garner, M. Baeriswyl, A. Lazaridis // 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 Dec. 2018. – Athens, 2018. – Р. 126–131.

Ramet G., Garner P. N., Baeriswyl M., Lazaridis A. Context-aware attention mechanism for speech emotion recognition. 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018. Athens, 2018, рр. 126–131.

Краснопрошин, Д. В. Метод распознавания эмоций в речевом сигнале с использованием машины опорных векторов и надсегментных акустических признаков / Д. В. Краснопрошин, М. И. Вашкевич // Доклады БГУИР. – 2024. – Т. 22, № 3. – С. 93–100.

Krasnoproshin D. V., Vashkevich M. I. Speech emotion recognition method based on support vector machine and suprasegmental acoustic features. Doklady BGUIR [BGUIR Proceedings], 2024, vol. 22, no. 3, pp. 93–100 (In Russ.).

Bahdanau, D. Neural machine translation by jointly learning to align and translate / D. Bahdanau, K. Cho, Y. Bengio // 3rd Intern. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. – San Diego, 2015. – URL: https://arxiv.org/abs/1409.0473 (date of access: 13.11.2025).

Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. San Diego, 2015. Available at: https://arxiv.org/abs/1409.0473 (accessed 13.11.2025).

Optuna: A next-generation hyperparameter optimization framework / T. Akiba, S. Sano, T. Yanase [et al.] // Proc. of the 25th ACM SIGKDD Intern. Conf. on Knowledge Discovery & Data Mining (KDD'19), Anchorage, AK, USA, 4–8 Aug. 2019. – Anchorage, 2019. – Р. 2623–2631.

Akiba T., Sano S., Yanase T., Ohta T., Koyama M. Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'19), Anchorage, AK, USA, 4–8 August 2019. Anchorage, 2019, рр. 2623–2631.

Algorithms for hyper-parameter optimization / J. S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl // NIPS'11: Proc. of the 25th Intern. Conf. on Neural Information Processing Systems, Granada, Spain, 12–15 Dec. 2011. – Granada, 2011. – Р. 2546–2554.

Bergstra J. S., Bardenet R., Bengio Y., Kégl B. Algorithms for hyper-parameter optimization. NIPS'11: Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011. Granada, 2011, рр. 2546–2554.

Spectrogram based multi-task audio classification / Y. Zeng, H. Mao, D. Peng, Z. Yi // Multimedia Tools and Applications. – 2019. – Vol. 78, no. 3. – Р. 3705–3722.

Zeng Y., Mao H., Peng D., Yi Z. Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 2019, vol. 78, no. 3, рр. 3705–3722.

A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset / C. Luna-Jiménez, R. Kleinlein, D. Griol [et al.] // Applied Sciences. – 2022. – Vol. 12, no. 1. – P. 1–23.

Luna-Jiménez C., Kleinlein R., Griol D., Callejas Z., Montero J. M., Fernández-Martínez F. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 2022, vol. 12, no. 1, рр. 1–23.

The authors declare that there are no conflicts of interest present.