Speech emotion recognition based on LSTM networks with multi-vector attention

Daniil V. Krasnoproshin; Maxim I. Vashkevich

doi:10.37661/1816-0301-2026-23-1-69-87

Speech emotion recognition based on LSTM networks with multi-vector attention

Daniil V. Krasnoproshin, Maxim I. Vashkevich

https://doi.org/10.37661/1816-0301-2026-23-1-69-87

Full Text:

PDF (Rus)

Generate QR code

Abstract

Objectives. Improvement of speech emotion recognition accuracy using Long Short-Term Memory (LSTM) recurrent neural network (RNN) models.

Methods. The paper proposes a multi-vector attention mechanism for LSTM-based RNNs. This mechanism generalizes the classical soft attention and allows the model to simultaneously analyze different aspects of temporal dependencies. The proposed RNN architectures were applied to the task of speech emotion recognition. Input data consisted of sequences of mel-frequency cepstral coefficients (MFCCs), which reflect the time-frequency structure of the speech signal. Experiments were conducted on the publicly available RAVDESS dataset. Bayesian optimization was employed for automated hyperparameter tuning of the models.

Results. The experimental results with LSTM networks having different hidden state dimensions (64, 96, 128) demonstrate that the application of the multi-vector attention mechanism leads to a statistically significant improvement in the average accuracy metric (UAR) by 0.88 to 1.56 %.

Conclusion. The obtained results confirm the effectiveness of using the proposed multi-vector attention mechanism in LSTM-based architectures for speech emotion classification.

Keywords

speech processing, emotion recognition, deep learning, recurrent neural networks, attention mechanism

About the Authors

Daniil V. Krasnoproshin

Belarusian State University of Informatics and Radioelectronics
Belarus

Daniil V. Krasnoproshin, М. Sci. (Eng.), Postgraduate Student of Computer Engineering Department

st. Brovki, 6, Minsk, 220013

Maxim I. Vashkevich

Belarusian State University of Informatics and Radioelectronics
Belarus

Maxim I. Vashkevich, Dr. Sci. (Eng.), Assoc. Prof., Prof. of Computer Engineering Department

st. Brovki, 6, Minsk, 220013

References

1. Poria S., Cambria E., Bajpai R., Hussain A. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 2017, vol. 37, рр. 98–125.

2. Luna-Jiménez C., Griol D., Callejas Z., Kleinlein R., Montero J. M., Fernández-Martínez F. Multimodal emotion recognition on RAVDESS dataset using transfer learning. Sensors, 2021, vol. 21, рр. 1–29.

3. Mirsamadi S., Barsoum E., Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 05–09 March 2017. New Orleans, 2017, рр. 2227–2231.

4. Krasnoproshin D. V., Vashkevich M. I. Transfer learning based feature selection for feedforward neural network for speech emotion classifier. Sistemnyj analiz i prikladnaja informatika [System Analysis and Applied Information Science], 2025, no. 1, рр. 38–43 (In Russ.).

5. Krasnoproshin D. V., Vashkevich M. I. Analysis of approaches to building speech emotion recognition systems using deep learning methods. Big Data and Advanced Analytics: sbornik nauchnyh statej XI Mezhdunarodnoj nauchno-prakticheskoj konferencii, Minsk, 23–24 aprelja 2025 g. [Big Data and Advanced Analytics: Collection of Scientific Articles of the XI International Scientific and Practical Conference, Minsk, 23–24 April 2025]. Minsk, 2025, рр. 343–353 (In Russ.).

6. Dal Rí F. A., Ciardi F. C., Conci N. Speech emotion recognition and deep learning: an extensive validation using convolutional neural networks. IEEE Access, 2023, vol. 11, рр. 116638–116649.

7. Waleed G. T., Shaker S. H. Speech emotion recognition on MELD and RAVDESS datasets using CNN. Information, 2025, vol. 16, no. 7, р. 518.

8. Kong Q., Cao Y., Iqbal T., Wang Y., Wang W., Plumbley M. D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, vol. 28, рр. 2880–2894.

9. Nikolenko S., Kadurin А., Archangelskaya Е. Glubokoe obuchenie. Deep Learning. Saint Petersburg, Piter, 2019, 480 p. (In Russ.).

10. Hochreiter S., Schmidhuber J. Long short-term memory. Neural Computation, 1997, vol. 9, no. 8, pp. 1735–1780.

11. Ramet G., Garner P. N., Baeriswyl M., Lazaridis A. Context-aware attention mechanism for speech emotion recognition. 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018. Athens, 2018, рр. 126–131.

12. Krasnoproshin D. V., Vashkevich M. I. Speech emotion recognition method based on support vector machine and suprasegmental acoustic features. Doklady BGUIR [BGUIR Proceedings], 2024, vol. 22, no. 3, pp. 93–100 (In Russ.).

13. Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. San Diego, 2015. Available at: https://arxiv.org/abs/1409.0473 (accessed 13.11.2025).

14. Akiba T., Sano S., Yanase T., Ohta T., Koyama M. Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'19), Anchorage, AK, USA, 4–8 August 2019. Anchorage, 2019, рр. 2623–2631.

15. Bergstra J. S., Bardenet R., Bengio Y., Kégl B. Algorithms for hyper-parameter optimization. NIPS'11: Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011. Granada, 2011, рр. 2546–2554.

16. Zeng Y., Mao H., Peng D., Yi Z. Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 2019, vol. 78, no. 3, рр. 3705–3722.

17. Luna-Jiménez C., Kleinlein R., Griol D., Callejas Z., Montero J. M., Fernández-Martínez F. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 2022, vol. 12, no. 1, рр. 1–23.

Review

For citations:

Krasnoproshin D.V., Vashkevich M.I. Speech emotion recognition based on LSTM networks with multi-vector attention. Informatics. 2026;23(1):69-87. (In Russ.) https://doi.org/10.37661/1816-0301-2026-23-1-69-87

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1816-0301 (Print)
ISSN 2617-6963 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Informatics

Speech emotion recognition based on LSTM networks with multi-vector attention

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy