Preview

Informatics

Advanced search

Speech emotion recognition based on LSTM networks with multi-vector attention

https://doi.org/10.37661/1816-0301-2026-23-1-69-87

Abstract

Objectives. Improvement of speech emotion recognition accuracy using Long Short-Term Memory (LSTM) recurrent neural network (RNN) models.

Methods. The paper proposes a multi-vector attention mechanism for LSTM-based RNNs. This mechanism generalizes the classical soft attention and allows the model to simultaneously analyze different aspects of temporal dependencies. The proposed RNN architectures were applied to the task of speech emotion recognition. Input data consisted of sequences of mel-frequency cepstral coefficients (MFCCs), which reflect the time-frequency structure of the speech signal. Experiments were conducted on the publicly available RAVDESS dataset. Bayesian optimization was employed for automated hyperparameter tuning of the models.

Results. The experimental results with LSTM networks having different hidden state dimensions (64, 96, 128) demonstrate that the application of the multi-vector attention mechanism leads to a statistically significant improvement in the average accuracy metric (UAR) by 0.88 to 1.56 %.

Conclusion. The obtained results confirm the effectiveness of using the proposed multi-vector attention mechanism in LSTM-based architectures for speech emotion classification.

About the Authors

Daniil V. Krasnoproshin
Belarusian State University of Informatics and Radioelectronics
Belarus

Daniil V. Krasnoproshin, М. Sci. (Eng.), Postgraduate Student of Computer Engineering Department

st. Brovki, 6, Minsk, 220013



Maxim I. Vashkevich
Belarusian State University of Informatics and Radioelectronics
Belarus

Maxim I. Vashkevich, Dr. Sci. (Eng.), Assoc. Prof., Prof. of Computer Engineering Department

st. Brovki, 6, Minsk, 220013



References

1. Poria S., Cambria E., Bajpai R., Hussain A. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 2017, vol. 37, рр. 98–125.

2. Luna-Jiménez C., Griol D., Callejas Z., Kleinlein R., Montero J. M., Fernández-Martínez F. Multimodal emotion recognition on RAVDESS dataset using transfer learning. Sensors, 2021, vol. 21, рр. 1–29.

3. Mirsamadi S., Barsoum E., Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 05–09 March 2017. New Orleans, 2017, рр. 2227–2231.

4. Krasnoproshin D. V., Vashkevich M. I. Transfer learning based feature selection for feedforward neural network for speech emotion classifier. Sistemnyj analiz i prikladnaja informatika [System Analysis and Applied Information Science], 2025, no. 1, рр. 38–43 (In Russ.).

5. Krasnoproshin D. V., Vashkevich M. I. Analysis of approaches to building speech emotion recognition systems using deep learning methods. Big Data and Advanced Analytics: sbornik nauchnyh statej XI Mezhdunarodnoj nauchno-prakticheskoj konferencii, Minsk, 23–24 aprelja 2025 g. [Big Data and Advanced Analytics: Collection of Scientific Articles of the XI International Scientific and Practical Conference, Minsk, 23–24 April 2025]. Minsk, 2025, рр. 343–353 (In Russ.).

6. Dal Rí F. A., Ciardi F. C., Conci N. Speech emotion recognition and deep learning: an extensive validation using convolutional neural networks. IEEE Access, 2023, vol. 11, рр. 116638–116649.

7. Waleed G. T., Shaker S. H. Speech emotion recognition on MELD and RAVDESS datasets using CNN. Information, 2025, vol. 16, no. 7, р. 518.

8. Kong Q., Cao Y., Iqbal T., Wang Y., Wang W., Plumbley M. D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, vol. 28, рр. 2880–2894.

9. Nikolenko S., Kadurin А., Archangelskaya Е. Glubokoe obuchenie. Deep Learning. Saint Petersburg, Piter, 2019, 480 p. (In Russ.).

10. Hochreiter S., Schmidhuber J. Long short-term memory. Neural Computation, 1997, vol. 9, no. 8, pp. 1735–1780.

11. Ramet G., Garner P. N., Baeriswyl M., Lazaridis A. Context-aware attention mechanism for speech emotion recognition. 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018. Athens, 2018, рр. 126–131.

12. Krasnoproshin D. V., Vashkevich M. I. Speech emotion recognition method based on support vector machine and suprasegmental acoustic features. Doklady BGUIR [BGUIR Proceedings], 2024, vol. 22, no. 3, pp. 93–100 (In Russ.).

13. Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. San Diego, 2015. Available at: https://arxiv.org/abs/1409.0473 (accessed 13.11.2025).

14. Akiba T., Sano S., Yanase T., Ohta T., Koyama M. Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'19), Anchorage, AK, USA, 4–8 August 2019. Anchorage, 2019, рр. 2623–2631.

15. Bergstra J. S., Bardenet R., Bengio Y., Kégl B. Algorithms for hyper-parameter optimization. NIPS'11: Proceedings of the 25th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011. Granada, 2011, рр. 2546–2554.

16. Zeng Y., Mao H., Peng D., Yi Z. Spectrogram based multi-task audio classification. Multimedia Tools and Applications, 2019, vol. 78, no. 3, рр. 3705–3722.

17. Luna-Jiménez C., Kleinlein R., Griol D., Callejas Z., Montero J. M., Fernández-Martínez F. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 2022, vol. 12, no. 1, рр. 1–23.


Review

For citations:


Krasnoproshin D.V., Vashkevich M.I. Speech emotion recognition based on LSTM networks with multi-vector attention. Informatics. 2026;23(1):69-87. (In Russ.) https://doi.org/10.37661/1816-0301-2026-23-1-69-87

Views: 285

JATS XML


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1816-0301 (Print)
ISSN 2617-6963 (Online)