<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">inform</journal-id><journal-title-group><journal-title xml:lang="ru">Информатика</journal-title><trans-title-group xml:lang="en"><trans-title>Informatics</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1816-0301</issn><issn pub-type="epub">2617-6963</issn><publisher><publisher-name>UIIP NASB</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.37661/1816-0301-2020-17-2-36-43</article-id><article-id custom-type="elpub" pub-id-type="custom">inform-968</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>ОБРАБОТКА СИГНАЛОВ, ИЗОБРАЖЕНИЙ, РЕЧИ, ТЕКСТА И РАСПОЗНАВАНИЕ ОБРАЗОВ</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="en"><subject>SIGNAL, IMAGE, SPEECH, TEXT PROCESSING AND PATTERN RECOGNITION</subject></subj-group></article-categories><title-group><article-title>Выделение речевой активности на фоне шумов при помощи компактной сверточной нейронной сети</article-title><trans-title-group xml:lang="en"><trans-title>Voice activity detection in noisy conditions using tiny convolutional neural network</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Вашкевич</surname><given-names>Г. С.</given-names></name><name name-style="western" xml:lang="en"><surname>Vashkevich</surname><given-names>R. S.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Вашкевич Григорий Сергеевич, магистр технических наук, аспирант кафедры ЭВС</p><p>Минск</p></bio><bio xml:lang="en"><p>Ryhor S. Vashkevich, M. Sci. (Eng.), Postgraduate Student of the Department of EMU</p><p>Minsk</p></bio><email xlink:type="simple">ryhorv@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Азаров</surname><given-names>И. С.</given-names></name><name name-style="western" xml:lang="en"><surname>Azarov</surname><given-names>E. S.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Азаров Илья Сергеевич, доктор технических наук, доцент, заведующий кафедрой ЭВС</p><p>Минск</p></bio><bio xml:lang="en"><p>Elias S. Azarov, Dr. Sci. (Eng.), Associate Professor, Head of the Department of EMU</p><p>Minsk</p></bio><email xlink:type="simple">azarov@bsuir.by</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Белорусский государственный университет информатики и радиоэлектроники</institution></aff><aff xml:lang="en"><institution>Belarusian State University of Informatics and Radioelectronics</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2020</year></pub-date><pub-date pub-type="epub"><day>22</day><month>04</month><year>2020</year></pub-date><volume>17</volume><issue>2</issue><fpage>36</fpage><lpage>43</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Вашкевич Г.С., Азаров И.С., 2020</copyright-statement><copyright-year>2020</copyright-year><copyright-holder xml:lang="ru">Вашкевич Г.С., Азаров И.С.</copyright-holder><copyright-holder xml:lang="en">Vashkevich R.S., Azarov E.S.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://inf.grid.by/jour/article/view/968">https://inf.grid.by/jour/article/view/968</self-uri><abstract><p>Исследуется задача выделения речевой активности из зашумленного звукового сигнала. Предлагается компактная модель сверточной нейронной сети, которая имеет всего 385 параметров. Модель нетребовательна к вычислительным ресурсам, что позволяет использовать ее в рамках концепции Интернета вещей для портативных устройств с низким энергопотреблением. В то же время эта модель обеспечивает высокую точность определения речевой активности на уровне лучших современных аналогов. Указанные полезные свойства достигаются путем применения специального сверточного слоя, учитывающего гармоническую структуру вокализованной речи и устраняющего избыточность модели за счет инвариантности к изменениям частоты основного тона. В рамках экспериментов производительность модели оценивалась в различных шумовых условиях для разных соотношений сигнала и шума. Результаты экспериментов показали, что предложенная модель обеспечивает более высокую точность определения речевой активности по сравнению с моделью, представленной компанией Google в фреймворке WebRTC.</p></abstract><trans-abstract xml:lang="en"><p>The paper investigates the problem of voice activity detection from a noisy sound signal. An extremely compact convolutional neural network is proposed. The model has only 385 trainable parameters. Proposed model doesn’t require a lot of computational resources that allows to use it as part of the “internet of things” concept for compact low power devices. At the same time the model provides state of the art results in voice activity detection in terms of detection accuracy. The properties of the model are achieved by using a special convolutional layer that considers the harmonic structure of vocal speech. This layer also eliminates redundancy of the model because it has invariance to changes of fundamental frequency. The model performance is evaluated in various noise conditions with different signal-to-noise ratios. The results show that the proposed model provides higher accuracy compared to voice activity detection model from the WebRTC framework by Google.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>детектор речевой активности</kwd><kwd>гармонический сигнал</kwd><kwd>сверточная нейронная сеть</kwd><kwd>частота основного тона</kwd><kwd>обработка речи</kwd></kwd-group><kwd-group xml:lang="en"><kwd>voice activity detector</kwd><kwd>harmonic signal</kwd><kwd>convolutional neural network</kwd><kwd>pitch</kwd><kwd>speech processing</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Yoo I.-C., Lim H., Yook D. Formant-based robust voice activity detection. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2015, vol. 23, no. 12, рр. 2238–2245. https://doi.org/10.1109/TASLP.2015.2476762</mixed-citation><mixed-citation xml:lang="en">Yoo I.-C., Lim H., Yook D. Formant-based robust voice activity detection. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2015, vol. 23, no. 12, рр. 2238–2245. https://doi.org/10.1109/TASLP.2015.2476762</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Pang J. Spectrum energy based voice activity detection. The 7th IEEE Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, 9–11 January 2017. Las Vegas, 2017, pp. 1–5. https://doi.org/10.1109/CCWC.2017.7868454</mixed-citation><mixed-citation xml:lang="en">Pang J. Spectrum energy based voice activity detection. The 7th IEEE Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, 9–11 January 2017. Las Vegas, 2017, pp. 1–5. https://doi.org/10.1109/CCWC.2017.7868454</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Kinnunen T., Chernenko E., Tuononen M., Fränti P., Li H. Voice activity detection using MFCC features and support vector machine. The 12th International Conference on Speech and Computer (SPECOM07), Moscow, Russia, 15–18 October 2007. Moscow, 2007, vol. 2, pp. 556–561.</mixed-citation><mixed-citation xml:lang="en">Kinnunen T., Chernenko E., Tuononen M., Fränti P., Li H. Voice activity detection using MFCC features and support vector machine. The 12th International Conference on Speech and Computer (SPECOM07), Moscow, Russia, 15–18 October 2007. Moscow, 2007, vol. 2, pp. 556–561.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Zazo R., Sainath T. N., Simko G., Parada C. Feature learning with raw-waveform CLDNNs for voice activity detection. 17 th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016. San Francisco, 2016, pp. 3668–3672. https://doi.org/10.21437/Interspeech.2016-268</mixed-citation><mixed-citation xml:lang="en">Zazo R., Sainath T. N., Simko G., Parada C. Feature learning with raw-waveform CLDNNs for voice activity detection. 17 th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016. San Francisco, 2016, pp. 3668–3672. https://doi.org/10.21437/Interspeech.2016-268</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Zhang X., Wu J. Denoising deep neural networks based voice activity detection. International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. Vancouver, 2013, pp. 853–857. https://doi.org/10.1109/ICASSP.2013.6637769</mixed-citation><mixed-citation xml:lang="en">Zhang X., Wu J. Denoising deep neural networks based voice activity detection. International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. Vancouver, 2013, pp. 853–857. https://doi.org/10.1109/ICASSP.2013.6637769</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Hughes T., Mierle K. Recurrent neural networks for voice activity detection. International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. Vancouver, 2013, pp. 7378–7382. https://doi.org/10.1109/ICASSP.2013.6639096</mixed-citation><mixed-citation xml:lang="en">Hughes T., Mierle K. Recurrent neural networks for voice activity detection. International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. Vancouver, 2013, pp. 7378–7382. https://doi.org/10.1109/ICASSP.2013.6639096</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Eyben F., Weninger F., Squartini S., Schuller B. Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies. International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. Vancouver, 2013, pp. 483–487. https://doi.org/10.1109/ICASSP.2013.6637694</mixed-citation><mixed-citation xml:lang="en">Eyben F., Weninger F., Squartini S., Schuller B. Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies. International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. Vancouver, 2013, pp. 483–487. https://doi.org/10.1109/ICASSP.2013.6637694</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Wang Q., Du J., Bao X., Wang Z.-R., Dai L.-R., Lee C.-H. A universal VAD based on jointly trained deep neural networks. 16 th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. Dresden, 2015, рр. 2282–2286.</mixed-citation><mixed-citation xml:lang="en">Wang Q., Du J., Bao X., Wang Z.-R., Dai L.-R., Lee C.-H. A universal VAD based on jointly trained deep neural networks. 16 th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. Dresden, 2015, рр. 2282–2286.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Ryant N., Liberman M., Yuan J. Speech activity detection on youtube using deep neural networks. 14 th Annual Conference of the International Speech Communication Association, Lyon, France, 25–29 August 2013. Lyon, 2013, pp. 728–731.</mixed-citation><mixed-citation xml:lang="en">Ryant N., Liberman M., Yuan J. Speech activity detection on youtube using deep neural networks. 14 th Annual Conference of the International Speech Communication Association, Lyon, France, 25–29 August 2013. Lyon, 2013, pp. 728–731.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Snyder D., Chen G., Povey D. Musan: a Music, Speech, and Noise Corpus, 2015. Available at: https://arxiv.org/abs/1510.08484 (accessed 20.10.2019).</mixed-citation><mixed-citation xml:lang="en">Snyder D., Chen G., Povey D. Musan: a Music, Speech, and Noise Corpus, 2015. Available at: https://arxiv.org/abs/1510.08484 (accessed 20.10.2019).</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Kasi K., Zahorian S. A. Yet another algorithm for pitch tracking. International Conference on Acoustics, Speech, and Signal Processing, Orlando, 13–17 May 2002. Orlando, 2002, vol. 1, рр. 361–364. https://doi.org/10.1109/ICASSP.2002.5743729</mixed-citation><mixed-citation xml:lang="en">Kasi K., Zahorian S. A. Yet another algorithm for pitch tracking. International Conference on Acoustics, Speech, and Signal Processing, Orlando, 13–17 May 2002. Orlando, 2002, vol. 1, рр. 361–364. https://doi.org/10.1109/ICASSP.2002.5743729</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Kingma D. P., Ba J. Adam: a Method for Stochastic Optimization, 2014. Available at: https://arxiv.org/abs/1412.6980 (accessed 20.10.2019).</mixed-citation><mixed-citation xml:lang="en">Kingma D. P., Ba J. Adam: a Method for Stochastic Optimization, 2014. Available at: https://arxiv.org/abs/1412.6980 (accessed 20.10.2019).</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
