<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">inform</journal-id><journal-title-group><journal-title xml:lang="ru">Информатика</journal-title><trans-title-group xml:lang="en"><trans-title>Informatics</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1816-0301</issn><issn pub-type="epub">2617-6963</issn><publisher><publisher-name>UIIP NASB</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.37661/1816-0301-2021-18-3-83-96</article-id><article-id custom-type="elpub" pub-id-type="custom">inform-1156</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>ОБРАБОТКА СИГНАЛОВ, ИЗОБРАЖЕНИЙ, РЕЧИ, ТЕКСТА И РАСПОЗНАВАНИЕ ОБРАЗОВ</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="en"><subject>SIGNAL, IMAGE, SPEECH, TEXT PROCESSING AND PATTERN RECOGNITION</subject></subj-group></article-categories><title-group><article-title>Нормализация данных в машинном обучении</article-title><trans-title-group xml:lang="en"><trans-title>Data normalization in machine learning</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Старовойтов</surname><given-names>В. В.</given-names></name><name name-style="western" xml:lang="en"><surname>Starovoitov</surname><given-names>V. V.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Старовойтов Валерий Васильевич - доктоp технических наук, пpофессоp, главный научный сотpудник.</p><p>Ул. Сурганова, 6, Минск, 220012</p></bio><bio xml:lang="en"><p>Valery V. Starovoitov - Dr. Sci. (Eng.), Professor, Chief Researcher.</p><p>St. Surganova, 6, Minsk, 220012</p></bio><email xlink:type="simple">valerystar@mail.ru</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Голуб</surname><given-names>Ю. И.</given-names></name><name name-style="western" xml:lang="en"><surname>Golub</surname><given-names>Yu. I.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Голуб Юлия Игоревна - кандидат технических наук, доцент, стаpший научный сотpудник.</p><p>Ул. Сурганова, 6, Минск, 220012</p></bio><bio xml:lang="en"><p>Yuliya I. Golub - Cand. Sci. (Eng.), Associate Professor, Senior Re searcher.</p><p>St. Surganova, 6, Minsk, 220012</p></bio><email xlink:type="simple">6423506@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Объединенный институт проблем информатики, Национальная академия наук Беларуси</institution></aff><aff xml:lang="en"><institution>The United Institute of Informatics Problems, National Academy of Sciences of Belarus</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2021</year></pub-date><pub-date pub-type="epub"><day>30</day><month>09</month><year>2021</year></pub-date><volume>18</volume><issue>3</issue><fpage>83</fpage><lpage>96</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Старовойтов В.В., Голуб Ю.И., 2021</copyright-statement><copyright-year>2021</copyright-year><copyright-holder xml:lang="ru">Старовойтов В.В., Голуб Ю.И.</copyright-holder><copyright-holder xml:lang="en">Starovoitov V.V., Golub Y.I.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://inf.grid.by/jour/article/view/1156">https://inf.grid.by/jour/article/view/1156</self-uri><abstract><p>В задачах машинного обучения исходные данные часто заданы в разных единицах измерения и типах шкал. Такие данные следует преобразовывать в единое представление путем их нормализации или стандартизации. В работе показана разница между этими операциями. Систематизированы основные типы шкал, операции над данными, представленными в этих шкалах, и основные варианты нормализации функций. Предложена новая шкала частей и приведены примеры использования нормализации данных для их более корректного анализа.</p><p>На сегодняшний день универсального метода нормализации данных, превосходящего другие методы, не существует, но нормализация исходных данных позволяет повысить точность их классификации. Кластеризацию данных методами, использующими функции расстояния, лучше выполнять после преобразования всех признаков в единую шкалу.</p><p>Результаты классификации и кластеризации разными методами можно сравнивать различными оценочными функциями, которые зачастую имеют разные диапазоны значений. Для выбора наиболее точной функции можно выполнить нормализацию нескольких из них и сравнить оценки в единой шкале.</p><p>Правила разделения признаков древовидных классификаторов инвариантны к шкалам количественных признаков. Они используют только операцию сравнения. Возможно, благодаря этому свойству классификатор типа «случайный лес» в результате многочисленных экспериментов признан одним из лучших при анализе данных разной природы.</p></abstract><trans-abstract xml:lang="en"><p>In machine learning, the input data is often given in different dimensions. As a result of the scientific papers review, it is shown that the initial data described in different types of scales and units of measurement should be converted into a single representation by normalization or standardization. The difference between these operations is shown. The paper systematizes the basic operations presented in these scales, as well as the main variants of the function normalization. A new scale of parts is suggested and examples of the data normalization for correct analysis are given. Analysis of publications has shown that there is no universal method of data normalization, but normalization of the initial data makes it possible to increase the accuracy of their classification. It is better to perform data clustering by methods using distance functions after converting all features into a single scale. The results of classification and clustering by different methods can be compared with different scoring functions, which often have different ranges of values. To select the most accurate function, it is reasonable to normalize several functions and to compare their estimates on a single scale. The rules for separating features of tree-like classifiers are invariant to scales of quantitative features. Only comparison operation is used. Perhaps due to this property, the random forest classifier, as a result of numerous experiments, is recognized as one of the best classifiers in the analysis of data of different nature.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>классификация объектов</kwd><kwd>кластеризация</kwd><kwd>нормализация данных</kwd><kwd>нормализация функций</kwd><kwd>сигмоида</kwd><kwd>гиперболический тангенс</kwd><kwd>случайный лес</kwd></kwd-group><kwd-group xml:lang="en"><kwd>object classification</kwd><kwd>clustering</kwd><kwd>data normalization</kwd><kwd>function normalization</kwd><kwd>sigmoid</kwd><kwd>hyperbolic tangent</kwd><kwd>random forest</kwd></kwd-group><funding-group><funding-statement xml:lang="ru">Работа частично выполнена в рамках проектов БРФФИ Ф20РА-014 и Ф21ПАКГ-001</funding-statement><funding-statement xml:lang="en">This work was partially performed within the framework of the BRFFR projects F20RA-014 and F21PAKG-001</funding-statement></funding-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Aksoy, S. Feature normalization and likelihood-based similarity measures for image retrieval / S. Aksoy, R. M. Haralick // Pattern Recognition Letters. - 2001. - Vol. 22, no. 5. - P. 563-582.</mixed-citation><mixed-citation xml:lang="en">Aksoy S., Haralick R. M. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognition Letters, 2001, vol. 22, no. 5, pp. 563-582.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Singh, В. Investigating the impact of data normalization on classification performance / B. Singh // Applied Soft Computing J. - 2020. - Vol. 97. - P. 105524.</mixed-citation><mixed-citation xml:lang="en">Singh В. Investigating the impact of data normalization on classification performance. Applied Soft Computing Journal, 2020, vol. 97, p. 105524.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Nayak, S. C. Impact of data normalization on stock index forecasting / S. C. Nayak, B. B. Misra, H. S. Behera // Intern. J. of Computer Information Systems and Industrial Management Applications. - 2014. -Vol. 6. - P. 257-269.</mixed-citation><mixed-citation xml:lang="en">Nayak S. C., Misra B. B., Behera H. S. Impact of data normalization on stock index forecasting. International Journal of Computer Information Systems and Industrial Management Applications, 2014, vol. 6, pp. 257-269.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Naeini, A. A. Assessment of normalization techniques on the accuracy of hyperspectral data clustering / A. A. Naeini, M. Babadi, S. Homayouni // Intern. Archives of the Photogrammetry, Remote Sensing &amp; Spatial Information Sciences. - 2017. - Vol. 42. - P. 27-30.</mixed-citation><mixed-citation xml:lang="en">Naeini A. A., Babadi M., Homayouni S. Assessment of normalization techniques on the accuracy of hyperspectral data clustering. International Archives of the Photogrammetry, Remote Sensing &amp; Spatial Information Sciences, 2017, vol. 42. pp. 27-30.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Stevens, S. S. On the theory of scales of measurement / S. S. Stevens // Science. New Series. - 1946. -Vol. 103, no. 2684. - P. 677-680.</mixed-citation><mixed-citation xml:lang="en">Stevens S. S. On the theory of scales of measurement. Science. New Series, 1946, vol. 103, no. 2684, pp. 677-680.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Орлов, А. И. Теория измерений как часть методов анализа данных / А. И. Орлов // Социология: методология, методы, математическое моделирование. - 2012. - № 35. - C. 155-174.</mixed-citation><mixed-citation xml:lang="en">Orlov A. I. Measurement theory as part of data analysis methods. Sotsiologiya: metodologiya, metody, matematicheskoe modelirovanie [Sociology: Methodology, Methods, Mathematical Modeling], 2012, no. 35, pp. 155-174 (In Russ.).</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Velleman, P. F. Nominal, ordinal, interval, and ratio typologies are misleading / P. F. Velleman, L. Wilkinson // The American Statistician. - 1993. - Vol. 47, no. 1. - P. 65-72.</mixed-citation><mixed-citation xml:lang="en">Velleman P. F., Wilkinson L. Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 1993, vol. 47, no. 1, pp. 65-72.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Tukey, J. W. Exploratory Data Analysis / J. W. Tukey. - Massachusetts : Addison-Wesley, 1977. -P. 39-49.</mixed-citation><mixed-citation xml:lang="en">Tukey J. W. Exploratory Data Analysis. Massachusetts, Addison-Wesley, 1977, pp. 39-49.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Bruffaerts, C. A generalized boxplot for skewed and heavy-tailed distributions / C. Bruffaerts, V. Verardi, C. Vermandele // Statistics &amp; Probability Letters. - 2014. - Vol. 95. - P. 110-117.</mixed-citation><mixed-citation xml:lang="en">Bruffaerts C., Verardi V., Vermandele C. A generalized boxplot for skewed and heavy-tailed distributions. Statistics &amp; Probability Letters, 2014, vol. 95, pp. 110-117.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Kimber, A. C. Exploratory data analysis for possibly censored data from skewed distributions / A. C. Kimber // Applied Statistics. - 1990. - Vol. 39. - P. 21-30.</mixed-citation><mixed-citation xml:lang="en">Kimber A. C. Exploratory data analysis for possibly censored data from skewed distributions. Applied Statistics, 1990, vol. 39, pp. 21-30.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Carling, K. Resistant outlier rules and the non-Gaussian case / K. Carling // Computational Statistics &amp; Data Analysis. - 2000. - Vol. 33, no. 3. - P. 249-258.</mixed-citation><mixed-citation xml:lang="en">Carling K. Resistant outlier rules and the non-Gaussian case. Computational Statistics &amp; Data Analysis, 2000, vol. 33, no. 3, pp. 249-258.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Hubert, M. An adjusted boxplot for skewed distributions / M. Hubert, E. Vandervieren // Computational Statistics &amp; Data Analysis. - 2008. - Vol. 52, no. 12. - P. 5186-5201.</mixed-citation><mixed-citation xml:lang="en">Hubert M., Vandervieren E. An adjusted boxplot for skewed distributions. Computational Statistics &amp; Data Analysis, 2008, vol. 52, no. 12, pp. 5186-5201.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Brys, G. A robust measure of skewness / G. Brys, M. Hubert, A. Struyf // J. of Computational and Graphical Statistics. - 2004. - Vol. 13. - P. 996-1017.</mixed-citation><mixed-citation xml:lang="en">Brys G., Hubert M., Struyf A. A robust measure of skewness. Journal of Computational and Graphical Statistics, 2004, vol. 13, pp. 996-1017.</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Kyurkchiev, N. Sigmoid Functions: Some Approximation and Modelling Aspects / N. Kyurkchiev, S. Markov. - Saarbrucken : LAP Lambert Academic Publishing, 2015. - 120 p.</mixed-citation><mixed-citation xml:lang="en">Kyurkchiev N., Markov S. Sigmoid Functions: Some Approximation and Modelling Aspects. Saarbruck-en, LAP Lambert Academic Publishing, 2015, 120 p.</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Флах, П. Машинное обучение. Наука и искусство построения алгоритмов, которые извлекают знания из данных / П. Флах. - М. : ДМК Пресс, 2015. - 402 с.</mixed-citation><mixed-citation xml:lang="en">Flach P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. 1st ed. Cambridge University Press, 2012, 409 p.</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Bicego, M. Properties of the Box-Cox transformation for pattern classification / M. Bicego, S. Baldo // Neurocomputing. - 2016. - Vol. 218. - P. 390-400.</mixed-citation><mixed-citation xml:lang="en">Bicego M., Baldo S. Properties of the Box-Cox transformation for pattern classification. Neurocomputing, 2016, vol. 218, pp. 390-400.</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">Zhang, Q. Weighted data normalization based on eigenvalues for artificial neural network classification / Q. Zhang, S. Sun // Proc. of Intern. Conf. Neural Information Processing. - 2009. - Vol. 5863. - P. 349-356. https://doi.org/10.1007/978-3-642-10677-4_39</mixed-citation><mixed-citation xml:lang="en">Zhang Q., Sun S. Weighted data normalization based on eigenvalues for artificial neural network classification. Proceedings of International Conference Neural Information Processing, 2009, vol. 5863, pp. 349-356. https://doi.org/10.1007/978-3-642-10677-4_39</mixed-citation></citation-alternatives></ref><ref id="cit18"><label>18</label><citation-alternatives><mixed-citation xml:lang="ru">Zadeh, L. A. Fuzzy sets / L. A. Zadeh // Information and Control. - 1965. - Vol. 8, no. 3. - P. 338-353.</mixed-citation><mixed-citation xml:lang="en">Zadeh L. A. Fuzzy sets. Information and Control, 1965, vol. 8, no. 3, pp. 338-353.</mixed-citation></citation-alternatives></ref><ref id="cit19"><label>19</label><citation-alternatives><mixed-citation xml:lang="ru">Więckowski, J. How the normalization of the decision matrix influences the results in the VIKOR method? / J. Więckowski, W. Salabun // Procedia Computer Science. - 2020. - Vol. 176. - P. 2222-2231.</mixed-citation><mixed-citation xml:lang="en">Więckowski J., Salabun W. How the normalization of the decision matrix influences the results in the VIKOR method? Procedia Computer Science, 2020, vol. 176, pp. 2222-2231.</mixed-citation></citation-alternatives></ref><ref id="cit20"><label>20</label><citation-alternatives><mixed-citation xml:lang="ru">Ioffe, S. Batch normalization: accelerating deep network training by reducing internal covariate shift / S. Ioffe, C. Szegedy // 32nd Intern. Conf. on Machine Learning, Lille, France, 7-9 July 2015. - Lille, 2015. -Vol. 37. - P. 448-456.</mixed-citation><mixed-citation xml:lang="en">Ioffe S., Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. 32nd International Conference on Machine Learning, Lille, France, 7-9 July 2015. Lille, 2015, vol. 37, pp. 448-456.</mixed-citation></citation-alternatives></ref><ref id="cit21"><label>21</label><citation-alternatives><mixed-citation xml:lang="ru">Do we need hundreds of classifiers to solve real world classification problems? / M. Fernandez-Delgado [et. al.] // The J. of Machine Learning Research. - 2014. - Vol. 15, no. 1. - P. 3133-3181.</mixed-citation><mixed-citation xml:lang="en">Fernandez-Delgado M., Cernadas E., Barro S., Amorim D. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 2014, vol. 15, no. 1, pp. 3133-3181.</mixed-citation></citation-alternatives></ref><ref id="cit22"><label>22</label><citation-alternatives><mixed-citation xml:lang="ru">Lemons, K. Comparison between Naive Bayes and random forest to predict breast cancer / K. A. Lemons // Intern. J. of Undergraduate Research &amp; Creative Activities. - 2020. - Vol. 12, art. 12. - Р. 1-5. http://doi.org/10.7710/2168-0620.0287</mixed-citation><mixed-citation xml:lang="en">Lemons K. A. Comparison between Naive Bayes and random forest to predict breast cancer. International Journal of Undergraduate Research &amp; Creative Activities, 2020, vol. 12, art. 12, pp. 1-5. http://doi.org/10.7710/2168-0620.0287</mixed-citation></citation-alternatives></ref><ref id="cit23"><label>23</label><citation-alternatives><mixed-citation xml:lang="ru">Chicco, D. The benefits of the Matthews correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in binary classification assessment / D. Chicco, V. Starovoitov, G. Jurman // IEEE Access. - 2021. -Vol. 9. - P. 47112-47124. https://doi.org/10.1109/ACCESS.2021.3068614</mixed-citation><mixed-citation xml:lang="en">Chicco D., Starovoitov V., Jurman G. The benefits of the Matthews correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in binary classification assessment. IEEE Access, 2021, vol. 9, pp. 4711247124. http://doi.org/10.1109/ACCESS.2021.3068614</mixed-citation></citation-alternatives></ref><ref id="cit24"><label>24</label><citation-alternatives><mixed-citation xml:lang="ru">Новиков, Д. А. Статистические методы в педагогических исследованиях (типовые случаи) / Д. А. Новиков. - М. : МЗ-Пресс, 2004. - 67 с.</mixed-citation><mixed-citation xml:lang="en">Novikov D. A. Statisticheskie metody v pedagogicheskikh issledovaniyakh (tipovye sluchai). Statistical Methods in Pedagogical Research (Typical Cases). Moscow, MZ-Press, 2004, 67 p. (In Russ.).</mixed-citation></citation-alternatives></ref><ref id="cit25"><label>25</label><citation-alternatives><mixed-citation xml:lang="ru">Cheddad, A. On box-cox transformation for image normality and pattern classification // IEEE Access. -2020. - Vol. 8. - P. 154975-154983. https://doi.org/10.1109/ACCESS.2020.3018874</mixed-citation><mixed-citation xml:lang="en">Cheddad A. On box-cox transformation for image normality and pattern classification. IEEE Access, 2020, vol. 8, pp. 154975-154983. http://doi.org/10.1109/ACCESS.2020.3018874</mixed-citation></citation-alternatives></ref><ref id="cit26"><label>26</label><citation-alternatives><mixed-citation xml:lang="ru">Han, J. The influence of the sigmoid function parameters on the speed of backpropagation learning / J. Han, C. Moraga // Intern. Workshop on Artificial Neural Networks, Malaga-Torremolinos, Spain, 7-9 June 1995. - Malaga-Torremolinos, 1995. - P. 195-201.</mixed-citation><mixed-citation xml:lang="en">Han J., Moraga C. The influence of the sigmoid function parameters on the speed of backpropagation learning. International Workshop on Artificial Neural Networks, Malaga-Torremolinos, Spain, 7-9 June 1995. Malaga-Torremolinos, 1995, pp. 195-201.</mixed-citation></citation-alternatives></ref><ref id="cit27"><label>27</label><citation-alternatives><mixed-citation xml:lang="ru">Jain, A. Score normalization in multimodal biometric systems / A. Jain, K. Nandakumar, A. Ross // Pattern Recognition. - 2005. - Vol. 38, no. 12. - P. 2270-2285.</mixed-citation><mixed-citation xml:lang="en">Jain A., Nandakumar K., Ross A. Score normalization in multimodal biometric systems. Pattern Recognition, 2005, vol. 38, no. 12, pp. 2270-2285.</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
