References

inform

Информатика

Informatics

1816-03012617-6963

UIIP NASB

10.37661/1816-0301-2021-18-3-83-96

inform-1156

Research Article

ОБРАБОТКА СИГНАЛОВ, ИЗОБРАЖЕНИЙ, РЕЧИ, ТЕКСТА И РАСПОЗНАВАНИЕ ОБРАЗОВ

SIGNAL, IMAGE, SPEECH, TEXT PROCESSING AND PATTERN RECOGNITION

Нормализация данных в машинном обучении

Data normalization in machine learning

Старовойтов

В. В.

Starovoitov

V. V.

Старовойтов Валерий Васильевич - доктоp технических наук, пpофессоp, главный научный сотpудник.

Ул. Сурганова, 6, Минск, 220012

Valery V. Starovoitov - Dr. Sci. (Eng.), Professor, Chief Researcher.

St. Surganova, 6, Minsk, 220012

valerystar@mail.ru

Голуб

Ю. И.

Golub

Yu. I.

Голуб Юлия Игоревна - кандидат технических наук, доцент, стаpший научный сотpудник.

Ул. Сурганова, 6, Минск, 220012

Yuliya I. Golub - Cand. Sci. (Eng.), Associate Professor, Senior Re searcher.

St. Surganova, 6, Minsk, 220012

6423506@gmail.com

Объединенный институт проблем информатики, Национальная академия наук БеларусиThe United Institute of Informatics Problems, National Academy of Sciences of Belarus

2021

30092021

1838396

2021

Старовойтов В.В., Голуб Ю.И.

Starovoitov V.V., Golub Y.I.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://inf.grid.by/jour/article/view/1156

В задачах машинного обучения исходные данные часто заданы в разных единицах измерения и типах шкал. Такие данные следует преобразовывать в единое представление путем их нормализации или стандартизации. В работе показана разница между этими операциями. Систематизированы основные типы шкал, операции над данными, представленными в этих шкалах, и основные варианты нормализации функций. Предложена новая шкала частей и приведены примеры использования нормализации данных для их более корректного анализа.

На сегодняшний день универсального метода нормализации данных, превосходящего другие методы, не существует, но нормализация исходных данных позволяет повысить точность их классификации. Кластеризацию данных методами, использующими функции расстояния, лучше выполнять после преобразования всех признаков в единую шкалу.

Результаты классификации и кластеризации разными методами можно сравнивать различными оценочными функциями, которые зачастую имеют разные диапазоны значений. Для выбора наиболее точной функции можно выполнить нормализацию нескольких из них и сравнить оценки в единой шкале.

Правила разделения признаков древовидных классификаторов инвариантны к шкалам количественных признаков. Они используют только операцию сравнения. Возможно, благодаря этому свойству классификатор типа «случайный лес» в результате многочисленных экспериментов признан одним из лучших при анализе данных разной природы.

In machine learning, the input data is often given in different dimensions. As a result of the scientific papers review, it is shown that the initial data described in different types of scales and units of measurement should be converted into a single representation by normalization or standardization. The difference between these operations is shown. The paper systematizes the basic operations presented in these scales, as well as the main variants of the function normalization. A new scale of parts is suggested and examples of the data normalization for correct analysis are given. Analysis of publications has shown that there is no universal method of data normalization, but normalization of the initial data makes it possible to increase the accuracy of their classification. It is better to perform data clustering by methods using distance functions after converting all features into a single scale. The results of classification and clustering by different methods can be compared with different scoring functions, which often have different ranges of values. To select the most accurate function, it is reasonable to normalize several functions and to compare their estimates on a single scale. The rules for separating features of tree-like classifiers are invariant to scales of quantitative features. Only comparison operation is used. Perhaps due to this property, the random forest classifier, as a result of numerous experiments, is recognized as one of the best classifiers in the analysis of data of different nature.

классификация объектовкластеризациянормализация данныхнормализация функцийсигмоидагиперболический тангенсслучайный лес

object classificationclusteringdata normalizationfunction normalizationsigmoidhyperbolic tangentrandom forest

Работа частично выполнена в рамках проектов БРФФИ Ф20РА-014 и Ф21ПАКГ-001

This work was partially performed within the framework of the BRFFR projects F20RA-014 and F21PAKG-001

References1

Aksoy, S. Feature normalization and likelihood-based similarity measures for image retrieval / S. Aksoy, R. M. Haralick // Pattern Recognition Letters. - 2001. - Vol. 22, no. 5. - P. 563-582.

Aksoy S., Haralick R. M. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognition Letters, 2001, vol. 22, no. 5, pp. 563-582.

Singh, В. Investigating the impact of data normalization on classification performance / B. Singh // Applied Soft Computing J. - 2020. - Vol. 97. - P. 105524.

Singh В. Investigating the impact of data normalization on classification performance. Applied Soft Computing Journal, 2020, vol. 97, p. 105524.

Nayak, S. C. Impact of data normalization on stock index forecasting / S. C. Nayak, B. B. Misra, H. S. Behera // Intern. J. of Computer Information Systems and Industrial Management Applications. - 2014. -Vol. 6. - P. 257-269.

Nayak S. C., Misra B. B., Behera H. S. Impact of data normalization on stock index forecasting. International Journal of Computer Information Systems and Industrial Management Applications, 2014, vol. 6, pp. 257-269.

Naeini, A. A. Assessment of normalization techniques on the accuracy of hyperspectral data clustering / A. A. Naeini, M. Babadi, S. Homayouni // Intern. Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences. - 2017. - Vol. 42. - P. 27-30.

Naeini A. A., Babadi M., Homayouni S. Assessment of normalization techniques on the accuracy of hyperspectral data clustering. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, 2017, vol. 42. pp. 27-30.

Stevens, S. S. On the theory of scales of measurement / S. S. Stevens // Science. New Series. - 1946. -Vol. 103, no. 2684. - P. 677-680.

Stevens S. S. On the theory of scales of measurement. Science. New Series, 1946, vol. 103, no. 2684, pp. 677-680.

Орлов, А. И. Теория измерений как часть методов анализа данных / А. И. Орлов // Социология: методология, методы, математическое моделирование. - 2012. - № 35. - C. 155-174.

Orlov A. I. Measurement theory as part of data analysis methods. Sotsiologiya: metodologiya, metody, matematicheskoe modelirovanie [Sociology: Methodology, Methods, Mathematical Modeling], 2012, no. 35, pp. 155-174 (In Russ.).

Velleman, P. F. Nominal, ordinal, interval, and ratio typologies are misleading / P. F. Velleman, L. Wilkinson // The American Statistician. - 1993. - Vol. 47, no. 1. - P. 65-72.

Velleman P. F., Wilkinson L. Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 1993, vol. 47, no. 1, pp. 65-72.

Tukey, J. W. Exploratory Data Analysis / J. W. Tukey. - Massachusetts : Addison-Wesley, 1977. -P. 39-49.

Tukey J. W. Exploratory Data Analysis. Massachusetts, Addison-Wesley, 1977, pp. 39-49.

Bruffaerts, C. A generalized boxplot for skewed and heavy-tailed distributions / C. Bruffaerts, V. Verardi, C. Vermandele // Statistics & Probability Letters. - 2014. - Vol. 95. - P. 110-117.

Bruffaerts C., Verardi V., Vermandele C. A generalized boxplot for skewed and heavy-tailed distributions. Statistics & Probability Letters, 2014, vol. 95, pp. 110-117.

Kimber, A. C. Exploratory data analysis for possibly censored data from skewed distributions / A. C. Kimber // Applied Statistics. - 1990. - Vol. 39. - P. 21-30.

Kimber A. C. Exploratory data analysis for possibly censored data from skewed distributions. Applied Statistics, 1990, vol. 39, pp. 21-30.

Carling, K. Resistant outlier rules and the non-Gaussian case / K. Carling // Computational Statistics & Data Analysis. - 2000. - Vol. 33, no. 3. - P. 249-258.

Carling K. Resistant outlier rules and the non-Gaussian case. Computational Statistics & Data Analysis, 2000, vol. 33, no. 3, pp. 249-258.

Hubert, M. An adjusted boxplot for skewed distributions / M. Hubert, E. Vandervieren // Computational Statistics & Data Analysis. - 2008. - Vol. 52, no. 12. - P. 5186-5201.

Hubert M., Vandervieren E. An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis, 2008, vol. 52, no. 12, pp. 5186-5201.

Brys, G. A robust measure of skewness / G. Brys, M. Hubert, A. Struyf // J. of Computational and Graphical Statistics. - 2004. - Vol. 13. - P. 996-1017.

Brys G., Hubert M., Struyf A. A robust measure of skewness. Journal of Computational and Graphical Statistics, 2004, vol. 13, pp. 996-1017.

Kyurkchiev, N. Sigmoid Functions: Some Approximation and Modelling Aspects / N. Kyurkchiev, S. Markov. - Saarbrucken : LAP Lambert Academic Publishing, 2015. - 120 p.

Kyurkchiev N., Markov S. Sigmoid Functions: Some Approximation and Modelling Aspects. Saarbruck-en, LAP Lambert Academic Publishing, 2015, 120 p.

Флах, П. Машинное обучение. Наука и искусство построения алгоритмов, которые извлекают знания из данных / П. Флах. - М. : ДМК Пресс, 2015. - 402 с.

Flach P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. 1st ed. Cambridge University Press, 2012, 409 p.

Bicego, M. Properties of the Box-Cox transformation for pattern classification / M. Bicego, S. Baldo // Neurocomputing. - 2016. - Vol. 218. - P. 390-400.

Bicego M., Baldo S. Properties of the Box-Cox transformation for pattern classification. Neurocomputing, 2016, vol. 218, pp. 390-400.

Zhang, Q. Weighted data normalization based on eigenvalues for artificial neural network classification / Q. Zhang, S. Sun // Proc. of Intern. Conf. Neural Information Processing. - 2009. - Vol. 5863. - P. 349-356. https://doi.org/10.1007/978-3-642-10677-4_39

Zhang Q., Sun S. Weighted data normalization based on eigenvalues for artificial neural network classification. Proceedings of International Conference Neural Information Processing, 2009, vol. 5863, pp. 349-356. https://doi.org/10.1007/978-3-642-10677-4_39

Zadeh, L. A. Fuzzy sets / L. A. Zadeh // Information and Control. - 1965. - Vol. 8, no. 3. - P. 338-353.

Zadeh L. A. Fuzzy sets. Information and Control, 1965, vol. 8, no. 3, pp. 338-353.

Więckowski, J. How the normalization of the decision matrix influences the results in the VIKOR method? / J. Więckowski, W. Salabun // Procedia Computer Science. - 2020. - Vol. 176. - P. 2222-2231.

Więckowski J., Salabun W. How the normalization of the decision matrix influences the results in the VIKOR method? Procedia Computer Science, 2020, vol. 176, pp. 2222-2231.

Ioffe, S. Batch normalization: accelerating deep network training by reducing internal covariate shift / S. Ioffe, C. Szegedy // 32nd Intern. Conf. on Machine Learning, Lille, France, 7-9 July 2015. - Lille, 2015. -Vol. 37. - P. 448-456.

Ioffe S., Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. 32nd International Conference on Machine Learning, Lille, France, 7-9 July 2015. Lille, 2015, vol. 37, pp. 448-456.

Do we need hundreds of classifiers to solve real world classification problems? / M. Fernandez-Delgado [et. al.] // The J. of Machine Learning Research. - 2014. - Vol. 15, no. 1. - P. 3133-3181.

Fernandez-Delgado M., Cernadas E., Barro S., Amorim D. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 2014, vol. 15, no. 1, pp. 3133-3181.

Lemons, K. Comparison between Naive Bayes and random forest to predict breast cancer / K. A. Lemons // Intern. J. of Undergraduate Research & Creative Activities. - 2020. - Vol. 12, art. 12. - Р. 1-5. http://doi.org/10.7710/2168-0620.0287

Lemons K. A. Comparison between Naive Bayes and random forest to predict breast cancer. International Journal of Undergraduate Research & Creative Activities, 2020, vol. 12, art. 12, pp. 1-5. http://doi.org/10.7710/2168-0620.0287

Chicco, D. The benefits of the Matthews correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in binary classification assessment / D. Chicco, V. Starovoitov, G. Jurman // IEEE Access. - 2021. -Vol. 9. - P. 47112-47124. https://doi.org/10.1109/ACCESS.2021.3068614

Chicco D., Starovoitov V., Jurman G. The benefits of the Matthews correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in binary classification assessment. IEEE Access, 2021, vol. 9, pp. 4711247124. http://doi.org/10.1109/ACCESS.2021.3068614

Новиков, Д. А. Статистические методы в педагогических исследованиях (типовые случаи) / Д. А. Новиков. - М. : МЗ-Пресс, 2004. - 67 с.

Novikov D. A. Statisticheskie metody v pedagogicheskikh issledovaniyakh (tipovye sluchai). Statistical Methods in Pedagogical Research (Typical Cases). Moscow, MZ-Press, 2004, 67 p. (In Russ.).

Cheddad, A. On box-cox transformation for image normality and pattern classification // IEEE Access. -2020. - Vol. 8. - P. 154975-154983. https://doi.org/10.1109/ACCESS.2020.3018874

Cheddad A. On box-cox transformation for image normality and pattern classification. IEEE Access, 2020, vol. 8, pp. 154975-154983. http://doi.org/10.1109/ACCESS.2020.3018874

Han, J. The influence of the sigmoid function parameters on the speed of backpropagation learning / J. Han, C. Moraga // Intern. Workshop on Artificial Neural Networks, Malaga-Torremolinos, Spain, 7-9 June 1995. - Malaga-Torremolinos, 1995. - P. 195-201.

Han J., Moraga C. The influence of the sigmoid function parameters on the speed of backpropagation learning. International Workshop on Artificial Neural Networks, Malaga-Torremolinos, Spain, 7-9 June 1995. Malaga-Torremolinos, 1995, pp. 195-201.

Jain, A. Score normalization in multimodal biometric systems / A. Jain, K. Nandakumar, A. Ross // Pattern Recognition. - 2005. - Vol. 38, no. 12. - P. 2270-2285.

Jain A., Nandakumar K., Ross A. Score normalization in multimodal biometric systems. Pattern Recognition, 2005, vol. 38, no. 12, pp. 2270-2285.

The authors declare that there are no conflicts of interest present.