Data normalization in machine learning
https://doi.org/10.37661/1816-0301-2021-18-3-83-96
Abstract
In machine learning, the input data is often given in different dimensions. As a result of the scientific papers review, it is shown that the initial data described in different types of scales and units of measurement should be converted into a single representation by normalization or standardization. The difference between these operations is shown. The paper systematizes the basic operations presented in these scales, as well as the main variants of the function normalization. A new scale of parts is suggested and examples of the data normalization for correct analysis are given. Analysis of publications has shown that there is no universal method of data normalization, but normalization of the initial data makes it possible to increase the accuracy of their classification. It is better to perform data clustering by methods using distance functions after converting all features into a single scale. The results of classification and clustering by different methods can be compared with different scoring functions, which often have different ranges of values. To select the most accurate function, it is reasonable to normalize several functions and to compare their estimates on a single scale. The rules for separating features of tree-like classifiers are invariant to scales of quantitative features. Only comparison operation is used. Perhaps due to this property, the random forest classifier, as a result of numerous experiments, is recognized as one of the best classifiers in the analysis of data of different nature.
Keywords
About the Authors
V. V. StarovoitovBelarus
Valery V. Starovoitov - Dr. Sci. (Eng.), Professor, Chief Researcher.
St. Surganova, 6, Minsk, 220012
Yu. I. Golub
Belarus
Yuliya I. Golub - Cand. Sci. (Eng.), Associate Professor, Senior Re searcher.
St. Surganova, 6, Minsk, 220012
References
1. Aksoy S., Haralick R. M. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognition Letters, 2001, vol. 22, no. 5, pp. 563-582.
2. Singh В. Investigating the impact of data normalization on classification performance. Applied Soft Computing Journal, 2020, vol. 97, p. 105524.
3. Nayak S. C., Misra B. B., Behera H. S. Impact of data normalization on stock index forecasting. International Journal of Computer Information Systems and Industrial Management Applications, 2014, vol. 6, pp. 257-269.
4. Naeini A. A., Babadi M., Homayouni S. Assessment of normalization techniques on the accuracy of hyperspectral data clustering. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, 2017, vol. 42. pp. 27-30.
5. Stevens S. S. On the theory of scales of measurement. Science. New Series, 1946, vol. 103, no. 2684, pp. 677-680.
6. Orlov A. I. Measurement theory as part of data analysis methods. Sotsiologiya: metodologiya, metody, matematicheskoe modelirovanie [Sociology: Methodology, Methods, Mathematical Modeling], 2012, no. 35, pp. 155-174 (In Russ.).
7. Velleman P. F., Wilkinson L. Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 1993, vol. 47, no. 1, pp. 65-72.
8. Tukey J. W. Exploratory Data Analysis. Massachusetts, Addison-Wesley, 1977, pp. 39-49.
9. Bruffaerts C., Verardi V., Vermandele C. A generalized boxplot for skewed and heavy-tailed distributions. Statistics & Probability Letters, 2014, vol. 95, pp. 110-117.
10. Kimber A. C. Exploratory data analysis for possibly censored data from skewed distributions. Applied Statistics, 1990, vol. 39, pp. 21-30.
11. Carling K. Resistant outlier rules and the non-Gaussian case. Computational Statistics & Data Analysis, 2000, vol. 33, no. 3, pp. 249-258.
12. Hubert M., Vandervieren E. An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis, 2008, vol. 52, no. 12, pp. 5186-5201.
13. Brys G., Hubert M., Struyf A. A robust measure of skewness. Journal of Computational and Graphical Statistics, 2004, vol. 13, pp. 996-1017.
14. Kyurkchiev N., Markov S. Sigmoid Functions: Some Approximation and Modelling Aspects. Saarbruck-en, LAP Lambert Academic Publishing, 2015, 120 p.
15. Flach P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. 1st ed. Cambridge University Press, 2012, 409 p.
16. Bicego M., Baldo S. Properties of the Box-Cox transformation for pattern classification. Neurocomputing, 2016, vol. 218, pp. 390-400.
17. Zhang Q., Sun S. Weighted data normalization based on eigenvalues for artificial neural network classification. Proceedings of International Conference Neural Information Processing, 2009, vol. 5863, pp. 349-356. https://doi.org/10.1007/978-3-642-10677-4_39
18. Zadeh L. A. Fuzzy sets. Information and Control, 1965, vol. 8, no. 3, pp. 338-353.
19. Więckowski J., Salabun W. How the normalization of the decision matrix influences the results in the VIKOR method? Procedia Computer Science, 2020, vol. 176, pp. 2222-2231.
20. Ioffe S., Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. 32nd International Conference on Machine Learning, Lille, France, 7-9 July 2015. Lille, 2015, vol. 37, pp. 448-456.
21. Fernandez-Delgado M., Cernadas E., Barro S., Amorim D. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 2014, vol. 15, no. 1, pp. 3133-3181.
22. Lemons K. A. Comparison between Naive Bayes and random forest to predict breast cancer. International Journal of Undergraduate Research & Creative Activities, 2020, vol. 12, art. 12, pp. 1-5. http://doi.org/10.7710/2168-0620.0287
23. Chicco D., Starovoitov V., Jurman G. The benefits of the Matthews correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in binary classification assessment. IEEE Access, 2021, vol. 9, pp. 4711247124. http://doi.org/10.1109/ACCESS.2021.3068614
24. Novikov D. A. Statisticheskie metody v pedagogicheskikh issledovaniyakh (tipovye sluchai). Statistical Methods in Pedagogical Research (Typical Cases). Moscow, MZ-Press, 2004, 67 p. (In Russ.).
25. Cheddad A. On box-cox transformation for image normality and pattern classification. IEEE Access, 2020, vol. 8, pp. 154975-154983. http://doi.org/10.1109/ACCESS.2020.3018874
26. Han J., Moraga C. The influence of the sigmoid function parameters on the speed of backpropagation learning. International Workshop on Artificial Neural Networks, Malaga-Torremolinos, Spain, 7-9 June 1995. Malaga-Torremolinos, 1995, pp. 195-201.
27. Jain A., Nandakumar K., Ross A. Score normalization in multimodal biometric systems. Pattern Recognition, 2005, vol. 38, no. 12, pp. 2270-2285.
Supplementary files
Review
For citations:
Starovoitov V.V., Golub Yu.I. Data normalization in machine learning. Informatics. 2021;18(3):83-96. (In Russ.) https://doi.org/10.37661/1816-0301-2021-18-3-83-96