Preview

Informatics

Advanced search

Software module for detecting fraudulent websites using classification based on machine learning methods

https://doi.org/10.37661/1816-0301-2025-22-3-83-94

Abstract

O b j e c t i v e s. Phishing web resources are among the most common tools of online fraud aimed at obtaining users' confidential information. The goal of this research was to develop a software module for the automatic detection of phishing websites using machine learning methods.

M e t h o d s. To achieve this goal, an analysis of existing datasets containing phishing website URLs was conducted, along with the study of datasets for natural language processing (NLP). This enabled the identification of key features characteristic of fraudulent resources. Two datasets were created (sizes: 18.9 MB and 1.08 GB), incorporating URL attributes and web page content, using a custom-developed parser. Machine learning algorithms such as SVM, Random Forest, Logistic Regression, and Multilayer Perceptron (MLP) were applied for website classification. The potential of the TinyBERT language model for analyzing textual content was also explored.

R e s u l t s. The analysis revealed that the MLP model demonstrated the best performance for URL classification, while the TinyBERT model excelled in analyzing textual content. A software module was developed, consisting of a server-side application and a browser extension. The extension collects data from web resources, transmits them to the server, where trained machine learning models analyze the information. The server calculates the likelihood of phishing activity, and the results are displayed to the user via the extension's interface. The implementation utilized a technology stack including Python 3.12, Flask, Pickle, Langdetect, Re, NLTK, JavaScript, and the Google Chrome API.

Co n c l u s i o n. The developed software module was tested and demonstrated high efficiency in phishing website classification tasks. The theoretical significance of the work lies in applying modern machine learning algorithms for analyzing textual content and URLs. The practical significance is reflected in the creation of a ready-to-use solution for real-time phishing site detection.

About the Authors

S. N. Petrov
Belarusian State University of Informatics and Radioelectronics
Belarus

Sergei N. Petrov - Ph. D. (Eng.), Assoc. Prof., Assoc. Prof. of the Information Security Department, Faculty of Infocommunications, Belarusian State University of Informatics and Radioelectronics.

Р. Brovki st., 6, Minsk, 220013

https://www.elibrary.ru/author_profile.asp?authorid=1088896



A. O. Myadelets
National Children’s Technopark
Belarus

Artyom O. Myadelets - Student, National Children’s Technopark.

Francis Skorina st., 25/3, Minsk, 220076



E. V. Kundas
National Children’s Technopark
Belarus

Elizaveta V. Kundas - Student, National Children’s Technopark.

Francis Skorina st., 25/3, Minsk, 220076



References

1. Zavyalov A. N. Internet fraud (phishing): problems of counteraction and prevention. Baikal Research Journal, 2022, vol. 13, no. 2, p. 36 (In Russ.).

2. Mosa D. T., Shams M. Y., Abohany A. A., El-kenawy E.-S. M., Thabet M. Machine learning techniques for detecting phishing URL attacks. Computers, Materials & Continua, 2023, vol. 75, no. 1, рр. 1271–1290. DOI: 10.32604/cmc.2023.036422.

3. Benavides-Astudillo E., Fuertes W., Sanchez-Gordon S., Nuñez-Agurto D., Rodríguez-Galán G. A phishing-attack-detection model using natural language processing and deep learning. Applied Sciences, 2023, vol. 13, iss. 9, р. 5275.

4. Petrov S. N., Myadelets A. O., Kundas E. V. Datasets for training models to detect fraudulent web resources. Luchshie studencheskie issledovanija 2025 : sbornik statej IV Mezhdunarodnogo nauchno-issledovatel'skogo konkursa [Best Student Research 2025: Collection of Articles of the IV International Research Competition]. Penza, Nauka i Prosveshchenie, 2025, pp. 27–32 (In Russ.).

5. Géron A. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly Media, 2017, 572 р.

6. Shukla N. Machine Learning with TensorFlow. Manning, 2018, 272 р.

7. Koroteev M. V. BERT: A Review of Applications in Natural Language Processing and Understanding. Available at: https://arxiv.org/pdf/2103.11943 (accessed 17.03.2025).

8. Worth P. J. Word embeddings and semantic spaces in natural language processing. International Journal of Intelligence Science, 2023, vol. 13, no. 1, рр. 1–21. DOI: 10.4236/ijis.2023.131001.

9. Jiao X., Yin Y., Shang L., Jiang X., Chen X., …, Liu Q. TinyBERT: Distilling BERT for Natural Language Understanding. Available at: https://arxiv.org/pdf/1909.10351v5 (accessed 17.03.2025). DOI: 10.48550/arXiv.1909.10351.


Review

For citations:


Petrov S.N., Myadelets A.O., Kundas E.V. Software module for detecting fraudulent websites using classification based on machine learning methods. Informatics. 2025;22(3):83-94. (In Russ.) https://doi.org/10.37661/1816-0301-2025-22-3-83-94

Views: 204


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1816-0301 (Print)
ISSN 2617-6963 (Online)