Ц е л и

inform

Информатика

Informatics

1816-03012617-6963

UIIP NASB

10.37661/1816-0301-2025-22-3-83-94

inform-1357

Research Article

ЗАЩИТА ИНФОРМАЦИИ И НАДЕЖНОСТЬ СИСТЕМ

INFORMATION PROTECTION AND SYSTEM RELIABILITY

Программный модуль для детектирования мошеннических веб-сайтов с использованием классификации на основе методов машинного обучения

Software module for detecting fraudulent websites using classification based on machine learning methods

Петров

С. Н.

Petrov

S. N.

Петров Сергей Николаевич - кандидат технических наук, доцент, доцент кафедры защиты информации, факультет инфокоммуникаций.

ул. П. Бровки, 6, Минск, 220013

https://www.elibrary.ru/author_profile.asp?authorid=1088896

Sergei N. Petrov - Ph. D. (Eng.), Assoc. Prof., Assoc. Prof. of the Information Security Department, Faculty of Infocommunications, Belarusian State University of Informatics and Radioelectronics.

Р. Brovki st., 6, Minsk, 220013

https://www.elibrary.ru/author_profile.asp?authorid=1088896

sergpetrov@inbox.ru

Мяделец

А. О.

Myadelets

A. O.

Мяделец Артем Олегович – учащийся.

ул. Франциска Скорины, 25/3, Минск, 220076

Artyom O. Myadelets - Student, National Children’s Technopark.

Francis Skorina st., 25/3, Minsk, 220076

artemmuadzelets@gmail.com

Кундас

Е. В.

Kundas

E. V.

Кундас Елизавета Владимировна – учащийся.

ул. Франциска Скорины, 25/3, Минск, 220076

Elizaveta V. Kundas - Student, National Children’s Technopark.

Francis Skorina st., 25/3, Minsk, 220076

kundaselizaveta@gmail.com

Белорусский государственный университет информатики и радиоэлектроникиBelarusian State University of Informatics and Radioelectronics

Национальный детский технопаркNational Children’s Technopark

2025

10102025

2238394

2025

Петров С.Н., Мяделец А.О., Кундас Е.В.

Petrov S.N., Myadelets A.O., Kundas E.V.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://inf.grid.by/jour/article/view/1357

Ц е л и

Ц е л и. Целью исследования является разработка программного модуля для автоматического выявления фишинговых веб-сайтов с использованием алгоритмов машинного обучения для классификации сайтов.

М е т о д ы

М е т о д ы. Для достижения поставленной цели проведен анализ существующих датасетов, содержащих URL-адреса фишинговых сайтов, а также изучены датасеты для обработки естественного языка. Это позволило определить ключевые признаки, характерные для мошеннических ресурсов. Были созданы два набора данных (размерами 18,9 Мб и 1,08 Гб), включающих признаки URL и текстовое наполнение веб-страниц, с использованием разработанного парсера. Для классификации веб-ресурсов применялись алгоритмы машинного обучения, такие как SVM, Random Forest, Logistic Regression и Multilayer Perceptron (MLP). Также изучены возможности использования языковой модели TinyBERT для анализа текстового содержимого.

Р е з у л ь т а т ы

Р е з у л ь т а т ы. По результатам проведенных исследований для работы с URL использована модель MLP (F1-score 99,3 %), а для анализа текстовой части веб-ресурса – модель TinyBERT (F1-score 95 %). Разработан программный модуль для выявления мошеннических веб-сайтов, состоящий из серверной части и браузерного расширения. Расширение собирает данные с веб-ресурса, передает их на сервер, где они анализируются обученными моделями машинного обучения. На сервере рассчитывается вероятность фишинговой активности, а результаты отображаются пользователю через интерфейс расширения. Реализация выполнена с использованием стека технологий Python 3.12, Flask, Pickle, Langdetect, Re и NLTK, а также JavaScript и Google Chrome API.

З а к л ю ч е н и е

З а к л ю ч е н и е. Разработанный программный модуль был протестирован и продемонстрировал высокую эффективность в задачах классификации фишинговых сайтов. Теоретическая значимость работы заключается в применении современных алгоритмов машинного обучения для анализа текстового контента и URL. Практическая значимость заключается в создании готового решения для выявления фишинговых сайтов в реальном времени.

O b j e c t i v e s

O b j e c t i v e s. Phishing web resources are among the most common tools of online fraud aimed at obtaining users' confidential information. The goal of this research was to develop a software module for the automatic detection of phishing websites using machine learning methods.

M e t h o d s

M e t h o d s. To achieve this goal, an analysis of existing datasets containing phishing website URLs was conducted, along with the study of datasets for natural language processing (NLP). This enabled the identification of key features characteristic of fraudulent resources. Two datasets were created (sizes: 18.9 MB and 1.08 GB), incorporating URL attributes and web page content, using a custom-developed parser. Machine learning algorithms such as SVM, Random Forest, Logistic Regression, and Multilayer Perceptron (MLP) were applied for website classification. The potential of the TinyBERT language model for analyzing textual content was also explored.

R e s u l t s

R e s u l t s. The analysis revealed that the MLP model demonstrated the best performance for URL classification, while the TinyBERT model excelled in analyzing textual content. A software module was developed, consisting of a server-side application and a browser extension. The extension collects data from web resources, transmits them to the server, where trained machine learning models analyze the information. The server calculates the likelihood of phishing activity, and the results are displayed to the user via the extension's interface. The implementation utilized a technology stack including Python 3.12, Flask, Pickle, Langdetect, Re, NLTK, JavaScript, and the Google Chrome API.

Co n c l u s i o n

Co n c l u s i o n. The developed software module was tested and demonstrated high efficiency in phishing website classification tasks. The theoretical significance of the work lies in applying modern machine learning algorithms for analyzing textual content and URLs. The practical significance is reflected in the creation of a ready-to-use solution for real-time phishing site detection.

фишинговые сайтымошенничествомашинное обучениеклассификацияобработка естественного языкадатасеты

phishing websitesfraudmachine learningclassificationnatural language processingdatasets

References1

Завьялов, А. Н. Интернет-мошенничество (фишинг): проблемы противодействия и предупреждения / А. Н. Завьялов // Baikal Research Journal. – 2022. – Т. 13, № 2. – С. 36.

Zavyalov A. N. Internet fraud (phishing): problems of counteraction and prevention. Baikal Research Journal, 2022, vol. 13, no. 2, p. 36 (In Russ.).

Machine learning techniques for detecting phishing URL attacks / D. T. Mosa, M. Y. Shams, A. A. Abohany [et al.] // Computers, Materials & Continua. – 2023. – Vol. 75, no. 1. – Р. 1271–1290. – DOI: 10.32604/cmc.2023.036422.

Mosa D. T., Shams M. Y., Abohany A. A., El-kenawy E.-S. M., Thabet M. Machine learning techniques for detecting phishing URL attacks. Computers, Materials & Continua, 2023, vol. 75, no. 1, рр. 1271–1290. DOI: 10.32604/cmc.2023.036422.

A phishing-attack-detection model using natural language processing and deep learning / E. Benavides-Astudillo, W. Fuertes, S. Sanchez-Gordon [et al.] // Applied Sciences. – 2023. – Vol. 13, iss. 9. – Р. 5275.

Benavides-Astudillo E., Fuertes W., Sanchez-Gordon S., Nuñez-Agurto D., Rodríguez-Galán G. A phishing-attack-detection model using natural language processing and deep learning. Applied Sciences, 2023, vol. 13, iss. 9, р. 5275.

Петров, С. Н. Датасеты для обучения моделей обнаружению мошеннических веб-ресурсов / С. Н. Петров, А. О. Мяделец, Е. В. Кундас // Лучшие студенческие исследования 2025 : сб. ст. IV Междунар. науч.-исслед. конкурса. – Пенза : МЦНС «Наука и Просвещение». – 2025. – С. 27–32.

Petrov S. N., Myadelets A. O., Kundas E. V. Datasets for training models to detect fraudulent web resources. Luchshie studencheskie issledovanija 2025 : sbornik statej IV Mezhdunarodnogo nauchno-issledovatel'skogo konkursa [Best Student Research 2025: Collection of Articles of the IV International Research Competition]. Penza, Nauka i Prosveshchenie, 2025, pp. 27–32 (In Russ.).

Жерон, О. Прикладное машинное обучение с помощью Scikit-Learn и TensorFlow: концепции, инструменты и техники для создания интеллектуальных систем / О. Жерон ; пер. с англ. – СПб. : ООО «Альфа-книга», 2018. – 688 с.

Géron A. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly Media, 2017, 572 р.

Шакла, Н. Машинное обучение и TeпsorFlow / Н. Шакла ; пер. с англ. – СПб. : Питер, 2019. – 336 с.

Shukla N. Machine Learning with TensorFlow. Manning, 2018, 272 р.

Koroteev, M. V. BERT: A Review of Applications in Natural Language Processing and Understanding / M. V. Koroteev. – URL: https://arxiv.org/pdf/2103.11943 (date of access: 17.03.2025).

Koroteev M. V. BERT: A Review of Applications in Natural Language Processing and Understanding. Available at: https://arxiv.org/pdf/2103.11943 (accessed 17.03.2025).

Worth, P. J. Word embeddings and semantic spaces in natural language processing / P. J. Worth // International Journal of Intelligence Science. – 2023. – Vol. 13, no. 1. – P. 1–21. – DOI: 10.4236/ijis.2023.131001.

Worth P. J. Word embeddings and semantic spaces in natural language processing. International Journal of Intelligence Science, 2023, vol. 13, no. 1, рр. 1–21. DOI: 10.4236/ijis.2023.131001.

TinyBERT: Distilling BERT for Natural Language Understanding / X. Jiao, Y. Yin, L. Shang [et al.]. – URL: https://arxiv.org/pdf/1909.10351v5 (date of access: 17.03.2025). – DOI: 10.48550/arXiv.1909.10351.

Jiao X., Yin Y., Shang L., Jiang X., Chen X., …, Liu Q. TinyBERT: Distilling BERT for Natural Language Understanding. Available at: https://arxiv.org/pdf/1909.10351v5 (accessed 17.03.2025). DOI: 10.48550/arXiv.1909.10351.

The authors declare that there are no conflicts of interest present.