Natural Language Processing (NLP) and Support Vector Machine (SVM) Optimization in Detecting Phishing Website URLs

Mhd Adi Setiawan  Aritonang; Maradona Jonas  Simanulang; Toras Pangidoan Batubara; Imanuel  Zega; M Hafis  Afrizal

doi:10.52436/1.jutif.2026.7.1.5334

Authors

Mhd Adi Setiawan Aritonang Teknologi Informasi, Teknik Komputer, Institut Teknologi Batam, Indonesia
Maradona Jonas Simanulang Teknologi Informasi, Universitas Senior Medan, Indonesia
Toras Pangidoan Batubara Sistem Informasi, Universitas Murni Teguh, Indonesia
Imanuel Zega Sistem Informasi, Universitas Pignatelli Triputra, Indonesia
M Hafis Afrizal Teknologi Informasi, Teknik Komputer, Institut Teknologi Batam, Indonesia

DOI:

https://doi.org/10.52436/1.jutif.2026.7.1.5334

Keywords:

Cybersecurity, Natural Language Processing, Phishing Detection, Support Vector Machine, URL Classification

Abstract

Phishing remains one of the most pervasive cyber-threats, with recent reports indicating a sharp rise in both volume and sophistication of attacks. According to the Anti‑Phishing Working Group, phishing incidents reached nearly 1 million in Q4 2024. To address this evolving threat, this study aims to develop an automated phishing-URL classification system based on Natural Language Processing (NLP) and Support Vector Machine (SVM). We utilised the Kaggle "PhiUSIIL Phishing URL Dataset" comprising 256,795 URL records and applied comprehensive preprocessing, feature extraction (structural URL features plus NLP-based keyword analysis), and SVM training with grid search optimisation. Evaluation was performed via confusion matrix and standard metrics of accuracy, precision, recall and F1-score. The best model achieved an accuracy of 99.99%, precision of 99.98%, recall of 100%, and F1-score of 99.99%. These results demonstrate that the combined NLP + SVM approach can effectively distinguish phishing from legitimate URLs with very high reliability. The proposed system contributes to cybersecurity by offering a feasible AI-based solution for real-time URL screening that can be integrated into browser extensions or enterprise email filters to bolster phishing defences.

Downloads

Download data is not yet available.

References

“PHISHING ACTIVITY TRENDS REPORT,” 2025. [Online]. Available: http://www.apwg.org,

“Phishing E-mail Reports and Phishing Site Trends 4 Brand-Domain Pairs Measurement 5 Brands & Legitimate Entities Hijacked by E-mail Phishing Attacks 6 Use of Domain Names for Phishing 7-9 Phishing and Identity Theft in Brazil 10-11 Most Targeted Industry Sectors 12 APWG Phishing Trends Report Contributors 13,” 2024. [Online]. Available: http://www.apwg.org,

NCSC, “Federal Department of Defense, Civil Protection and Sport DDPS National Cyber Security Centre NCSC Anti-Phishing Report 2024,” 2025.

A. Zamir et al., “Phishing web site detection using diverse machine learning algorithms,” Electronic Library, vol. 38, no. 1, pp. 65–80, Mar. 2020, doi: 10.1108/EL-05-2019-0118.

S. Saeed, L. Fotia, M. Jabed, M. Chowdhury, and M. A. Tamal, “Dataset of suspicious phishing URL detection,” 2024, doi: 10.17632/6tm2d6sz7p.1.

A. Anand, A. Gupta, A. Dubey, and C. Naidu, “Phishing Site Detection Using ML Algorithms,” 2024. [Online]. Available: www.ijfmr.com

Catal & Giray, “Applications of deep learning for phishing detection: a systematic literature review ,” 2022.

A. Safi and S. Singh, “A systematic literature review on phishing website detection techniques,” Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 2, pp. 590–611, Feb. 2023, doi: 10.1016/j.jksuci.2023.01.004.

D. Kalla and S. Kuraku, “Phishing Website URL’s Detection Using NLP and Machine Learning Techniques,” Journal on Artificial Intelligence, vol. 5, no. 0, pp. 145–162, 2023, doi: 10.32604/jai.2023.043366.

M. A. Uddin, M. Mahiuddin, and I. H. Sarker, “An Explainable Transformer-based Model for Phishing Email Detection: A Large Language Model Approach,” Aug. 2025, [Online]. Available: http://arxiv.org/abs/2402.13871

Q. E. ul Haq, M. H. Faheem, and I. Ahmad, “Detecting Phishing URLs Based on a Deep Learning Approach to Prevent Cyber-Attacks,” Applied Sciences (Switzerland), vol. 14, no. 22, Nov. 2024, doi: 10.3390/app142210086.

C. Catal, G. Giray, B. Tekinerdogan, S. Kumar, and S. Shukla, “Applications of deep learning for phishing detection: a systematic literature review,” Knowl Inf Syst, vol. 64, no. 6, pp. 1457–1500, Jun. 2022, doi: 10.1007/s10115-022-01672-x.

R. N. F. Tanjung and S. Rahman, “Meningkatkan Deteksi Email Phising Melalui Pendekatan SVM yang Dioptimalkan NLP,” INCODING: Journal of Informatics and Computer Science Engineering, vol. 5, no. 1, pp. 38–50, Apr. 2025, doi: 10.34007/incoding.v5i1.831.

E. S. Aung and H. Yamana, “PhiSN: Phishing URL Detection Using Segmentation and NLP Features,” Journal of Information Processing, vol. 32, pp. 973–989, 2024, doi: 10.2197/ipsjjip.32.973.

A. Kumar, J. M. Chatterjee, and V. G. Díaz, “A novel hybrid approach of SVM combined with NLP and probabilistic neural network for email phishing,” International Journal of Electrical and Computer Engineering, vol. 10, no. 1, pp. 486–493, 2020, doi: 10.11591/ijece.v10i1.pp486-493.

R. Luthfiansyah and B. Wasito, “Penerapan Teknik Deep Learning (Long Short Term Memory) dan Pendekatan Klasik (Regresi Linier) dalam Prediksi Pergerakan Saham BRI,” 2023.

Nailah Azzahra, Merry Dwi Handayani, and Awwaliyah Aliyah, “Evaluasi Kinerja AI berbasis Recurrent Neural Network (RNN) dalam Mengidentifikasi Ancaman Phising pada URL Website,” Bridge : Jurnal Publikasi Sistem Informasi dan Telekomunikasi, vol. 3, no. 3, pp. 15–37, Jun. 2025, doi: 10.62951/bridge.v3i3.485.

E. S. Shombot, G. Dusserre, R. Bestak, and N. B. Ahmed, “An application for predicting phishing attacks: A case of implementing a support vector machine learning model,” Cyber Security and Applications, vol. 2, Jan. 2024, doi: 10.1016/j.csa.2024.100036.

A. Safi and S. Singh, “A systematic literature review on phishing website detection techniques,” Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 2, pp. 590–611, Feb. 2023, doi: 10.1016/j.jksuci.2023.01.004.

M. Abdolrazzagh-Nezhad and N. Langarib, “Phishing Detection Techniques: A review,” Data Science: Journal of Computing and Applied Informatics, vol. 9, no. 1, pp. 32–46, Jan. 2025, doi: 10.32734/jocai.v9.i1-19904.

A. Prasad and S. Chandra., “PhiUSIIL Phishing URL,” UCI Machine Learning Repository.

A. D. Prastiko and A. Davy Wiranata, “Analisis Sentimen Publik terhadap Fenomena Judi Online di Media Sosial X dengan SVM,” Andika Dwi Prastiko, vol. 1, no. 2, pp. 306–315, 2025, doi: 10.55382/jurnalpustakaai.v5i2.1180.

D. Fitriono, S. A. Wardani, M. Nizar, B. Al Varuq, A. Ristyawan, and E. Daniati, “Perbandingan Metode Algoritma Decission Tree dan K-Nearest Neighbors untuk Memprediksi Kualitas Air yang dapat dikonsumsi,” Online, 2024.

D. Sangaji and T. Sutabri, “Analisis XGBoost dan Random Forest untuk Prediksi Curah Hujan dalam Mendukung Mitigasi Karhutla,” Jurnal Pustaka AI (Pusat Akses Kajian Teknologi Artificial Intelligence), vol. 5, no. 1, pp. 13–18, Apr. 2025, doi: 10.55382/jurnalpustakaai.v5i1.905.

V. Pramaningsih, R. Yuliawati, S. Sukisman, H. Hansen, R. Suhelmi, and A. Daramusseng, “Indek Kualitas Air dan Dampak terhadap Kesehatan Masyarakat Sekitar Sungai Karang Mumus, Samarinda,” Jurnal Kesehatan Lingkungan Indonesia, vol. 22, no. 3, pp. 313–319, Oct. 2023, doi: 10.14710/jkli.22.3.313-319.

Rakesh Jampa, “PhiUSIIL_Phishing_URL_Dataset,” Kaggle. Accessed: Sep. 15, 2025. [Online]. Available: https://www.kaggle.com/datasets/rakeshjampa/phiusiil-phishing-url-dataset/data

S. Asiri, Y. Xiao, and T. Li, “PhishTransformer: A Novel Approach to Detect Phishing Attacks Using URL Collection and Transformer,” Electronics (Switzerland), vol. 13, no. 1, Jan. 2024, doi: 10.3390/electronics13010030.

A. Aljofey, Q. Jiang, Q. Qu, M. Huang, and J. P. Niyigena, “An effective phishing detection model based on character level convolutional neural network from URL,” Electronics (Switzerland), vol. 9, no. 9, pp. 1–24, Sep. 2020, doi: 10.3390/electronics9091514