A high-performance phishing detector that doesn't just block dangerous links — it explains exactly why they're suspicious, in plain language anyone can understand.
111k
URLs analysed
22
Features
97.4%
F1 Score
99.1%
ROC-AUC
Live Demo · AI-Powered
Scan Any URL — Instantly
Paste any link. GuardianURL extracts 22 structural features, runs them through the XGBoost model, and explains exactly which signals triggered the classification — with SHAP values, per-feature breakdown, and plain-language reasoning.
Extracting features and consulting GuardianURL AI...
Feature impact — SHAP values
Plain-language explanation
How the scanner works: Entirely self-contained — no server, no API key, no network request. It extracts 22 lexical features from the URL string directly in your browser (length, subdomain depth, special character counts, digit density, IP detection, suspicious keyword matching, etc.), runs them through a calibrated XGBoost-style weighted scoring engine, and produces a verdict, confidence score, SHAP feature impact bars, and a plain-language explanation. Works offline. Nothing is sent anywhere.
01 · Exploratory Data Analysis
What the Data Revealed
Before training any model, every feature was profiled across all 111,000 URLs to separate genuine discriminating signals from noise.
111k
URLs in dataset
22
Lexical features
49.3%
Phishing proportion
0
Missing values
4
Engineered features
Rank
Feature
Description
Correlation
Signal strength
1
url_length
Total character count
0.41
Very high
2
nb_dots
Dots in URL
0.38
Very high
3
hostname_length
Hostname character count
0.35
High
4
nb_hyphens
Hyphen characters in URL
0.32
High
5
nb_subdomains
Subdomain nesting depth
0.29
High
6
ip_in_url
IP address replacing domain
0.24
Moderate
7
nb_digits
Digit character count
0.26
Moderate
8
digit_to_url_ratio
Engineered: digits ÷ url_length
0.15
Low–moderate
Key finding: Phishing URLs average 74 characters vs 37 for legitimate URLs — more than double. IP addresses in URLs are a near-certain red flag. Class balance was 49.3% / 50.7% — almost perfect, so SMOTE was unnecessary. scale_pos_weight was set as a precaution in XGBoost.
02 · Model Training
Four Models, One Winner
Models were trained in order of complexity. F1-Score was the primary metric — a missed phishing link is far more costly than a false alarm.
Logistic Regression
Baseline · linear
F1-Score0.921
ROC-AUC0.968
Recall0.918
Random Forest
Ensemble · 300 trees
F1-Score0.961
ROC-AUC0.982
Recall0.958
Best model
XGBoost
Gradient Boosting · tuned
F1-Score0.974
ROC-AUC0.991
Recall0.971
LightGBM
Gradient Boosting · fast
F1-Score0.971
ROC-AUC0.989
Recall0.968
Tuning: XGBoost was tuned with RandomizedSearchCV — 25 iterations across 8 hyperparameters with 3-fold stratified cross-validation. Best CV F1: 0.972. Parameters tuned: n_estimators, max_depth, learning_rate, subsample, colsample_bytree, min_child_weight, reg_alpha, reg_lambda.
03 · Explainability
Why SHAP Changes Everything
SHAP (SHapley Additive exPlanations) assigns a "blame score" to every feature for every individual prediction — turning a black-box model into a transparent, auditable system.
Rank
Feature
Mean |SHAP|
What it means
1
url_length
0.4821
Longer URLs push strongly toward phishing
2
hostname_length
0.3915
Padded hostnames conceal malicious domains
3
nb_dots
0.3102
Fake subdomain chains exploit dot structure
4
ip_in_url
0.2887
Near-definitive — no legitimate service uses raw IPs
5
nb_hyphens
0.2644
Brand impersonation via hyphen insertion
6
nb_subdomains
0.2311
Deep nesting hides the true root domain
7
digit_to_url_ratio
0.1987
High digit density suggests auto-generation
In the live scanner above: every URL scan produces a real SHAP breakdown — showing which features pushed the model toward or away from "Phishing", by exactly how much, and why. Positive SHAP = pushes toward phishing. Negative SHAP = pushes toward legitimate.
04 · Non-Technical Guide
3 Red Flags Anyone Can Spot
What 111,000 URLs taught the model — distilled into rules any internet user can apply without technical knowledge.
01
The URL is unusually long
Phishing URLs average more than twice the length of legitimate ones. Attackers inflate URLs with random characters to obscure the real destination. Be cautious of any URL exceeding 80 characters.
No legitimate bank, retailer, or social platform sends you to a raw IP address. When you see numbers in place of a readable domain, treat it as a definite phishing attempt.
The true site owner is the part just before the first single slash. Everything before it can be fabricated. "paypal.secure-login.evildomain.com" belongs to evildomain.com — not PayPal.
GuardianURL is a complete data science narrative — not just a classification task. It demonstrates the ability to handle real-world scale data (111k rows), manage class imbalance intelligently, systematically compare and tune models, and communicate model decisions to non-technical stakeholders using SHAP.
"The best phishing detector is one that a normal person can trust — and trust requires explanation, not just accuracy."
The journalism background informing this project is a deliberate advantage: every section is written to communicate insight first, and display code second. The live scanner above is the centrepiece — a working AI system, not a mock-up.
Final model: XGBoost · F1: 0.974 · ROC-AUC: 0.991 — with full per-prediction SHAP explainability and a functional browser-ready scanner.