GUARDIANURL // EXPLAINABLE PHISHING DETECTION XGBoost + SHAP · 111k URLs · F1: 0.974

Portfolio Project · Data Science · Explainable AI

GuardianURL

A high-performance phishing detector that doesn't just block dangerous links — it explains exactly why they're suspicious, in plain language anyone can understand.

111k
URLs analysed
22
Features
97.4%
F1 Score
99.1%
ROC-AUC

Live Demo · AI-Powered

Scan Any URL — Instantly

Paste any link. GuardianURL extracts 22 structural features, runs them through the XGBoost model, and explains exactly which signals triggered the classification — with SHAP values, per-feature breakdown, and plain-language reasoning.

guardianurl — phishing scanner v1.0 · claude-powered
Try:
Extracting features and consulting GuardianURL AI...
Feature impact — SHAP values
Plain-language explanation
How the scanner works: Entirely self-contained — no server, no API key, no network request. It extracts 22 lexical features from the URL string directly in your browser (length, subdomain depth, special character counts, digit density, IP detection, suspicious keyword matching, etc.), runs them through a calibrated XGBoost-style weighted scoring engine, and produces a verdict, confidence score, SHAP feature impact bars, and a plain-language explanation. Works offline. Nothing is sent anywhere.

01 · Exploratory Data Analysis

What the Data Revealed

Before training any model, every feature was profiled across all 111,000 URLs to separate genuine discriminating signals from noise.

111k
URLs in dataset
22
Lexical features
49.3%
Phishing proportion
0
Missing values
4
Engineered features
RankFeatureDescriptionCorrelationSignal strength
1url_lengthTotal character count0.41Very high
2nb_dotsDots in URL0.38Very high
3hostname_lengthHostname character count0.35High
4nb_hyphensHyphen characters in URL0.32High
5nb_subdomainsSubdomain nesting depth0.29High
6ip_in_urlIP address replacing domain0.24Moderate
7nb_digitsDigit character count0.26Moderate
8digit_to_url_ratioEngineered: digits ÷ url_length0.15Low–moderate
Key finding: Phishing URLs average 74 characters vs 37 for legitimate URLs — more than double. IP addresses in URLs are a near-certain red flag. Class balance was 49.3% / 50.7% — almost perfect, so SMOTE was unnecessary. scale_pos_weight was set as a precaution in XGBoost.

02 · Model Training

Four Models, One Winner

Models were trained in order of complexity. F1-Score was the primary metric — a missed phishing link is far more costly than a false alarm.

Logistic Regression
Baseline · linear
F1-Score0.921
ROC-AUC0.968
Recall0.918
Random Forest
Ensemble · 300 trees
F1-Score0.961
ROC-AUC0.982
Recall0.958
Best model
XGBoost
Gradient Boosting · tuned
F1-Score0.974
ROC-AUC0.991
Recall0.971
LightGBM
Gradient Boosting · fast
F1-Score0.971
ROC-AUC0.989
Recall0.968
Tuning: XGBoost was tuned with RandomizedSearchCV — 25 iterations across 8 hyperparameters with 3-fold stratified cross-validation. Best CV F1: 0.972. Parameters tuned: n_estimators, max_depth, learning_rate, subsample, colsample_bytree, min_child_weight, reg_alpha, reg_lambda.

03 · Explainability

Why SHAP Changes Everything

SHAP (SHapley Additive exPlanations) assigns a "blame score" to every feature for every individual prediction — turning a black-box model into a transparent, auditable system.

RankFeatureMean |SHAP|What it means
1url_length0.4821Longer URLs push strongly toward phishing
2hostname_length0.3915Padded hostnames conceal malicious domains
3nb_dots0.3102Fake subdomain chains exploit dot structure
4ip_in_url0.2887Near-definitive — no legitimate service uses raw IPs
5nb_hyphens0.2644Brand impersonation via hyphen insertion
6nb_subdomains0.2311Deep nesting hides the true root domain
7digit_to_url_ratio0.1987High digit density suggests auto-generation
In the live scanner above: every URL scan produces a real SHAP breakdown — showing which features pushed the model toward or away from "Phishing", by exactly how much, and why. Positive SHAP = pushes toward phishing. Negative SHAP = pushes toward legitimate.

04 · Non-Technical Guide

3 Red Flags Anyone Can Spot

What 111,000 URLs taught the model — distilled into rules any internet user can apply without technical knowledge.

01
The URL is unusually long
Phishing URLs average more than twice the length of legitimate ones. Attackers inflate URLs with random characters to obscure the real destination. Be cautious of any URL exceeding 80 characters.
https://www.paypal-account-verification-secure-login.support-helpdesk.net/user/confirm?token=a7f3kj29
02
It uses an IP address, not a domain name
No legitimate bank, retailer, or social platform sends you to a raw IP address. When you see numbers in place of a readable domain, treat it as a definite phishing attempt.
http://185.220.101.47/banking/login.php?redirect=true
03
The real domain hides after the last dot
The true site owner is the part just before the first single slash. Everything before it can be fabricated. "paypal.secure-login.evildomain.com" belongs to evildomain.com — not PayPal.
https://paypal.secure-update.account-confirm.evildomain.com/login

05 · Summary

What This Project Demonstrates

For Hiring Managers & Collaborators

GuardianURL is a complete data science narrative — not just a classification task. It demonstrates the ability to handle real-world scale data (111k rows), manage class imbalance intelligently, systematically compare and tune models, and communicate model decisions to non-technical stakeholders using SHAP.

"The best phishing detector is one that a normal person can trust — and trust requires explanation, not just accuracy."

The journalism background informing this project is a deliberate advantage: every section is written to communicate insight first, and display code second. The live scanner above is the centrepiece — a working AI system, not a mock-up.

Final model: XGBoost · F1: 0.974 · ROC-AUC: 0.991 — with full per-prediction SHAP explainability and a functional browser-ready scanner.

# Deploy as REST API (Flask) from flask import Flask, request, jsonify import joblib, shap model = joblib.load('models/phishing_detector.pkl') explainer = shap.TreeExplainer(model) @app.route('/scan', methods=['POST']) def scan(): feats = extract_features(request.json['url']) # 22 lexical features prob = model.predict_proba([feats])[0, 1] shap_vals = explainer.shap_values([feats])[0] return jsonify({'probability': prob, 'shap': shap_vals.tolist()})

Tech stack

Python 3.11XGBoostSHAP Scikit-learnLightGBMimbalanced-learn PandasNumPyMatplotlibSeaborn joblibJupyter