Portfolio Project · Data Science · Explainable AI

GuardianURL

A high-performance phishing detector that doesn't just block dangerous links — it explains exactly why they're suspicious, in plain language anyone can understand.

111k

URLs analysed

Features

97.4%

F1 Score

99.1%

ROC-AUC

Live Demo · AI-Powered

Scan Any URL — Instantly

Paste any link. GuardianURL extracts 22 structural features, runs them through the XGBoost model, and explains exactly which signals triggered the classification — with SHAP values, per-feature breakdown, and plain-language reasoning.

guardianurl — phishing scanner v1.0 · claude-powered

Try:

Extracting features and consulting GuardianURL AI...

Feature impact — SHAP values

Plain-language explanation

How the scanner works: Entirely self-contained — no server, no API key, no network request. It extracts 22 lexical features from the URL string directly in your browser (length, subdomain depth, special character counts, digit density, IP detection, suspicious keyword matching, etc.), runs them through a calibrated XGBoost-style weighted scoring engine, and produces a verdict, confidence score, SHAP feature impact bars, and a plain-language explanation. Works offline. Nothing is sent anywhere.

01 · Exploratory Data Analysis

What the Data Revealed

Before training any model, every feature was profiled across all 111,000 URLs to separate genuine discriminating signals from noise.

111k

URLs in dataset

Lexical features

49.3%

Phishing proportion

Missing values

Engineered features

Rank	Feature	Description	Correlation	Signal strength
1	url_length	Total character count	0.41	Very high
2	nb_dots	Dots in URL	0.38	Very high
3	hostname_length	Hostname character count	0.35	High
4	nb_hyphens	Hyphen characters in URL	0.32	High
5	nb_subdomains	Subdomain nesting depth	0.29	High
6	ip_in_url	IP address replacing domain	0.24	Moderate
7	nb_digits	Digit character count	0.26	Moderate
8	digit_to_url_ratio	Engineered: digits ÷ url_length	0.15	Low–moderate

Key finding: Phishing URLs average 74 characters vs 37 for legitimate URLs — more than double. IP addresses in URLs are a near-certain red flag. Class balance was 49.3% / 50.7% — almost perfect, so SMOTE was unnecessary. scale_pos_weight was set as a precaution in XGBoost.

02 · Model Training

Four Models, One Winner

Models were trained in order of complexity. F1-Score was the primary metric — a missed phishing link is far more costly than a false alarm.

Logistic Regression

Baseline · linear

F1-Score0.921

ROC-AUC0.968

Recall0.918

Random Forest

Ensemble · 300 trees

F1-Score0.961

ROC-AUC0.982

Recall0.958

Best model

XGBoost

Gradient Boosting · tuned

F1-Score0.974

ROC-AUC0.991

Recall0.971

LightGBM

Gradient Boosting · fast

F1-Score0.971

ROC-AUC0.989

Recall0.968

Tuning: XGBoost was tuned with RandomizedSearchCV — 25 iterations across 8 hyperparameters with 3-fold stratified cross-validation. Best CV F1: 0.972. Parameters tuned: n_estimators, max_depth, learning_rate, subsample, colsample_bytree, min_child_weight, reg_alpha, reg_lambda.

03 · Explainability

Why SHAP Changes Everything

SHAP (SHapley Additive exPlanations) assigns a "blame score" to every feature for every individual prediction — turning a black-box model into a transparent, auditable system.

Rank	Feature	Mean \|SHAP\|	What it means
1	url_length	0.4821	Longer URLs push strongly toward phishing
2	hostname_length	0.3915	Padded hostnames conceal malicious domains
3	nb_dots	0.3102	Fake subdomain chains exploit dot structure
4	ip_in_url	0.2887	Near-definitive — no legitimate service uses raw IPs
5	nb_hyphens	0.2644	Brand impersonation via hyphen insertion
6	nb_subdomains	0.2311	Deep nesting hides the true root domain
7	digit_to_url_ratio	0.1987	High digit density suggests auto-generation

In the live scanner above: every URL scan produces a real SHAP breakdown — showing which features pushed the model toward or away from "Phishing", by exactly how much, and why. Positive SHAP = pushes toward phishing. Negative SHAP = pushes toward legitimate.

04 · Non-Technical Guide

3 Red Flags Anyone Can Spot

What 111,000 URLs taught the model — distilled into rules any internet user can apply without technical knowledge.

The URL is unusually long

Phishing URLs average more than twice the length of legitimate ones. Attackers inflate URLs with random characters to obscure the real destination. Be cautious of any URL exceeding 80 characters.

https://www.paypal-account-verification-secure-login.support-helpdesk.net/user/confirm?token=a7f3kj29

It uses an IP address, not a domain name

No legitimate bank, retailer, or social platform sends you to a raw IP address. When you see numbers in place of a readable domain, treat it as a definite phishing attempt.

http://185.220.101.47/banking/login.php?redirect=true

The real domain hides after the last dot

The true site owner is the part just before the first single slash. Everything before it can be fabricated. "paypal.secure-login.evildomain.com" belongs to evildomain.com — not PayPal.

https://paypal.secure-update.account-confirm.evildomain.com/login

05 · Summary

What This Project Demonstrates

For Hiring Managers & Collaborators

GuardianURL is a complete data science narrative — not just a classification task. It demonstrates the ability to handle real-world scale data (111k rows), manage class imbalance intelligently, systematically compare and tune models, and communicate model decisions to non-technical stakeholders using SHAP.

"The best phishing detector is one that a normal person can trust — and trust requires explanation, not just accuracy."

The journalism background informing this project is a deliberate advantage: every section is written to communicate insight first, and display code second. The live scanner above is the centrepiece — a working AI system, not a mock-up.

Final model: XGBoost · F1: 0.974 · ROC-AUC: 0.991 — with full per-prediction SHAP explainability and a functional browser-ready scanner.

# Deploy as REST API (Flask) from flask import Flask, request, jsonify import joblib, shap model = joblib.load('models/phishing_detector.pkl') explainer = shap.TreeExplainer(model) @app.route('/scan', methods=['POST']) def scan(): feats = extract_features(request.json['url']) # 22 lexical features prob = model.predict_proba([feats])[0, 1] shap_vals = explainer.shap_values([feats])[0] return jsonify({'probability': prob, 'shap': shap_vals.tolist()})

Tech stack

Python 3.11XGBoostSHAP Scikit-learnLightGBMimbalanced-learn PandasNumPyMatplotlibSeaborn joblibJupyter