M02 Customer Segmentation, Unsupervised

Which customers
earn their keep.

A segmentation model that sorts every customer by value. Aims every marketing dollar at the 21% who generate 64% of revenue. The other 79% stop draining the budget.

Below: a live classifier, segment profiles, and a campaign ROI simulator. Try the ROI simulator first. Every chart is computed in your browser from the JSON the Python script writes to /assets/data/segmentation/. No hosted inference, no screenshots, no hand-picked numbers.

Filed

Dataset

UCI · Online Retail II · ~1M transactions

Source

notebooks/segmentation_model.py ↗

§ I Why segmentation matters

Mass marketing dilutes both signal and budget. A 1% response rate from a blanket email is a 99% waste, and the 1% who responded were going to buy anyway. Segmentation replaces that with targeted outreach: the right message to the right subset, priced against the revenue each segment actually produces.

Well-run segmentation programs routinely deliver double-digit percentage-point lift in campaign ROI over blanket sends. This page shows the method, RFM features, three clustering algorithms compared, each segment profiled against its revenue contribution, run against the public UCI Online Retail II dataset so every number is reproducible from the linked script.

The winning model here is K-Means at k=4. Four segments unpack a real 1M-transaction book into groups with dramatically different economics, as you're about to see, one of them earns roughly 64% of revenue on 21% of customers.

§ II Data Profile

§ III Three clustering methods, one dataset

Run on the same log+scaled R/F/M matrix. Higher silhouette and Calinski-Harabasz are better; lower Davies-Bouldin is better. No single metric decides, silhouette is the primary criterion here. The winner row is highlighted.

Method	Best k	Silhouette	Calinski Harabasz	Davies Bouldin	Inertia

§ IV · M02.1 K selection · silhouette and elbow, all three methods

Left: silhouette score vs k, higher is tighter clusters. Accent dot marks the chosen operational k. The k=2 column is shaded and labeled (degenerate): it wins on score because RFM data has a strong bimodal Pareto split, but two clusters is operationally useless for targeted outreach. Right: K-Means inertia elbow, diminishing returns after the operational k.

Fig. 02.1 a · Silhouette vs k

Fig. 02.1 b · K-Means inertia elbow

§ V · M02.2 The segment map · 2,000 customers, PCA(2)

Each point is one customer. Axes are the top two principal components of the log-scaled R/F/M matrix. Hover a point for raw values. Click a legend entry to isolate a cluster.

Fig. 02.2 · Segment scatter (PCA)

HOVER , for R / F / M on this customer

§ VI · M02.3 Classify a customer, live

Inputs are log1p-transformed and standardized with the same scaler the model was trained on, then assigned to the nearest K-Means centroid by Euclidean distance. Move the sliders, assignment updates in real time.

Assigned segment

—

Customers: —
Of base: —
Of revenue: —

This is the same K-Means model used in Fig. 02.2. The inset re-projects through the same PCA so the "you" dot lives in the same space as the scatter above.

§ VII · M02.4 Segment profiles

One card per cluster. The small bars show where each segment's median customer lands on the overall distribution for Recency, Frequency, and Monetary. The revenue-vs-customers bar tells you whether the segment is pulling its weight.

§ VIII · M02.5 Campaign ROI simulator

Pick a segment, set a per-customer campaign cost and an expected lift multiplier. The baseline response rate is the segment's own measured 60-day repurchase rate, not a fabricated number. The ROI math and all intermediates are computed live.

Segment

Cost / customer $1.00

Expected lift 1.5×

contacted

→

baseline buyers
size × repurchase rate

→

with-campaign buyers
baseline × lift

→

incremental buyers

Incremental revenue

—

incremental buyers × AOV

Campaign cost

—

size × cost per customer

Net ROI

—

The 60-day repurchase rate is measured directly from the data, fraction of each segment's customers who bought in the last 60 days of the snapshot. The lift multiplier is user-adjustable, industry direct-response benchmarks typically land between 1.3× and 2.0× for win-back campaigns. Your mileage, as always, varies.

§ IX Methodology & Colophon

Dataset

UCI ML Repository · Online Retail II ↗, two years of UK online retail transactions, roughly 1M rows before cleaning.

Pipeline script

notebooks/segmentation_model.py ↗. RFM aggregation, log1p + StandardScaler, three clustering methods × k grid, silhouette / Calinski-Harabasz / Davies-Bouldin, PCA(2) for the map, deterministic percentile-band auto-namer.

Reproducibility

random_state=42 everywhere. Last regenerated —. Running python notebooks/segmentation_model.py twice on the same CSV produces byte-identical JSON.

Limitations

Static snapshot, no drift between training and today. RFM captures transaction pattern but misses product affinity and channel mix. The live classifier snaps a new customer to the nearest frozen centroid; a production program would retrain on a schedule. Four segments is deliberately few for readability, real programs usually run a dozen or more.

← Back to the portfolio View the script on GitHub ↗