M02 Customer Segmentation, Unsupervised
Which customers
earn their keep.
A segmentation model that sorts every customer by value. Aims every marketing dollar at the 21% who generate 64% of revenue. The other 79% stop draining the budget.
Below: a live classifier, segment profiles, and a campaign ROI simulator. Try the ROI simulator first. Every chart is computed in your browser from the JSON the Python script writes to /assets/data/segmentation/. No hosted inference, no screenshots, no hand-picked numbers.
§ I Why segmentation matters
Mass marketing dilutes both signal and budget. A 1% response rate from a blanket email is a 99% waste, and the 1% who responded were going to buy anyway. Segmentation replaces that with targeted outreach: the right message to the right subset, priced against the revenue each segment actually produces.
Well-run segmentation programs routinely deliver double-digit percentage-point lift in campaign ROI over blanket sends. This page shows the method, RFM features, three clustering algorithms compared, each segment profiled against its revenue contribution, run against the public UCI Online Retail II dataset so every number is reproducible from the linked script.
The winning model here is K-Means at k=4. Four segments unpack a real 1M-transaction book into groups with dramatically different economics, as you're about to see, one of them earns roughly 64% of revenue on 21% of customers.
§ II Data Profile
§ III Three clustering methods, one dataset
Run on the same log+scaled R/F/M matrix. Higher silhouette and Calinski-Harabasz are better; lower Davies-Bouldin is better. No single metric decides, silhouette is the primary criterion here. The winner row is highlighted.
| Method | Best k | Silhouette | Calinski Harabasz | Davies Bouldin | Inertia |
|---|
§ IV · M02.1 K selection · silhouette and elbow, all three methods
Left: silhouette score vs k, higher is tighter clusters. Accent dot marks the chosen operational k. The k=2 column is shaded and labeled (degenerate): it wins on score because RFM data has a strong bimodal Pareto split, but two clusters is operationally useless for targeted outreach. Right: K-Means inertia elbow, diminishing returns after the operational k.
§ V · M02.2 The segment map · 2,000 customers, PCA(2)
Each point is one customer. Axes are the top two principal components of the log-scaled R/F/M matrix. Hover a point for raw values. Click a legend entry to isolate a cluster.
§ VI · M02.3 Classify a customer, live
Inputs are log1p-transformed and standardized with the same scaler the model was trained on, then assigned to the nearest K-Means centroid by Euclidean distance. Move the sliders, assignment updates in real time.
- Customers
- —
- Of base
- —
- Of revenue
- —
This is the same K-Means model used in Fig. 02.2. The inset re-projects through the same PCA so the "you" dot lives in the same space as the scatter above.
§ VII · M02.4 Segment profiles
One card per cluster. The small bars show where each segment's median customer lands on the overall distribution for Recency, Frequency, and Monetary. The revenue-vs-customers bar tells you whether the segment is pulling its weight.
§ VIII · M02.5 Campaign ROI simulator
Pick a segment, set a per-customer campaign cost and an expected lift multiplier. The baseline response rate is the segment's own measured 60-day repurchase rate, not a fabricated number. The ROI math and all intermediates are computed live.
size × repurchase rate
baseline × lift
The 60-day repurchase rate is measured directly from the data, fraction of each segment's customers who bought in the last 60 days of the snapshot. The lift multiplier is user-adjustable, industry direct-response benchmarks typically land between 1.3× and 2.0× for win-back campaigns. Your mileage, as always, varies.
§ IX Methodology & Colophon
UCI ML Repository · Online Retail II ↗, two years of UK online retail transactions, roughly 1M rows before cleaning.
notebooks/segmentation_model.py ↗. RFM aggregation, log1p + StandardScaler, three clustering methods × k grid, silhouette / Calinski-Harabasz / Davies-Bouldin, PCA(2) for the map, deterministic percentile-band auto-namer.
random_state=42 everywhere. Last regenerated —. Running python notebooks/segmentation_model.py twice on the same CSV produces byte-identical JSON.
Static snapshot, no drift between training and today. RFM captures transaction pattern but misses product affinity and channel mix. The live classifier snaps a new customer to the nearest frozen centroid; a production program would retrain on a schedule. Four segments is deliberately few for readability, real programs usually run a dozen or more.