The probability
boundary.
The simplest classifier worth deploying. Returns a calibrated probability instead of a hard label, which is the difference between "this lead might convert" and "this lead has a 73% chance, so chase the top decile and skip the rest."
A logistic regression trained live in the browser by batch gradient descent. Plant points by clicking, switch the active class with the toggle, and watch the sigmoid slide its decision boundary into place. Every weight update, every iteration of log-loss, every shift in accuracy is recomputed on this page. Nothing is precomputed.
A hard "yes or no" classifier is a coin that has been told to stop spinning early. Logistic regression keeps it spinning, and reports the angle.
A spam filter that flags an email is useful. A spam filter that flags an email and says "I'm 0.62 confident" is more useful, because now you can route the soft calls to a review queue and auto-archive only the 0.99s. A churn model that returns a probability lets the retention team budget against expected saves rather than gut feel about who looks "shaky."
Calibration is the quiet superpower here. The output of a logistic regression isn't just a number between zero and one, it's a number that, when calibrated, actually means what it sounds like. Of all the customers it scores at 0.30, roughly thirty percent really do convert. That is not true of an arbitrary deep network's softmax, and it is what makes logistic regression the model production keeps coming back to even when fancier options exist.
Click the canvas to drop a point of the active class. The model refits as you go. The shaded region is the model's probability surface, deepest at the highest-confidence regions. The line is the 0.5 decision boundary.
The sigmoid squash
A linear score, w dot x plus b, can land anywhere on the number line. The sigmoid 1 / (1 + e^-z) compresses that score to a probability between zero and one. Big positive scores saturate near one, big negative scores near zero, and the slope at zero is steepest, which is why the boundary is where the model is most uncertain.
Log loss has a gradient
Squared error has nasty plateaus on a sigmoid. Log loss does not. Its gradient with respect to the weights collapses to a clean vector, the average of (predicted minus actual) times the input. That gradient is what every step uses to nudge the boundary toward a better fit.
L2 keeps the line honest
With perfectly separable points, the loss minimum lies at infinite weights, which produces a brittle vertical wall of a boundary. The L2 penalty adds a small cost for each weight squared. That trades a microscopic amount of training accuracy for a smoother boundary that generalizes better when new points arrive.
The model fits one straight line in feature space. Three failure modes worth memorizing before reaching for it in production.
Non-linear truth
Try the moons preset above. Two interlocking arcs cannot be separated by any straight line. The boundary settles into the least-bad compromise and accuracy stalls in the seventies. The fix is feature engineering or a non-linear model, not more iterations.
Outlier domination
A handful of mislabeled points far from the boundary can pull the line into a worse position than ignoring them. Logistic regression is not robust by default. Real deployments add weight clipping, sample reweighting, or a Huber-style loss to make it less brittle to bad labels.
Class imbalance
Drop ten positives and one negative on the canvas. The model will happily predict positive everywhere and report ninety percent accuracy. Production fixes are class weighting in the loss, or scoring against a metric that actually rewards minority-class recall like AUC or F1.
Pure JavaScript, HTML5 Canvas 2D, no libraries. Three weights (bias plus two features) updated by batch gradient descent every animation frame. Decision surface drawn by sampling the sigmoid on a 60 by 45 grid and shading by probability. No external scoring service, no model file, no precomputed anything.
Forward pass is one dot product and one sigmoid per point. For each iteration, we compute predictions on every planted point, derive the gradient of cross-entropy plus L2, and step. The Python notebook linked in the source field uses the identical update rule with NumPy and converges to the same weights to within four decimal places.
Hastie, Tibshirani, Friedman · The Elements of Statistical Learning ↗
Tom Mitchell · Generative and Discriminative Classifiers ↗
scikit-learn · the canonical production implementation ↗
Two-feature toy by design. Real deployments handle hundreds of features, sparse inputs, and need stochastic or mini-batch optimization for memory reasons. Calibration beyond the training distribution is not free, requires Platt scaling or isotonic regression. Decision boundary is always a hyperplane in feature space, so anything truly non-linear needs explicit feature crosses.