Linear Classifiers in Python

A surprising share of production machine learning comes down to drawing one straight line. Learn how linear classifiers work in Python: logistic regression, SVMs, loss functions, regularization, and kernels.

Strip away the hype and a surprising share of production machine learning comes down to drawing a single straight line. Spam filters, sentiment scorers, credit decisions, churn models: behind a great many of them sits a linear classifier, usually logistic regression or a support vector machine. They are fast to train, cheap to run, and easy to interrogate, which is exactly why they have outlived a decade of flashier alternatives. This article is about how they actually work and how to use them well in scikit-learn.

The one idea behind every linear classifier

A linear classifier scores an example by multiplying each feature by a weight, adding the results together, and tacking on a bias term. If the total comes out positive it predicts one class, and if it comes out negative it predicts the other. The dividing line, the place where the score is exactly zero, is the decision boundary. In two dimensions that boundary is a line, in three it is a flat plane, and beyond that it is a hyperplane you cannot picture but can still compute with.

Everything the model has learned lives in two places: the weight vector and the intercept. In scikit-learn these are coef_and intercept_, and they are worth reading directly. A large positive weight means the feature pushes predictions toward the positive class, a large negative weight pushes the other way, and a weight near zero means the feature barely matters. That transparency is one of the main reasons linear models survive in regulated settings where you have to explain a decision, not just make it.

Scikit-learn gives you three linear classifiers worth knowing. LogisticRegression is the one to reach for when you need probabilities, because it is built to output calibrated likelihoods rather than bare labels. LinearSVC chases the widest possible margin between the classes and is fast on large datasets, but it does not give you probabilities. SVC(kernel='linear')solves essentially the same problem as LinearSVC but more slowly, earning its keep only when you want access to the support vectors. The differences between them come down almost entirely to one choice, which is the loss function each one minimizes.

The encouraging part is that they all behave identically from your side of the keyboard. Create the model, call fit, call score. The same three steps apply to non-linear models like k-nearest neighbours too, which makes swapping models a one-line experiment rather than a rewrite.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
for model in [LogisticRegression(), LinearSVC(), SVC(), KNeighborsClassifier()]:
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

KNN is the odd one out in that list. It is not linear at all. Instead of learning weights, it simply stores the training data and, when asked about a new point, finds its nearest neighbours and takes a majority vote. That gives it wandering, irregular boundaries rather than a clean line, and it is included here only as a contrast to what the linear models are doing.

Training is just minimizing a loss

To train a linear classifier you need a way to measure how wrong it is, and that is what a loss function does. It assigns a penalty to each prediction based on how far off it was, and training means searching for the weights that make the total penalty across all your examples as small as possible. Linear regression uses squared error, which punishes big misses far more harshly than small ones. Classifiers use something better suited to yes-or-no answers.

The two losses that matter for classification are logistic loss and hinge loss, and the difference between them explains the personalities of the two model families. Logistic loss, the one logistic regression uses, never quite stops complaining. Even a confident, correct prediction still earns a tiny penalty, and that gentle pressure is exactly what keeps the model’s probability estimates meaningful. Hinge loss, the one support vector machines use, is more relaxed. Once a prediction is correct by a comfortable margin, the penalty drops to exactly zero and the model stops caring about that example entirely. That is why an SVM ends up focusing all its attention on the awkward points near the boundary and ignoring the easy ones far from it.

You can see the whole mechanism by rebuilding logistic regression by hand and checking it against scikit-learn. The loop below loops over every example, computes the raw score, and accumulates the logistic loss, then hands the total to an optimizer that searches for the best weights. The one subtlety is multiplying the score by the label. Because the label is plus one for positives and minus one for negatives, that flip makes a single formula penalize wrong answers correctly for both classes at once.

from scipy.optimize import minimize
def logistic_loss(w):
total = 0
for i in range(len(y)):
raw_score = w @ X[i]
total = total + np.log(1 + np.exp(-raw_score * y[i]))
return total
w_fit = minimize(logistic_loss, X[0]).x
# With C set very large, sklearn turns regularization off and should match w_fit
lr = LogisticRegression(fit_intercept=False, C=1000000).fit(X, y)
print(lr.coef_)

Watching those two sets of coefficients come out nearly identical is the moment the abstraction stops being mysterious. Scikit-learn is running a far better optimizer than this toy loop, but it is solving the same problem you just wrote down.

Regularization and the parameter that confuses everyone

Left to its own devices, a flexible model will happily memorize the noise in your training data and then fall apart on anything new. Regularization is the cure. It adds a penalty for large coefficients, which pulls the model back toward something simpler and more general. In scikit-learn this is controlled by C, and C trips up almost everyone the first time because it works backwards from intuition. A large C means weak regularization and a more complex model that risks overfitting. A small C means strong regularization and a simpler model that risks underfitting. C is the inverse of regularization strength, and once that clicks the rest follows.

You can watch the tradeoff happen by sweeping C across several orders of magnitude and plotting training error against validation error. At the strongly regularized end, both errors are high because the model is too simple to capture the pattern. As you loosen regularization, training error keeps falling because the model grows more flexible, but validation error eventually turns back upward as it begins fitting noise. The best C is the one sitting at the bottom of that validation curve, and finding it is the whole game.

train_errs, valid_errs = [], []
C_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
for C_value in C_values:
lr = LogisticRegression(C=C_value).fit(X_train, y_train)
train_errs.append(1.0 - lr.score(X_train, y_train))
valid_errs.append(1.0 - lr.score(X_valid, y_valid))
plt.semilogx(C_values, train_errs, C_values, valid_errs)
plt.legend(("train", "validation"))
plt.show()

There is also a deeper choice about what kind of penalty to apply. L2 regularization, the default, shrinks every coefficient a little but almost never pushes any of them all the way to zero. L1 regularization behaves differently in kind rather than degree, because it can drive some coefficients to exactly zero, which removes those features from the model completely. That makes L1 a form of automatic feature selection. Switch the penalty to 'l1', pair it with the liblinear solver, and after fitting you can count the non-zero coefficients to see how many features actually earned their place. Everything that got zeroed out was judged irrelevant and quietly dropped. Reach for L1 when you suspect many of your features are noise, and stick with L2 when you believe most of them carry real signal.

Regularization does one more thing that often surprises people. It does not only control accuracy, it controls confidence. Strong regularization keeps the coefficients small, and small coefficients produce probabilities that hug 0.5 rather than racing to the extremes. So if your model is making wildly overconfident predictions, tightening regularization will calm it down as a side effect of simplifying it.

From two classes to many

Everything so far assumes two classes, but real problems often have more. There are two ways linear models handle this. The default, one-vs-rest, trains a separate binary classifier for each class, each one answering the question “is this that class or not?” At prediction time every classifier votes and the most confident one wins. It is simple and usually works, but it can leave ambiguous regions where no single classifier is confident, so the result there depends on whichever one happens to be least uncertain.

The alternative, softmax or multinomial logistic regression, trains one joint model that considers all the classes together and forces their probabilities to sum to one. Because it reasons about the whole picture at once rather than each class in isolation, it tends to produce tighter, better-calibrated boundaries when the classes overlap. The tradeoff is speed, since one-vs-rest can be faster when there are a great many classes. As a rule of thumb, prefer multinomial when classes overlap and you want robustness, and lean on one-vs-rest when you have a large number of classes and a simpler problem.

Support vector machines and the points that matter

Support vector machines deserve their own moment because of one elegant idea. An SVM does not treat all your training data as equally important. It searches for the boundary that sits as far as possible from the nearest points of either class, maximizing the margin between them, and the only examples that actually shape that boundary are the handful sitting closest to it. Those are the support vectors. You could delete every other point in the dataset, retrain, and get the identical boundary back. That is what makes SVMs both efficient and robust to outliers that sit far from the dividing line, since those distant points have no say at all.

A linear SVM draws a straight boundary, but the real power arrives with kernels. The RBF kernel lets the model bend its boundary into smooth curves by implicitly mapping the data into a higher-dimensional space, which is what you want when the classes are not linearly separable. It introduces a second knob called gamma that controls how far the influence of each training point reaches. A large gamma makes each point’s influence local, producing a wiggly boundary that can overfit, while a small gamma spreads influence widely and yields a smoother boundary. Tuning C and gamma together, usually with a grid search and cross-validation, is the standard recipe for getting a kernel SVM to perform.

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
parameters = {'C': [0.1, 1, 10],
'gamma': [0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(SVC(), parameters)
searcher.fit(X_train, y_train)
print("Best CV params", searcher.best_params_)
print("Test accuracy:", searcher.score(X_test, y_test))

The discipline that matters here is touching the test set only once, at the very end. Cross-validation on the training data picks the parameters, and the held-out test set is the single honest measurement of how the chosen model will behave on data it has never seen.

One classifier to rule them all

There is a tidy way to avoid choosing between logistic regression and a linear SVM in advance. SGDClassifier fits linear models using stochastic gradient descent, and by changing its loss parameter it becomes one model or the other. Set loss='log_loss' and it is logistic regression, set loss='hinge' and it is a linear SVM. Put both options into a grid search and you let cross-validation decide not just the best hyperparameter values but the best model type itself, all in a single fair competition. The only quirk to remember is that SGDClassifier uses alpha for regularization, and alpha runs the opposite way to C, so larger alpha means stronger regularization.

Where to go from here

Linear classifiers reward you for understanding them rather than treating them as black boxes. Read the coefficients to learn what the model thinks matters, tune C deliberately rather than leaving it at the default, choose L1 or L2 based on whether you expect noise or signal, and reach for an RBF kernel only when a straight line genuinely will not do. Master those decisions and you will get more mileage out of logistic regression and SVMs than most practitioners get out of far heavier machinery.

See you soon.

View Comments (1)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading