Hyperparameter Tuning in Python: Full Guide

A trained model is rarely the best version of itself on the first try. Its parameters, the coefficients and tree splits, are learned automatically from data, but its hyperparameters, the learning rate, the tree depth, the number of neighbours, are settings you choose before training even begins. Get those settings right and a mediocre model often becomes a strong one. Hyperparameter tuning is the systematic search for the best combination of those settings, and there is a clear progression of strategies from crude to sophisticated: manual loops, grid search, random search, coarse-to-fine refinement, Bayesian optimisation, and genetic algorithms. This article walks the whole ladder, with the trade-offs that tell you which rung to stand on.

Looking Inside a Trained Model First

Before tuning, it helps to understand what a model has actually learned, because that often guides where to focus the search. For logistic regression, the coefficients tell you the direction and size of each feature’s effect.

			
feature_names = list(X_train.columns)
coefficients = logreg.coef_[0]
coef_df = pd.DataFrame({"Variable": feature_names, "Coefficient": coefficients})
top_three = coef_df.sort_values(by="Coefficient", ascending=False)[0:3]

Logistic regression assigns a weight to each feature and combines them into a single score, like a panel of judges where each feature votes positively or negatively and its coefficient is the loudness of its vote. A large positive coefficient pushes predictions toward class one, a large negative one pushes toward class zero. The [0] after coef_ is just unpacking, because sklearn stores coefficients in a two-dimensional array with one row per class, and binary classification has a single row.

A random forest is harder to read because it is a whole collection of trees, but you can reach into any single one.

			
chosen_tree = forest.estimators_[6]
split_index = chosen_tree.tree_.feature[0]
split_name = X_train.columns[split_index]
split_value = chosen_tree.tree_.threshold[0]
print("Split on {} at value {}".format(split_name, split_value))

		

The estimators_ attribute is the list of fitted trees, and inside each tree is a record of every decision it made. Reading tree_.feature[0] and tree_.threshold[0] reveals the very first question that particular tree asks at its root. It is like pulling one expert from a panel of a hundred and asking what they look at first, and different trees give different answers because each was trained on a random slice of the data, which is exactly the diversity that makes the forest robust.

Manual Search: Loops and Curves

The simplest tuning is a loop over candidate values. Here is a sweep of learning rates for a gradient boosting model.

			
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
learning_rates = [0.001, 0.01, 0.05, 0.1, 0.2, 0.5]
results = []
for rate in learning_rates:
    model = GradientBoostingClassifier(learning_rate=rate)
    predictions = model.fit(X_train, y_train).predict(X_test)
    results.append([rate, accuracy_score(y_test, predictions)])
results_df = pd.DataFrame(results, columns=["learning_rate", "accuracy"])

		

The learning rate controls how big a step the model takes when correcting its mistakes. Too small and it barely learns from each correction, like nudging a thermostat by a hundredth of a degree at a time. Too large and it overcorrects wildly. The chained .fit().predict() works because .fit() returns the model itself, so you can immediately predict on it. The loop produces a table, but turning that table into a curve reveals the shape of the relationship far better.

			
import numpy as np
import matplotlib.pyplot as plt
rates = np.linspace(0.01, 2, num=30)
accuracies = []
for rate in rates:
    model = GradientBoostingClassifier(learning_rate=rate)
    predictions = model.fit(X_train, y_train).predict(X_test)
    accuracies.append(accuracy_score(y_test, predictions))
plt.plot(rates, accuracies)
plt.gca().set(xlabel="learning_rate", ylabel="Accuracy", title="Accuracy vs learning rate")
plt.show()

		

np.linspace(0.01, 2, num=30) lays down thirty evenly spaced tick marks across the range instead of six arbitrary ones, and plotting accuracy against them draws a smooth curve. A picture instantly shows whether accuracy rises then falls toward a sweet spot, plateaus after some point, or collapses at high rates, in a way a table of numbers never can.

The same manual approach scales to comparing a few discrete options, such as the neighbour count in KNN, where you simply train three models and print their scores side by side. A small neighbour count is sensitive to local noise while a large one smooths the boundaries, and the comparison tells you which suits the data. When the comparisons get repetitive, you wrap the build-train-predict-score steps in a function and call it from nested loops, one loop per hyperparameter.

			
def gbm_grid_search(learn_rate, max_depth):
    model = GradientBoostingClassifier(learning_rate=learn_rate, max_depth=max_depth)
    predictions = model.fit(X_train, y_train).predict(X_test)
    return [learn_rate, max_depth, accuracy_score(y_test, predictions)]
results = []
for learn_rate in learn_rate_list:
    for max_depth in max_depth_list:
        results.append(gbm_grid_search(learn_rate, max_depth))

		

This is grid search done by hand, the exact logic that the next tool automates. With five learning rates and four depths you get twenty function calls, each returning one row of parameters and accuracy.

GridSearchCV: Exhaustive and Automated

GridSearchCV tests every combination in a grid, but with two upgrades over the manual version: it uses cross-validation rather than a single train-test split, and it handles all the bookkeeping.

			
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rf = RandomForestClassifier(criterion="entropy")
param_grid = {
    "max_depth": [2, 4, 8, 15],
    "max_features": ["sqrt", "log2"]
}
grid_rf = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring="roc_auc",
    n_jobs=4,
    cv=5,
    refit=True,
    return_train_score=True
)
grid_rf.fit(X_train, y_train)

		

This grid has four depths times two feature strategies, which is eight combinations, and with five-fold cross-validation that means forty model fits. Think of it as a competition where every entry is scored by five independent judges rather than one, which is a much fairer comparison than a single lucky or unlucky split. The scoring argument defines what “best” means, n_jobs=4 lets four fits run in parallel, and refit=True retrains the winning combination on the full training set so it is ready to predict. A small but important note on max_features: the old 'auto' value is deprecated in recent sklearn, so use 'sqrt'or 'log2'.

Everything the search learned lands in cv_results_, which becomes a tidy table once wrapped in a DataFrame, with one row per combination and columns for timings, train and test scores, and a rank. You can filter it for the row where rank_test_score equals one to find the winner, but sklearn also exposes direct shortcuts: best_score_ is the winning mean CV score, best_params_ is the dictionary of winning settings, and best_index_ is the winner’s row number. The fitted best model is available directly for prediction.

			
from sklearn.metrics import confusion_matrix, roc_auc_score
predictions = grid_rf.best_estimator_.predict(X_test)
print(confusion_matrix(y_test, predictions))
proba = grid_rf.best_estimator_.predict_proba(X_test)[:, 1]
print("ROC-AUC:", roc_auc_score(y_test, proba))

		

Because refit=True was set, best_estimator_ is already retrained on all the training data and behaves like any fitted model. The [:, 1] slice matters for ROC-AUC, because predict_proba returns two columns and the metric needs the positive-class probability, which is the second one.

RandomizedSearchCV: Smarter Coverage of Large Spaces

When the grid is large or continuous, testing every combination is hopeless, and RandomizedSearchCV samples a fixed number of random combinations instead.

			
from sklearn.model_selection import RandomizedSearchCV
param_grid = {
    "learning_rate": np.linspace(0.1, 2, 150),
    "min_samples_leaf": list(range(20, 65))
}
random_gbm = RandomizedSearchCV(
    estimator=GradientBoostingClassifier(),
    param_distributions=param_grid,
    n_iter=10,
    scoring="accuracy",
    n_jobs=4,
    cv=5,
    refit=True,
    return_train_score=True
)
random_gbm.fit(X_train, y_train)

		

This space holds 150 learning rates times 45 leaf sizes, which is 6,750 combinations, and exhausting it with five-fold CV would mean over thirty thousand fits. With n_iter=10 the search samples ten combinations for fifty fits total. The reason this works surprisingly well is that good regions of a hyperparameter space tend to be broad rather than needle-thin, so a handful of random samples usually lands near a strong region. Note the argument is param_distributions, not param_grid, signalling that these values are sampled rather than enumerated, and inspecting cv_results_['param_learning_rate'] shows exactly which values were drawn.

The headline comparison between the two automated searches: grid search is exhaustive but slow and best for small, well-understood grids, while random search is fast, controlled by n_iter, and far better suited to large or continuous spaces, at the small risk of missing the true optimum.

If you want full control over the sampling, you can build the combinations yourself with itertools.product to generate the full Cartesian product, then draw a random subset with np.random.choice(..., replace=False) or, more simply, random.sample. This is useful when you need custom constraints but still want a random subset to feed a manual loop.

Coarse to Fine: Two Zoom Levels

A smart hybrid runs a wide, coarse random search first to find promising regions, then a narrow, dense search to refine within them. After the wide pass, you sort results by accuracy and plot each hyperparameter against the score to see which ones actually matter.

			
results_df.sort_values(by="accuracy", ascending=False).head(10)
max_depth_list = list(range(1, 21))          # narrowed from the wide search
learn_rate_list = np.linspace(0.001, 1, 50)   # narrowed from the wide search

The analogy is radar then sniper scope. The first stage sweeps a wide angle to orient you toward the good region, and the visualisations are the map you draw from that sweep: if accuracy collapses above a learning rate of one, or depth barely matters, you learn that here. The second stage zooms into the promising region with a finer search, replacing a range of one to a hundred with one to twenty because the coarse pass showed nothing useful happens beyond that. A flat scatter for a parameter is its own lesson: that parameter is not worth fine-tuning.

Bayesian Optimisation: A Search With Memory

Random search samples blindly, ignoring everything earlier trials revealed. Bayesian optimisation remembers. After each evaluation it updates a probabilistic model of which regions are likely to perform well, and draws the next sample from the most promising area rather than at random. The Hyperopt library implements this.

			
from hyperopt import hp, fmin, tpe
from sklearn.model_selection import cross_val_score
space = {
    "max_depth": hp.quniform("max_depth", 2, 10, 2),
    "learning_rate": hp.uniform("learning_rate", 0.001, 0.9)
}
def objective(params):
    params = {"max_depth": int(params["max_depth"]),
              "learning_rate": params["learning_rate"]}
    model = GradientBoostingClassifier(n_estimators=100, **params)
    score = cross_val_score(model, X_train, y_train, scoring="accuracy", cv=2, n_jobs=4).mean()
    return 1 - score
best = fmin(fn=objective, space=space, max_evals=20,
            rstate=np.random.default_rng(42), algo=tpe.suggest)

		

Think of a treasure hunter keeping a map, marking where they found gold and where they found nothing, and using it to choose the next dig rather than picking a random spot. The objective function is what Hyperopt minimises, which is why it returns 1 - score: minimising one minus accuracy is the same as maximising accuracy. The **params syntax unpacks the dictionary into keyword arguments, so a sampled dictionary becomes the model’s constructor arguments directly, and int()is needed because hp.quniform returns floats. The search space is defined with Hyperopt’s distribution helpers: hp.uniform for continuous ranges like a learning rate, hp.quniform for discrete steps like depth, hp.loguniform for regularisation parameters that span orders of magnitude, and hp.choice for categorical options. In practice, give Bayesian search enough iterations to learn the space, at least fifty rather than the small demo value here.

Genetic Algorithms: Evolving Whole Pipelines

TPOT takes a different approach again, using evolutionary algorithms to search over both the model type and its hyperparameters at once.

			
from tpot import TPOTClassifier
tpot = TPOTClassifier(
    generations=3,
    population_size=4,
    offspring_size=3,
    scoring="accuracy",
    verbosity=2,
    random_state=2,
    cv=2
)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

		

The mechanism mirrors biological evolution. It starts with a population of randomly assembled pipelines, scores each one as a fitness test, lets the strongest reproduce by mixing and mutating their settings to create offspring, and drops the weakest. Repeating that over several generations pushes the population toward pipelines that suit your data. Its distinguishing feature is that it chooses the model family and the individual hyperparameters simultaneously, like selecting both the species and its traits at once. The cost is very long runtimes, and because it is inherently random, a different random_state often finds a different best model, so treat a short run as a starting point rather than a final verdict.

The Bayesian Idea Underneath

The intelligence in Bayesian optimisation rests on Bayes’ rule, which updates a belief given new evidence. The posterior probability equals the likelihood times the prior, divided by the probability of the evidence. A quick worked example makes it concrete.

			
p_unhappy = 0.15          # P(unhappy)
p_unhappy_given_close = 0.35   # P(unhappy | closes)
p_close = 0.07            # P(closes)
p_close_given_unhappy = (p_unhappy_given_close * p_close) / p_unhappy
print(p_close_given_unhappy)   # about 0.163

		

You cannot directly observe whether a specific unhappy customer will leave, but you can measure three population facts: fifteen percent of customers are unhappy, seven percent eventually close, and of those who close, thirty-five percent were unhappy. Bayes’ rule inverts this to give the probability of closing given unhappiness, about sixteen percent. Bayesian optimisation applies the same logic to tuning, updating its belief about which parameter regions are good based on the evidence from past trials.

Choosing a Method

The methods form a ladder from simple to sophisticated. A manual loop or learning curve is fast and ideal for a quick first look at a single hyperparameter. GridSearchCV gives exhaustive, cross-validated coverage of small grids. RandomizedSearchCVtrades exhaustiveness for speed on large or continuous spaces. Coarse-to-fine suits the case where you can interpret visual results and iterate. Bayesian optimisation earns its keep when each model fit is expensive and you want the search itself to be intelligent. And TPOT is for when you do not even know which model to use, since it searches the model family and the hyperparameters together, at the price of long runtimes.

The Pitfalls That Recur

A few mistakes account for most tuning errors. Evaluating hyperparameters on the test set leaks information, so always tune with cross-validation on the training data alone. Using max_features='auto' breaks on recent sklearn, so use 'sqrt' or 'log2'. Forgetting refit=True leaves best_estimator_ unavailable for prediction. Scoring ROC-AUC with predict instead of predict_proba(X_test)[:, 1] fails because the metric needs probabilities. Calling random.sample without importing random errors out. And running Bayesian search or TPOT with too few iterations or generations gives unstable results, because both need room to learn before their intelligence pays off.

Code along

I prepared a code along article where you can just copy my code: https://datalad.co.uk/hyperparameter-tuning-start-to-finish-a-code-along/

Conclusion

Hyperparameter tuning is a search, and the art is matching the search strategy to the situation. Start by understanding your model through its coefficients or trees, then explore single parameters with loops and curves. Reach for GridSearchCV when the grid is small and you want exhaustive, cross-validated coverage, and RandomizedSearchCV when the space is large or continuous. Use coarse-to-fine to spend compute intelligently, Bayesian optimisation when each fit is costly and you want a search with memory, and TPOT when you want the machine to choose the model itself. Throughout, keep tuning strictly on the training data with cross-validation, because the entire point is to find settings that will generalise, not ones that happen to flatter your test set.

Hyperparameter Tuning, Start to Finish: A Code-Along

Cluster Analysis in Python

Ensemble Methods in Python