June 9, 2026

12 min read

ML Engineering

Model Validation in Machine Learning

A good model generalises well to new data instead of memorising training examples. Model validation techniques identify overfitting, guiding systematic tuning and reliable performance estimates through consistent evaluation.

A model that scores perfectly on training data but fails on new data is not a good model. It has memorized examples rather than learning patterns that generalize. Model validation is the set of techniques that catch this problem, measure its severity, and guide you toward a model that performs reliably in the real world.

The validation pipeline follows a consistent sequence: split the data so the test set is sealed away; fit the model on training data; evaluate predictions against held-out data; diagnose whether errors reflect overfitting or underfitting; cross-validate for a more reliable estimate; and tune hyperparameters systematically to find the best configuration.

Seen vs Unseen Data

Every model performs better on data it was trained on than on data it has never seen. This gap between training and test performance is the core measurement of overfitting. A model that memorized specific examples rather than learning general patterns will show a large gap; a model that learned real signal will show a small one.

			
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
test_predictions  = model.predict(X_test)
train_error = mean_absolute_error(y_train, train_predictions)
test_error  = mean_absolute_error(y_test,  test_predictions)
print("Error on training data: {0:.2f}".format(train_error))
print("Error on unseen data:   {0:.2f}".format(test_error))

		

The training error will almost always be lower. What you are watching for is the size of the gap. Think of it like studying for an exam using the exact questions that will appear versus studying the subject itself. A student who memorized the answer key will score perfectly on those specific questions but fail the moment the questions change. Training error is the practice-question score; test error is the real exam.

Splitting Data

One holdout set

A single train/test split is sufficient when you need a final performance estimate and are not tuning hyperparameters.

			
from sklearn.model_selection import train_test_split
import pandas as pd
X = pd.get_dummies(match_data.iloc[:, 0:9])
y = match_data.iloc[:, 9]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.10, random_state=1111
)

		

pd.get_dummies converts categorical feature columns into binary numeric columns before passing them to the model. test_size=0.10 reserves 10% as the held-out test set. The random_state argument fixes the shuffle so the same split is reproduced every time the code runs, making results reproducible for anyone working with the same dataset.

Two holdout sets

When tuning hyperparameters, a third set is carved out of the training data. The validation set guides tuning decisions; the test set is reserved for the final honest evaluation.

			
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1111
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1111
)

		

The first split reserves 20% as the untouched test set. The second takes the remaining 80% and divides it 75/25, yielding roughly 60% train, 20% validation, and 20% test of the original total. Using the test set to guide tuning corrupts the evaluation: you would be optimizing for that specific holdout rather than for general performance.

Fitting a Random Forest

Random forests are an ensemble: many decision trees each trained on a random subset of data and features, with the final prediction produced by majority vote (classification) or averaging (regression).

			
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
rfr = RandomForestRegressor(
    n_estimators=100,
    max_depth=6,
    max_features=4,
    random_state=1111
)
rfr.fit(X_train, y_train)
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)
rfc.fit(X_train, y_train)

		

n_estimators sets the number of trees. More trees produce a more stable result but take longer to train. max_depth limits how many splits each tree can make; without this cap, individual trees can perfectly memorize training data, causing overfitting. max_features restricts how many feature columns each tree considers at each split, which introduces diversity and prevents all trees from learning the same patterns.

After training, the model tracks how much it relied on each input feature across all its trees.

			
rfr.fit(X_train, y_train)
for i, importance in enumerate(rfr.feature_importances_):
    print("{0:s}: {1:.2f}".format(X_train.columns[i], importance))

A score near 1.0 means the model leaned on that feature at nearly every decision point; a score near 0 means it barely used it. Feature importances reflect what was useful to this specific model on this specific dataset. They are a useful starting point for understanding model behavior but not a definitive statement about which variables matter in the real world. A feature with low importance may still carry genuine significance that the model failed to detect given the available data.

You can inspect any hyperparameter on a fitted model at any time.

			
print(rfc)
print('Random state: {}'.format(rfc.random_state))
print(rfc.get_params())

get_params() returns a dictionary of all current hyperparameter values. This matters during tuning: if you forget which configuration produced a good result, calling get_params() on the fitted model recovers the exact settings used.

Making Predictions

Regression

predictions = rfr.predict(X_test)

.predict() passes each test observation through the fitted forest, collects a numeric estimate from each tree, averages them, and returns one continuous value per row. The output might be a price, a duration, or a measurement, depending on what the target variable represents.

Classification

			
predictions = rfc.predict(X_test)
print(pd.Series(predictions).value_counts())

For classification, each tree votes for a class and the majority wins. .predict() returns one label per row. .value_counts() tallies the predicted labels as a quick sanity check. If a binary classifier predicts only one class for every row, something has gone wrong.

Predicted probabilities

			
prob_predictions = rfc.predict_proba(X_test)
print(prob_predictions[0])

.predict_proba() exposes the voting breakdown as probabilities. If 68 of 100 trees voted for the positive class, the output for that row is [0.32, 0.68]. This is more informative than a hard label: you might only act on a prediction when confidence exceeds 80%, rather than treating every slim majority as equally certain. .predict() gives the decision; .predict_proba() shows the confidence behind it.

A quick accuracy shortcut skips manual computation entirely.

print(rfc.score(X_test, y_test))

.score() calls .predict() internally, compares results to y_test, and returns the fraction of correct predictions. It is the fastest way to get an overall performance number, though it hides the detail of which types of mistakes the model is making.

Regression Error Metrics

Mean Absolute Error

MAE is the average absolute difference between predictions and true values, expressed in the same units as the target variable.

			
from sklearn.metrics import mean_absolute_error
mae_manual  = sum(abs(y_test - predictions)) / len(predictions)
mae_sklearn = mean_absolute_error(y_test, predictions)

If a model predicts ages and the errors for five people are 1, 3, 2, 5, and 4 years, MAE is (1+3+2+5+4)/5 = 3 years. The word “absolute” means direction is ignored: being 3 years too high counts the same as 3 years too low. An MAE of 3 years means the predictions are off by 3 years on average, which is a statement any audience can understand directly.

Mean Squared Error

MSE squares each error before averaging, which penalizes large mistakes far more heavily than small ones.

			
from sklearn.metrics import mean_squared_error
mse_manual  = sum((y_test - predictions) ** 2) / len(predictions)
mse_sklearn = mean_squared_error(y_test, predictions)

Using the same age example: squaring gives 1, 9, 4, 25, 16. The 5-year error contributes 25 while the 1-year error contributes only 1. MSE is useful when a large error is disproportionately worse than a small one. The downside is that the result is in squared units, which is hard to interpret as a standalone number; MSE is most useful when comparing two models against each other.

	MAE	MSE
Units	Same as target	Squared units
Sensitive to large errors	No	Yes
Easy to interpret alone	Yes	Less so
Use when	All errors matter equally	Large errors are especially costly

Evaluating on subsets

A model may look excellent overall while quietly underperforming for a specific group.

			
north_mask  = divisions == "North"
true_north  = y_test[north_mask]
preds_north = predictions[north_mask]
print("North division MAE: {}".format(mean_absolute_error(true_north, preds_north)))

The boolean mask acts as a sieve: only rows where divisions == "North" pass through. Running the metric on just those filtered rows answers the question “how does the model perform for this specific segment?” rather than for the full population. Overall strong performance can mask systematic failure for a subgroup that matters to the business.

Classification Error Metrics

Confusion matrix

Accuracy alone hides how a classifier is failing. A confusion matrix breaks performance into four categories.

			
from sklearn.metrics import confusion_matrix
test_predictions = rfc.predict(X_test)
cm = confusion_matrix(y_test, test_predictions)
print(cm)
print("True Positives:  {}".format(cm[1, 1]))
print("False Positives: {}".format(cm[0, 1]))
print("True Negatives:  {}".format(cm[0, 0]))
print("False Negatives: {}".format(cm[1, 0]))

		

The matrix is indexed as cm[actual, predicted]. A True Positive (cm[1,1]) is a positive case correctly identified. A True Negative (cm[0,0]) is a negative case correctly dismissed. A False Positive (cm[0,1]) is a negative case incorrectly flagged. A False Negative (cm[1,0]) is a positive case the model missed. A classifier that handles 90% of cases correctly might still be systematically missing every instance of the rare class that matters most.

Precision and Recall

Precision measures how reliable the model’s positive predictions are: of all the cases flagged as positive, what fraction actually were? Recall measures how comprehensive the detection is: of all the real positive cases, what fraction did the model catch?

			
# Manual calculation (example: TN=324, FP=15, FN=123, TP=491)
accuracy  = (324 + 491) / (324 + 15 + 123 + 491)   # 0.86
precision = 491 / (491 + 15)                         # 0.97
recall    = 491 / (491 + 123)                        # 0.80
# scikit-learn
from sklearn.metrics import precision_score, recall_score
precision = precision_score(y_test, test_predictions)
recall    = recall_score(y_test, test_predictions)

		

The example shows high precision (0.97) but moderate recall (0.80): when the model predicted positive it was almost always right, but it missed 20% of actual positives. Whether that trade-off is acceptable depends entirely on the cost of each mistake type. A smoke alarm with high precision rarely cries wolf; one with high recall never misses a real fire. Most systems cannot maximize both simultaneously, and the right balance is a business decision before it is a modeling decision.

Situation	Prioritize	Reason
Medical screening	Recall	Missing a real case is the worst outcome
Spam filter	Precision	Flagging a legitimate message is costly
Only act when confident	Precision	You need to be right when you say yes
Must catch every positive	Recall	Missing any positive is unacceptable

Overfitting and Underfitting

The gap between training and test error diagnoses two problems. Overfitting produces low training error but high test error: the model memorized training noise rather than generalizing patterns. Underfitting produces high error on both sets: the model is too simple to capture the signal.

The max_features parameter illustrates both extremes.

			
from sklearn.ensemble import RandomForestRegressor
rfr_underfit = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=2)
rfr_underfit.fit(X_train, y_train)
print('Train error: {0:.2f}'.format(mean_absolute_error(y_train, rfr_underfit.predict(X_train))))
print('Test error:  {0:.2f}'.format(mean_absolute_error(y_test,  rfr_underfit.predict(X_test))))
rfr_overfit  = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=11)
rfr_balanced = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4)

		

With only 2 features at each split, each tree works with almost no information: errors are high on both training and test data. With all 11 features available, trees can engineer a near-perfect fit on training data by memorizing its noise, and test error rises. Four features provides enough diversity to find real patterns without enabling the trees to overfit.

The number of trees demonstrates underfitting through a loop.

			
from sklearn.metrics import accuracy_score
test_scores, train_scores = [], []
for n_trees in [1, 2, 3, 4, 5, 10, 20, 50]:
    rfc = RandomForestClassifier(n_estimators=n_trees, random_state=1111)
    rfc.fit(X_train, y_train)
    train_scores.append(round(accuracy_score(y_train, rfc.predict(X_train)), 2))
    test_scores.append(round(accuracy_score(y_test,  rfc.predict(X_test)),  2))
print("Training scores: {}".format(train_scores))
print("Testing scores:  {}".format(test_scores))

		

A single tree’s judgment is unreliable. As more trees are added, their individual errors partially cancel out and test accuracy rises. Eventually the gain levels off. If accuracy is still climbing at the last data point in the loop, the ensemble has not reached a stable consensus and more trees may help.

Sampling Variability

Different random samples drawn from the same dataset can produce different class distributions, causing model scores to vary across runs even when the model itself has not changed.

			
sample1 = match_data.sample(200, random_state=1111)
sample2 = match_data.sample(200, random_state=1171)
print(len([i for i in sample1.index if i in sample2.index]))
print(sample1['result'].value_counts())
print(sample2['result'].value_counts())

		

The list comprehension counts how many rows appear in both samples. The value_counts() calls reveal the class distribution in each. If one sample has significantly more of one class than the other, a model trained on that sample will behave differently than one trained on the alternative. A single train/test split can give misleading results because the outcome is partly an accident of which rows happened to land in which pile. Cross-validation addresses this directly by averaging over many different splits.

Cross-Validation

KFold

KFold divides the data into k equally-sized folds and runs k rounds of evaluation. In each round a different fold is held out for validation and the model trains on the rest. Every observation ends up in the validation set exactly once.

			
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=1111)
splits = kf.split(X)
for train_index, val_index in splits:
    print("Training rows:   %s" % len(train_index))
    print("Validation rows: %s" % len(val_index))

		

shuffle=True randomizes the row order before dividing into folds, preventing early rows from clustering in the first fold and late rows in the last. The result is 5 index pairs that can be used to slice features and labels for each training and validation round.

			
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
rfr = RandomForestRegressor(n_estimators=25, random_state=1111)
for train_index, val_index in splits:
    X_train_fold, y_train_fold = X[train_index], y[train_index]
    X_val_fold,   y_val_fold   = X[val_index],   y[val_index]
    rfr.fit(X_train_fold, y_train_fold)
    predictions = rfr.predict(X_val_fold)
    print("Fold MSE: " + str(mean_squared_error(y_val_fold, predictions)))

		

If the five MSE values are consistent, the model is stable across different subsets of the data. If one fold produces an outlier score, something unusual landed in that fold. Consistent scores across folds mean the performance estimate is reliable rather than a lucky or unlucky draw.

`cross_val_score`

The manual KFold loop compresses into a single function call.

			
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, make_scorer
rfr = RandomForestRegressor(n_estimators=25, random_state=1111)
mse_scorer = make_scorer(mean_squared_error)
cv_scores = cross_val_score(
    estimator=rfr,
    X=X_train,
    y=y_train,
    cv=10,
    scoring=mse_scorer
)
print(cv_scores.mean())

		

make_scorer converts a raw metric function into the scorer format that cross_val_score expects. With cv=10, the function runs 10 rounds of splitting, fitting, predicting, and scoring internally, returning an array of 10 results. .mean() collapses that array into a single summary metric that is more stable than any single fold’s result.

Leave-One-Out Cross-Validation

LOOCV takes KFold to its extreme: setting cv equal to the number of rows means each observation is a validation set of one. The model trains on every other row, validates on the single held-out row, and repeats for every row in the dataset.

			
import numpy as np
mae_scorer = make_scorer(mean_absolute_error)
rfr = RandomForestRegressor(n_estimators=15, random_state=1111)
scores = cross_val_score(rfr, X=X, y=y, cv=85, scoring=mae_scorer)
print("Mean error:     %s" % np.mean(scores))
print("Std of errors:  %s" % np.std(scores))

		

This maximizes the use of available data, making it well-suited to small datasets where a standard split would waste too much. The cost is computational: with 85 rows, 85 separate model fits are required. The standard deviation across scores reveals how sensitive the model is to individual data points. A high standard deviation suggests the model is unstable or that the dataset contains unusual observations with outsized influence.

Method	Use when
Single train/test split	Large dataset, quick baseline
KFold (manual)	Need control over each fold
`cross_val_score`	Standard cross-validation
LOOCV (`cv=n`)	Very small datasets

Hyperparameter Tuning

Hyperparameters are settings chosen before training that control how the model learns. They cannot be inferred from data. max_depth, min_samples_split, and max_features are all hyperparameters for a random forest.

			
print(rfr.get_params())
max_depth         = [4, 8, 12]
min_samples_split = [2, 5, 10]
max_features      = [4, 6, 8, 10]

get_params() returns all current values as a dictionary, which is a useful starting point before deciding what to search over. The lists define the candidate values: not a commitment, but a menu.

A simple first approach picks randomly from each list.

			
import random
rfr = RandomForestRegressor(
    n_estimators=100,
    max_depth=random.choice(max_depth),
    min_samples_split=random.choice(min_samples_split),
    max_features=random.choice(max_features)
)
print(rfr.get_params())

		

This is fast but unsystematic. The print(rfr.get_params()) call records which combination was selected, since the choice was random and needs to be logged for reproducibility.

`RandomizedSearchCV`

A more rigorous approach samples many combinations and evaluates each with cross-validation.

			
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
    "max_depth":          [2, 4, 6, 8],
    "max_features":       [2, 4, 6, 8, 10],
    "min_samples_split":  [2, 4, 8, 16]
}
rfr = RandomForestRegressor(n_estimators=10, random_state=1111)
mse_scorer = make_scorer(mean_squared_error)
random_search = RandomizedSearchCV(
    estimator=rfr,
    param_distributions=param_dist,
    n_iter=10,
    cv=5,
    scoring=mse_scorer
)

		

The full grid here is 4 × 5 × 4 = 80 combinations. Setting n_iter=10 samples 10 of those 80, each evaluated with 5-fold cross-validation, for 50 total model fits. This covers the search space efficiently without exhaustively testing every option.

The same mechanism works with any scoring metric. To optimize for precision instead of MSE:

			
from sklearn.metrics import precision_score
precision_scorer = make_scorer(precision_score)
rs = RandomizedSearchCV(
    estimator=rfc,
    param_distributions=param_dist,
    scoring=precision_scorer,
    cv=5,
    n_iter=10,
    random_state=1111
)
rs.fit(X, y)
print("Scores per combination: {}".format(rs.cv_results_['mean_test_score']))
print("Best score: {}".format(rs.best_score_))

		

rs.fit() runs all 50 fits and records every result. cv_results_['mean_test_score'] is the leaderboard: one mean cross-validated score per combination tried. best_score_ is the top entry. random_state=1111 ensures the same 10 combinations are sampled every run, making the search reproducible.

	GridSearchCV	RandomizedSearchCV
Tests	Every combination	Random sample
Speed	Slow on large grids	Faster
Coverage	Exhaustive	Approximate
Use when	Small parameter space	Large parameter space

Common Pitfalls

Fitting a scaler or any transformation on the full dataset before splitting leaks test-set information into the training process. Always split first, then fit transforms on the training fold only.

Reporting only training error without evaluating on held-out data gives a falsely optimistic picture of real-world performance. Always report test-set results.

Using the test set to guide tuning decisions corrupts the final evaluation. The test set should be touched exactly once, at the very end. Tuning belongs on a validation set or inside cross-validation.

Not setting random_state means results will differ every run, making experiments impossible to reproduce. Set it consistently.

Choosing between precision and recall without first defining the cost of each mistake leads to optimizing for the wrong objective. Define what a false positive and a false negative cost in your specific context before selecting the metric.

High accuracy is not always a good result. On an imbalanced dataset where one class makes up 95% of observations, a model that always predicts the majority class achieves 95% accuracy without learning anything. Precision and recall expose this failure that accuracy conceals.

Quick Reference

			
# Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1111)
# Fit
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
rfr = RandomForestRegressor(n_estimators=100, max_depth=6, random_state=1111)
rfc = RandomForestClassifier(n_estimators=50,  max_depth=6, random_state=1111)
model.fit(X_train, y_train)
# Predict
model.predict(X_test)             # labels or continuous values
rfc.predict_proba(X_test)         # per-class probabilities
rfc.score(X_test, y_test)         # accuracy or R² shortcut
# Regression metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error
mean_absolute_error(y_test, predictions)
mean_squared_error(y_test, predictions)
# Classification metrics
from sklearn.metrics import confusion_matrix, precision_score, recall_score
confusion_matrix(y_test, predictions)    # cm[actual, predicted]
precision_score(y_test, predictions)
recall_score(y_test, predictions)
# Feature importances
for i, v in enumerate(rfr.feature_importances_):
    print(X_train.columns[i], round(v, 2))
# Cross-validation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
cv = cross_val_score(model, X_train, y_train, cv=10, scoring=make_scorer(mean_squared_error))
print(cv.mean())
# LOOCV
cross_val_score(model, X, y, cv=len(X), scoring=make_scorer(mean_absolute_error))
# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
param_dist = {"max_depth": [2, 4, 6, 8], "max_features": [2, 4, 6, 8]}
rs = RandomizedSearchCV(estimator=model, param_distributions=param_dist,
                        n_iter=10, cv=5, scoring=mse_scorer, random_state=1111)
rs.fit(X, y)
print(rs.best_score_)
print(rs.cv_results_['mean_test_score'])

		

The workflow is always the same sequence: lock away the test set; train on the training portion; predict on held-out data; choose the right metric for the problem type; diagnose overfitting or underfitting by comparing train versus test error; cross-validate for a reliable estimate; tune hyperparameters to find the best configuration; report final performance on the test set exactly once.

See you soon.

ML Engineering

Andrei

June 9, 2026

12 min read

View Comments (4)

XGBoost: A Practical Guide to Extreme Gradient Boosting

XGBoost is a powerful tool for tabular data, excelling in classification and regression. It combines speed, accuracy, and built-in regularization, making it crucial for effective machine learning tasks.

June 12, 2026

9 min read

Feature Engineering in Python

Raw data is never model-ready. Learn feature engineering in Python: encoding categories, scaling and transforming numbers, handling missing values and outliers, building text features, and the train-test discipline that prevents leakage.

June 15, 2026

10 min read

Content testing: Converting Users into Customers

A/B Testing Basics

Experimentation Program: Building a Testing Program That Lasts

User-Defined Functions in BigQuery

Model Validation in Machine Learning

Seen vs Unseen Data

Splitting Data

One holdout set

Two holdout sets

Fitting a Random Forest

Making Predictions

Regression

Classification

Predicted probabilities

Regression Error Metrics

Mean Absolute Error

Mean Squared Error

Evaluating on subsets

Classification Error Metrics

Confusion matrix

Precision and Recall

Overfitting and Underfitting

Sampling Variability

Cross-Validation

KFold

`cross_val_score`

Leave-One-Out Cross-Validation

Hyperparameter Tuning

`RandomizedSearchCV`

Common Pitfalls

Quick Reference

Related

Leave a ReplyCancel reply

Recommended for You

XGBoost: A Practical Guide to Extreme Gradient Boosting

Feature Engineering in Python

Content testing: Converting Users into Customers

A/B Testing Basics

Experimentation Program: Building a Testing Program That Lasts

User-Defined Functions in BigQuery

Model Validation in Machine Learning

Seen vs Unseen Data

Splitting Data

One holdout set

Two holdout sets

Fitting a Random Forest

Making Predictions

Regression

Classification

Predicted probabilities

Regression Error Metrics

Mean Absolute Error

Mean Squared Error

Evaluating on subsets

Classification Error Metrics

Confusion matrix

Precision and Recall

Overfitting and Underfitting

Sampling Variability

Cross-Validation

KFold

cross_val_score

Leave-One-Out Cross-Validation

Hyperparameter Tuning

RandomizedSearchCV

Common Pitfalls

Quick Reference

Related

Leave a ReplyCancel reply

Subscribe to My Newsletter

Recommended for You

XGBoost: A Practical Guide to Extreme Gradient Boosting

Feature Engineering in Python

Discover more from Discuss Data Science, Machine Learning and Analytics

`cross_val_score`

`RandomizedSearchCV`