A model that scores perfectly on training data but fails on new data is not a good model. It has memorized examples rather than learning patterns that generalize. Model validation is the set of techniques that catch this problem, measure its severity, and guide you toward a model that performs reliably in the real world.
The validation pipeline follows a consistent sequence: split the data so the test set is sealed away; fit the model on training data; evaluate predictions against held-out data; diagnose whether errors reflect overfitting or underfitting; cross-validate for a more reliable estimate; and tune hyperparameters systematically to find the best configuration.
Seen vs Unseen Data
Every model performs better on data it was trained on than on data it has never seen. This gap between training and test performance is the core measurement of overfitting. A model that memorized specific examples rather than learning general patterns will show a large gap; a model that learned real signal will show a small one.
model.fit(X_train, y_train)train_predictions = model.predict(X_train)test_predictions = model.predict(X_test)train_error = mean_absolute_error(y_train, train_predictions)test_error = mean_absolute_error(y_test, test_predictions)print("Error on training data: {0:.2f}".format(train_error))print("Error on unseen data: {0:.2f}".format(test_error))
The training error will almost always be lower. What you are watching for is the size of the gap. Think of it like studying for an exam using the exact questions that will appear versus studying the subject itself. A student who memorized the answer key will score perfectly on those specific questions but fail the moment the questions change. Training error is the practice-question score; test error is the real exam.
Splitting Data
One holdout set
A single train/test split is sufficient when you need a final performance estimate and are not tuning hyperparameters.
from sklearn.model_selection import train_test_splitimport pandas as pdX = pd.get_dummies(match_data.iloc[:, 0:9])y = match_data.iloc[:, 9]X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.10, random_state=1111)
pd.get_dummies converts categorical feature columns into binary numeric columns before passing them to the model. test_size=0.10 reserves 10% as the held-out test set. The random_state argument fixes the shuffle so the same split is reproduced every time the code runs, making results reproducible for anyone working with the same dataset.
Two holdout sets
When tuning hyperparameters, a third set is carved out of the training data. The validation set guides tuning decisions; the test set is reserved for the final honest evaluation.
X_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=0.20, random_state=1111)X_train, X_val, y_train, y_val = train_test_split( X_temp, y_temp, test_size=0.25, random_state=1111)
The first split reserves 20% as the untouched test set. The second takes the remaining 80% and divides it 75/25, yielding roughly 60% train, 20% validation, and 20% test of the original total. Using the test set to guide tuning corrupts the evaluation: you would be optimizing for that specific holdout rather than for general performance.
Fitting a Random Forest
Random forests are an ensemble: many decision trees each trained on a random subset of data and features, with the final prediction produced by majority vote (classification) or averaging (regression).
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifierrfr = RandomForestRegressor( n_estimators=100, max_depth=6, max_features=4, random_state=1111)rfr.fit(X_train, y_train)rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)rfc.fit(X_train, y_train)
n_estimators sets the number of trees. More trees produce a more stable result but take longer to train. max_depth limits how many splits each tree can make; without this cap, individual trees can perfectly memorize training data, causing overfitting. max_features restricts how many feature columns each tree considers at each split, which introduces diversity and prevents all trees from learning the same patterns.
After training, the model tracks how much it relied on each input feature across all its trees.
rfr.fit(X_train, y_train)for i, importance in enumerate(rfr.feature_importances_): print("{0:s}: {1:.2f}".format(X_train.columns[i], importance))
A score near 1.0 means the model leaned on that feature at nearly every decision point; a score near 0 means it barely used it. Feature importances reflect what was useful to this specific model on this specific dataset. They are a useful starting point for understanding model behavior but not a definitive statement about which variables matter in the real world. A feature with low importance may still carry genuine significance that the model failed to detect given the available data.
You can inspect any hyperparameter on a fitted model at any time.
print(rfc)print('Random state: {}'.format(rfc.random_state))print(rfc.get_params())
get_params() returns a dictionary of all current hyperparameter values. This matters during tuning: if you forget which configuration produced a good result, calling get_params() on the fitted model recovers the exact settings used.
Making Predictions
Regression
predictions = rfr.predict(X_test)
.predict() passes each test observation through the fitted forest, collects a numeric estimate from each tree, averages them, and returns one continuous value per row. The output might be a price, a duration, or a measurement, depending on what the target variable represents.
Classification
predictions = rfc.predict(X_test)print(pd.Series(predictions).value_counts())
For classification, each tree votes for a class and the majority wins. .predict() returns one label per row. .value_counts() tallies the predicted labels as a quick sanity check. If a binary classifier predicts only one class for every row, something has gone wrong.
Predicted probabilities
prob_predictions = rfc.predict_proba(X_test)print(prob_predictions[0])
.predict_proba() exposes the voting breakdown as probabilities. If 68 of 100 trees voted for the positive class, the output for that row is [0.32, 0.68]. This is more informative than a hard label: you might only act on a prediction when confidence exceeds 80%, rather than treating every slim majority as equally certain. .predict() gives the decision; .predict_proba() shows the confidence behind it.
A quick accuracy shortcut skips manual computation entirely.
print(rfc.score(X_test, y_test))
.score() calls .predict() internally, compares results to y_test, and returns the fraction of correct predictions. It is the fastest way to get an overall performance number, though it hides the detail of which types of mistakes the model is making.
Regression Error Metrics
Mean Absolute Error
MAE is the average absolute difference between predictions and true values, expressed in the same units as the target variable.
from sklearn.metrics import mean_absolute_errormae_manual = sum(abs(y_test - predictions)) / len(predictions)mae_sklearn = mean_absolute_error(y_test, predictions)
If a model predicts ages and the errors for five people are 1, 3, 2, 5, and 4 years, MAE is (1+3+2+5+4)/5 = 3 years. The word “absolute” means direction is ignored: being 3 years too high counts the same as 3 years too low. An MAE of 3 years means the predictions are off by 3 years on average, which is a statement any audience can understand directly.
Mean Squared Error
MSE squares each error before averaging, which penalizes large mistakes far more heavily than small ones.
from sklearn.metrics import mean_squared_errormse_manual = sum((y_test - predictions) ** 2) / len(predictions)mse_sklearn = mean_squared_error(y_test, predictions)
Using the same age example: squaring gives 1, 9, 4, 25, 16. The 5-year error contributes 25 while the 1-year error contributes only 1. MSE is useful when a large error is disproportionately worse than a small one. The downside is that the result is in squared units, which is hard to interpret as a standalone number; MSE is most useful when comparing two models against each other.
| MAE | MSE | |
|---|---|---|
| Units | Same as target | Squared units |
| Sensitive to large errors | No | Yes |
| Easy to interpret alone | Yes | Less so |
| Use when | All errors matter equally | Large errors are especially costly |
Evaluating on subsets
A model may look excellent overall while quietly underperforming for a specific group.
north_mask = divisions == "North"true_north = y_test[north_mask]preds_north = predictions[north_mask]print("North division MAE: {}".format(mean_absolute_error(true_north, preds_north)))
The boolean mask acts as a sieve: only rows where divisions == "North" pass through. Running the metric on just those filtered rows answers the question “how does the model perform for this specific segment?” rather than for the full population. Overall strong performance can mask systematic failure for a subgroup that matters to the business.
Classification Error Metrics
Confusion matrix
Accuracy alone hides how a classifier is failing. A confusion matrix breaks performance into four categories.
from sklearn.metrics import confusion_matrixtest_predictions = rfc.predict(X_test)cm = confusion_matrix(y_test, test_predictions)print(cm)print("True Positives: {}".format(cm[1, 1]))print("False Positives: {}".format(cm[0, 1]))print("True Negatives: {}".format(cm[0, 0]))print("False Negatives: {}".format(cm[1, 0]))
The matrix is indexed as cm[actual, predicted]. A True Positive (cm[1,1]) is a positive case correctly identified. A True Negative (cm[0,0]) is a negative case correctly dismissed. A False Positive (cm[0,1]) is a negative case incorrectly flagged. A False Negative (cm[1,0]) is a positive case the model missed. A classifier that handles 90% of cases correctly might still be systematically missing every instance of the rare class that matters most.
Precision and Recall
Precision measures how reliable the model’s positive predictions are: of all the cases flagged as positive, what fraction actually were? Recall measures how comprehensive the detection is: of all the real positive cases, what fraction did the model catch?
# Manual calculation (example: TN=324, FP=15, FN=123, TP=491)accuracy = (324 + 491) / (324 + 15 + 123 + 491) # 0.86precision = 491 / (491 + 15) # 0.97recall = 491 / (491 + 123) # 0.80# scikit-learnfrom sklearn.metrics import precision_score, recall_scoreprecision = precision_score(y_test, test_predictions)recall = recall_score(y_test, test_predictions)
The example shows high precision (0.97) but moderate recall (0.80): when the model predicted positive it was almost always right, but it missed 20% of actual positives. Whether that trade-off is acceptable depends entirely on the cost of each mistake type. A smoke alarm with high precision rarely cries wolf; one with high recall never misses a real fire. Most systems cannot maximize both simultaneously, and the right balance is a business decision before it is a modeling decision.
| Situation | Prioritize | Reason |
|---|---|---|
| Medical screening | Recall | Missing a real case is the worst outcome |
| Spam filter | Precision | Flagging a legitimate message is costly |
| Only act when confident | Precision | You need to be right when you say yes |
| Must catch every positive | Recall | Missing any positive is unacceptable |
Overfitting and Underfitting
The gap between training and test error diagnoses two problems. Overfitting produces low training error but high test error: the model memorized training noise rather than generalizing patterns. Underfitting produces high error on both sets: the model is too simple to capture the signal.
The max_features parameter illustrates both extremes.
from sklearn.ensemble import RandomForestRegressorrfr_underfit = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=2)rfr_underfit.fit(X_train, y_train)print('Train error: {0:.2f}'.format(mean_absolute_error(y_train, rfr_underfit.predict(X_train))))print('Test error: {0:.2f}'.format(mean_absolute_error(y_test, rfr_underfit.predict(X_test))))rfr_overfit = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=11)rfr_balanced = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4)
With only 2 features at each split, each tree works with almost no information: errors are high on both training and test data. With all 11 features available, trees can engineer a near-perfect fit on training data by memorizing its noise, and test error rises. Four features provides enough diversity to find real patterns without enabling the trees to overfit.
The number of trees demonstrates underfitting through a loop.
from sklearn.metrics import accuracy_scoretest_scores, train_scores = [], []for n_trees in [1, 2, 3, 4, 5, 10, 20, 50]: rfc = RandomForestClassifier(n_estimators=n_trees, random_state=1111) rfc.fit(X_train, y_train) train_scores.append(round(accuracy_score(y_train, rfc.predict(X_train)), 2)) test_scores.append(round(accuracy_score(y_test, rfc.predict(X_test)), 2))print("Training scores: {}".format(train_scores))print("Testing scores: {}".format(test_scores))
A single tree’s judgment is unreliable. As more trees are added, their individual errors partially cancel out and test accuracy rises. Eventually the gain levels off. If accuracy is still climbing at the last data point in the loop, the ensemble has not reached a stable consensus and more trees may help.
Sampling Variability
Different random samples drawn from the same dataset can produce different class distributions, causing model scores to vary across runs even when the model itself has not changed.
sample1 = match_data.sample(200, random_state=1111)sample2 = match_data.sample(200, random_state=1171)print(len([i for i in sample1.index if i in sample2.index]))print(sample1['result'].value_counts())print(sample2['result'].value_counts())
The list comprehension counts how many rows appear in both samples. The value_counts() calls reveal the class distribution in each. If one sample has significantly more of one class than the other, a model trained on that sample will behave differently than one trained on the alternative. A single train/test split can give misleading results because the outcome is partly an accident of which rows happened to land in which pile. Cross-validation addresses this directly by averaging over many different splits.
Cross-Validation
KFold
KFold divides the data into k equally-sized folds and runs k rounds of evaluation. In each round a different fold is held out for validation and the model trains on the rest. Every observation ends up in the validation set exactly once.
from sklearn.model_selection import KFoldkf = KFold(n_splits=5, shuffle=True, random_state=1111)splits = kf.split(X)for train_index, val_index in splits: print("Training rows: %s" % len(train_index)) print("Validation rows: %s" % len(val_index))
shuffle=True randomizes the row order before dividing into folds, preventing early rows from clustering in the first fold and late rows in the last. The result is 5 index pairs that can be used to slice features and labels for each training and validation round.
from sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_errorrfr = RandomForestRegressor(n_estimators=25, random_state=1111)for train_index, val_index in splits: X_train_fold, y_train_fold = X[train_index], y[train_index] X_val_fold, y_val_fold = X[val_index], y[val_index] rfr.fit(X_train_fold, y_train_fold) predictions = rfr.predict(X_val_fold) print("Fold MSE: " + str(mean_squared_error(y_val_fold, predictions)))
If the five MSE values are consistent, the model is stable across different subsets of the data. If one fold produces an outlier score, something unusual landed in that fold. Consistent scores across folds mean the performance estimate is reliable rather than a lucky or unlucky draw.
cross_val_score
The manual KFold loop compresses into a single function call.
from sklearn.model_selection import cross_val_scorefrom sklearn.metrics import mean_squared_error, make_scorerrfr = RandomForestRegressor(n_estimators=25, random_state=1111)mse_scorer = make_scorer(mean_squared_error)cv_scores = cross_val_score( estimator=rfr, X=X_train, y=y_train, cv=10, scoring=mse_scorer)print(cv_scores.mean())
make_scorer converts a raw metric function into the scorer format that cross_val_score expects. With cv=10, the function runs 10 rounds of splitting, fitting, predicting, and scoring internally, returning an array of 10 results. .mean() collapses that array into a single summary metric that is more stable than any single fold’s result.
Leave-One-Out Cross-Validation
LOOCV takes KFold to its extreme: setting cv equal to the number of rows means each observation is a validation set of one. The model trains on every other row, validates on the single held-out row, and repeats for every row in the dataset.
import numpy as npmae_scorer = make_scorer(mean_absolute_error)rfr = RandomForestRegressor(n_estimators=15, random_state=1111)scores = cross_val_score(rfr, X=X, y=y, cv=85, scoring=mae_scorer)print("Mean error: %s" % np.mean(scores))print("Std of errors: %s" % np.std(scores))
This maximizes the use of available data, making it well-suited to small datasets where a standard split would waste too much. The cost is computational: with 85 rows, 85 separate model fits are required. The standard deviation across scores reveals how sensitive the model is to individual data points. A high standard deviation suggests the model is unstable or that the dataset contains unusual observations with outsized influence.
| Method | Use when |
|---|---|
| Single train/test split | Large dataset, quick baseline |
| KFold (manual) | Need control over each fold |
cross_val_score | Standard cross-validation |
LOOCV (cv=n) | Very small datasets |
Hyperparameter Tuning
Hyperparameters are settings chosen before training that control how the model learns. They cannot be inferred from data. max_depth, min_samples_split, and max_features are all hyperparameters for a random forest.
print(rfr.get_params())max_depth = [4, 8, 12]min_samples_split = [2, 5, 10]max_features = [4, 6, 8, 10]
get_params() returns all current values as a dictionary, which is a useful starting point before deciding what to search over. The lists define the candidate values: not a commitment, but a menu.
A simple first approach picks randomly from each list.
import randomrfr = RandomForestRegressor( n_estimators=100, max_depth=random.choice(max_depth), min_samples_split=random.choice(min_samples_split), max_features=random.choice(max_features))print(rfr.get_params())
This is fast but unsystematic. The print(rfr.get_params()) call records which combination was selected, since the choice was random and needs to be logged for reproducibility.
RandomizedSearchCV
A more rigorous approach samples many combinations and evaluates each with cross-validation.
from sklearn.model_selection import RandomizedSearchCVparam_dist = { "max_depth": [2, 4, 6, 8], "max_features": [2, 4, 6, 8, 10], "min_samples_split": [2, 4, 8, 16]}rfr = RandomForestRegressor(n_estimators=10, random_state=1111)mse_scorer = make_scorer(mean_squared_error)random_search = RandomizedSearchCV( estimator=rfr, param_distributions=param_dist, n_iter=10, cv=5, scoring=mse_scorer)
The full grid here is 4 × 5 × 4 = 80 combinations. Setting n_iter=10 samples 10 of those 80, each evaluated with 5-fold cross-validation, for 50 total model fits. This covers the search space efficiently without exhaustively testing every option.
The same mechanism works with any scoring metric. To optimize for precision instead of MSE:
from sklearn.metrics import precision_scoreprecision_scorer = make_scorer(precision_score)rs = RandomizedSearchCV( estimator=rfc, param_distributions=param_dist, scoring=precision_scorer, cv=5, n_iter=10, random_state=1111)rs.fit(X, y)print("Scores per combination: {}".format(rs.cv_results_['mean_test_score']))print("Best score: {}".format(rs.best_score_))
rs.fit() runs all 50 fits and records every result. cv_results_['mean_test_score'] is the leaderboard: one mean cross-validated score per combination tried. best_score_ is the top entry. random_state=1111 ensures the same 10 combinations are sampled every run, making the search reproducible.
| GridSearchCV | RandomizedSearchCV | |
|---|---|---|
| Tests | Every combination | Random sample |
| Speed | Slow on large grids | Faster |
| Coverage | Exhaustive | Approximate |
| Use when | Small parameter space | Large parameter space |
Common Pitfalls
Fitting a scaler or any transformation on the full dataset before splitting leaks test-set information into the training process. Always split first, then fit transforms on the training fold only.
Reporting only training error without evaluating on held-out data gives a falsely optimistic picture of real-world performance. Always report test-set results.
Using the test set to guide tuning decisions corrupts the final evaluation. The test set should be touched exactly once, at the very end. Tuning belongs on a validation set or inside cross-validation.
Not setting random_state means results will differ every run, making experiments impossible to reproduce. Set it consistently.
Choosing between precision and recall without first defining the cost of each mistake leads to optimizing for the wrong objective. Define what a false positive and a false negative cost in your specific context before selecting the metric.
High accuracy is not always a good result. On an imbalanced dataset where one class makes up 95% of observations, a model that always predicts the majority class achieves 95% accuracy without learning anything. Precision and recall expose this failure that accuracy conceals.
Quick Reference
# Splitfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1111)# Fitfrom sklearn.ensemble import RandomForestRegressor, RandomForestClassifierrfr = RandomForestRegressor(n_estimators=100, max_depth=6, random_state=1111)rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)model.fit(X_train, y_train)# Predictmodel.predict(X_test) # labels or continuous valuesrfc.predict_proba(X_test) # per-class probabilitiesrfc.score(X_test, y_test) # accuracy or R² shortcut# Regression metricsfrom sklearn.metrics import mean_absolute_error, mean_squared_errormean_absolute_error(y_test, predictions)mean_squared_error(y_test, predictions)# Classification metricsfrom sklearn.metrics import confusion_matrix, precision_score, recall_scoreconfusion_matrix(y_test, predictions) # cm[actual, predicted]precision_score(y_test, predictions)recall_score(y_test, predictions)# Feature importancesfor i, v in enumerate(rfr.feature_importances_): print(X_train.columns[i], round(v, 2))# Cross-validationfrom sklearn.model_selection import cross_val_scorefrom sklearn.metrics import make_scorercv = cross_val_score(model, X_train, y_train, cv=10, scoring=make_scorer(mean_squared_error))print(cv.mean())# LOOCVcross_val_score(model, X, y, cv=len(X), scoring=make_scorer(mean_absolute_error))# Hyperparameter tuningfrom sklearn.model_selection import RandomizedSearchCVparam_dist = {"max_depth": [2, 4, 6, 8], "max_features": [2, 4, 6, 8]}rs = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=5, scoring=mse_scorer, random_state=1111)rs.fit(X, y)print(rs.best_score_)print(rs.cv_results_['mean_test_score'])
The workflow is always the same sequence: lock away the test set; train on the training portion; predict on held-out data; choose the right metric for the problem type; diagnose overfitting or underfitting by comparing train versus test error; cross-validate for a reliable estimate; tune hyperparameters to find the best configuration; report final performance on the test set exactly once.
See you soon.
[…] Model Validation in Machine Learning […]
[…] Model Validation in Machine Learning […]
[…] experimental, and that fact shapes everything. Empirical work moves from a question to an economic model to an econometric model with an error term, and finally to estimation. The data arrives as […]
[…] single cross-validation score hides how much it would change on slightly different data. The bootstrap answers that. You […]