A machine learning model is only ever as good as the data you feed it, and raw data almost never arrives in a state a model can use. It has missing values, columns stored as the wrong type, categories that need translating into numbers, features on wildly different scales, and far more columns than the model actually needs. Preprocessing is everything you do to fix that.
A useful way to picture it: imagine a giant box of LEGO bricks tipped onto the floor. Some are broken, some are dirty, some are duplicates, and a few belong to a different set entirely. Before you can build the spaceship on the box, you have to sort, clean, and select the right pieces. The model is the spaceship, the data is the LEGO, and everything in between is making sure the bricks actually snap together. This guide walks through that whole process, from cleaning through to building complete preprocessing pipelines.
Cleaning the Data
Everything starts with cleaning, and the two most common jobs are handling missing values and fixing column types.
A DataFrame is like a spreadsheet where some cells are empty. The isnull() method puts a marker on every empty cell, and .sum() counts them:
# Count missing values in one columnprint(df['category'].isnull().sum())# Count across several columns at onceprint(df[['time_text', 'state_code', 'posting_type']].isnull().sum())
To remove the bad rows, you ask the opposite question, “is this cell filled?”, using notnull(), and keep only the rows that answer yes. When several columns must all be present, you chain the conditions with &, which means AND, so a row survives only if every condition is true, like a bouncer checking ID, ticket, and dress code all at once:
# Keep rows where one column is presentdf_kept = df[df['category'].notnull()]# Keep rows where several columns are all presentdf_full = df[df['time_text'].notnull() & df['state_code'].notnull() & df['posting_type'].notnull()]print(df_full.shape)
The other cleaning job is fixing types. Pandas sometimes reads a column of numbers as text, like a calculator that thinks “42” is a word rather than a number, and you cannot do maths on words. astype() tells pandas to treat the column as the real type, and pd.to_datetime does the same for dates:
print(df.dtypes) # check every column's typedf['view_count'] = df['view_count'].astype('int') # text → integerdf['duration_sec'] = df['duration_sec'].astype(float) # text → floatdf['created_at'] = pd.to_datetime(df['created_at']) # string → datetimeprint(df[['duration_sec', 'created_at']].dtypes) # verify
Once a date string like “2024-01-15” becomes a real datetime, pandas understands it as an actual day on the calendar, so you can later ask “what month?” or “how many days ago?” That conversion unlocks a whole category of feature engineering we will come back to.
Splitting with Stratification
Before scaling or modelling, you split your data into training and test sets, and how you split matters more than it first appears. If your target has an uneven class distribution, say 80% of one label and 20% of another, a purely random split might leave your test set with almost none of the rare class, making it useless for measuring how well you predict that class.
Picture a bag of 80 red marbles and 20 blue. Reach in blindfolded for 25 and you might pull only one blue. Stratified sampling sorts the marbles by colour first and takes 25% from each pile, so both halves keep the original 80/20 ratio. That is exactly what stratify=y does:
from sklearn.model_selection import train_test_splitX = postings.drop("category", axis=1) # all featuresy = postings[["category"]] # target onlyX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)print(y_train["category"].value_counts()) # distribution preserved
For any imbalanced dataset, stratifying is not optional; it is what makes your evaluation trustworthy.
Normalising and Scaling
Models that rely on distances between data points, like K-Nearest Neighbours, are thrown off when features sit on very different scales. There are two common fixes, and which you use depends on the problem.
When a single column has extreme variance compared to the others, log normalisation helps. Variance measures how spread out the numbers are, and if most columns range from 0 to 10 but one ranges from 500 to 100,000, that giant column dominates everything. The logarithm is a zoom-out function that squishes large numbers: log(10) is 2.3, log(1000) is 6.9, log(100000) is 11.5, so tenfold jumps in the input become small steps in the output.
import numpy as npprint(blends['Caffeine'].var()) # e.g. 99,166 — very highblends['Caffeine_log'] = np.log(blends['Caffeine'])print(blends['Caffeine_log'].var()) # e.g. 0.17 — much better
One caveat: np.log needs positive values, so for columns containing zeros use np.log1p, which computes log(1 + x) and handles zero safely.
When several columns are simply on different scales, standardisation with StandardScaler is the tool. Suppose one column is height in centimetres (150 to 200) and another is income in dollars (20,000 to 200,000). The model has no idea these are different units; it just sees that income’s numbers are a thousand times larger and assumes income is a thousand times more important. Standardisation transforms every column so its average is 0 and its typical spread is 1, after which a value of 1.5 means “1.5 standard deviations above average” and means the same thing for height as for income:
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()blend_subset = blends[['Acidity', 'Bitterness', 'Density']]blend_subset_scaled = scaler.fit_transform(blend_subset)
The method name fit_transform is two actions in one. The fit step learns the mean and standard deviation from the data, and the transform step subtracts the mean and divides by the standard deviation. Doing both at once is convenient, but only on training data, which leads to one of the most important rules in all of preprocessing.
Split First, Then Scale
You must split your data before scaling, and the reason is subtle but critical. Think of the test set as a sealed envelope you open only at the end to grade yourself. If you compute the mean using the test set’s numbers, you have peeked at the answers and leaked information from the test set into your training. The honest approach is to compute the mean and standard deviation from the training data alone, then apply that same mean and standard deviation to the test set. The test set never gets to influence its own scaling:
from sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_split# Split FIRSTX_train, X_test, y_train, y_test = train_test_split(X, y)# Fit the scaler only on training datascaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train) # learn + apply on trainX_test_scaled = scaler.transform(X_test) # apply only on testknn.fit(X_train_scaled, y_train)print(knn.score(X_test_scaled, y_test))
This is worth flagging because a common shortcut, scaling the whole dataset before splitting, quietly leaks test-set statistics into training and inflates your scores. The correct order is always: split, then fit_transform on train, then transform on test.
Encoding Categorical Variables
Computers are maths machines; they understand numbers, not words like “Yes” and “No,” so categories must be translated. How you translate depends on how many categories there are.
For a column with just two values, binary encoding maps them to 0 and 1. Scikit-learn’s LabelEncoder does this automatically, assigning numbers in alphabetical order:
from sklearn.preprocessing import LabelEncoderencoder = LabelEncoder()routes['PetFriendly_enc'] = encoder.fit_transform(routes['PetFriendly'])print(routes[['PetFriendly', 'PetFriendly_enc']].head())# N → 0, Y → 1 (alphabetical)
When you want specific logic rather than alphabetical numbering, a lambda gives you control. A lambda is a tiny one-line function, and .apply() runs it on every cell. Here you specifically want to distinguish “is it the US?” rather than “what is this country’s alphabetical position?”:
incidents['nation_enc'] = incidents['nation'].apply(lambda val: 1 if val == 'us' else 0)
For columns with three or more categories, neither of those works well, and here is why. If you label-encoded Red as 0, Green as 1, and Blue as 2, the model would assume Blue is “twice as much” as Green, which is nonsense, because colours have no natural order. One-hot encoding avoids this by creating a separate 0/1 column for each category, so a red row becomes [1, 0, 0] and a green row [0, 1, 0], with no fake ordering implied:
category_dummies = pd.get_dummies(postings['category'])# For a multi-value column, then attach it backtype_dummies = pd.get_dummies(incidents['incident_type'])incidents = pd.concat([incidents, type_dummies], axis=1)
The rule of thumb: two categories take a lambda or a label encoder, and three or more take pd.get_dummies.
Engineering New Features
Encoding translates what is already there. Feature engineering creates new columns that expose information the model could not otherwise see.
You can aggregate several numeric columns into one. If each swimmer has five heat times, you might want their average as a single feature. The key is direction: axis=1 means “walk across the columns within one row,” so for each swimmer pandas grabs all five times and averages them:
heat_cols = ["heat1", "heat2", "heat3", "heat4", "heat5"]swimmers["avg_time"] = swimmers.apply( lambda row: row[heat_cols].mean(), axis=1)
A handy mnemonic is that axis=1 means “one row at a time,” whereas axis=0 walks down a column instead.
Dates are a rich source of features. A raw date is one chunk of information, but a model might learn more from its parts, since incidents or sales might spike in summer or every July. Once a string is a datetime object, it gains attributes you can pull out into their own columns:
incidents["logged_date"] = pd.to_datetime(incidents["logged_date"])incidents["month"] = incidents["logged_date"].apply(lambda row: row.month)incidents["year"] = incidents["logged_date"].apply(lambda row: row.year)
Beyond .month and .year, datetime objects offer .day, .hour, .weekday(), and .quarter, each a potential signal.
Messy text fields often hide numbers, and regular expressions, or regex, are a small pattern-matching language for extracting them. The pattern \d+\.\d+ reads as “one or more digits, a literal dot, one or more digits,” which captures a decimal like 3.5 from “3.5 miles”:
import redef extract_distance(text): pattern = re.compile(r"\d+\.\d+") match = re.match(pattern, text) if match is not None: return float(match.group(0))routes["Distance_num"] = routes["Distance"].apply(lambda row: extract_distance(row))
For whole numbers, the simpler pattern \d+ grabs one or more digits in a row, pulling “15” out of “15 minutes”:
def extract_minutes(time_string): pattern = re.compile(r"\d+") match = re.match(pattern, time_string) if match is not None: return int(match.group(0))incidents["mins"] = incidents["duration_text"].apply(extract_minutes)
Regex is far more robust than fragile alternatives like text.split(' ')[0], which breaks the moment the spacing or format changes.
Turning Text into Numbers with TF-IDF
When a feature is free text, you need a way to convert it into numbers a model can use, and TF-IDF is the standard approach. It stands for Term Frequency times Inverse Document Frequency, and it scores each word by how distinctive it is. Words that appear often in one document but rarely across all documents get a high score, because they are meaningful and specific. Words like “the,” “is,” and “and” appear everywhere, so they get a low score.
Imagine trying to work out what each book in a library is about from word counts alone. “The” appears in every book and tells you nothing, but “thermodynamics” appearing fifty times in one physics book and nowhere else is a giant fingerprint. TF-IDF gives every word a score that is high when it is frequent here but rare everywhere else:
from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer()headline_tfidf = vectorizer.fit_transform(postings["headline"])
The result is a large table where rows are documents, columns are words, and each cell says how distinctive that word is for that document. It comes back as a sparse matrix, which simply means most cells are zero (most documents do not contain most words) and only the non-zero scores are stored, for efficiency.
Once text is numbers, you can feed it to a model. Naive Bayes is a natural fit for word-count features, asking “given the words I see, which category is most likely?” The .toarray() call converts the compact sparse format into the regular grid some models expect:
y = postings["category"]X_train, X_test, y_train, y_test = train_test_split( headline_tfidf.toarray(), y, stratify=y)nb.fit(X_train, y_train)print(nb.score(X_test, y_test))
Keeping Only the Most Informative Words
A TF-IDF matrix can have thousands of word columns, most of them noise. You can keep only the most distinctive words per document. The first helper returns the top N weighted words for a single document by pairing each word’s ID with its score, looking up the actual word, sorting, and taking the top few:
def top_words_for_doc(vocab, original_vocab, vector, vector_index, top_n): zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data)) zipped_series = pd.Series( {vocab[i]: zipped[i] for i in vector[vector_index].indices} ) zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index return [original_vocab[i] for i in zipped_index]
The second helper runs that across every document and collects the union of all their top words, using set() to remove duplicates, then slices the matrix down to just those columns:
def collect_top_words(vocab, original_vocab, vector, top_n): keep_list = [] for i in range(0, vector.shape[0]): top = top_words_for_doc(vocab, original_vocab, vector, i, top_n) keep_list.extend(top) return set(keep_list)keep_words = collect_top_words(vocab, vectorizer.vocabulary_, headline_tfidf, 3)filtered_tfidf = headline_tfidf[:, list(keep_words)]
Instead of ten thousand word columns you might end up with five hundred informative ones: smaller, faster, and with less noise, which often means better test accuracy because you have cut the clutter the model would otherwise try to fit.
Selecting Features
Not every column earns its place. Removing redundant features makes models faster and less prone to overfitting.
Some redundancy you spot with human judgement. If you already have a “town” column, you do not also need “area”; they say the same thing twice, and duplicate information just adds noise:
to_drop = ["area", "zone", "created_at", "slots", "category"]postings_subset = postings.drop(to_drop, axis=1)
Other redundancy is data-driven. Correlation measures how strongly two columns move together, and two columns that correlate near 1.0 or -1.0 carry essentially the same information, like height in centimetres and height in inches. The corr()method produces a table of every column against every other, and you drop one of each highly correlated pair:
print(blends.corr()) # values near ±1.0 flag redundancyblends = blends.drop("Aroma", axis=1)
The same logic applies to engineered columns: once you have replaced a raw column with an encoded or transformed version, there is no reason to keep both, so the original gets dropped.
Reducing Dimensions with PCA
When you have many correlated features, Principal Component Analysis, or PCA, compresses them into fewer components while preserving as much of the information, the variance, as possible.
Think of photographing a pencil. The pencil is three-dimensional, but from the right angle a flat photo captures nearly everything you need, its length and orientation, reducing three dimensions to two with almost no loss. PCA does this in higher dimensions, finding the directions along which the data varies most and letting you keep only the top few:
from sklearn.decomposition import PCApca = PCA()blends_X = blends.drop("Grade", axis=1) # features onlytransformed_X = pca.fit_transform(blends_X)print(pca.explained_variance_ratio_) # e.g. [0.998, 0.001, ...]
If the first two or three components explain over 95% of the variance, you can drop the rest, train faster, and reduce overfitting with barely any loss. One important caveat: PCA is sensitive to scale, so always run StandardScaler before PCA, or the column with the largest numeric range will dominate the components purely because of its units.
After transforming, you train a model on the smaller, uncorrelated feature set exactly as normal. KNN especially benefits, because it is distance-based and suffers in high dimensions, a problem known as the curse of dimensionality:
knn.fit(pca_X_train, y_train)print(knn.score(pca_X_test, y_test))
Putting It All Together
The real value of these techniques shows when you chain them into a pipeline tuned to the shape of your data.
For a purely numerical dataset, the assembly line tames the wild high-variance column with a log, cuts a redundant correlated column, puts everything on a common scale, then splits, trains, and scores. The cleaner version splits before scaling, as discussed earlier:
blends['Caffeine_log'] = np.log(blends['Caffeine']) # tame varianceblends = blends.drop("Aroma", axis=1) # cut redundancyX_train, X_test, y_train, y_test = train_test_split(X, y)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)knn.fit(X_train_scaled, y_train)print(knn.score(X_test_scaled, y_test))
A dataset of text and categories looks completely different. You drop rows with a missing target, one-hot encode the categories, vectorise the text with TF-IDF, filter to the most informative words, then train Naive Bayes on a stratified split, which matters because the posting categories are usually imbalanced:
postings = postings[postings['category'].notnull()]category_dummies = pd.get_dummies(postings['category'])vectorizer = TfidfVectorizer()headline_tfidf = vectorizer.fit_transform(postings['headline'])keep_words = collect_top_words(vocab, vectorizer.vocabulary_, headline_tfidf, 3)filtered_tfidf = headline_tfidf[:, list(keep_words)]X_train, X_test, y_train, y_test = train_test_split( filtered_tfidf.toarray(), y, stratify=y)nb.fit(X_train, y_train)print(nb.score(X_test, y_test))
A full end-to-end dataset can even feed two complementary models. After fixing types, dropping incomplete rows, engineering features from messy text and dates, encoding categories both binary and one-hot, and vectorising a description field, you drop the original columns you have already replaced and train KNN on the structured numeric features and Naive Bayes on the text features. Each model sees a different view of the same record:
incidents['elapsed_sec'] = incidents['elapsed_sec'].astype(float)incidents['logged_date'] = pd.to_datetime(incidents['logged_date'])incidents = incidents[incidents['duration_text'].notnull() & incidents['province'].notnull() & incidents['incident_type'].notnull()]incidents['mins'] = incidents['duration_text'].apply(extract_minutes)incidents['elapsed_sec_log'] = np.log(incidents['elapsed_sec'])incidents['month'] = incidents['logged_date'].apply(lambda row: row.month)incidents['year'] = incidents['logged_date'].apply(lambda row: row.year)incidents['nation_enc'] = incidents['nation'].apply(lambda val: 1 if val == 'us' else 0)type_dummies = pd.get_dummies(incidents['incident_type'])incidents = pd.concat([incidents, type_dummies], axis=1)to_drop = ['nation', 'longitude', 'latitude', 'town', 'province', 'logged_date', 'narrative', 'duration_text', 'mins', 'elapsed_sec', 'archived']incidents_dropped = incidents.drop(to_drop, axis=1)
Dropping the originals after replacing them matters, because keeping both the raw and the engineered version is double-counting the same information.
Conclusion
Preprocessing is the work that turns raw data into something a model can actually learn from, and it follows a consistent arc. Clean the data by handling missing values and fixing types. Split before you scale, stratifying when classes are imbalanced, so your evaluation stays honest and no test-set information leaks into training. Bring features onto comparable footing with log normalisation or standardisation. Translate categories into numbers, using one-hot encoding for anything with more than two values to avoid implying a false order. Engineer new features from dates, strings, and aggregates to expose hidden signal, and vectorise text with TF-IDF. Then trim the result down, dropping redundant and correlated columns and reaching for PCA when you have too many features. Chain these into a pipeline matched to your data, and the model you build on top stands the best possible chance of working.
[…] Feature Engineering and Preprocessing: Getting Data Ready for a Model […]
[…] Feature Engineering and Preprocessing: Getting Data Ready for a Model […]