Dimensionality Reduction in Python: A Guide

More features sound like more information, but past a point they become a liability. As the number of columns grows, you need exponentially more data to avoid overfitting, training slows, irrelevant features inject noise that hurts generalization, and you lose any hope of plotting the data, since the eye tops out at two or three dimensions. Dimensionality reduction is the practice of shrinking a dataset while keeping most of its signal, and it splits into two families. Feature selection keeps a subset of the original columns, which stays interpretable. Feature extraction builds new columns from combinations of the originals, which is more powerful but harder to read. This article walks through both, from the quick manual checks to PCA.

Dropping What Carries No Information

The cheapest win is removing features with no variance, because a column that holds the same value in every row tells a model nothing. The first pass is simply selecting the columns worth keeping, the numeric ones with real spread plus any labels you need for identification, and quietly discarding the rest.

			
number_cols = ['speed', 'strength', 'stamina']
label_cols = ['name', 'type']
selected = players_df[number_cols + label_cols]

To find low-information columns systematically, .describe() or .var() flags anything with variance near zero as a candidate for removal.

Redundancy is the next target, and a pairplot is the fastest way to spot it visually. If two features trace a straight line against each other, one is derivable from the other and can go; if a column shows the same value everywhere, it has no variance; and two identical distributions usually mean duplicates.

			
import seaborn as sns
reduced = measures_df.drop('n_limbs', axis=1)
sns.pairplot(reduced, hue='sex', diag_kind='hist')
plt.show()

Colouring by a category with hue also reveals whether the classes separate in any feature pair, which hints at what will be predictive later.

Seeing High-Dimensional Data with t-SNE

When the data lives in dozens of dimensions, t-SNE projects it down to two so you can actually look at it, placing similar points near each other and dissimilar ones far apart. One critical limitation: it is for visualization only, with no transformmethod for new data, just fit_transform.

			
from sklearn.manifold import TSNE
numeric = measures_df.drop(['branch', 'sex', 'component'], axis=1)
embedding = TSNE(learning_rate=50).fit_transform(numeric)

Think of t-SNE as a careful map-maker: it studies who resembles whom across all the original dimensions, then arranges everyone on a flat sheet so the neighbourhoods are preserved. Adding the two output columns back to the frame and colouring by a category shows whether the high-dimensional data naturally separates by that category, even though the category was never part of the compression. The learning_rate, typically between 50 and 300, controls the spread: too low and points clump together, too high and they scatter uniformly.

Why Fewer Features Reduce Overfitting

The clearest motivation for reduction is overfitting, which shows up as a large gap between training and test accuracy. A train/test split is the instrument that exposes it: you lock away part of the data, train on the rest, and judge the model on what it never saw.

			
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
y = measures_df['sex']
X = measures_df.drop('sex', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
svc = SVC()
svc.fit(X_train, y_train)
train_acc = accuracy_score(y_train, svc.predict(X_train))
test_acc = accuracy_score(y_test, svc.predict(X_test))

		

If the model scores near-perfectly on the training data but poorly on the test set, it has memorised rather than learned, and that gap is your overfitting detector. Reducing dimensions shrinks the gap, sometimes dramatically. Trained on a hundred body measurements an SVM might hit 99% on training and 50% on test, but trained on a single well-chosen feature it might land at 70% on both. The lower headline number hides a real improvement: a simple model that generalizes beats a complex one that memorizes, and the tighter the train-test gap, the more you can trust the result.

Automatic Selection by Variance and Missingness

To remove low-variance features without eyeballing them, VarianceThreshold does it by rule, but variances are only comparable once features are on the same scale, so normalize first by dividing each column by its mean. That turns every feature into a scale-free ratio where one equals the typical value, and the variances can then be fairly compared.

			
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.001)
sel.fit(head_measures_df / head_measures_df.mean())
reduced = head_measures_df.loc[:, sel.get_support()]

The selector acts like a bouncer: you set the bar, it checks each column’s normalized variance, and anything below the bar is turned away, with get_support() returning the boolean list of who got in.

A related quick filter handles missing data. A feature that is mostly empty is rarely worth imputing, since filling more than half a column is mostly guessing, so a simple mask drops columns above a missingness threshold.

			
mask = schools_df.isna().sum() / len(schools_df) < 0.5
reduced = schools_df.loc[:, mask]

The rule of thumb is to consider dropping any feature more than thirty to fifty percent missing, unless it is genuinely critical.

Removing Redundancy by Correlation

Highly correlated features carry overlapping information, so keeping both is wasteful. A correlation heatmap reveals the pairs, and masking the upper triangle avoids reading each pair twice.

			
import numpy as np
corr = measures_df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, cmap='coolwarm', center=0, annot=True, fmt='.2f')
plt.show()

		

To automate the cull, take the absolute correlations, blank out the upper triangle so each pair appears once, and drop any feature that correlates above a high threshold with another.

			
corr_matrix = measures_df.corr().abs()
tri = corr_matrix.mask(np.triu(np.ones_like(corr_matrix, dtype=bool)))
to_drop = [c for c in tri.columns if any(tri[c] > 0.95)]
reduced = measures_df.drop(to_drop, axis=1)

The masking matters: if you scanned every correlation naively you would remove both members of each pair, so the lower-triangle mask is what keeps one of each. And a permanent caution applies here, because correlation is not causation. Two unrelated time series, such as streaming hours and avocado imports, can correlate strongly simply because both trend upward over the years, and a naive algorithm would treat one as a predictor of the other. A high correlation coefficient is a fact about the numbers, not the world, so always sanity-check with domain knowledge.

Letting a Model Pick the Features

Models that assign a weight per feature can rank importance directly. Logistic regression is the simplest case, but its coefficients are only comparable once the features are scaled.

			
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
scaler = StandardScaler()
lr = LogisticRegression()
lr.fit(scaler.fit_transform(X_train), y_train)
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

		

A large absolute coefficient means the model leaned on that feature; one near zero means it barely used it. You can remove the weakest feature, refit, and repeat, which is exactly what Recursive Feature Elimination automates.

			
from sklearn.feature_selection import RFE
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=3, verbose=1)
rfe.fit(X_train, y_train)
print(X.columns[rfe.support_])

RFE is like pruning a bush one weakest branch at a time, re-examining the whole plant after each cut because removing one branch can change which remaining one looks weakest. Its support_ gives the surviving features and ranking_ records the order of elimination, and crucially it re-evaluates importances at every step rather than judging once.

Tree-based models offer a different importance signal. A random forest scores each feature by how much it reduces impurity across all its trees, and you can select by a threshold or, more carefully, with RFE.

			
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)
print(dict(zip(X.columns, rf.feature_importances_.round(2))))

A single threshold like keeping features above 0.15 is fast but crude, because importances shift once weaker features are removed; wrapping the forest in RFE and removing a few features per step recomputes those importances each round, which is more principled. This connects to the broader ensemble methods toolkit, where the same forests do the predicting.

Lasso: Selection Built into the Model

Lasso is linear regression with an L1 penalty that charges a fee for every non-zero coefficient, and that fee drives weak coefficients to exactly zero, removing those features automatically. Like the others it needs scaled inputs.

			
from sklearn.linear_model import Lasso
la = Lasso(alpha=0.1, random_state=0)
la.fit(scaler.fit_transform(X_train), y_train)
n_ignored = sum(la.coef_ == 0)

The strength of the penalty is set by alpha: a high alpha is a tight budget that zeros out most features for a very simple model, while a low alpha is a loose budget that keeps more. Rather than guess, LassoCV searches many alpha values by cross-validation and picks the best.

			
from sklearn.linear_model import LassoCV
lcv = LassoCV()
lcv.fit(X_train, y_train)
selected = lcv.coef_ != 0

Because different models disagree about which features matter, combining their verdicts is more robust than trusting any one. You run several selectors, an RFE with a random forest, a LassoCV, an RFE with gradient boosting, collect each one’s boolean mask, sum the masks, and keep only the features that every selector chose.

			
votes = np.sum([lcv_mask, rf_mask, gb_mask], axis=0)
consensus = votes >= 3
X_reduced = X.loc[:, consensus]

If three very different models independently agree a feature matters, it almost certainly does, and unanimous agreement is a far stronger signal than one model’s opinion.

Building Better Features by Hand

Feature extraction does not have to be fancy. When domain knowledge tells you several columns encode the same underlying quantity, you can replace them with a single summary. If you have revenue and quantity, their ratio is unit price, and that one column holds the same information as the two.

			
sales_df['price'] = sales_df['revenue'] / sales_df['quantity']
reduced = sales_df.drop(['revenue', 'quantity'], axis=1)

Likewise, three noisy repeated measurements of the same thing collapse into a single, more reliable average.

			
height_df['height'] = height_df[['height_1', 'height_2', 'height_3']].mean(axis=1)
reduced = height_df.drop(['height_1', 'height_2', 'height_3'], axis=1)

Two or three redundant columns become one better column, with no real information lost, which is the whole idea of manual extraction. This is the same spirit as the work in the feature engineering article.

Principal Component Analysis

PCA is the workhorse of feature extraction. It builds new, uncorrelated features called principal components, each a linear combination of the originals, ordered so the first captures the most variance, the second the next most, and so on. It must always be preceded by scaling, or features with large magnitudes dominate the components purely because of their units.

			
from sklearn.decomposition import PCA
scaler = StandardScaler()
std = scaler.fit_transform(measures_df)
pca = PCA()
components = pca.fit_transform(std)

		

Think of PCA as finding the best camera angles on a sculpture: it rotates the axes to align with the directions of greatest spread, so the new axes are uncorrelated by design. The cumulative sum of the explained-variance ratios tells you how much of the story you keep as you add components.

print(pca.explained_variance_ratio_.cumsum())

If the first three components reach eighty percent, you can keep three and discard the rest with little loss. Each component is also interpretable through its loadings, the weights it places on the original features, which you can read off a fitted pipeline.

			
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('reducer', PCA(n_components=2))])
pipe.fit(players_df)
loadings = pipe.steps[1][1].components_.round(2)
print(dict(zip(players_df.columns, loadings[0])))

		

If the first component loads positively on every stat, it captures overall strength; if the second loads positively on attack and negatively on defense, it captures an offense-versus-defense trade-off, a meaning PCA discovered without knowing anything about the domain. Dropped into a full pipeline, PCA becomes a clean modelling step that also enforces good hygiene, since the scaler and PCA learn only from the training data.

			
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('reducer', PCA(n_components=3)),
    ('classifier', RandomForestClassifier(random_state=0))
])
pipe.fit(X_train, y_train)
accuracy = pipe.score(X_test, y_test)

		

Choosing how many components to keep has two clean approaches. You can hand PCA a fraction and let it select enough components to reach it, passing n_components=0.9 to retain ninety percent of the variance, which is like telling a packing service to fit everything that matters into the bag rather than specifying the bag size. Or you can plot the explained-variance ratio against the component index and look for the elbow, the bend where the steep drop flattens into a tail, keeping the components before it and discarding the diminishing returns after.

PCA even compresses images, where each pixel is a feature. A twenty-eight by twenty-eight image is 784 numbers, and PCA can re-describe it as a recipe of perhaps seventy-eight learned visual patterns that together capture most of the variation, a tenfold compression. Running inverse_transform on the components reconstructs a slightly blurry version of the original, the blur being precisely the small fraction of detail traded away for compactness.

Selection or Extraction?

The choice between the two families comes down to one question: do you need the original features to stay interpretable? If yes, use feature selection, dropping low-variance or heavily missing columns, filtering by correlation, or letting Lasso or RFE pick, and combining several selectors when robustness matters most. If you only need fewer dimensions and can accept less interpretable mixtures, use feature extraction, reaching for PCA when you want a reversible linear method that also works on new data, t-SNE when you only need a two-dimensional picture, and manual ratios or means when domain knowledge points the way. Selection keeps a subset of the originals; extraction builds new columns from them.

The Pitfalls That Recur

A handful of mistakes account for most dimensionality-reduction bugs. Forgetting to scale before PCA or Lasso lets large-magnitude features dominate, so always standardize first. Treating t-SNE as a modelling tool fails because it has no transform method and exists only for visualization. Filtering correlated features without a triangle mask removes both members of every pair instead of one. Reading correlation as causation invites spurious predictors. Running any feature selection on the full dataset before splitting leaks test information into the choice of features, so split first and fit selectors on training data alone. Keeping too few components throws away real signal, while keeping too many defeats the purpose, so let the cumulative variance or the elbow guide you. And selecting by a single importance threshold is fragile, because importances shift as features are removed, which is exactly why the recursive approach is more trustworthy.

Conclusion

Dimensionality reduction makes datasets smaller, faster, less prone to overfitting, and possible to visualize. Start with the cheap wins, dropping zero-variance, mostly-missing, and highly correlated columns, then escalate to model-driven selection with Lasso, RFE, or tree importances, combining several selectors when you want a robust consensus. When interpretability is not required, extract instead: build features by hand where domain knowledge allows, and use PCA, always after scaling, to compress many correlated features into a few uncorrelated components, choosing the number by a variance target or the elbow. Throughout, split before you select to avoid leakage, scale before PCA and Lasso, and remember that the goal is never fewer columns for their own sake, but a simpler model that generalizes better than the crowded one it replaces.

See you soon.

Dynamic Causal Effects in Time Series

Machine Learning for Time Series Data

Linear Classifiers in Python