A Practical Guide to Exploratory Data Analysis in Python

Exploratory Data Analysis (EDA) is essential for understanding datasets before drawing conclusions. This guide outlines the Python EDA workflow, detailing techniques for validating data, handling missing values, detecting outliers, and visualising findings effectively.

Before building a model, writing a report, or drawing any conclusion from a dataset, there is a step that separates analysts who get it right from those who do not. Exploratory Data Analysis is the process of actually understanding your data before you trust it. It is how you find the missing values, the type mismatches, the outliers that will silently wreck a model, and the distributional patterns that tell you what the data is really saying.

This guide walks through the full EDA workflow in Python, from the first five-line inspection to outlier removal and visualisation.

The First Look

Every dataset deserves the same four-line introduction before anything else happens.

df.head() # first 5 rows
df.info() # column names, types, non-null counts
df.describe() # summary stats for numeric columns
df["status"].value_counts() # frequency of each category
df["status"].value_counts(normalize=True) # proportions instead of counts

head() gives you a concrete preview of what the data looks like. info() is the metadata layer: column names, data types, and how many non-null values each column holds. Gaps in non-null counts are missing values. Type mismatches show up here too, a column that should be numeric but reads as object is a red flag. describe() runs a battery of summary statistics across every numeric column so outliers and unusual ranges surface immediately. value_counts() answers how often each category appears in a column. Adding normalize=True converts raw counts to proportions, which is more useful when you are checking class balance.

These five calls together take under a minute and give you a working mental model of any new dataset.

Validating the Data

The first validation step is a simple range check. Do the smallest and largest values in a column make sense for what that column is supposed to represent?

df["age"].min()
df["age"].max()

A maximum age of 250 or a negative salary signals dirty data immediately. Cheaper to catch here than after building three months of analysis on top of it.

When you need to filter out specific categories, .isin() combined with the ~ operator is the standard pattern:

not_apac = ~df["region"].isin(["APAC"])
df[not_apac] # all rows except APAC

.isin() produces a True/False value per row. The tilde flips every True to False and every False to True, so the filter reads: keep every row whose region is not in this list.

For a two-dimensional view of how a numeric variable behaves across two categorical dimensions, pd.crosstab with aggregation is more useful than filtering manually:

pd.crosstab(df["role_type"], df["org_size"],
values=df["base_salary"], aggfunc="mean")

By default, crosstab counts occurrences. Adding values and aggfunc changes each cell from a count to an aggregation of a third column. The result here is the average base salary for each combination of role type and organisation size, laid out as a grid.

Grouped Summaries

agg is how you compute multiple statistics in a single pass. The cleanest form is named aggregation, where you control the output column names directly:

df.groupby("region").agg(
mean_revenue = ("q2_revenue", "mean"),
std_revenue = ("q2_revenue", "std")
)

Each keyword argument becomes a column name in the output. The tuple pairs a source column with the aggregation function. This is preferable to passing a list of function names, which produces multi-level column names that are awkward to work with downstream.

When you want to add a group-level statistic back to the original DataFrame rather than collapsing it, transform is the right tool:

df["carrier_fare_std"] = df.groupby("carrier")["fare"].transform(lambda x: x.std())
df["carrier_median_hrs"] = df.groupby("carrier")["flight_hrs"].transform(lambda x: x.median())
df["destination_avg_fare"] = df.groupby("destination")["fare"].transform(lambda x: x.mean())

While agg returns one row per group, transform broadcasts the group statistic back to every row in that group. The DataFrame keeps its original shape, and each row gains a new column showing the group-level value it belongs to. This is the standard approach for feature engineering: adding columns like “this carrier’s average fare” to every individual booking row, so you can later compare each booking against its group’s baseline.

Handling Missing Values

The first step with missing data is knowing how much of it you have:

df.isna().sum()
df.isna().sum() / len(df)

isna() returns a boolean DataFrame where True marks every NaN. Summing column-wise converts those booleans to counts. Dividing by the row count gives proportions, which is what actually drives the decision about what to do next. Below roughly five percent missing, dropping the affected rows is usually acceptable. Above that, you need to think harder.

Dropping rows with few missing values:

threshold = len(df) * 0.05
cols_to_drop = df.columns[df.isna().sum() <= threshold]
df.dropna(subset=cols_to_drop, inplace=True)

This finds every column with fewer than five percent missing values and drops only the rows that have NaN in any of those columns. The data loss is small enough that it rarely biases analysis.

Dropping a column entirely:

df.drop(columns=["remarks"], inplace=True)

When a column has too many missing values to salvage, is redundant with another column, or contains unstructured text you cannot use, removing it entirely is cleaner than carrying a half-empty column through the rest of the analysis.

Imputing with group median:

group_medians = df.groupby("carrier")["fare"].median()
fares_dict = group_medians.to_dict()
df["fare"] = df["fare"].fillna(df["carrier"].map(fares_dict))

Filling all missing values with a single global median ignores the structure in the data. A budget carrier’s missing fare should be filled with a budget carrier’s median, not the overall median across all carriers. This approach computes a median per group, converts it to a lookup dictionary, then maps each row’s group to the appropriate fill value before passing it to fillna. Group-aware imputation preserves the differences between groups that a global fill would wash out.

Data Type Issues

String-typed columns require a different approach from numeric ones. This loop gives you a quick picture of the cardinality of every object column in the dataset:

non_numeric = df.select_dtypes("object")
for col in non_numeric.columns:
print(f"{col}: {non_numeric[col].nunique()} unique values")

Low cardinality (a handful of distinct values) usually means a categorical variable that can be encoded for modelling. High cardinality (thousands of values) usually means free text or identifiers that need separate handling.

When a numeric column has been loaded as a string because of a unit suffix, the fix is a two-step clean-then-cast:

df["flight_hrs"] = df["flight_hrs"].str.replace("h", "")
df["flight_hrs"] = df["flight_hrs"].astype(float)

For cases where some values might not parse cleanly, pd.to_numeric(df["col"], errors="coerce") is more tolerant: bad values become NaN instead of throwing an error.

To create categories from text patterns, np.select is a vectorised if/elif/else that operates on the whole column at once:

import numpy as np
short = "0h|1h|2h|3h|4h"
medium = "5h|6h|7h|8h|9h"
long_ = "10h|11h|12h|13h|14h|15h|16h"
conditions = [
df["flight_hrs"].str.contains(short),
df["flight_hrs"].str.contains(medium),
df["flight_hrs"].str.contains(long_)
]
labels = ["Short-haul", "Medium-haul", "Long-haul"]
df["haul_type"] = np.select(conditions, labels, default="Ultra-long")

For each row, numpy walks down the condition list and assigns the label of the first condition that evaluates to True. The pipe character inside the regex strings means “or”, so "0h|1h|2h" matches any of those substrings. The default value covers anything that matches none of the conditions.

To create categories from numeric ranges, pd.cut assigns each value to a named bucket based on boundaries you define:

fare_ranges = [0, p25, median_fare, p75, df["fare"].max()]
fare_labels = ["budget", "economy", "business", "premium"]
df["fare_tier"] = pd.cut(df["fare"], bins=fare_ranges, labels=fare_labels)

There is always one more boundary than labels, because four buckets need five cut points to define them. A close alternative, pd.qcut, splits by quantiles instead of fixed boundaries, so each bucket contains roughly the same number of rows rather than covering the same numeric range.

Detecting and Removing Outliers

The IQR method is the standard approach for flagging outliers without making assumptions about the underlying distribution:

q75 = df["fare"].quantile(0.75)
q25 = df["fare"].quantile(0.25)
iqr = q75 - q25
upper = q75 + (1.5 * iqr)
lower = q25 - (1.5 * iqr)
df = df[(df["fare"] > lower) & (df["fare"] < upper)]

Anything more than 1.5 times the IQR above Q3 or below Q1 is treated as an outlier. The boolean filter keeps only rows that fall within those fences. Whether to actually remove them is a judgement call: sometimes outliers are data entry errors and removal is correct; sometimes they are the most analytically interesting rows and removing them throws away signal. Always investigate what the outliers are before deleting them.

Working with Dates

By default, pandas reads date columns as plain strings. Passing parse_dates at load time tells pandas to convert those columns to proper datetime values during the read, which unlocks date arithmetic and the .dt accessor:

df = pd.read_csv("bookings.csv", parse_dates=["booking_date", "join_date"])

If you need to convert after loading, pd.to_datetime handles it:

df["booking_date"] = pd.to_datetime(df["booking_date"])

Once a column is datetime64, you can extract any component of the date as a numeric feature:

df["year"] = df["booking_date"].dt.year
df["month"] = df["booking_date"].dt.month
df["weekday"] = df["booking_date"].dt.weekday # 0 = Monday, 6 = Sunday

This is the standard recipe for turning one date column into multiple features a model can use directly. The .dt accessor only works on datetime64 columns, so confirm the dtype before using it.

Visualisation

Each chart type answers a different question about the data. This is the EDA visualisation toolkit:

import seaborn as sns
import matplotlib.pyplot as plt
# Distribution of one numeric variable
sns.histplot(data=df, x="fare", binwidth=50)
# Distribution by group
sns.boxplot(data=df, x="fare", y="carrier")
# Smoothed distribution, multiple groups
sns.kdeplot(data=df, x="fare", hue="haul_type", cut=0, cumulative=True)
# Average of numeric by category
sns.barplot(data=df, x="carrier", y="fare")
sns.barplot(data=df, x="carrier", y="fare", hue="haul_type")
# Count of categories
sns.countplot(data=df, x="carrier", hue="haul_type")
# Relationship between two numeric variables
sns.scatterplot(data=df, x="flight_hrs", y="fare", hue="haul_type")
# All pairwise relationships at once
sns.pairplot(data=df, vars=["fare", "flight_hrs", "stops"])
# Correlation heatmap
sns.heatmap(df.corr(), annot=True)
plt.show()

histplot shows the shape of one variable’s distribution. boxplot compares distributions across categories using five-number summaries and makes outliers visible. kdeplot is a smoothed histogram, better than a standard histogram for overlaying multiple groups. barplot shows means per category with confidence intervals. countplot shows category frequencies. scatterplot shows the relationship between two numeric variables, with hue adding a third dimension through colour. pairplot is the quickest way to survey all pairwise relationships at once: it produces a grid of scatterplots between every listed variable, making correlations and unusual shapes visible in a single call. heatmap(df.corr()) summarises those pairwise correlations as a coloured grid with annot=True writing the values into each cell. Every seaborn plot needs plt.show() at the end to render.

The EDA Checklist

1. df.head() / .info() / .describe() understand shape and types
2. .value_counts() check categorical distributions
3. .isna().sum() find missing values
4. histplot / boxplot spot outliers and distributions
5. groupby + agg compare groups
6. scatterplot / pairplot / heatmap find relationships
7. fix types (dates, strings to numbers) prepare for analysis
8. handle missing (drop or impute) clean the data
9. remove outliers if needed (IQR method) validate ranges

Quick Reference

TaskCode
Count missingdf.isna().sum()
Drop rows with missingdf.dropna(subset=["col"])
Fill with valuedf["col"].fillna(0)
Fill with group mediandf["col"].fillna(df["group"].map(medians_dict))
Group and aggregatedf.groupby("col").agg(["mean", "std"])
Named aggregation.agg(new_name=("col", "mean"))
Add group stat as column.groupby("col")["val"].transform(lambda x: x.mean())
Convert to number.str.replace("h", "").astype(float)
Category from numberpd.cut(df["col"], bins=[...], labels=[...])
Category from textnp.select(conditions, labels, default="other")
IQR outlier boundsq75 + 1.5 * iqr and q25 - 1.5 * iqr
Parse dates on loadpd.read_csv("f.csv", parse_dates=["col"])
Convert to datetimepd.to_datetime(df["col"])
Extract year and monthdf["col"].dt.year / .dt.month
Exclude categories~df["col"].isin(["val1", "val2"])
Cross-tabulatepd.crosstab(df["a"], df["b"], values=df["c"], aggfunc="mean")
Check class balancedf["col"].value_counts(normalize=True)

See you soon

View Comments (6)

Leave a Reply

  1. […] Descriptive statistics and visualisation are two views of the same thing. The charts, dot plots, histograms, stem-and-leaf diagrams, and boxplots, let you see the shape of a distribution, where it centres, how it spreads, and whether it leans one way. The numbers, mean, median, mode, range, IQR, and standard deviation, pin that shape down precisely. The two reinforce each other: a histogram suggests skewness, and comparing the mean to the median confirms it; a boxplot hints at spread, and the IQR quantifies it. Master both, and you can understand any dataset before you ever try to model it. […]

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Datalad - Data Science and ML

Subscribe now to keep reading and get access to the full archive.

Continue reading