July 17, 2026

7 min read

Data Science

A Practical Guide to Exploratory Data Analysis in Python

Exploratory Data Analysis (EDA) is essential for understanding datasets before drawing conclusions. This guide outlines the Python EDA workflow, detailing techniques for validating data, handling missing values, detecting outliers, and visualising findings effectively.

Before building a model, writing a report, or drawing any conclusion from a dataset, there is a step that separates analysts who get it right from those who do not. Exploratory Data Analysis is the process of actually understanding your data before you trust it. It is how you find the missing values, the type mismatches, the outliers that will silently wreck a model, and the distributional patterns that tell you what the data is really saying.

This guide walks through the full EDA workflow in Python, from the first five-line inspection to outlier removal and visualisation.

The math: https://datalad.co.uk/the-mathematics-behind-descriptive-statistics-and-eda/

The First Look

Every dataset deserves the same four-line introduction before anything else happens.

			
df.head()                                    # first 5 rows
df.info()                                    # column names, types, non-null counts
df.describe()                                # summary stats for numeric columns
df["status"].value_counts()                  # frequency of each category
df["status"].value_counts(normalize=True)    # proportions instead of counts

		

head() gives you a concrete preview of what the data looks like. info() is the metadata layer: column names, data types, and how many non-null values each column holds. Gaps in non-null counts are missing values. Type mismatches show up here too, a column that should be numeric but reads as object is a red flag. describe() runs a battery of summary statistics across every numeric column so outliers and unusual ranges surface immediately. value_counts() answers how often each category appears in a column. Adding normalize=True converts raw counts to proportions, which is more useful when you are checking class balance.

These five calls together take under a minute and give you a working mental model of any new dataset.

Validating the Data

The first validation step is a simple range check. Do the smallest and largest values in a column make sense for what that column is supposed to represent?

			
df["age"].min()
df["age"].max()

A maximum age of 250 or a negative salary signals dirty data immediately. Cheaper to catch here than after building three months of analysis on top of it.

When you need to filter out specific categories, .isin() combined with the ~ operator is the standard pattern:

			
not_apac = ~df["region"].isin(["APAC"])
df[not_apac]    # all rows except APAC

.isin() produces a True/False value per row. The tilde flips every True to False and every False to True, so the filter reads: keep every row whose region is not in this list.

For a two-dimensional view of how a numeric variable behaves across two categorical dimensions, pd.crosstab with aggregation is more useful than filtering manually:

			
pd.crosstab(df["role_type"], df["org_size"],
            values=df["base_salary"], aggfunc="mean")

By default, crosstab counts occurrences. Adding values and aggfunc changes each cell from a count to an aggregation of a third column. The result here is the average base salary for each combination of role type and organisation size, laid out as a grid.

Grouped Summaries

agg is how you compute multiple statistics in a single pass. The cleanest form is named aggregation, where you control the output column names directly:

			
df.groupby("region").agg(
    mean_revenue = ("q2_revenue", "mean"),
    std_revenue  = ("q2_revenue", "std")
)

Each keyword argument becomes a column name in the output. The tuple pairs a source column with the aggregation function. This is preferable to passing a list of function names, which produces multi-level column names that are awkward to work with downstream.

When you want to add a group-level statistic back to the original DataFrame rather than collapsing it, transform is the right tool:

			
df["carrier_fare_std"]     = df.groupby("carrier")["fare"].transform(lambda x: x.std())
df["carrier_median_hrs"]   = df.groupby("carrier")["flight_hrs"].transform(lambda x: x.median())
df["destination_avg_fare"] = df.groupby("destination")["fare"].transform(lambda x: x.mean())

While agg returns one row per group, transform broadcasts the group statistic back to every row in that group. The DataFrame keeps its original shape, and each row gains a new column showing the group-level value it belongs to. This is the standard approach for feature engineering: adding columns like “this carrier’s average fare” to every individual booking row, so you can later compare each booking against its group’s baseline.

Handling Missing Values

The first step with missing data is knowing how much of it you have:

			
df.isna().sum()
df.isna().sum() / len(df)

isna() returns a boolean DataFrame where True marks every NaN. Summing column-wise converts those booleans to counts. Dividing by the row count gives proportions, which is what actually drives the decision about what to do next. Below roughly five percent missing, dropping the affected rows is usually acceptable. Above that, you need to think harder.

Dropping rows with few missing values:

			
threshold = len(df) * 0.05
cols_to_drop = df.columns[df.isna().sum() <= threshold]
df.dropna(subset=cols_to_drop, inplace=True)

This finds every column with fewer than five percent missing values and drops only the rows that have NaN in any of those columns. The data loss is small enough that it rarely biases analysis.

Dropping a column entirely:

df.drop(columns=["remarks"], inplace=True)

When a column has too many missing values to salvage, is redundant with another column, or contains unstructured text you cannot use, removing it entirely is cleaner than carrying a half-empty column through the rest of the analysis.

Imputing with group median:

			
group_medians = df.groupby("carrier")["fare"].median()
fares_dict    = group_medians.to_dict()
df["fare"]    = df["fare"].fillna(df["carrier"].map(fares_dict))

Filling all missing values with a single global median ignores the structure in the data. A budget carrier’s missing fare should be filled with a budget carrier’s median, not the overall median across all carriers. This approach computes a median per group, converts it to a lookup dictionary, then maps each row’s group to the appropriate fill value before passing it to fillna. Group-aware imputation preserves the differences between groups that a global fill would wash out.

Data Type Issues

String-typed columns require a different approach from numeric ones. This loop gives you a quick picture of the cardinality of every object column in the dataset:

			
non_numeric = df.select_dtypes("object")
for col in non_numeric.columns:
    print(f"{col}: {non_numeric[col].nunique()} unique values")

Low cardinality (a handful of distinct values) usually means a categorical variable that can be encoded for modelling. High cardinality (thousands of values) usually means free text or identifiers that need separate handling.

When a numeric column has been loaded as a string because of a unit suffix, the fix is a two-step clean-then-cast:

			
df["flight_hrs"] = df["flight_hrs"].str.replace("h", "")
df["flight_hrs"] = df["flight_hrs"].astype(float)

For cases where some values might not parse cleanly, pd.to_numeric(df["col"], errors="coerce") is more tolerant: bad values become NaN instead of throwing an error.

To create categories from text patterns, np.select is a vectorised if/elif/else that operates on the whole column at once:

			
import numpy as np
short  = "0h|1h|2h|3h|4h"
medium = "5h|6h|7h|8h|9h"
long_  = "10h|11h|12h|13h|14h|15h|16h"
conditions = [
    df["flight_hrs"].str.contains(short),
    df["flight_hrs"].str.contains(medium),
    df["flight_hrs"].str.contains(long_)
]
labels = ["Short-haul", "Medium-haul", "Long-haul"]
df["haul_type"] = np.select(conditions, labels, default="Ultra-long")

		

For each row, numpy walks down the condition list and assigns the label of the first condition that evaluates to True. The pipe character inside the regex strings means “or”, so "0h|1h|2h" matches any of those substrings. The default value covers anything that matches none of the conditions.

To create categories from numeric ranges, pd.cut assigns each value to a named bucket based on boundaries you define:

			
fare_ranges = [0, p25, median_fare, p75, df["fare"].max()]
fare_labels = ["budget", "economy", "business", "premium"]
df["fare_tier"] = pd.cut(df["fare"], bins=fare_ranges, labels=fare_labels)

There is always one more boundary than labels, because four buckets need five cut points to define them. A close alternative, pd.qcut, splits by quantiles instead of fixed boundaries, so each bucket contains roughly the same number of rows rather than covering the same numeric range.

Detecting and Removing Outliers

The IQR method is the standard approach for flagging outliers without making assumptions about the underlying distribution:

			
q75 = df["fare"].quantile(0.75)
q25 = df["fare"].quantile(0.25)
iqr = q75 - q25
upper = q75 + (1.5 * iqr)
lower = q25 - (1.5 * iqr)
df = df[(df["fare"] > lower) & (df["fare"] < upper)]

		

Anything more than 1.5 times the IQR above Q3 or below Q1 is treated as an outlier. The boolean filter keeps only rows that fall within those fences. Whether to actually remove them is a judgement call: sometimes outliers are data entry errors and removal is correct; sometimes they are the most analytically interesting rows and removing them throws away signal. Always investigate what the outliers are before deleting them.

Working with Dates

By default, pandas reads date columns as plain strings. Passing parse_dates at load time tells pandas to convert those columns to proper datetime values during the read, which unlocks date arithmetic and the .dt accessor:

df = pd.read_csv("bookings.csv", parse_dates=["booking_date", "join_date"])

If you need to convert after loading, pd.to_datetime handles it:

df["booking_date"] = pd.to_datetime(df["booking_date"])

Once a column is datetime64, you can extract any component of the date as a numeric feature:

			
df["year"]    = df["booking_date"].dt.year
df["month"]   = df["booking_date"].dt.month
df["weekday"] = df["booking_date"].dt.weekday    # 0 = Monday, 6 = Sunday

This is the standard recipe for turning one date column into multiple features a model can use directly. The .dt accessor only works on datetime64 columns, so confirm the dtype before using it.

Visualisation

Each chart type answers a different question about the data. This is the EDA visualisation toolkit:

			
import seaborn as sns
import matplotlib.pyplot as plt
# Distribution of one numeric variable
sns.histplot(data=df, x="fare", binwidth=50)
# Distribution by group
sns.boxplot(data=df, x="fare", y="carrier")
# Smoothed distribution, multiple groups
sns.kdeplot(data=df, x="fare", hue="haul_type", cut=0, cumulative=True)
# Average of numeric by category
sns.barplot(data=df, x="carrier", y="fare")
sns.barplot(data=df, x="carrier", y="fare", hue="haul_type")
# Count of categories
sns.countplot(data=df, x="carrier", hue="haul_type")
# Relationship between two numeric variables
sns.scatterplot(data=df, x="flight_hrs", y="fare", hue="haul_type")
# All pairwise relationships at once
sns.pairplot(data=df, vars=["fare", "flight_hrs", "stops"])
# Correlation heatmap
sns.heatmap(df.corr(), annot=True)
plt.show()

		

histplot shows the shape of one variable’s distribution. boxplot compares distributions across categories using five-number summaries and makes outliers visible. kdeplot is a smoothed histogram, better than a standard histogram for overlaying multiple groups. barplot shows means per category with confidence intervals. countplot shows category frequencies. scatterplot shows the relationship between two numeric variables, with hue adding a third dimension through colour. pairplot is the quickest way to survey all pairwise relationships at once: it produces a grid of scatterplots between every listed variable, making correlations and unusual shapes visible in a single call. heatmap(df.corr()) summarises those pairwise correlations as a coloured grid with annot=True writing the values into each cell. Every seaborn plot needs plt.show() at the end to render.

The EDA Checklist

			
df.head() / .info() / .describe()        understand shape and types
.value_counts()                           check categorical distributions
.isna().sum()                             find missing values
histplot / boxplot                        spot outliers and distributions
groupby + agg                             compare groups
scatterplot / pairplot / heatmap          find relationships
fix types (dates, strings to numbers)     prepare for analysis
handle missing (drop or impute)           clean the data
remove outliers if needed (IQR method)    validate ranges

		

Quick Reference

Task	Code
Count missing	`df.isna().sum()`
Drop rows with missing	`df.dropna(subset=["col"])`
Fill with value	`df["col"].fillna(0)`
Fill with group median	`df["col"].fillna(df["group"].map(medians_dict))`
Group and aggregate	`df.groupby("col").agg(["mean", "std"])`
Named aggregation	`.agg(new_name=("col", "mean"))`
Add group stat as column	`.groupby("col")["val"].transform(lambda x: x.mean())`
Convert to number	`.str.replace("h", "").astype(float)`
Category from number	`pd.cut(df["col"], bins=[...], labels=[...])`
Category from text	`np.select(conditions, labels, default="other")`
IQR outlier bounds	`q75 + 1.5 * iqr` and `q25 - 1.5 * iqr`
Parse dates on load	`pd.read_csv("f.csv", parse_dates=["col"])`
Convert to datetime	`pd.to_datetime(df["col"])`
Extract year and month	`df["col"].dt.year` / `.dt.month`
Exclude categories	`~df["col"].isin(["val1", "val2"])`
Cross-tabulate	`pd.crosstab(df["a"], df["b"], values=df["c"], aggfunc="mean")`
Check class balance	`df["col"].value_counts(normalize=True)`

See you soon

Data Science

Andrei

July 17, 2026

7 min read

View Comments (9)

Recommended for You

Why a Central Data Platform is a Strategic Imperative

A Data Scientist’s Perspective for Financial Services Organizations Financial institutions today generate massive volumes of customer, transactional, operational,…

June 17, 2026

5 min read

Data Scientist – Roadmap

The Data Scientist Roadmap outlines a two-phase journey to becoming a data scientist, focusing on mathematics and coding, beginning with foundational theory and progressing to hands-on machine learning applications.

July 24, 2026

3 min read

CSS Display Property Cheatsheet

Understanding the CSS Display Property: 10 Code-Along Examples

Standard SQL in BigQuery Cheatsheet

Standard SQL in BigQuery: 10 Code-Along Examples

A Practical Guide to Exploratory Data Analysis in Python

The First Look

Validating the Data

Grouped Summaries

Handling Missing Values

Data Type Issues

Detecting and Removing Outliers

Working with Dates

Visualisation

The EDA Checklist

Quick Reference

Related

Leave a ReplyCancel reply

Recommended for You

Why a Central Data Platform is a Strategic Imperative

Data Scientist – Roadmap

CSS Display Property Cheatsheet

Understanding the CSS Display Property: 10 Code-Along Examples

Standard SQL in BigQuery Cheatsheet

Standard SQL in BigQuery: 10 Code-Along Examples

A Practical Guide to Exploratory Data Analysis in Python

The First Look

Validating the Data

Grouped Summaries

Handling Missing Values

Data Type Issues

Detecting and Removing Outliers

Working with Dates

Visualisation

The EDA Checklist

Quick Reference

Related

Leave a ReplyCancel reply

Subscribe to My Newsletter

Recommended for You

Why a Central Data Platform is a Strategic Imperative

Data Scientist – Roadmap

Discover more from Discuss Data Science, Machine Learning and Analytics