Data Visualization in Python: Customization, Annotations, and Statistical Plots

This article explains transforming basic visual outputs from seaborn or matplotlib into effective charts by using techniques such as highlighting points, annotations, distribution plots, handling categories, selecting colour palettes, and visualising uncertainties.

The default output from seaborn or matplotlib is a starting point, not a finished chart. It shows your data, but it does not direct attention. Every point looks equally important, every line carries the same weight, and the viewer has to do all the interpretive work themselves.

This article covers the toolkit for going from that raw output to something that communicates clearly: highlighting specific observations, annotating stories, comparing distributions, handling messy multi-category data, choosing color palettes that match your data type, and visualizing uncertainty honestly through confidence intervals and bootstrap resampling.

Highlighting Specific Points

The most direct way to focus a viewer’s eye is to make one point colored and everything else gray. The mechanism is simple: build a list of colors the same length as your DataFrame, one per row, then pass it to the scatter function.

point_colors = [
'orangered' if (day == 330) & (year == 2014) else 'lightgray'
for day, year in zip(air_data.day, air_data.year)
]
sns.regplot(
x='NO2', y='SO2',
data=air_data,
fit_reg=False,
scatter_kws={'facecolors': point_colors, 'alpha': 0.7}
)
plt.show()

zip(air_data.day, air_data.year) pairs each row’s day and year together. The list comprehension walks every pair and assigns a color. scatter_kws is a pass-through dict that goes straight to the underlying matplotlib scatter call, which lets seaborn expose low-level matplotlib options without wrapping every single one. fit_reg=False strips the regression line so only the dots render.

When the condition is a data-derived rule rather than a hardcoded coordinate, a cleaner approach is to add a categorical column and use hue=:

houston_df = air_data[air_data.city == 'Houston'].copy()
peak_o3 = houston_df.O3.max()
houston_df['point_label'] = [
'Highest O3 Day' if O3 == peak_o3 else 'Others'
for O3 in houston_df.O3
]
sns.scatterplot(x='NO2', y='SO2', hue='point_label', data=houston_df)
plt.show()

The .copy() is not optional: filtering a DataFrame without it gives you a view, and assigning to houston_df['point_label'] on a view triggers a SettingWithCopyWarning and may silently fail. With a copy, you own the DataFrame and can modify it freely.

The hue= approach assigns colors automatically and generates a legend with no extra work. Use scatter_kws={'facecolors': color_list} when you want full manual control of the palette; use hue='column' when you want seaborn to handle colors and the legend.

To highlight an entire group rather than a single point, the same pattern applies across all rows of that group:

city_colors = [
'orangered' if city == 'Long Beach' else 'lightgray'
for city in air_data['city']
]
sns.regplot(
x='CO', y='O3',
data=air_data,
fit_reg=False,
scatter_kws={'facecolors': city_colors, 'alpha': 0.3}
)
plt.text(1.6, 0.072, 'April 30th, Bad Day')
plt.show()

alpha=0.3 is intentional here. With hundreds of overlapping gray points, transparency reveals density: darker patches mean more points stacking on top of each other. The colored group still pops because it is fully saturated. plt.text(x, y, 'message')places a label at data coordinates, so x and y are in the same units as CO and O3.

Annotations

plt.text is enough for open areas of a chart, but when the point you want to label sits in a crowded region, you need to separate the label from the point. plt.annotate does this by drawing an arrow from a text position to the data coordinate.

lb_jan1 = jan_df.query("(day == 1) & (city == 'Long Beach')")
plt.annotate(
'Long Beach New Years',
xy=(lb_jan1.CO, lb_jan1.NO2),
xytext=(2, 15),
arrowprops={'facecolor': 'gray', 'width': 3, 'shrink': 0.03},
backgroundcolor='white'
)
plt.show()

xy is the arrow tip, placed at the actual data point. xytext is where the label sits. shrink=0.03 pulls the arrowhead back slightly so it does not overlap the dot. backgroundcolor='white' puts a white rectangle behind the text so it stays readable when it lands on top of other data.

For a different use case, annotations can replace a legend entirely. When you have 15 or 20 categories and a legend would be a wall of text, labeling the points directly is cleaner:

g = sns.scatterplot('category', 'sell_rate', data=goods_by_state, s=0)
for _, row in goods_by_state.iterrows():
g.annotate(
row['state'],
(row['category'], row['sell_rate']),
ha='center',
size=10
)
plt.show()

s=0 makes the dots invisible. The loop then drops each state’s name at its (category, sell_rate) coordinate. The text becomes the point. g.annotate (called on the axes object returned by seaborn) ensures the annotations land on the correct subplot in multi-panel figures.

Distribution Plots

Histograms are fine for a single distribution but misleading when comparing groups, because changing bar width or bin offset can make the same data look very different. Kernel density estimates sidestep this by drawing a smooth continuous curve. Two KDE curves on the same axes make group comparison direct.

sns.kdeplot(air_data[air_data.year == 2012].O3, shade=True, label='2012')
sns.kdeplot(air_data[air_data.year != 2012].O3, shade=True, label='Other years')
plt.show()

shade=True fills the area under each curve. When the two filled regions overlap, the overlapping patch is visibly darker because two transparent fills are stacking, which immediately shows where the distributions agree and where they diverge.

A rug plot adds an honesty check beneath a KDE. The row of small tick marks along the x-axis shows exactly where real observations sit, preventing the smooth curve from implying more data than actually exists.

sns.kdeplot(
air_data[air_data.city == 'Vandenberg Air Force Base'].O3,
shade=True, label='Vandenberg', color='steelblue'
)
sns.rugplot(
air_data[air_data.city == 'Vandenberg Air Force Base'].O3,
color='steelblue'
)
sns.kdeplot(
air_data[air_data.city != 'Vandenberg Air Force Base'].O3,
shade=True, label='Other cities', color='gray'
)
plt.show()

Note: sns.distplot was deprecated in seaborn 0.11. The equivalent is sns.kdeplot for the density curve and sns.rugplot for the tick marks, used separately.

For showing individual observations while also conveying density, a beeswarm plot (seaborn’s swarmplot) places every dot at its exact x-position but nudges vertically so nothing overlaps.

mar_df = air_data[air_data.month == 3]
sns.swarmplot(y='city', x='O3', data=mar_df, size=3)
plt.title('March Ozone levels by city')
plt.show()

Where data is dense, the swarm fans out wide. Where it is sparse, the swarm is narrow. Unlike a box plot, every individual observation is visible. size=3 keeps the dots small to reduce crowding; for datasets with more than a few hundred observations per category, sns.violinplot or sns.stripplot(jitter=True) scales better.

Handling Too Many Categories

The human eye can reliably distinguish around six to eight colors. Beyond that, a legend becomes a lookup table and the viewer loses any intuition about which line is which. Two solutions: split into small multiples, or collapse the unimportant categories into a single color.

FacetGrid produces one panel per category, so comparison happens by position rather than color.

g = sns.FacetGrid(data=air_data, col='city', col_wrap=3)
g.map(sns.scatterplot, 'CO', 'NO2', alpha=0.2)
plt.show()

col_wrap=3 starts a new row after every three panels. Each subplot shows just that city’s data at full clarity, with no legend confusion. This pattern, called small multiples, scales gracefully to dozens of categories.

When you want all lines visible for context but only a few to stand out, collapse the unimportant ones into a single label:

featured_series = [
'Vandenberg Air Force Base NO2',
'Long Beach CO',
'Cincinnati SO2'
]
city_poll_monthly['series_color'] = [
x if x in featured_series else 'other'
for x in city_poll_monthly['city_poll']
]
sns.lineplot(
x='month', y='value',
hue='series_color',
units='city_poll',
estimator=None,
palette='Set2',
data=city_poll_monthly
)
plt.show()

The key parameters are units='city_poll' and estimator=None. Without them, seaborn would average all the rows labeled 'other'together into one line. units= tells seaborn to still draw one line per original city_poll value; estimator=None disables aggregation so raw values are plotted directly. The result: all original lines are visible, but the 47 unimportant ones share a single muted color while the three featured ones each have a distinct color.

Color Palettes

Choosing the wrong palette type distorts the information in your chart even before the viewer reads the axes. The type of palette should match the type of data being encoded.

Sequential palettes are for continuous ordered data where higher values should be visually darker.

pal = sns.light_palette('orangered', as_cmap=True)
sns.scatterplot(x='CO', y='NO2', hue='O3', data=cinci_df, palette=pal)
plt.show()

as_cmap=True tells seaborn to treat this as a continuous color map rather than a list of discrete colors. Without it, you get a fixed number of buckets instead of a smooth gradient.

Diverging palettes are for data with a meaningful center point, typically zero, where values can go in either direction.

pal = sns.diverging_palette(250, 0, as_cmap=True)
sns.heatmap(nov_co_heatmap, cmap=pal, center=0, vmin=-4, vmax=4)
plt.yticks(rotation=0)
plt.show()

center=0 anchors the neutral color at zero. Without this, the palette stretches across the data range, making a slightly positive value look red even when it is nearly identical to zero. The two hue arguments to diverging_palette are hue values in HSL color space: 250 is blue, 0 is red. On a dark background, the white midpoint of the default diverging palette jumps out as the most prominent thing on the chart, the opposite of what you want. Setting center='dark' fixes this:

plt.style.use('dark_background')
pal = sns.diverging_palette(250, 0, center='dark', as_cmap=True)
sns.heatmap(oct_o3_heatmap, cmap=pal, center=0)
plt.show()

ColorBrewer palettes are a well-known collection designed specifically for data visualization. They are perceptually uniform, colorblind-friendly, and printer-safe. Three families cover all cases:

  • Set1Set2Set3Paired for categorical data (no ordering, maximize distinction)
  • BluesGnBuYlOrRd for ordered continuous data or ordinal buckets
  • RdBuBrBGPuOr for diverging data
# Categorical: city lines, no ordering
sns.lineplot(
x='day', y='CO',
hue='city',
palette='Set2',
linewidth=3,
data=jan13_df
)
# Ordinal: CO quartiles, light to dark
air_data['CO quartile'] = pd.qcut(air_data['CO'], q=4, labels=False)
des_moines = air_data.query("city == 'Des Moines'")
sns.scatterplot(
x='SO2', y='NO2',
hue='CO quartile',
data=des_moines,
palette='GnBu'
)

pd.qcut(col, q=4, labels=False) divides a continuous variable into four equal-frequency bins. labels=False assigns integer labels (0, 1, 2, 3) instead of interval strings, which pair cleanly with a sequential palette where the bin order matches the color order.

For bar charts, edgecolor='black' is a small touch with a large payoff. It adds a thin outline around every bar, which prevents lightly shaded bars from blending together.

import numpy as np
sns.barplot(
y='city', x='CO',
estimator=np.mean,
ci=False,
data=air_data,
edgecolor='black'
)
plt.show()

Confidence Intervals

A point estimate by itself says “the mean CO level in Cincinnati is 0.43.” A confidence interval adds “and we expect the true mean to fall between 0.38 and 0.48 with 95% confidence.” The standard 95% CI formula is:

lower = mean - 1.96 * std_err
upper = mean + 1.96 * std_err

For a dot-and-whisker chart that shows one CI per estimate, horizontal lines are the right geometry.

avg_estimates['lower'] = avg_estimates['mean'] - 1.96 * avg_estimates['std_err']
avg_estimates['upper'] = avg_estimates['mean'] + 1.96 * avg_estimates['std_err']
g = sns.FacetGrid(avg_estimates, row='pollutant', sharex=False)
g.map(plt.hlines, 'y', 'lower', 'upper')
g.map(plt.scatter, 'seen', 'y', color='orangered')
g.set_ylabels('').set_xlabels('')
plt.show()

sharex=False gives each panel its own x-axis scale, which matters when different pollutants live on very different scales.

When you are plotting differences rather than raw estimates, adding a vertical line at zero makes statistical significance immediately readable: if a CI crosses the line, the difference might be noise; if it sits entirely to one side, the difference is significant.

plt.hlines(
y='year', xmin='lower', xmax='upper',
linewidth=5, color='steelblue', alpha=0.7,
data=yearly_diffs
)
plt.plot('mean', 'year', 'k|', data=yearly_diffs)
plt.axvline(x=0, color='orangered', linestyle='--')
plt.xlabel('95% CI')
plt.title('Avg SO2 differences between Cincinnati and Indianapolis')
plt.show()

'k|' is matplotlib shorthand for black vertical bar markers. They mark the point estimate without a filled circle that would obscure the CI line.

For time series, the CI becomes a band rather than individual whiskers.

vandenberg_no2['lower'] = vandenberg_no2['mean'] - 2.58 * vandenberg_no2['std_err']
vandenberg_no2['upper'] = vandenberg_no2['mean'] + 2.58 * vandenberg_no2['std_err']
plt.plot('day', 'mean', data=vandenberg_no2, color='white', alpha=0.4)
plt.fill_between(x='day', y1='lower', y2='upper', data=vandenberg_no2)
plt.show()

plt.fill_between shades the region between two y-curves over a shared x. The 2.58 z-score gives a 99% CI rather than 95%. A wider band means more confident the true value falls inside; a narrower band means the estimate is tight.

You can layer multiple CI widths on the same chart by looping over z-scores. With horizontal lines, vary the line thickness so inner intervals are narrower visually.

sizes = [15, 10, 5]
int_widths = ['90% CI', '95%', '99%']
z_scores = [1.645, 1.96, 2.58]
for label, Z, size in zip(int_widths, z_scores, sizes):
plt.hlines(
y=coef_df.pollutant,
xmin=coef_df['est'] - Z * coef_df['std_err'],
xmax=coef_df['est'] + Z * coef_df['std_err'],
label=label,
linewidth=size,
color='gray'
)
plt.plot('est', 'pollutant', 'wo', data=coef_df, label='Point Estimate')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

bbox_to_anchor=(1, 0.5) places the legend just outside the right edge of the plot, centered vertically, so it does not obscure any of the chart content.

For time series, layered bands with fill_between and transparency produce a similar effect.

int_widths = ['90%', '99%']
z_scores = [1.645, 2.58]
colors = ['#fc8d59', '#fee08b']
for label, Z, color in zip(int_widths, z_scores, colors):
plt.fill_between(
x=cinci_no2.day,
y1=cinci_no2['mean'] - Z * cinci_no2['std_err'],
y2=cinci_no2['mean'] + Z * cinci_no2['std_err'],
alpha=0.4, color=color, label=label
)
plt.legend()
plt.show()

With alpha=0.4, the 90% inner region appears darker because both bands overlap there, which intuitively reads as “more certain.”

Z-score reference:

Confidence levelZ-score
90%1.645
95%1.96
99%2.5

Bootstrap Visualization

Bootstrapping estimates a confidence interval empirically rather than mathematically. From your observed data, repeatedly draw samples of the same size with replacement, compute your statistic on each sample, and collect the results. The middle 95% of those simulated values is your 95% CI, with no distributional assumption required.

cinci_may = air_data.query("city == 'Cincinnati' & month == 5").NO2
resample_means = resample_stat(cinci_may, 1000)
lower, upper = np.percentile(resample_means, [2.5, 97.5])
plt.axvspan(lower, upper, color='gray', alpha=0.2)
sns.histplot(resample_means, bins=100)
plt.show()

np.percentile(resample_means, [2.5, 97.5]) slices off the bottom 2.5% and top 2.5%, leaving the central 95%. plt.axvspan shades the full vertical strip between those two x-values. The histogram of the resampled means shows the simulated sampling distribution; its spread directly reflects how certain the estimate is.

The same idea works for regression. Instead of drawing one line with a confidence band, draw 100 bootstrapped regression lines, each with low opacity.

sns.lmplot(
'NO2', 'SO2',
data=boot_pairs,
hue='sample',
line_kws={'color': 'steelblue', 'alpha': 0.2},
ci=None,
legend=False,
scatter=False
)
plt.scatter('NO2', 'SO2', data=raw_pairs)
plt.show()

hue='sample' tells seaborn to draw one line per bootstrap sample. alpha=0.2 makes each line 20% opaque. Where the lines agree, they stack into a dark, confident region. Where they fan out, the chart is light, signaling uncertainty.

To compare bootstrapped estimates across multiple cities, loop over cities, resample each one, stack the results, and swarmplot the distributions.

city_resamples = pd.DataFrame()
for city in ['Cincinnati', 'Des Moines', 'Indianapolis', 'Houston']:
city_NO2 = may_df[may_df.city == city].NO2
cur_boot = pd.DataFrame({
'NO2_avg': resample_stat(city_NO2, 100),
'city': city
})
city_resamples = pd.concat([city_resamples, cur_boot])
sns.swarmplot(y='city', x='NO2_avg', data=city_resamples, color='coral')
plt.show()

Each dot in the beeswarm is one bootstrap estimate. The horizontal spread of dots for a given city shows its uncertainty; the center of the spread shows the estimate. Cities with wide, scattered swarms have high uncertainty; cities with tight swarms have low uncertainty. All of this is visible at a glance without any formal hypothesis test.

Multi-Panel Layouts

plt.subplots is the foundation for placing multiple charts in one figure. plt.subplots(2, 1) gives two rows, one column. plt.subplots(1, 2) gives one row, two columns. Pass the target axes to seaborn via ax=.

_, (ax1, ax2) = plt.subplots(2, 1)
sns.scatterplot(
x='lon', y='lat',
hue='weeks_open',
palette=sns.light_palette('orangered', n_colors=12),
legend=False,
data=venues,
ax=ax1
)
sns.regplot(
'lon', 'weeks_open',
scatter_kws={'alpha': 0.2, 'color': 'gray'},
lowess=True,
data=venues,
ax=ax2
)
plt.show()

Stacking panels works well when both share the same x-axis (longitude here), so the eye can scan vertically and align features. lowess=True uses locally weighted scatterplot smoothing instead of a global linear fit, which adapts to nonlinear trends.

Side-by-side panels suit cases where you want two different views of the same entities.

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 15))
sns.barplot(
'people_per_venue', 'state',
hue='is_featured',
dodge=False,
data=venues_by_state,
ax=ax1
)
sns.scatterplot(
'log_population', 'log_venues',
hue='is_featured',
data=venues_by_state,
ax=ax2,
s=100
)
ax1.legend_.remove()
ax2.legend_.remove()
plt.show()

Both panels color by is_featured, so each would generate its own legend. Removing both with legend_.remove() prevents duplication; you would then add one shared legend manually.

Polishing

Small finishing touches make a substantial visual difference.

sns.despine removes axis borders (called spines). By default it removes the top and right borders. Passing left=True, bottom=True removes all four, leaving the data in open space.

sns.set_style('whitegrid')
sns.despine(left=True, bottom=True)

Set the style once at the top of a notebook and every subsequent plot inherits it.

When you have multiple confidence bands that overlap heavily, faceting separates them instead of letting them pile into an unreadable blob.

g = sns.FacetGrid(region_so2, col='city', col_wrap=2)
g.map(plt.fill_between, 'day', 'lower', 'upper', color='coral')
g.map(plt.plot, 'day', 'mean', color='white')
plt.show()

When you only have two groups to compare, keeping them on one chart with transparency is often cleaner than faceting.

for city, color in [('Denver', '#66c2a5'), ('Long Beach', '#fc8d62')]:
city_data = so2_compare[so2_compare.city == city]
plt.fill_between(
x='day', y1='lower', y2='upper',
data=city_data, color=color, alpha=0.4
)
plt.plot('day', 'mean', data=city_data, label=city, color=color, alpha=0.25)
plt.legend()
plt.show()

The hex codes above are the first two colors from ColorBrewer’s Set2, chosen to be distinct and harmonious.

First Pass on a New Dataset

When you meet an unfamiliar DataFrame, a consistent opening routine surfaces problems and patterns quickly before you commit to any specific chart.

# First rows, transposed so wide DataFrames do not get cut off
print(venues.head(3).transpose())
# Summary stats for all columns, including text, with median
print(venues.describe(include='all', percentiles=[0.5]).transpose())
# All-pairs scatter grid for numeric columns
pd.plotting.scatter_matrix(venues[numeric_columns], figsize=[15, 10], alpha=0.5)
plt.show()
# Log-transform heavily skewed variables
venues['log_population'] = np.log(venues['population'])

describe(include='all') covers both numeric and text columns. For text columns it returns count, unique, top, and frequency rather than mean and std. Transposing again keeps it readable when there are many columns. scatter_matrix shows every numeric column against every other; the diagonal shows each column’s own distribution. Population is a classic case for a log transform: a few enormous values compress everything else toward zero on a linear scale. After logging, the distribution becomes more symmetric and relationships to other variables become clearer.

Quick Reference

# Per-point color list
point_colors = ['orangered' if condition else 'lightgray' for ...]
sns.regplot(..., fit_reg=False, scatter_kws={'facecolors': point_colors, 'alpha': 0.7})
# Highlight via column
df['point_label'] = ['Special' if cond else 'Other' for ...]
sns.scatterplot(..., hue='point_label')
# Text annotation
plt.text(x, y, 'message', fontdict={'ha': 'left', 'size': 'large'})
# Arrow annotation
plt.annotate('label', xy=(x, y), xytext=(tx, ty),
arrowprops={'facecolor': 'gray', 'width': 3, 'shrink': 0.03},
backgroundcolor='white')
# KDE comparison
sns.kdeplot(df[cond].col, shade=True, label='group1')
sns.kdeplot(df[~cond].col, shade=True, label='group2')
# Beeswarm
sns.swarmplot(y='category', x='value', data=df, size=3)
# FacetGrid
g = sns.FacetGrid(data=df, col='category', col_wrap=3)
g.map(sns.scatterplot, 'x_col', 'y_col', alpha=0.2)
# Color palettes
pal = sns.light_palette('orangered', as_cmap=True) # sequential
pal = sns.diverging_palette(250, 0, as_cmap=True) # diverging
palette = 'Set2' # categorical (ColorBrewer)
palette = 'GnBu' # ordinal (ColorBrewer)
# CI bounds
df['lower'] = df['mean'] - 1.96 * df['std_err']
df['upper'] = df['mean'] + 1.96 * df['std_err']
# CI as whiskers
plt.hlines(y='row', xmin='lower', xmax='upper', linewidth=5, data=df)
plt.axvline(x=0, color='orangered', linestyle='--') # null reference
# CI as band
plt.fill_between(x='day', y1='lower', y2='upper', data=df, alpha=0.4)
# Multi-panel
_, (ax1, ax2) = plt.subplots(2, 1) # stacked
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5)) # side by side
sns.despine(left=True, bottom=True) # strip borders

See you soon.

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Datalad - Data Science and ML

Subscribe now to keep reading and get access to the full archive.

Continue reading