Visualising Data in Python with Matplotlib

Matplotlib is a foundational Python plotting library essential for data visualisation. It offers tools for creating various charts, labelling, and customising their appearance, ensuring effective communication of data insights.

If you have ever tried to understand a dataset by staring at numbers in a spreadsheet, you already know why visualisation matters. Matplotlib is the foundational plotting library in Python, and understanding how it works gives you control over every chart you will ever build.

How Matplotlib Thinks About Plots

Every matplotlib chart starts with the same two lines:

import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
plt.show()

The mental model here is important. fig is the entire canvas, like a sheet of paper. ax is the rectangular drawing area on that paper where your actual chart goes. plt.subplots() creates both at once and hands them to you via two variables. You then draw by calling methods on ax, and when you are done, plt.show() renders the result. The reason matplotlib separates figure and axes will become clear when you start placing multiple charts on one canvas.

Drawing Lines

ax.plot() is the call you will use most often. Pass it two sequences of equal length, one for x values and one for y values, and it connects the corresponding points with a line:

ax.plot(months, london_revenue)

Calling it a second time on the same axes adds a second line rather than starting a new chart. Matplotlib assigns different colours automatically so the lines are distinguishable:

ax.plot(months, london_revenue)
ax.plot(months, paris_revenue)

Both lines appear on the same plot. The first call is drawn first, the second on top.

Controlling How Lines Look

Three keyword arguments inside ax.plot() control the visual appearance of a line:

ax.plot(months, london_revenue,
color = '#2E86AB',
marker = 'o',
linestyle = '--'
)

color accepts a short code like 'b' for blue or a hex string. marker places a symbol at each data point: circles with 'o', squares with 's', triangles with 'v'linestyle controls the pattern of the connecting line: solid with '-', dashed with '--', dotted with ':'. Setting linestyle='None' (the string, not Python’s None) removes the line entirely and leaves only the markers, which is a quick way to produce a scatter-like effect using ax.plot.

Labelling Your Chart

A chart without labels is a puzzle. These three calls put text where it belongs:

ax.set_xlabel("Month")
ax.set_ylabel("Revenue (GBP)")
ax.set_title("Monthly Revenue: London vs Paris")

All three are called on ax rather than plt. This matters when you have multiple subplots later, because each axes object carries its own labels independently. Always include units in axis labels. Bare numbers tell the reader nothing about what they are looking at.

When category labels along the x-axis are long enough to overlap, rotating them fixes the problem:

ax.set_xticklabels(results.index, rotation=90)

Vertical labels at 90 degrees give long names room to breathe. A 45-degree rotation is a good middle ground when you want something less abrupt.

Placing Multiple Charts on One Canvas

When you want to compare several charts side by side rather than stacking everything onto one plot, pass row and column counts to plt.subplots():

fig, ax = plt.subplots(2, 2)

Now ax is a two-dimensional array. You access each individual plot by its position in the grid: ax[0, 0] is the top left, ax[0, 1]is the top right, ax[1, 0] is the bottom left, and ax[1, 1] is the bottom right. Each one is independent and accepts its own ax.plot() calls, labels, and titles.

For a single column of stacked charts, one index is enough:

fig, ax = plt.subplots(2, 1, sharey=True)
ax[0].plot(months, london_revenue)
ax[1].plot(months, berlin_revenue)

The sharey=True argument forces both charts to use the same y-axis scale, which is essential for honest comparison. Without it, matplotlib auto-scales each subplot independently, and a small change can be made to look large simply because the axes are different.

Two Variables on One Chart with Twin Axes

Sometimes you want to plot two variables that share an x-axis but live on completely different scales. Plotting them both on the same y-axis would make one of them unreadable:

fig, ax = plt.subplots()
ax.plot(energy_data.index, energy_data['emissions'], color='steelblue')
ax2 = ax.twinx()
ax2.plot(energy_data.index, energy_data['avg_temp'], color='firebrick')

ax.twinx() creates a second axes that shares the x-axis but has its own independent y-axis on the right side. Each variable gets its own scale. Colour-code the lines so it is obvious which axis belongs to which line. Use this technique sparingly: because you control both axes independently, it is easy to adjust the scales in ways that make a relationship look stronger or weaker than it really is.

Pointing at What Matters

Annotations let you direct attention to a specific moment or data point:

ax2.annotate("Peak emissions",
xy = (pd.Timestamp('2022-06-15'), 1.2),
xytext = (pd.Timestamp('2019-01-01'), 0.4),
arrowprops = {"arrowstyle": "->", "color": "gray"}
)

xy is where the arrow points, the actual data point you want to highlight. xytext is where the text label sits, usually somewhere with more room. Matplotlib draws an arrow from the text to the data point. Without xytext, the label appears directly at the data point with no arrow. This is the standard way to call out a notable event in a time-series chart, whether that is a peak, a threshold being crossed, or a change in trend.

Other Chart Types

Bar charts compare counts or amounts across categories. Each category gets a rectangle with height proportional to its value:

ax.bar(channel_labels, conversion_counts)
ax.set_ylabel("Conversions")

For stacked bars, each layer starts where the previous one ended, which the bottom argument controls:

ax.bar(results.index, results['first_place'], label='Gold')
ax.bar(results.index, results['second_place'], label='Silver', bottom=results['first_place'])
ax.bar(results.index, results['third_place'], label='Bronze',
bottom=results['first_place'] + results['second_place'])
ax.legend()

Each subsequent call uses the cumulative sum of the layers below it as its starting point. The label arguments feed into ax.legend(), which draws the colour key.

Histograms show how a single variable is distributed across a range:

ax.hist(sprinters['power_output'], label='Sprinters', histtype='step', bins=5)
ax.hist(cyclists['power_output'], label='Cyclists', histtype='step', bins=5)
ax.legend()

When comparing two distributions on the same axes, histtype='step' draws only the outline of each bar rather than a filled rectangle. Overlapping filled bars become a visual mess. Outlines stay legible even when the distributions overlap heavily.

Error bars show uncertainty around a central value. On bar charts, the yerr argument adds a vertical whisker:

ax.bar("Sprinters", sprinters['time_seconds'].mean(), yerr=sprinters['time_seconds'].std())
ax.bar("Cyclists", cyclists['time_seconds'].mean(), yerr=cyclists['time_seconds'].std())

On line charts, ax.errorbar does the same thing point by point:

ax.errorbar(months, london_revenue['mean'], yerr=london_revenue['std'])

A short whisker means the data clusters tightly around the average. A long one means high variability. Showing only the mean without the spread is incomplete.

Box plots summarise a full distribution in a compact shape:

ax.boxplot([sprinters['power_output'], cyclists['power_output']])
ax.set_xticklabels(['Sprinters', 'Cyclists'])

The box spans the middle fifty percent of the data, with the median as a line inside it. Whiskers extend to the most extreme values within 1.5 times the interquartile range, and anything beyond that appears as individual dots flagged as outliers. Box plots are useful when you want to compare several distributions at a glance without building a separate histogram for each.

Scatter plots show the relationship between two variables:

ax.scatter(energy_data['emissions'], energy_data['avg_temp'])

Use ax.scatter rather than ax.plot when there is no meaningful order between the points and connecting them with a line would imply a sequence that does not exist. The c argument encodes a third variable through colour:

ax.scatter(energy_data['emissions'], energy_data['avg_temp'],
c=energy_data.index)

Passing a Series to c colours each point by that variable, turning a two-variable chart into something that shows three dimensions at once.

Working with Time-Series Data

Loading a CSV with dates requires two arguments to make it fully functional:

energy_data = pd.read_csv('energy_records.csv',
parse_dates=['report_date'],
index_col='report_date')

parse_dates tells pandas to treat the listed columns as actual dates rather than strings. index_col promotes the date column to the row index, giving you a proper DatetimeIndex. Once both are in place, plotting time series is straightforward:

ax.plot(energy_data.index, energy_data['avg_temp'])

Matplotlib recognises datetime values on the x-axis and formats the tick labels automatically, choosing months for short spans and years for longer ones.

Automating Repetitive Charts

When you have the same chart to produce for many categories, a loop is far cleaner than copying and pasting:

for channel in channels:
channel_df = campaign_data[campaign_data['channel'] == channel]
ax.bar(channel, channel_df['conversion_rate'].mean(),
yerr=channel_df['conversion_rate'].std())
ax.set_ylabel("Conversion Rate")
ax.set_xticklabels(channels, rotation=90)
fig.savefig("channel_performance.png")

Each iteration filters the data to one channel, calculates the mean and standard deviation, and adds one bar to the chart. The labels and the save call happen once after the loop because they apply to the finished chart as a whole.

Saving Charts to File

When you want the chart as an image rather than a screen display, call savefig on the figure object:

fig.savefig('channel_performance.png')
fig.savefig('channel_performance_hires.png', dpi=300)

The file format is inferred from the extension. PNG works well for screens and the web. PDF and SVG produce vector graphics that scale without becoming pixelated. The default resolution is 100 dots per inch, which is fine for screen use. Print-quality output typically needs at least 300.

The Standard Template

Almost every matplotlib chart follows the same five-step structure. Memorise this skeleton and adapt the middle step for each chart type:

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(months, london_revenue,
color='steelblue',
marker='o',
linestyle='--',
label='London')
ax.set_xlabel("Month")
ax.set_ylabel("Revenue (GBP)")
ax.set_title("Monthly Revenue Performance")
ax.legend()
plt.show()

The one thing that trips people up: the legend will not appear unless both conditions are met. The label argument must be passed inside the ax.plot() call to name the series, and ax.legend() must be called separately to actually draw the legend box. Either one alone produces nothing.

Quick Reference

TaskCode
Create canvasfig, ax = plt.subplots()
Line plotax.plot(x, y)
Bar chartax.bar(x, y)
Histogramax.hist(data, bins=5)
Scatter plotax.scatter(x, y)
Box plotax.boxplot([col1, col2])
Error bars on lineax.errorbar(x, y, yerr=err)
Error bars on barax.bar(x, y, yerr=err)
Second y-axisax2 = ax.twinx()
2×2 subplotsfig, ax = plt.subplots(2, 2)
Shared y-axisplt.subplots(2, 1, sharey=True)
Annotate with arrowax.annotate("text", xy=(x, y), xytext=(tx, ty), arrowprops={...})
X labelax.set_xlabel("label")
Y labelax.set_ylabel("label")
Titleax.set_title("title")
Legendax.legend()
Rotate tick labelsax.set_xticklabels(labels, rotation=90)
Save to filefig.savefig("file.png", dpi=300)

See you soon

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Datalad - Data Science and ML

Subscribe now to keep reading and get access to the full archive.

Continue reading