July 22, 2026

10 min read

Statistics

Understanding Your Data with Visualisation and Descriptive Statistics

Before you can model data, test a hypothesis, or build a dashboard, you need to understand what the data actually looks like. That understanding comes from two complementary toolkits: visualisations that reveal the shape of a distribution at a glance, and descriptive statistics that summarise it in numbers. This guide walks through both, from classifying variables to calculating spread, with worked examples throughout.

Knowing What Kind of Variable You Have

Everything starts with the type of variable you are dealing with, because it determines which charts and statistics make sense. Variables fall into two broad families. Measurable variables have a genuine numerical scale with a natural ordering. Categorical variables do not; they simply sort observations into groups.

Within measurable variables, there is a further split. Discrete variables are counts in whole numbers, like the number of passengers on a flight. Continuous variables can be measured to any precision, like height, weight, or time. Categorical variables split too. Ordinal categories have a natural order, like a satisfaction rating of poor, fair, or good. Nominal categories have no order at all, like nationality or gender.

Type	Sub-type	Description	Examples
Measurable	Discrete	Count in whole numbers	Passengers on a flight
Measurable	Continuous	Measured to any precision	Height, weight, time
Categorical	Ordinal	Categories with a natural order	Satisfaction: poor / fair / good
Categorical	Nominal	Categories with no natural order	Gender, nationality, religion

To see how this works in practice, consider four variables. The number of phone calls received today is something you count in whole numbers, $0, 1, 2, \ldots$ , so it is discrete. Time spent on hold is measured in seconds to any decimal, so it is continuous. A customer satisfaction rating of dissatisfied, neutral, or satisfied has ordered categories but no numeric scale, so it is ordinal categorical. Country of birth has no ordering between values, so it is nominal categorical. Getting this classification right is the foundation everything else rests on.

Dot Plots: The Simplest View

For small datasets, a dot plot is often the quickest way to see what is going on. You place one dot per observation above a horizontal axis, stacking dots vertically where values repeat. The result gives an immediate sense of where values cluster, where the gaps are, and whether anything looks extreme.

There is no formula. You draw a horizontal axis spanning the range of the data, then place a dot at each observation’s value. Take the values 3, 5, 5, 7, 8, 8, 8, 10. The axis runs from 3 to 10. You place one dot at 3, two at 5, one at 7, three at 8, and one at 10. The stack of three dots at 8 immediately marks it as the most common value, and the gap between 3 and 5 stands out as a region worth a second look.

Histograms: Grouping into Bins

When datasets get larger, you group values into class intervals, or bins, and draw a histogram. The single most important rule of a histogram is that the area of each bar is proportional to the frequency of that interval:

\text{Area of bar} \propto \text{frequency of that interval}

This rule only becomes visible when bin widths are unequal. In that case, you must plot frequency density on the vertical axis rather than raw frequency:

\text{Frequency density} = \frac{\text{frequency}}{\text{interval width}}

When all bins have the same width, the heights are proportional to the frequencies directly and you can plot frequency without worry. But the moment widths differ, plotting raw frequency would distort the picture, making wide intervals look more important than they are.

Consider weekly production data with unequal bin widths. The first step is to build a frequency table that includes the density:

Interval	Width	Frequency	Frequency density
[300, 360)	60	6	6/60 = 0.100
[360, 380)	20	14	14/20 = 0.700
[380, 400)	20	10	10/20 = 0.500
[400, 420)	20	4	4/20 = 0.200
[420, 460)	40	13	13/40 = 0.325
[460, 500)	40	3	3/40 = 0.075

You then draw adjacent bars over each interval, with the height equal to the frequency density. A quick sanity check confirms the logic: the area of the [360, 380) bar is $0.700 \times 20 = 14$ , which matches its frequency exactly. The bars sit flush against each other with no gaps, the vertical axis is frequency density rather than frequency, and as a rule of thumb you want around six or seven bins to balance detail against clarity.

Stem-and-Leaf Diagrams: Keeping the Raw Values

A stem-and-leaf diagram is a clever hybrid. It shows the shape of the distribution like a histogram, but unlike a histogram it preserves every original data value. The stem holds the leading digits and each leaf is the remaining digit. If you rotate the diagram 90 degrees anti-clockwise, the leaves form the same shape a histogram would.

To build one, sort the data, choose a stem unit such as the tens digit, then list each stem vertically with its leaves written horizontally beside it. Take the values 354, 358, 360, 362, 365, 371, 381, 393 with the tens as the stem:

Stem (tens)	Leaves (units)
35	4 8
36	0 2 5
37	1
38	1
39	3

The diagram labels itself. The first row reads as 354 and 358, and you can recover every original value exactly. Rotating it shows a distribution concentrated around the 360s. This ability to keep the raw numbers while still seeing the shape is what makes the stem-and-leaf diagram uniquely useful.

The Mean: The Average Everyone Knows

Now we move from pictures to numbers, starting with measures of location, which describe the centre of the data. The sample mean, written $\bar{x}$ , is the arithmetic average and the most commonly used measure of location. It uses every observation, which is its strength, but that also makes it sensitive to outliers.

For raw data, the mean is the sum of all values divided by how many there are:

\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

For grouped data, you weight each class midpoint by its frequency:

\bar{x} = \frac{\sum_{k=1}^{K} f_k x_k}{\sum_{k=1}^{K} f_k}

where $x_k$ is the midpoint of class k and $f_k$ is its frequency.

Take the raw dataset 32, 28, 67, 39, 19, 48, 32, 44, 37, 24. With n = 10, the values sum to 370, so the mean is 370 / 10 = 37.

For grouped data, consider trading-volume data with seven equal classes of width 10. You extend the frequency table with a column for $f_k x_k$ :

Midpoint x_k	Frequency f_k	f_k x_k
125	1	125
135	4	540
145	5	725
155	6	930
165	7	1,155
175	5	875
185	1	185
Total	29	4,535

The mean is then:

\bar{x} = \frac{4{,}535}{29} = 156.4 \text{ million shares/week}

The Median: The Middle Value

The sample median, written $m$ , is the middle value of the ordered data. Exactly half the observations lie below it and half above. Its great advantage over the mean is that it is not affected by outliers, which makes it the better measure of location for skewed data.

To find it, sort the data first. If there is an odd number of values, the median is the single middle one. If there is an even number, it is the average of the two middle values:

If n is odd: $m = x_{((n+1)/2)}$
If n is even: $m = \dfrac{x_{(n/2)} + x_{(n/2+1)}}{2}$

For grouped data, you interpolate within the class that contains the middle observation:

m = \text{(lower endpoint of median class)} + \frac{\text{class width} \times (\text{target rank} – \text{cumulative frequency before class})}{\text{class frequency}}

Take the earlier dataset, ordered as 19, 24, 28, 32, 32, 37, 39, 44, 48, 67. With n = 10, an even number, you average the 5th and 6th values, which are 32 and 37, giving a median of

(32 + 37)/2 = 34.5

For grouped weekly production data with n = 50, the target rank is the 25.5th observation. The cumulative frequency reaches 20 after the [360, 380) interval and 30 after [380, 400), so the 25.5th value falls in [380, 400). Interpolating gives

380 + \frac{20 \times (25.5 – 20)}{10} = 380 + 11 = 391

Skewness: Reading Asymmetry

Comparing the mean and median tells you something a single number cannot: how lopsided the distribution is. This is skewness. When the mean exceeds the median, the distribution is positively skewed, with a long tail stretching to the right. When they are equal, it is symmetric. When the mean is below the median, it is negatively skewed, with a long tail to the left.

Condition	Name	Shape
$\bar{x} > m$	Positively skewed (right-skewed)	Long tail to the right
$\bar{x} = m$	Symmetric	No tail
$\bar{x} < m$	Negatively skewed (left-skewed)	Long tail to the left

No calculation is needed beyond comparing the two figures. For the weekly production data, the mean is 399.72 and the median is 392.5. Since the mean exceeds the median, the long tail is pulling the mean to the right, so the distribution is positively skewed. This is the intuition behind why the median is more robust: outliers in the tail drag the mean toward them but leave the median largely untouched.

The Mode: The Most Common Value

The mode is simply the value that occurs most often. A distribution can have more than one mode, making it bimodal or multimodal, and for grouped data you report the modal class, which is the interval with the highest frequency.

For the raw dataset 32, 28, 67, 39, 19, 48, 32, 44, 37, 24, the value 32 appears twice while every other value appears once, so the mode is 32. For grouped weekly production data, scanning the frequency column shows that [360, 380) has a frequency of 14, the highest of any interval, so that is the modal class.

The Range: The Simplest Measure of Spread

Having located the centre of the data, we now turn to measures of spread, which describe how dispersed it is. The simplest is the range, the gap between the largest and smallest values:

\text{Range} = x_{(n)} – x_{(1)}

For the dataset 19, 24, 28, 32, 32, 37, 39, 44, 48, 67, the smallest value is 19 and the largest is 67, so the range is 67 – 19 = 48. The range is trivial to compute but dangerously sensitive to outliers. If that 67 were replaced by 167, the range would leap to 148, even though only one value changed. That fragility is why we need a more robust measure.

The Interquartile Range: Spread of the Middle

The interquartile range, or IQR, measures the spread of the central 50% of the data, discarding the top and bottom quarters entirely. Because it ignores the extremes, it is robust to outliers in a way the range is not:

\text{IQR} = Q_3 – Q_1

Here $Q_1$ is the lower quartile, the 25th percentile, and $Q_3$ is the upper quartile, the 75th percentile. For a small sample with an even number of values, a clean method is to split the ordered data at the median into a lower half and an upper half, then take the median of each half.

For the ordered dataset 19, 24, 28, 32, 32, 37, 39, 44, 48, 67, the lower half is 19, 24, 28, 32, 32, whose median is $Q_1 = (24 + 28)/2 = 26$ . The upper half is 37, 39, 44, 48, 67, whose median is $Q_3 = (39 + 44)/2 = 41.5$ . The IQR is therefore 41.5 – 26 = 15.5. Compared with the range of 48, this is much smaller, precisely because it excludes the extreme values 19 and 67.

Boxplots: Five Numbers in One Picture

A boxplot turns these quartile-based measures into a visual summary. It displays five statistics at once: the minimum within the whisker bounds, $Q_1$ , the median, $Q_3$ , and the maximum within the whisker bounds, with any outliers plotted as individual points.

To construct one, draw a box from $Q_1$ to $Q_3$ and mark the median with a line inside it. Extend whiskers out to the furthest data points that lie within $1.5 \times \text{IQR}$ of each quartile. Any points beyond the whiskers are plotted individually as outliers.

Reading a boxplot becomes second nature once you know what each feature means. The width of the box is the IQR, showing the spread of the central half. The position of the median line within the box hints at symmetry or skewness. The whisker lengths show how far the tails reach, and the dots beyond them flag extreme observations. A long lower tail or outliers below suggests negative skewness; a long upper tail or outliers above suggests positive skewness.

Suppose a boxplot shows $Q_1 \approx 63$ , a median of about 74, $Q_3 \approx 77$ , and many outlier dots below the lower whisker. The upper part of the box spans only 77 – 74 = 3, while the lower part spans 74 – 63 = 11, so the spread below the median is far larger. Combined with the outliers sitting below, both signals point clearly to negative, or left, skewness.

Variance and Standard Deviation: The Workhorses of Spread

The most important measures of spread for formal statistics are the variance and standard deviation. Both quantify how far the data spreads around the mean, they use every observation, and they underpin most of the statistical inference you will go on to do.

The starting point is the corrected sum of squares, which has a convenient computational form:

S_{xx} = \sum_{i=1}^{n}(x_i – \bar{x})^2 = \sum_{i=1}^{n}x_i^2 – n\bar{x}^2

From this, the sample variance and standard deviation follow:

s^2 = \frac{S_{xx}}{n-1}, \qquad s = \sqrt{s^2}

The detail that catches people out is the division by n – 1 rather than n. For sample data, dividing by n – 1 corrects a bias that would otherwise make the variance too small. For grouped data with large n, where $n – 1 \approx n$ , the variance can be computed directly from the frequency table:

s^2 = \frac{\sum_{k=1}^{K} f_k x_k^2}{\sum_{k=1}^{K} f_k} – \left(\frac{\sum_{k=1}^{K} f_k x_k}{\sum_{k=1}^{K} f_k}\right)^2

Take the raw dataset 32, 28, 67, 39, 19, 48, 32, 44, 37, 24, which has a mean of 37. First find the sum of the squared values:

32^2 + 28^2 + 67^2 + 39^2 + 19^2 + 48^2 + 32^2 + 44^2 + 37^2 + 24^2 = 15{,}388

Then the corrected sum of squares is $S_{xx} = 15{,}388 – 10 \times 37^2 = 15{,}388 – 13{,}690 = 1{,}698$ . The variance is $s^2 = 1{,}698 / 9 = 188.7$ , and the standard deviation is

s = \sqrt{188.7} = 13.74

For grouped trading-volume data with n = 29, you extend the table with a column for

f_k x_k^2

$x_k$	$f_k$	$f_k x_k$	$f_k x_k^2$
125	1	125	15,625
135	4	540	72,900
145	5	725	105,125
155	6	930	144,150
165	7	1,155	190,575
175	5	875	153,125
185	1	185	34,225
Total	29	4,535	715,725

The mean is $4{,}535 / 29 = 156.4$ . The variance is then:

s^2 = \frac{715{,}725}{29} – (156.4)^2 = 24{,}680.2 – 24{,}461.0 = 219.2

and the standard deviation is $s = \sqrt{219.2} = 14.8$ million shares per week. The standard deviation is reported in the original units, which is exactly why it is more interpretable than the variance, whose units are squared and therefore harder to reason about.

Cumulative Frequency Diagrams: Reading Quartiles off a Curve

The final tool is the cumulative frequency diagram, which plots the running total of frequencies against the upper boundary of each class. Its great use is that you can read the median and quartiles straight off the graph. A close relative, the cumulative relative frequency diagram or ogive, plots cumulative percentages instead, using the relative frequency of each class:

\text{relative frequency}_k = \frac{f_k}{n}

Cumulating these gives percentages that build up to 100%.

For the trading-volume data with n = 29, you build a cumulative relative frequency table:

Interval	$f_k$	Relative freq (%)	Cumulative rel freq (%)
$[120,130)$	1	3.45	3.5
[130,140)	4	13.79	17.2
[140,150)	5	17.24	34.5
[150,160)	6	20.69	55.2
[160,170)	7	24.14	79.3
[170,180)	5	17.24	96.6
[180,190)	1	3.45	100.0

You then plot a point at each upper class boundary, giving (130, 3.5), (140, 17.2), and so on up to (190, 100), and connect them with straight lines for grouped data. To read the median, draw a horizontal line across at 50% and read off the corresponding value on the horizontal axis, which lands at roughly 157 million here. The same trick at 25% gives $Q_1$ and at 75% gives $Q_3$ . This turns the diagram into a quick graphical calculator for any percentile you need.

Bringing It Together

Descriptive statistics and visualisation are two views of the same thing. The charts, dot plots, histograms, stem-and-leaf diagrams, and boxplots, let you see the shape of a distribution, where it centres, how it spreads, and whether it leans one way. The numbers, mean, median, mode, range, IQR, and standard deviation, pin that shape down precisely. The two reinforce each other: a histogram suggests skewness, and comparing the mean to the median confirms it; a boxplot hints at spread, and the IQR quantifies it. Master both, and you can understand any dataset before you ever try to model it.

See you soon.

Statistics

Andrei

July 22, 2026

10 min read

View Comments (3)

Recommended for You

Probability Theory

This content explores probability, defining key concepts such as outcomes, events, axioms, independence, and conditional probability, while illustrating applications across diverse fields like engineering and finance.

June 23, 2026

7 min read

The Normal Distribution

A random variable assigns real numbers to experiment outcomes, categorised as discrete or continuous. Key statistical concepts include population mean, variance, normal distribution, central limit theorem, and sampling distributions.

June 23, 2026

5 min read

CSS Display Property Cheatsheet

Understanding the CSS Display Property: 10 Code-Along Examples

Standard SQL in BigQuery Cheatsheet

Standard SQL in BigQuery: 10 Code-Along Examples

Understanding Your Data with Visualisation and Descriptive Statistics

Knowing What Kind of Variable You Have

Dot Plots: The Simplest View

Histograms: Grouping into Bins

Stem-and-Leaf Diagrams: Keeping the Raw Values

The Mean: The Average Everyone Knows

The Median: The Middle Value

Skewness: Reading Asymmetry

The Mode: The Most Common Value

The Range: The Simplest Measure of Spread

The Interquartile Range: Spread of the Middle

Boxplots: Five Numbers in One Picture

Variance and Standard Deviation: The Workhorses of Spread

Cumulative Frequency Diagrams: Reading Quartiles off a Curve

Bringing It Together

Related

Leave a ReplyCancel reply

Recommended for You

Probability Theory

The Normal Distribution

CSS Display Property Cheatsheet

Understanding the CSS Display Property: 10 Code-Along Examples

Standard SQL in BigQuery Cheatsheet

Standard SQL in BigQuery: 10 Code-Along Examples

Understanding Your Data with Visualisation and Descriptive Statistics

Knowing What Kind of Variable You Have

Dot Plots: The Simplest View

Histograms: Grouping into Bins

Stem-and-Leaf Diagrams: Keeping the Raw Values

The Mean: The Average Everyone Knows

The Median: The Middle Value

Skewness: Reading Asymmetry

The Mode: The Most Common Value

The Range: The Simplest Measure of Spread

The Interquartile Range: Spread of the Middle

Boxplots: Five Numbers in One Picture

Variance and Standard Deviation: The Workhorses of Spread

Cumulative Frequency Diagrams: Reading Quartiles off a Curve

Bringing It Together

Related

Leave a ReplyCancel reply

Subscribe to My Newsletter

Recommended for You

Probability Theory

The Normal Distribution

Discover more from Discuss Data Science, Machine Learning and Analytics