Understanding Your Data with Visualisation and Descriptive Statistics

Before you can model data, test a hypothesis, or build a dashboard, you need to understand what the data actually looks like. That understanding comes from two complementary toolkits: visualisations that reveal the shape of a distribution at a glance, and descriptive statistics that summarise it in numbers. This guide walks through both, from classifying variables to calculating spread, with worked examples throughout.

Knowing What Kind of Variable You Have

Everything starts with the type of variable you are dealing with, because it determines which charts and statistics make sense. Variables fall into two broad families. Measurable variables have a genuine numerical scale with a natural ordering. Categorical variables do not; they simply sort observations into groups.

Within measurable variables, there is a further split. Discrete variables are counts in whole numbers, like the number of passengers on a flight. Continuous variables can be measured to any precision, like height, weight, or time. Categorical variables split too. Ordinal categories have a natural order, like a satisfaction rating of poor, fair, or good. Nominal categories have no order at all, like nationality or gender.

TypeSub-typeDescriptionExamples
MeasurableDiscreteCount in whole numbersPassengers on a flight
MeasurableContinuousMeasured to any precisionHeight, weight, time
CategoricalOrdinalCategories with a natural orderSatisfaction: poor / fair / good
CategoricalNominalCategories with no natural orderGender, nationality, religion

To see how this works in practice, consider four variables. The number of phone calls received today is something you count in whole numbers, $0, 1, 2, \ldots$, so it is discrete. Time spent on hold is measured in seconds to any decimal, so it is continuous. A customer satisfaction rating of dissatisfied, neutral, or satisfied has ordered categories but no numeric scale, so it is ordinal categorical. Country of birth has no ordering between values, so it is nominal categorical. Getting this classification right is the foundation everything else rests on.

Dot Plots: The Simplest View

For small datasets, a dot plot is often the quickest way to see what is going on. You place one dot per observation above a horizontal axis, stacking dots vertically where values repeat. The result gives an immediate sense of where values cluster, where the gaps are, and whether anything looks extreme.

There is no formula. You draw a horizontal axis spanning the range of the data, then place a dot at each observation’s value. Take the values $3, 5, 5, 7, 8, 8, 8, 10$. The axis runs from 3 to 10. You place one dot at 3, two at 5, one at 7, three at 8, and one at 10. The stack of three dots at 8 immediately marks it as the most common value, and the gap between 3 and 5 stands out as a region worth a second look.

Histograms: Grouping into Bins

When datasets get larger, you group values into class intervals, or bins, and draw a histogram. The single most important rule of a histogram is that the area of each bar is proportional to the frequency of that interval:

Area of barfrequency of that interval\text{Area of bar} \propto \text{frequency of that interval}

This rule only becomes visible when bin widths are unequal. In that case, you must plot frequency density on the vertical axis rather than raw frequency:

Frequency density=frequencyinterval width\text{Frequency density} = \frac{\text{frequency}}{\text{interval width}}

When all bins have the same width, the heights are proportional to the frequencies directly and you can plot frequency without worry. But the moment widths differ, plotting raw frequency would distort the picture, making wide intervals look more important than they are.

Consider weekly production data with unequal bin widths. The first step is to build a frequency table that includes the density:

IntervalWidthFrequencyFrequency density
[300, 360)6066/60 = 0.100
[360, 380)201414/20 = 0.700
[380, 400)201010/20 = 0.500
[400, 420)2044/20 = 0.200
[420, 460)401313/40 = 0.325
[460, 500)4033/40 = 0.075

You then draw adjacent bars over each interval, with the height equal to the frequency density. A quick sanity check confirms the logic: the area of the $[360, 380)$ bar is $0.700 \times 20 = 14$, which matches its frequency exactly. The bars sit flush against each other with no gaps, the vertical axis is frequency density rather than frequency, and as a rule of thumb you want around six or seven bins to balance detail against clarity.

Stem-and-Leaf Diagrams: Keeping the Raw Values

A stem-and-leaf diagram is a clever hybrid. It shows the shape of the distribution like a histogram, but unlike a histogram it preserves every original data value. The stem holds the leading digits and each leaf is the remaining digit. If you rotate the diagram 90 degrees anti-clockwise, the leaves form the same shape a histogram would.

To build one, sort the data, choose a stem unit such as the tens digit, then list each stem vertically with its leaves written horizontally beside it. Take the values 354, 358, 360, 362, 365, 371, 381, 393 with the tens as the stem:

Stem (tens)Leaves (units)
354 8
360 2 5
371
381
393

The diagram labels itself. The first row reads as 354 and 358, and you can recover every original value exactly. Rotating it shows a distribution concentrated around the 360s. This ability to keep the raw numbers while still seeing the shape is what makes the stem-and-leaf diagram uniquely useful.

The Mean: The Average Everyone Knows

Now we move from pictures to numbers, starting with measures of location, which describe the centre of the data. The sample mean, written $\bar{x}$, is the arithmetic average and the most commonly used measure of location. It uses every observation, which is its strength, but that also makes it sensitive to outliers.

For raw data, the mean is the sum of all values divided by how many there are:

x=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

For grouped data, you weight each class midpoint by its frequency:

x=k=1Kfkxkk=1Kfk\bar{x} = \frac{\sum_{k=1}^{K} f_k x_k}{\sum_{k=1}^{K} f_k}

where x_k is the midpoint of class k and f_k is its frequency.

Take the raw dataset $32, 28, 67, 39, 19, 48, 32, 44, 37, 24$. With $n = 10$, the values sum to 370, so the mean is $370 / 10 = 37$.

For grouped data, consider trading-volume data with seven equal classes of width 10. You extend the frequency table with a column for $f_k x_k$:

Midpoint x_kFrequency f_kf_k x_k
1251125
1354540
1455725
1556930
16571,155
1755875
1851185
Total294,535

The mean is then:

x=4,53529=156.4 million shares/week\bar{x} = \frac{4{,}535}{29} = 156.4 \text{ million shares/week}

The Median: The Middle Value

The sample median, written $m$, is the middle value of the ordered data. Exactly half the observations lie below it and half above. Its great advantage over the mean is that it is not affected by outliers, which makes it the better measure of location for skewed data.

To find it, sort the data first. If there is an odd number of values, the median is the single middle one. If there is an even number, it is the average of the two middle values:

  • If $n$ is odd: $m = x_{((n+1)/2)}$
  • If $n$ is even: $m = \dfrac{x_{(n/2)} + x_{(n/2+1)}}{2}$

For grouped data, you interpolate within the class that contains the middle observation:

m=(lower endpoint of median class)+class width×(target rankcumulative frequency before class)class frequencym = \text{(lower endpoint of median class)} + \frac{\text{class width} \times (\text{target rank} – \text{cumulative frequency before class})}{\text{class frequency}}

Take the earlier dataset, ordered as $19, 24, 28, 32, 32, 37, 39, 44, 48, 67$. With $n = 10$, an even number, you average the 5th and 6th values, which are 32 and 37, giving a median of

(32+37)/2=34.5(32 + 37)/2 = 34.5

For grouped weekly production data with $n = 50$, the target rank is the 25.5th observation. The cumulative frequency reaches 20 after the $[360, 380)$ interval and 30 after $[380, 400)$, so the 25.5th value falls in $[380, 400)$. Interpolating gives

380+20×(25.520)10=380+11=391 380 + \frac{20 \times (25.5 – 20)}{10} = 380 + 11 = 391

Skewness: Reading Asymmetry

Comparing the mean and median tells you something a single number cannot: how lopsided the distribution is. This is skewness. When the mean exceeds the median, the distribution is positively skewed, with a long tail stretching to the right. When they are equal, it is symmetric. When the mean is below the median, it is negatively skewed, with a long tail to the left.

ConditionNameShape
$\bar{x} > m$Positively skewed (right-skewed)Long tail to the right
$\bar{x} = m$SymmetricNo tail
$\bar{x} < m$Negatively skewed (left-skewed)Long tail to the left

No calculation is needed beyond comparing the two figures. For the weekly production data, the mean is 399.72 and the median is 392.5. Since the mean exceeds the median, the long tail is pulling the mean to the right, so the distribution is positively skewed. This is the intuition behind why the median is more robust: outliers in the tail drag the mean toward them but leave the median largely untouched.

The Mode: The Most Common Value

The mode is simply the value that occurs most often. A distribution can have more than one mode, making it bimodal or multimodal, and for grouped data you report the modal class, which is the interval with the highest frequency.

For the raw dataset $32, 28, 67, 39, 19, 48, 32, 44, 37, 24$, the value 32 appears twice while every other value appears once, so the mode is 32. For grouped weekly production data, scanning the frequency column shows that $[360, 380)$ has a frequency of 14, the highest of any interval, so that is the modal class.

The Range: The Simplest Measure of Spread

Having located the centre of the data, we now turn to measures of spread, which describe how dispersed it is. The simplest is the range, the gap between the largest and smallest values:

Range=x(n)x(1)\text{Range} = x_{(n)} – x_{(1)}

For the dataset $19, 24, 28, 32, 32, 37, 39, 44, 48, 67$, the smallest value is 19 and the largest is 67, so the range is $67 – 19 = 48$. The range is trivial to compute but dangerously sensitive to outliers. If that 67 were replaced by 167, the range would leap to 148, even though only one value changed. That fragility is why we need a more robust measure.

The Interquartile Range: Spread of the Middle

The interquartile range, or IQR, measures the spread of the central 50% of the data, discarding the top and bottom quarters entirely. Because it ignores the extremes, it is robust to outliers in a way the range is not:

IQR=Q3Q1\text{IQR} = Q_3 – Q_1

Here $Q_1$ is the lower quartile, the 25th percentile, and $Q_3$ is the upper quartile, the 75th percentile. For a small sample with an even number of values, a clean method is to split the ordered data at the median into a lower half and an upper half, then take the median of each half.

For the ordered dataset $19, 24, 28, 32, 32, 37, 39, 44, 48, 67$, the lower half is $19, 24, 28, 32, 32$, whose median is $Q_1 = (24 + 28)/2 = 26$. The upper half is $37, 39, 44, 48, 67$, whose median is $Q_3 = (39 + 44)/2 = 41.5$. The IQR is therefore $41.5 – 26 = 15.5$. Compared with the range of 48, this is much smaller, precisely because it excludes the extreme values 19 and 67.

Boxplots: Five Numbers in One Picture

A boxplot turns these quartile-based measures into a visual summary. It displays five statistics at once: the minimum within the whisker bounds, $Q_1$, the median, $Q_3$, and the maximum within the whisker bounds, with any outliers plotted as individual points.

To construct one, draw a box from $Q_1$ to $Q_3$ and mark the median with a line inside it. Extend whiskers out to the furthest data points that lie within $1.5 \times \text{IQR}$ of each quartile. Any points beyond the whiskers are plotted individually as outliers.

Reading a boxplot becomes second nature once you know what each feature means. The width of the box is the IQR, showing the spread of the central half. The position of the median line within the box hints at symmetry or skewness. The whisker lengths show how far the tails reach, and the dots beyond them flag extreme observations. A long lower tail or outliers below suggests negative skewness; a long upper tail or outliers above suggests positive skewness.

Suppose a boxplot shows $Q_1 \approx 63$, a median of about 74, $Q_3 \approx 77$, and many outlier dots below the lower whisker. The upper part of the box spans only $77 – 74 = 3$, while the lower part spans $74 – 63 = 11$, so the spread below the median is far larger. Combined with the outliers sitting below, both signals point clearly to negative, or left, skewness.

Variance and Standard Deviation: The Workhorses of Spread

The most important measures of spread for formal statistics are the variance and standard deviation. Both quantify how far the data spreads around the mean, they use every observation, and they underpin most of the statistical inference you will go on to do.

The starting point is the corrected sum of squares, which has a convenient computational form:

Sxx=i=1n(xix)2=i=1nxi2nx2S_{xx} = \sum_{i=1}^{n}(x_i – \bar{x})^2 = \sum_{i=1}^{n}x_i^2 – n\bar{x}^2

From this, the sample variance and standard deviation follow:

s2=Sxxn1,s=s2s^2 = \frac{S_{xx}}{n-1}, \qquad s = \sqrt{s^2}

The detail that catches people out is the division by $n – 1$ rather than $n$. For sample data, dividing by $n – 1$ corrects a bias that would otherwise make the variance too small. For grouped data with large $n$, where $n – 1 \approx n$, the variance can be computed directly from the frequency table:

s2=k=1Kfkxk2k=1Kfk(k=1Kfkxkk=1Kfk)2s^2 = \frac{\sum_{k=1}^{K} f_k x_k^2}{\sum_{k=1}^{K} f_k} – \left(\frac{\sum_{k=1}^{K} f_k x_k}{\sum_{k=1}^{K} f_k}\right)^2

Take the raw dataset $32, 28, 67, 39, 19, 48, 32, 44, 37, 24$, which has a mean of 37. First find the sum of the squared values:

322+282+672+392+192+482+322+442+372+242=15,38832^2 + 28^2 + 67^2 + 39^2 + 19^2 + 48^2 + 32^2 + 44^2 + 37^2 + 24^2 = 15{,}388

Then the corrected sum of squares is $S_{xx} = 15{,}388 – 10 \times 37^2 = 15{,}388 – 13{,}690 = 1{,}698$. The variance is $s^2 = 1{,}698 / 9 = 188.7$, and the standard deviation is

s=188.7=13.74s = \sqrt{188.7} = 13.74

For grouped trading-volume data with $n = 29$, you extend the table with a column for

fkxk2f_k x_k^2
$x_k$$f_k$$f_k x_k$$f_k x_k^2$
125112515,625
135454072,900
1455725105,125
1556930144,150
16571,155190,575
1755875153,125
185118534,225
Total294,535715,725

The mean is $4{,}535 / 29 = 156.4$. The variance is then:

s2=715,72529(156.4)2=24,680.224,461.0=219.2s^2 = \frac{715{,}725}{29} – (156.4)^2 = 24{,}680.2 – 24{,}461.0 = 219.2

and the standard deviation is $s = \sqrt{219.2} = 14.8$ million shares per week. The standard deviation is reported in the original units, which is exactly why it is more interpretable than the variance, whose units are squared and therefore harder to reason about.

Cumulative Frequency Diagrams: Reading Quartiles off a Curve

The final tool is the cumulative frequency diagram, which plots the running total of frequencies against the upper boundary of each class. Its great use is that you can read the median and quartiles straight off the graph. A close relative, the cumulative relative frequency diagram or ogive, plots cumulative percentages instead, using the relative frequency of each class:

relative frequencyk=fkn\text{relative frequency}_k = \frac{f_k}{n}

Cumulating these gives percentages that build up to 100%.

For the trading-volume data with $n = 29$, you build a cumulative relative frequency table:

Interval$f_k$Relative freq (%)Cumulative rel freq (%)
$[120,130)$13.453.5
$[130,140)$413.7917.2
$[140,150)$517.2434.5
$[150,160)$620.6955.2
$[160,170)$724.1479.3
$[170,180)$517.2496.6
$[180,190)$13.45100.0

You then plot a point at each upper class boundary, giving $(130, 3.5)$, $(140, 17.2)$, and so on up to $(190, 100)$, and connect them with straight lines for grouped data. To read the median, draw a horizontal line across at 50% and read off the corresponding value on the horizontal axis, which lands at roughly 157 million here. The same trick at 25% gives $Q_1$ and at 75% gives $Q_3$. This turns the diagram into a quick graphical calculator for any percentile you need.

Bringing It Together

Descriptive statistics and visualisation are two views of the same thing. The charts, dot plots, histograms, stem-and-leaf diagrams, and boxplots, let you see the shape of a distribution, where it centres, how it spreads, and whether it leans one way. The numbers, mean, median, mode, range, IQR, and standard deviation, pin that shape down precisely. The two reinforce each other: a histogram suggests skewness, and comparing the mean to the median confirms it; a boxplot hints at spread, and the IQR quantifies it. Master both, and you can understand any dataset before you ever try to model it.

See you soon.

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Datalad - Data Science and ML

Subscribe now to keep reading and get access to the full archive.

Continue reading