Before you can model data, test a hypothesis, or build a dashboard, you need to understand what the data actually looks like. That understanding comes from two complementary toolkits: visualisations that reveal the shape of a distribution at a glance, and descriptive statistics that summarise it in numbers. This guide walks through both, from classifying variables to calculating spread, with worked examples throughout.
Knowing What Kind of Variable You Have
Everything starts with the type of variable you are dealing with, because it determines which charts and statistics make sense. Variables fall into two broad families. Measurable variables have a genuine numerical scale with a natural ordering. Categorical variables do not; they simply sort observations into groups.
Within measurable variables, there is a further split. Discrete variables are counts in whole numbers, like the number of passengers on a flight. Continuous variables can be measured to any precision, like height, weight, or time. Categorical variables split too. Ordinal categories have a natural order, like a satisfaction rating of poor, fair, or good. Nominal categories have no order at all, like nationality or gender.
| Type | Sub-type | Description | Examples |
|---|---|---|---|
| Measurable | Discrete | Count in whole numbers | Passengers on a flight |
| Measurable | Continuous | Measured to any precision | Height, weight, time |
| Categorical | Ordinal | Categories with a natural order | Satisfaction: poor / fair / good |
| Categorical | Nominal | Categories with no natural order | Gender, nationality, religion |
To see how this works in practice, consider four variables. The number of phone calls received today is something you count in whole numbers, $0, 1, 2, \ldots$, so it is discrete. Time spent on hold is measured in seconds to any decimal, so it is continuous. A customer satisfaction rating of dissatisfied, neutral, or satisfied has ordered categories but no numeric scale, so it is ordinal categorical. Country of birth has no ordering between values, so it is nominal categorical. Getting this classification right is the foundation everything else rests on.
Dot Plots: The Simplest View
For small datasets, a dot plot is often the quickest way to see what is going on. You place one dot per observation above a horizontal axis, stacking dots vertically where values repeat. The result gives an immediate sense of where values cluster, where the gaps are, and whether anything looks extreme.
There is no formula. You draw a horizontal axis spanning the range of the data, then place a dot at each observation’s value. Take the values $3, 5, 5, 7, 8, 8, 8, 10$. The axis runs from 3 to 10. You place one dot at 3, two at 5, one at 7, three at 8, and one at 10. The stack of three dots at 8 immediately marks it as the most common value, and the gap between 3 and 5 stands out as a region worth a second look.
Histograms: Grouping into Bins
When datasets get larger, you group values into class intervals, or bins, and draw a histogram. The single most important rule of a histogram is that the area of each bar is proportional to the frequency of that interval:
This rule only becomes visible when bin widths are unequal. In that case, you must plot frequency density on the vertical axis rather than raw frequency:
When all bins have the same width, the heights are proportional to the frequencies directly and you can plot frequency without worry. But the moment widths differ, plotting raw frequency would distort the picture, making wide intervals look more important than they are.
Consider weekly production data with unequal bin widths. The first step is to build a frequency table that includes the density:
| Interval | Width | Frequency | Frequency density |
|---|---|---|---|
| [300, 360) | 60 | 6 | 6/60 = 0.100 |
| [360, 380) | 20 | 14 | 14/20 = 0.700 |
| [380, 400) | 20 | 10 | 10/20 = 0.500 |
| [400, 420) | 20 | 4 | 4/20 = 0.200 |
| [420, 460) | 40 | 13 | 13/40 = 0.325 |
| [460, 500) | 40 | 3 | 3/40 = 0.075 |
You then draw adjacent bars over each interval, with the height equal to the frequency density. A quick sanity check confirms the logic: the area of the $[360, 380)$ bar is $0.700 \times 20 = 14$, which matches its frequency exactly. The bars sit flush against each other with no gaps, the vertical axis is frequency density rather than frequency, and as a rule of thumb you want around six or seven bins to balance detail against clarity.
Stem-and-Leaf Diagrams: Keeping the Raw Values
A stem-and-leaf diagram is a clever hybrid. It shows the shape of the distribution like a histogram, but unlike a histogram it preserves every original data value. The stem holds the leading digits and each leaf is the remaining digit. If you rotate the diagram 90 degrees anti-clockwise, the leaves form the same shape a histogram would.
To build one, sort the data, choose a stem unit such as the tens digit, then list each stem vertically with its leaves written horizontally beside it. Take the values 354, 358, 360, 362, 365, 371, 381, 393 with the tens as the stem:
| Stem (tens) | Leaves (units) |
|---|---|
| 35 | 4 8 |
| 36 | 0 2 5 |
| 37 | 1 |
| 38 | 1 |
| 39 | 3 |
The diagram labels itself. The first row reads as 354 and 358, and you can recover every original value exactly. Rotating it shows a distribution concentrated around the 360s. This ability to keep the raw numbers while still seeing the shape is what makes the stem-and-leaf diagram uniquely useful.
The Mean: The Average Everyone Knows
Now we move from pictures to numbers, starting with measures of location, which describe the centre of the data. The sample mean, written $\bar{x}$, is the arithmetic average and the most commonly used measure of location. It uses every observation, which is its strength, but that also makes it sensitive to outliers.
For raw data, the mean is the sum of all values divided by how many there are:
For grouped data, you weight each class midpoint by its frequency:
where x_k is the midpoint of class k and f_k is its frequency.
Take the raw dataset $32, 28, 67, 39, 19, 48, 32, 44, 37, 24$. With $n = 10$, the values sum to 370, so the mean is $370 / 10 = 37$.
For grouped data, consider trading-volume data with seven equal classes of width 10. You extend the frequency table with a column for $f_k x_k$:
| Midpoint x_k | Frequency f_k | f_k x_k |
|---|---|---|
| 125 | 1 | 125 |
| 135 | 4 | 540 |
| 145 | 5 | 725 |
| 155 | 6 | 930 |
| 165 | 7 | 1,155 |
| 175 | 5 | 875 |
| 185 | 1 | 185 |
| Total | 29 | 4,535 |
The mean is then:
The Median: The Middle Value
The sample median, written $m$, is the middle value of the ordered data. Exactly half the observations lie below it and half above. Its great advantage over the mean is that it is not affected by outliers, which makes it the better measure of location for skewed data.
To find it, sort the data first. If there is an odd number of values, the median is the single middle one. If there is an even number, it is the average of the two middle values:
- If $n$ is odd: $m = x_{((n+1)/2)}$
- If $n$ is even: $m = \dfrac{x_{(n/2)} + x_{(n/2+1)}}{2}$
For grouped data, you interpolate within the class that contains the middle observation:
Take the earlier dataset, ordered as $19, 24, 28, 32, 32, 37, 39, 44, 48, 67$. With $n = 10$, an even number, you average the 5th and 6th values, which are 32 and 37, giving a median of
For grouped weekly production data with $n = 50$, the target rank is the 25.5th observation. The cumulative frequency reaches 20 after the $[360, 380)$ interval and 30 after $[380, 400)$, so the 25.5th value falls in $[380, 400)$. Interpolating gives
Skewness: Reading Asymmetry
Comparing the mean and median tells you something a single number cannot: how lopsided the distribution is. This is skewness. When the mean exceeds the median, the distribution is positively skewed, with a long tail stretching to the right. When they are equal, it is symmetric. When the mean is below the median, it is negatively skewed, with a long tail to the left.
| Condition | Name | Shape |
|---|---|---|
| $\bar{x} > m$ | Positively skewed (right-skewed) | Long tail to the right |
| $\bar{x} = m$ | Symmetric | No tail |
| $\bar{x} < m$ | Negatively skewed (left-skewed) | Long tail to the left |
No calculation is needed beyond comparing the two figures. For the weekly production data, the mean is 399.72 and the median is 392.5. Since the mean exceeds the median, the long tail is pulling the mean to the right, so the distribution is positively skewed. This is the intuition behind why the median is more robust: outliers in the tail drag the mean toward them but leave the median largely untouched.
The Mode: The Most Common Value
The mode is simply the value that occurs most often. A distribution can have more than one mode, making it bimodal or multimodal, and for grouped data you report the modal class, which is the interval with the highest frequency.
For the raw dataset $32, 28, 67, 39, 19, 48, 32, 44, 37, 24$, the value 32 appears twice while every other value appears once, so the mode is 32. For grouped weekly production data, scanning the frequency column shows that $[360, 380)$ has a frequency of 14, the highest of any interval, so that is the modal class.
The Range: The Simplest Measure of Spread
Having located the centre of the data, we now turn to measures of spread, which describe how dispersed it is. The simplest is the range, the gap between the largest and smallest values:
For the dataset $19, 24, 28, 32, 32, 37, 39, 44, 48, 67$, the smallest value is 19 and the largest is 67, so the range is $67 – 19 = 48$. The range is trivial to compute but dangerously sensitive to outliers. If that 67 were replaced by 167, the range would leap to 148, even though only one value changed. That fragility is why we need a more robust measure.
The Interquartile Range: Spread of the Middle
The interquartile range, or IQR, measures the spread of the central 50% of the data, discarding the top and bottom quarters entirely. Because it ignores the extremes, it is robust to outliers in a way the range is not:
Here $Q_1$ is the lower quartile, the 25th percentile, and $Q_3$ is the upper quartile, the 75th percentile. For a small sample with an even number of values, a clean method is to split the ordered data at the median into a lower half and an upper half, then take the median of each half.
For the ordered dataset $19, 24, 28, 32, 32, 37, 39, 44, 48, 67$, the lower half is $19, 24, 28, 32, 32$, whose median is $Q_1 = (24 + 28)/2 = 26$. The upper half is $37, 39, 44, 48, 67$, whose median is $Q_3 = (39 + 44)/2 = 41.5$. The IQR is therefore $41.5 – 26 = 15.5$. Compared with the range of 48, this is much smaller, precisely because it excludes the extreme values 19 and 67.
Boxplots: Five Numbers in One Picture
A boxplot turns these quartile-based measures into a visual summary. It displays five statistics at once: the minimum within the whisker bounds, $Q_1$, the median, $Q_3$, and the maximum within the whisker bounds, with any outliers plotted as individual points.
To construct one, draw a box from $Q_1$ to $Q_3$ and mark the median with a line inside it. Extend whiskers out to the furthest data points that lie within $1.5 \times \text{IQR}$ of each quartile. Any points beyond the whiskers are plotted individually as outliers.
Reading a boxplot becomes second nature once you know what each feature means. The width of the box is the IQR, showing the spread of the central half. The position of the median line within the box hints at symmetry or skewness. The whisker lengths show how far the tails reach, and the dots beyond them flag extreme observations. A long lower tail or outliers below suggests negative skewness; a long upper tail or outliers above suggests positive skewness.
Suppose a boxplot shows $Q_1 \approx 63$, a median of about 74, $Q_3 \approx 77$, and many outlier dots below the lower whisker. The upper part of the box spans only $77 – 74 = 3$, while the lower part spans $74 – 63 = 11$, so the spread below the median is far larger. Combined with the outliers sitting below, both signals point clearly to negative, or left, skewness.
Variance and Standard Deviation: The Workhorses of Spread
The most important measures of spread for formal statistics are the variance and standard deviation. Both quantify how far the data spreads around the mean, they use every observation, and they underpin most of the statistical inference you will go on to do.
The starting point is the corrected sum of squares, which has a convenient computational form:
From this, the sample variance and standard deviation follow:
The detail that catches people out is the division by $n – 1$ rather than $n$. For sample data, dividing by $n – 1$ corrects a bias that would otherwise make the variance too small. For grouped data with large $n$, where $n – 1 \approx n$, the variance can be computed directly from the frequency table:
Take the raw dataset $32, 28, 67, 39, 19, 48, 32, 44, 37, 24$, which has a mean of 37. First find the sum of the squared values:
Then the corrected sum of squares is $S_{xx} = 15{,}388 – 10 \times 37^2 = 15{,}388 – 13{,}690 = 1{,}698$. The variance is $s^2 = 1{,}698 / 9 = 188.7$, and the standard deviation is
For grouped trading-volume data with $n = 29$, you extend the table with a column for
| $x_k$ | $f_k$ | $f_k x_k$ | $f_k x_k^2$ |
|---|---|---|---|
| 125 | 1 | 125 | 15,625 |
| 135 | 4 | 540 | 72,900 |
| 145 | 5 | 725 | 105,125 |
| 155 | 6 | 930 | 144,150 |
| 165 | 7 | 1,155 | 190,575 |
| 175 | 5 | 875 | 153,125 |
| 185 | 1 | 185 | 34,225 |
| Total | 29 | 4,535 | 715,725 |
The mean is $4{,}535 / 29 = 156.4$. The variance is then:
and the standard deviation is $s = \sqrt{219.2} = 14.8$ million shares per week. The standard deviation is reported in the original units, which is exactly why it is more interpretable than the variance, whose units are squared and therefore harder to reason about.
Cumulative Frequency Diagrams: Reading Quartiles off a Curve
The final tool is the cumulative frequency diagram, which plots the running total of frequencies against the upper boundary of each class. Its great use is that you can read the median and quartiles straight off the graph. A close relative, the cumulative relative frequency diagram or ogive, plots cumulative percentages instead, using the relative frequency of each class:
Cumulating these gives percentages that build up to 100%.
For the trading-volume data with $n = 29$, you build a cumulative relative frequency table:
| Interval | $f_k$ | Relative freq (%) | Cumulative rel freq (%) |
|---|---|---|---|
| $[120,130)$ | 1 | 3.45 | 3.5 |
| $[130,140)$ | 4 | 13.79 | 17.2 |
| $[140,150)$ | 5 | 17.24 | 34.5 |
| $[150,160)$ | 6 | 20.69 | 55.2 |
| $[160,170)$ | 7 | 24.14 | 79.3 |
| $[170,180)$ | 5 | 17.24 | 96.6 |
| $[180,190)$ | 1 | 3.45 | 100.0 |
You then plot a point at each upper class boundary, giving $(130, 3.5)$, $(140, 17.2)$, and so on up to $(190, 100)$, and connect them with straight lines for grouped data. To read the median, draw a horizontal line across at 50% and read off the corresponding value on the horizontal axis, which lands at roughly 157 million here. The same trick at 25% gives $Q_1$ and at 75% gives $Q_3$. This turns the diagram into a quick graphical calculator for any percentile you need.
Bringing It Together
Descriptive statistics and visualisation are two views of the same thing. The charts, dot plots, histograms, stem-and-leaf diagrams, and boxplots, let you see the shape of a distribution, where it centres, how it spreads, and whether it leans one way. The numbers, mean, median, mode, range, IQR, and standard deviation, pin that shape down precisely. The two reinforce each other: a histogram suggests skewness, and comparing the mean to the median confirms it; a boxplot hints at spread, and the IQR quantifies it. Master both, and you can understand any dataset before you ever try to model it.
See you soon.
[…] how you compute multiple statistics in a single pass. The cleanest form is named aggregation, where you control the output column names […]
[…] test each assigns different probabilities to different outcomes. In such settings, probability is understood as the limiting relative frequency. If an experiment is repeated F times independently and event A […]