Correlation and Linear Regression

Correlation measures the strength of a linear relationship; regression writes its equation. Learn scatterplots, the correlation coefficient, Spearman rank, least-squares fitting, and prediction, plus why correlation never proves causation and extrapolation is dangerous.

Correlation and regression carry the idea of association, first met for categorical variables through the chi-squared test, into the world of measurable variables. Where chi-squared could only tell you whether a relationship existed, these tools let you measure how strong it is and write down an equation for it, which is what makes prediction and decision-making possible. The division of labour is clean. Correlation measures the strength of a linear relationship, and regression provides a way of representing that relationship. This article works through both, from the first scatterplot to the cautions about extrapolation that separate a careful analyst from a careless one. Throughout, we deal with two variables at a time, though the same theory extends to many.

Start by Looking

The first thing to do with any paired data is plot it. A scatterplot of the points takes no calculation and reveals at a glance whether a relationship looks present, what shape it takes, and whether any points sit oddly apart from the rest. Consider data on twelve city areas recording the number of unemployed people against the monthly number of reported offences. Plotting offences against unemployment shows a positive, roughly linear pattern: as unemployment rises, offences tend to rise too. The points do not fall exactly on a straight line, so the relationship is approximate rather than perfect, which is precisely the situation correlation and regression are built to handle.

A scatterplot can show one of three patterns. Positive correlation slopes upward, with y increasing as x increases. Negative correlation slopes downward, with y falling as x rises. And zero correlation shows no clear linear trend at all. The crucial qualifier is that word linear. Zero correlation means no straight-line relationship, but a strong non-linear relationship, such as a parabola, can still be present and invisible to the correlation coefficient. As a rough intuition for what to expect, height and weight are positively correlated, rainfall and sunshine hours negatively, ice cream sales and sun cream sales positively, study hours and exam marks positively, while petrol consumption and goals scored have no relationship at all.

Correlation Is Not Causation

The single most important caution in the whole subject is that a correlation between two variables does not mean one causes the other. Very often a third, confounding variable is driving both. Average teacher salary and national alcohol consumption move together across countries, but neither causes the other; both reflect a flourishing economy, and wealth is the real driver. The stork population in Bavaria correlates with the human birth rate, but storks do not deliver babies; young families may simply favour the same scenic rural areas that storks do. The size of a local student population correlates with the number of juvenile offenders, but only because both rise with the number of young people in the area. Connecting such pairs directly is meaningless, and remembering that keeps you from drawing embarrassing conclusions.

Measuring the Strength

The population correlation coefficient, written as the Greek letter rho, captures the strength of the linear relationship between two random variables.

ρ=E(XY)E(X),E(Y)Var(X),Var(Y)\rho = \frac{E(XY) – E(X),E(Y)}{\sqrt{\text{Var}(X),\text{Var}(Y)}}

In practice we never know the population values, so we estimate rho from sample data using the sample correlation coefficient r, built from three corrected sums.

r=SxySxx,Syyr = \frac{S_{xy}}{\sqrt{S_{xx},S_{yy}}}

Each of those sums is a familiar quantity with the effect of the means removed.

Sxx=i=1nxi2nx2S_{xx} = \sum_{i=1}^n x_i^2 – n\bar{x}^2
Syy=i=1nyi2ny2S_{yy} = \sum_{i=1}^n y_i^2 – n\bar{y}^2
Sxy=i=1nxiyinxyS_{xy} = \sum_{i=1}^n x_i y_i – n\bar{x}\bar{y}

Returning to the unemployment and crime data, with twelve areas, a mean unemployment of 1,665 and a mean offence count of 5,567, the corrected sums work out to an S_xx of 3,431,759, an S_yy of 2,584,497, and an S_xy of 2,563,066. Substituting these gives the coefficient.

r=2,563,0663,431,759×2,584,497=0.8606r = \frac{2{,}563{,}066}{\sqrt{3{,}431{,}759 \times 2{,}584{,}497}} = 0.8606

This is a strong, positive linear correlation, exactly as the upward-sloping scatterplot suggested.

The coefficient has several reassuring properties. It is independent of the scale and the origin of measurement, so converting pounds to pence or shifting a zero point leaves it unchanged. It is symmetric, so the correlation of X with Y equals the correlation of Y with X. And it is always bounded between minus one and plus one, with values near plus one signalling a strong positive linear relationship, values near minus one a strong negative one, and values near zero the absence of any linear relationship. That last point bears repeating with a concrete case. The curve y equals x times one minus x, over the range from zero to one, produces a correlation near zero because there is no linear trend, yet x and y are obviously related along a parabola. A coefficient around 0.148 would understate a relationship that a scatterplot makes plain, which is why you always look at the picture as well as the number.

When Ranks Are Better

Sometimes the ordinary correlation coefficient is the wrong tool, either because the data contains outliers that distort it or because only ranks, not raw measurements, are available. In those cases the Spearman rank correlation steps in. You rank both variables in ascending order, find the difference in ranks for each pair, and feed the squared differences into a compact formula.

rs=16i=1ndi2n(n21) r_s = 1 – \frac{6\sum_{i=1}^n d_i^2}{n(n^2 – 1)}

Its limits are the same as for r, running from minus one to plus one. Suppose ten staff are ranked on both a sales aptitude test and their actual productivity. Taking the difference between each person’s two ranks, squaring those differences, and summing gives a total of 46. With ten people, the formula yields the result.

rs=16×4610×(1001)=1276990=0.7212r_s = 1 – \frac{6 \times 46}{10 \times (100 – 1)} = 1 – \frac{276}{990} = 0.7212

This reasonably strong positive value indicates the aptitude test is a fairly good predictor of sales ability.

From Strength to Equation

Regression goes a step beyond correlation by giving an actual equation for the relationship, so you can predict y for any given x. Simple linear regression involves exactly two variables, and it matters which is which. The dependent or response variable, Y, is what you are trying to explain, and the independent or explanatory variable, X, is what you believe influences it. The true population relationship is assumed to take a specific form.

y=α+βx+ε y = \alpha + \beta x + \varepsilon

Here alpha is the y-intercept and beta the slope, both unknown population parameters, while epsilon is the error term, the random deviation of each individual observation from the underlying line. Written for a single data point, it becomes the following.

yi=α+βxi+εiy_i = \alpha + \beta x_i + \varepsilon_i

The model rests on three assumptions. Linearity says the true relationship really does follow that straight-line-plus-error form. Constant variance says the spread of the errors is the same for every observation. And normality with independence says the errors follow a normal distribution with mean zero and constant variance, independently of one another.

Fitting the Line

We estimate the two parameters by the method of least squares, which finds the line that minimises the total of the squared vertical distances between the observed points and the line’s predictions. The fitted line uses lower-case letters to mark that these are estimates.

y^=a+bx\hat{y} = a + bx

The slope comes first, because the intercept formula needs it.

b=SxySxx=xiyinxyxi2nx2b = \frac{S_{xy}}{S_{xx}} = \frac{\sum x_i y_i – n\bar{x}\bar{y}}{\sum x_i^2 – n\bar{x}^2}
a=ybxa = \bar{y} – b\bar{x}

A short calculation shows the rhythm. With ten observations where the sum of x is 32.7, the sum of y is 145.7, the sum of the products is 516.19, and the sum of the squared x values is 117.57, the means are 3.27 and 14.57. The slope follows.

b=516.1910×3.27×14.57117.5710×(3.27)2=3.7356b = \frac{516.19 – 10 \times 3.27 \times 14.57}{117.57 – 10 \times (3.27)^2} = 3.7356

Then the intercept.

a=14.573.7356×3.27=2.3546a = 14.57 – 3.7356 \times 3.27 = 2.3546

So the estimated line is the following.

y^=2.3546+3.7356x\hat{y} = 2.3546 + 3.7356x

Predicting From the Line

Once you have the line, prediction is simply substituting a chosen value of x and reading off the result, always taking care that the units of your input match the units of the original data.

y^=a+bx0\hat{y} = a + bx_0

Take twelve weeks of advertising spend and sales, both measured in thousands of pounds. With the relevant sums supplied, the slope works out to 3.221 and the intercept to 343.7, giving a fitted line of y-hat equals 343.7 plus 3.221 x. To predict sales for £35,000 of advertising, note that the data is in thousands, so the input is 35, not 35,000.

y^=343.7+3.221×35=456.4\hat{y} = 343.7 + 3.221 \times 35 = 456.4

The predicted weekly sales are therefore £456,400, and attaching the right units to that answer is part of getting it right.

A fuller example makes the danger of over-reaching clear. For nine plots relating fertiliser in grams per square metre to crop yield in kilograms per hectare, the sums give means of 4 and 179, a slope of 3.05, and an intercept of 166.8.

b=6,6279×4×1792049×16=18360=3.05b = \frac{6{,}627 – 9 \times 4 \times 179}{204 – 9 \times 16} = \frac{183}{60} = 3.05
y^=166.8+3.05x\hat{y} = 166.8 + 3.05x

Predicting the yield at 3.5 grams per square metre is sound, because that value sits comfortably inside the observed range.

y^=166.8+3.05×3.5=177.475\hat{y} = 166.8 + 3.05 \times 3.5 = 177.475

But predicting at 10 grams per square metre would be reckless, because the data only reaches 8, and the scatterplot even hints that fertiliser above about 7 may start to reduce yield. Pushing the line beyond the data is extrapolation, and it is unreliable by nature.

The Cautions That Matter

A few traps deserve permanent attention. A straight line captures only linear patterns, so any genuinely curved relationship, common in the natural sciences, will be missed entirely no matter how good the fit looks. The choice of dependent variable is not arbitrary either, because regressing y on x gives a different line from regressing x on y; common sense settles it by asking which variable reacts to the other, and in the advertising example it is sales that respond to spend, not the reverse.

The distinction between interpolation and extrapolation is the one that catches people out most. Interpolation, predicting within the range of the observed x values, is reliable. Extrapolation, predicting outside that range, should be treated with extreme caution. A line estimated as y-hat equals 8 minus 0.6 x from women with five to eight years of education predicts a sensible five births at five years and a plausible two births at ten years, but at fifteen years it predicts minus one birth, which is impossible, and at zero years it predicts eight, a figure the data simply does not support. A relationship observed over a narrow band cannot be assumed to hold beyond it.

Finally, two subtler points connect correlation and regression. A large value of r-squared means the points lie close to the line, which gives a small standard error for the slope, while a small r-squared means the points scatter widely and the slope is estimated imprecisely. But a large absolute r does not mean the slope is steep. A strong correlation only says the points cluster tightly around their line; that line may climb sharply or gently. As a matter of exam technique, expect to be handed the five summary statistics, the sums of x, of x squared, of y, of y squared, and of the cross-products, to spare you the raw arithmetic, and always show your working, because an unsupported wrong answer earns nothing.

Please see an article with 10 exercises for this lesson here: https://datalad.co.uk/correlation-and-regression-workbook-10-exercises-with-full-solutions/

Summary

Correlation and regression turn a scatterplot into numbers you can act on. Always plot the data first, then measure the strength of any linear relationship with the correlation coefficient r, remembering that it lives between minus one and plus one, that it sees only linear patterns, and above all that correlation never proves causation. When outliers or ranked data make r unreliable, switch to the Spearman rank correlation. To go from strength to prediction, fit a least-squares line by computing the slope first and the intercept second, then predict by substituting a value of x, but only within the range of your data, because extrapolation beyond it is where careful analysis goes to die. Keep those cautions in view and this pair of techniques becomes one of the most useful in all of applied statistics.

See you soon.

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading