The Simple Regression Model

The simple regression model is where econometrics begins. Learn the SLR equation, OLS estimation, R-squared, the four assumptions, why OLS is unbiased, and where its standard errors come from.

The simple regression model is the workhorse from which all of econometrics is built, and it earns that place by solving three problems at once. Whenever you try to explain one variable y in terms of another variable x, you immediately face three difficulties. First, y is never determined by x alone, so you have to do something with all the other influences. Second, you have to commit to a functional form, the actual mathematical shape of the relationship. Third, you need ceteris paribus, some way to ensure that the relationship you estimate reflects the effect of x on y holding everything else constant. Take class size and student performance as an example: teacher quality, family income, and prior ability all affect performance and become the “other factors,” the shape of the relationship has to be assumed, and smaller classes tend to sit in wealthier schools, which is the ceteris paribus threat. The simple linear regression model addresses all three together.

The Model and What Its Pieces Mean

The population relationship between two variables is written as a line plus an error.

y=β0+β1x+uy = \beta_0 + \beta_1 x + u

The two variables are not treated symmetrically; the goal is always to explain y in terms of x, never the reverse. Here y is the dependent or response variable, the thing being explained, and x is the regressor or explanatory variable, the thing doing the explaining. The intercept beta-zero is the baseline level of y when x is zero, the slope beta-one is the change in y per one-unit change in x holding all else fixed, and the error term u collects every other factor affecting y that x does not capture. The slope is precisely the ceteris paribus effect, because when u is held fixed, a change in x produces a proportional change in y.

Δy=β1Δx\Delta y = \beta_1 \Delta x

For wages and education, the model says wage equals beta-zero plus beta-one times years of education plus an error containing ability, motivation, experience, and everything else.

wage=β0+β1,educ+u\text{wage} = \beta_0 + \beta_1,\text{educ} + u

Holding those other factors fixed, one more year of education changes the hourly wage by beta-one. But that ceteris paribus reading is only valid if u really is held constant, which is not automatic; it requires formal assumptions about how u relates to x.

Pinning Down the Error Term

Two assumptions tame the error. The first is that its expected value is zero.

E(u)=0 E(u) = 0

This is an innocuous normalisation rather than a real restriction, because as long as the model includes an intercept, any non-zero mean of the error simply gets absorbed into beta-zero without touching beta-one. If the error had mean alpha-zero, you could define a new error with that mean subtracted out and a new intercept with it added in, and the equation would be unchanged except that the intercept shifts while the slope, the parameter you actually care about, stays exactly the same.

The second assumption is the substantive one, and it concerns how the error relates to x. Zero correlation is not enough, because the error could be uncorrelated with x yet correlated with a function of x such as its square. The stronger condition needed is mean independence: the average of the error is the same for every value of x.

E(u|x)=E(u)for all xE(u \mid x) = E(u) \quad \text{for all } x

In words, knowing x tells you nothing about the expected value of u. Randomly assigned fertiliser satisfies this, because the amount applied is independent of land quality. Education almost certainly violates it, because higher-ability people tend to choose more schooling, so ability, which lives in the error, is correlated with education.

Combining the two assumptions gives the single condition that everything else rests on, the zero conditional mean assumption.

E(u|x)=0for all xE(u \mid x) = 0 \quad \text{for all } x

This immediately produces the population regression function, the line on which the average of y sits for each value of x.

E(y|x)=β0+β1xE(y \mid x) = \beta_0 + \beta_1 x

The derivation is short: take the conditional expectation of the model, the constant and the slope-times-x pass through unchanged, and the error term vanishes because its conditional mean is zero. So for any given x, the distribution of y is centred on beta-zero plus beta-one times x, with individual values scattered around that centre by the error. If, say, the population function for college GPA were 1.5 plus 0.5 times high-school GPA, then a student with a high-school GPA of 3.6 would have an expected college GPA of 3.3, but that is the average over all such students; any one of them will land above or below depending on their unobserved factors.

Estimating the Line from a Sample

Population expectations cannot be computed from data, so OLS replaces them with their sample analogues. The two population conditions, that the error has mean zero and that x is uncorrelated with the error, become two sample equations in the two unknown estimates, and solving them yields the OLS estimators. The slope is the sample covariance of x and y divided by the sample variance of x.

β^1=i=1n(xix)(yiy)i=1n(xix)2\hat{\beta}1 = \frac{\sum{i=1}^{n}(x_i – \bar{x})(y_i – \bar{y})}{\sum_{i=1}^{n}(x_i – \bar{x})^2}

The intercept then follows from forcing the line through the sample means.

β^0=yβ^1x\hat{\beta}_0 = \bar{y} – \hat{\beta}_1 \bar{x}

The slope exists as long as x actually varies in the sample. A small worked example makes it concrete. With five points (1,3), (2,5), (3,4), (4,7), (5,8), the means are x-bar equal to 3 and y-bar equal to 5.4, the sum of cross-products is 12 and the sum of squared deviations of x is 10, so the slope is 1.2 and the intercept is 1.8.

y^=1.8+1.2x\hat{y} = 1.8 + 1.2x

Each one-unit increase in x is associated with a predicted increase of 1.2 in y.

The name “ordinary least squares” comes from an equivalent route to the same answer: the estimates are the values that minimise the sum of squared residuals.

SSR=i=1n(yiβ^0β^1xi)2\text{SSR} = \sum_{i=1}^{n}(y_i – \hat{\beta}_0 – \hat{\beta}_1 x_i)^2

Differentiating with respect to the two estimates and setting the derivatives to zero produces exactly the same equations as the method-of-moments approach, so the OLS line both satisfies the moment conditions and minimises prediction error. From the line you get fitted values and residuals.

y^i=β^0+β^1xiu^i=yiy^i\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \qquad \hat{u}_i = y_i – \hat{y}_i

A positive residual means the line underpredicts that point, a negative one means it overpredicts. It is worth keeping the residual distinct from the error: the residual uses the estimated parameters, while the true error uses the unknown population parameters and can never be observed.

What OLS Guarantees by Construction

Four algebraic properties hold for any OLS fit, whether or not the model assumptions are true, simply because of how the estimates are defined. The residuals always sum to zero. The residuals are uncorrelated with the regressor, so the sum of x times the residuals is zero. The mean of the fitted values equals the mean of the actual y. And the fitted values are uncorrelated with the residuals. A direct consequence is that the point of sample means always lies on the regression line, which follows immediately from the intercept formula. In the five-point example, the residuals are 0, 0.8, -1.4, 0.4, and 0.2, which sum to zero, and weighting them by the x values also sums to zero, confirming the first two properties.

How Well the Line Fits

To judge fit, the total variation in y is split into an explained part and an unexplained part.

SST=SSE+SSR \text{SST} = \text{SSE} + \text{SSR}

The total sum of squares is the overall variation in y, the explained sum of squares is the variation captured by the regression line, and the residual sum of squares is what is left over. The coefficient of determination, R-squared, is the fraction of total variation the model explains.

R2=SSESST=1SSRSSTR^2 = \frac{\text{SSE}}{\text{SST}} = 1 – \frac{\text{SSR}}{\text{SST}}

It lies between zero and one, with zero meaning no linear relationship and one meaning a perfect linear fit. In the five-point example the total, explained, and residual sums of squares are 17.2, 14.4, and 2.8, giving an R-squared of about 0.837, so x explains roughly 84% of the variation in y. A crucial caveat: a high R-squared signals good predictive power but says nothing about causality, and a low one does not invalidate a regression. In the classic wage-on-education regression, education alone explains only about 16.5% of wage variation, yet what matters is whether the slope is a reliable estimate of the ceteris paribus effect, not how much variation is explained.

The Assumptions and the Properties They Buy

The statistical behaviour of OLS rests on a short list of assumptions, and it is worth knowing which property each one earns. SLR.1 states that the population model is linear in the parameters, which fixes the functional form and defines beta-zero and beta-one as the targets of estimation. SLR.2 states that the data is a random sample from that model, linking the sample to the population, and it reminds us that the unobserved error is not the same thing as the OLS residual. SLR.3 requires that x varies in the sample, since without variation the slope cannot be computed; this is almost always satisfied in practice. SLR.4 is the key substantive assumption, the zero conditional mean condition, which requires no systematic relationship between the error and x in the population.

E(u|x)=0for all xE(u \mid x) = 0 \quad \text{for all } x

Under these four assumptions, the OLS estimators are unbiased, meaning their expected values equal the true parameters.

E(β^1)=β1E(β^0)=β0E(\hat{\beta}_1) = \beta_1 \qquad E(\hat{\beta}_0) = \beta_0

The proof for the slope runs in three moves. First, substituting the true model into the estimator gives a truth-plus-noise decomposition, the true slope plus a weighted sum of the errors, where the weights depend only on the x values.

β^1=β1+i=1nwiui\hat{\beta}1 = \beta_1 + \sum{i=1}^{n} w_i u_i

Second, conditioning on the x values, those weights are non-random, and by random sampling and zero conditional mean each weighted error has conditional expectation zero, so the conditional expectation of the slope estimate is exactly beta-one. Third, the law of iterated expectations carries that result through to the unconditional expectation, completing the proof. Two distinctions matter here. Unbiasedness is a property of the estimator, the recipe, not of any single number it produces; a particular estimate like 1.43 may sit far from the truth in one sample, because unbiasedness is about the average across many repeated samples. And if the error is correlated with x, SLR.4 fails and OLS is biased, which is the single most common source of bias in all of econometrics.

Precision and Standard Errors

Unbiasedness says nothing about how precise the estimates are; for that, a fifth assumption is added. SLR.5, homoskedasticity, states that the error has the same variance for every value of x.

Var(u|x)=σ2>0for all x \text{Var}(u \mid x) = \sigma^2 > 0 \quad \text{for all } x

The conditional mean of y is allowed to move with x, that is the whole point of the regression, but the conditional variance does not. This is violated when, for instance, wealthier families show more variable saving behaviour, or when school-level averages are less variable for larger schools because they average over more students. Under all five assumptions, the variance of the slope estimate has a clean form.

Var(β^1|𝐗)=σ2SSTx\text{Var}(\hat{\beta}_1 \mid \mathbf{X}) = \frac{\sigma^2}{\text{SST}_x}

Three forces drive precision. More error variance raises the variance of the estimate, because more noise makes the slope harder to learn. More variation in x lowers it, because a wider spread of x carries more information about the slope. And a larger sample lowers it, since the total variation in x grows with n, so the variance shrinks roughly at rate one over n. The proof is short: the variance of the estimate is the variance of the weighted error sum, the weights come out squared because they are non-random given x, homoskedasticity makes every error variance equal to sigma-squared, and the weights simplify to give one over the total variation in x. Notice that homoskedasticity played no part in establishing unbiasedness; it is needed only for these variance formulas.

The error variance itself is unknown and must be estimated. Because the true errors are unobserved, you use the residuals, but a naive average of squared residuals is biased slightly downward, since OLS imposes two restrictions on the residuals and so consumes two degrees of freedom. The unbiased estimator divides by n minus two instead of n.

σ^2=SSRn2\hat{\sigma}^2 = \frac{\text{SSR}}{n-2}

Its square root, the standard error of the regression, estimates the standard deviation of the error in the population. In the five-point example, with a residual sum of squares of 2.8 and n equal to 5, the estimate is 2.8 over 3, about 0.933, so the standard error of the regression is about 0.966. Substituting this estimate into the variance formula gives the standard error of the slope.

se(β^1)=σ^SSTxse(\hat{\beta}_1) = \frac{\hat{\sigma}}{\sqrt{\text{SST}_x}}

Standard errors are always reported beside the coefficient estimates, usually in parentheses below them, because they measure the precision of the estimates and are the basis for confidence intervals and hypothesis tests. In the wage-on-education regression, the fitted line is roughly negative 5.12 plus 1.43 times education, with a standard error on the slope of about 0.053, meaning that across repeated samples the estimated return to a year of schooling would typically vary by about five cents an hour around its average of 1.43.

A few exercises for you to practice here: https://datalad.co.uk/the-simple-regression-model-10-exercises-with-full-solutions/

Conclusion

The simple regression model writes y as a line in x plus an error, and almost everything follows from what you assume about that error. Setting its mean to zero is a harmless normalisation, but the zero conditional mean assumption, that the error is mean-independent of x, is the substantive condition that makes the slope a genuine ceteris paribus effect and makes OLS unbiased. OLS estimates the line either by matching sample moments or, equivalently, by minimising squared residuals, and several tidy algebraic properties hold automatically, including that the line passes through the sample means. R-squared measures fit but not causality. Adding homoskedasticity yields the variance formula that shows precision improving with more data and more variation in x, and a degrees-of-freedom correction gives an unbiased estimate of the error variance and hence the standard errors. The one assumption to guard above all others is the zero conditional mean: when an omitted factor in the error is correlated with x, the whole edifice tilts, and the estimate stops meaning what you want it to mean.

See you soon.

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading