The Logic of a Hypothesis Test
A hypothesis test is a formal procedure for choosing between two competing statements about a population parameter on the basis of sample evidence. Rather than accepting a claim without scrutiny, the test asks: if the default position were true, how likely is the data we actually observed?
The null hypothesis H₀ represents the default position — no effect, no difference, no change — and is always stated with an equality. The alternative hypothesis H₁ represents what the researcher is trying to find evidence for. It specifies either a direction (the parameter is greater than, or less than, the null value) or simply that the parameter differs from it.
When H₁ states only that the parameter differs without specifying a direction, the test is two-tailed. When a specific direction is anticipated, the test is upper-tailed or lower-tailed accordingly. The form of H₁ must be chosen before looking at the data; it reflects the research question, not the sample outcome.
To illustrate, suppose a researcher believes comprehensive financial planning raises average bank returns above 10.2%. The parameter of interest is μ, the mean return. The null hypothesis is H₀: μ = 10.2, and because the researcher expects an increase, the alternative is H₁: μ > 10.2, making this an upper-tailed test.
Test Statistics and Critical Regions
Given a null hypothesis, a test statistic is computed from the sample. Its general form is:
Under the assumption that H₀ is true, the test statistic follows a known distribution, typically a standard normal or a t distribution. The critical region is the set of values extreme enough to cast doubt on H₀, defined by critical values that mark the boundary between retaining and rejecting it.
The decision rule is direct: reject H₀ if the test statistic falls in the critical region; do not reject otherwise. Failing to reject H₀ does not prove it is true — it means the data are consistent with it.
Types of Error
Because a hypothesis test makes a decision under uncertainty, two types of error are possible. A Type I error occurs when H₀ is rejected even though it is true — a false positive — with probability α. A Type II error occurs when H₀ is not rejected even though it is false — a false negative — with probability β. The power of a test is 1 − β, the probability of correctly rejecting a false H₀.
In a drug trial where H₀ states the drug has no effect, a Type I error means concluding the drug works when it does not, wasting resources and raising false hope. A Type II error means concluding the drug does not work when it does, causing a genuine treatment to be overlooked. Type I errors are generally considered the more serious of the two, which is why the significance level α is fixed in advance to control them.
Significance Levels and the Decision Framework
The significance level α is the threshold for the Type I error probability. Common values are 10%, 5%, and 1%. A standard approach is to test at two levels sequentially: begin at 5%, then move to 1% if H₀ is rejected, or to 10% if it is not. This produces four possible conclusions. Rejection at 1% is highly significant; rejection at 5% but not 1% is moderately significant; rejection at 10% but not 5% is weakly significant; failure to reject at 10% is not significant.
The critical values for a standard normal test statistic at each level are:
| Level α | Two-tailed (±z_{α/2}) | Upper-tailed (z_α) | Lower-tailed (−z_α) |
|---|---|---|---|
| 10% | ±1.6449 | 1.2816 | −1.2816 |
| 5% | ±1.9600 | 1.6449 | −1.6449 |
| 1% | ±2.5758 | 2.3263 | −2.3263 |
p-Values
The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the observed value, under the assumption that H₀ is true. H₀ is rejected when the p-value falls below α.
For a two-tailed test the p-value doubles the tail probability beyond the observed statistic:
For a lower-tailed test it is the left-tail probability:
For an upper-tailed test it is the right-tail probability:
As a worked example, for a lower-tailed test with z = −1.82 and Z ∼ N(0, 1), the p-value is P(Z ≤ −1.82) = 1 − Φ(1.82) = 1 − 0.9656 = 0.0344. Since 0.01 < 0.0344 < 0.05, the result is moderately significant.
Test for a Single Mean: σ Known
When the population standard deviation σ is known, the test statistic under H₀: μ = μ₀ is:
Suppose n = 100, x̄ = 1570, and σ = 120, testing H₀: μ = 1600 against H₁: μ ≠ 1600. The test statistic is z = (1570 − 1600) / (120/√100) = −30/12 = −2.50. At the 5% level the critical values are ±1.960 and |−2.50| > 1.960, so H₀ is rejected. At the 1% level the critical values are ±2.5758 and |−2.50| < 2.5758, so H₀ is not rejected at this stricter threshold. The result is moderately significant: there is evidence that μ ≠ 1600. The p-value confirms this: 2 × P(Z > 2.50) = 2 × 0.0062 = 0.0124, which lies between 0.01 and 0.05.
Test for a Single Mean: σ Unknown
When σ is unknown and estimated by the sample standard deviation s, the test statistic follows a t distribution with n − 1 degrees of freedom:
Returning to the bank returns example with n = 26, x̄ = 10.5, and s = 0.714, testing H₀: μ = 10.2 against H₁: μ > 10.2, the test statistic is t = (10.5 − 10.2) / (0.714/√26) = 0.3/0.1400 = 2.14 on 25 degrees of freedom. At the 5% level t_{0.05, 25} = 1.708 and 2.14 > 1.708, so H₀ is rejected. At the 1% level t_{0.01, 25} = 2.485 and 2.14 < 2.485, so H₀ is not rejected. The result is moderately significant: there is evidence that comprehensive planning raises returns above 10.2%.
Test for a Single Proportion
When testing a claim about a population proportion π₀, the standard error is computed using the null value π₀, not the sample proportion p. The test statistic is:
Suppose 68 out of 150 customers favour mobile banking and the claim is that π = 0.40. The sample proportion is p = 0.453 and the test statistic is z = (0.453 − 0.40) / √(0.40 × 0.60 / 150) = 0.053/0.04000 = 1.325. For the two-tailed test H₁: π ≠ 0.40, the 5% critical values are ±1.960 and 1.325 < 1.960, so H₀ is not rejected. Moving to the 10% level, the critical values are ±1.6449 and 1.325 < 1.6449, so H₀ is again not rejected. The result is not significant: the data are consistent with π = 0.40.
Test for the Difference Between Two Proportions
To compare two population proportions under H₀: π₁ = π₂, the common unknown proportion π is estimated by pooling both samples:
The test statistic is:
In the advertising awareness example, p₂ = 68/150 = 0.4533 before the campaign and p₁ = 65/120 = 0.5417 after. The pooled proportion is (68 + 65)/(150 + 120) = 133/270 = 0.4926, and the test statistic is z = (0.5417 − 0.4533) / √(0.4926 × 0.5074 × (1/150 + 1/120)) = 0.0884/0.0614 = 1.44. For the upper-tailed test H₁: π₁ > π₂, the 5% critical value is 1.645 and 1.44 < 1.645, so H₀ is not rejected. At 10%, the critical value is 1.282 and 1.44 > 1.282, so H₀ is rejected. The result is weakly significant: there is some, but not conclusive, evidence that the campaign increased awareness.
Test for the Difference Between Two Means: Variances Known
For two independent populations with known variances, the test statistic under H₀: μ₁ = μ₂ is:
With n₁ = 40, x̄₁ = 52, σ₁ = 6 and n₂ = 50, x̄₂ = 48, σ₂ = 4, the test statistic is z = (52 − 48) / √(36/40 + 16/50) = 4/1.105 = 3.62. This exceeds the critical values at both 5% (±1.96) and 1% (±2.576), making the result highly significant: there is strong evidence of a difference between the means. When variances are unknown but both sample sizes exceed 30, replace σ₁² and σ₂² with s₁² and s₂² and continue using standard normal critical values, justified by the central limit theorem.
Test for the Difference Between Two Means: Equal Variances (Pooled t)
When population variances are unknown but assumed equal, pool the two sample variances:
The test statistic on n₁ + n₂ − 2 degrees of freedom is:
For Company A with n₁ = 12, x̄₁ = 8.5, s₁ = 3.6 and Company B with n₂ = 10, x̄₂ = 4.8, s₂ = 2.1, the pooled variance on 20 degrees of freedom is:
The test statistic is:
At the 5% level t_{0.025, 20} = 2.086, and at the 1% level t_{0.005, 20} = 2.845. The test statistic exceeds both, giving a highly significant result: Company B reacts faster on average.
Test for the Difference Between Two Means: Paired Samples
When the same individuals are measured twice, compute differences d_i = x_i − y_i and apply a one-sample t-test on those differences. Under H₀: μ_d = 0:
In the diet study with eight participants, the before-minus-after differences were 5, 10, −2, 7, 6, 9, 12, 1, giving x̄_d = 6 and s_d = 4.66. The test statistic is t = 6/(4.66/√8) = 6/1.648 = 3.64 on 7 degrees of freedom. Testing H₁: μ_d > 0, the 5% critical value is t_{0.05, 7} = 1.895 and the 1% critical value is t_{0.01, 7} = 2.998. The test statistic exceeds both, giving a highly significant result: there is strong evidence that the diet reduces weight on average.
Quick Reference
| Situation | σ known? | Test statistic | Distribution |
|---|---|---|---|
| Single mean | Yes | (x̄ − μ₀) / (σ/√n) | N(0,1) |
| Single mean | No | (x̄ − μ₀) / (s/√n) | t_{n−1} |
| Single proportion | — | (p − π₀) / √(π₀(1−π₀)/n) | N(0,1) |
| Two means, unpaired | Known | (x̄₁−x̄₂) / √(σ₁²/n₁+σ₂²/n₂) | N(0,1) |
| Two means, unpaired | Unknown, unequal, large n | (x̄₁−x̄₂) / √(s₁²/n₁+s₂²/n₂) | N(0,1) |
| Two means, unpaired | Unknown, equal | (x̄₁−x̄₂) / √(s_p²(1/n₁+1/n₂)) | t_{n₁+n₂−2} |
| Two means, paired | — | x̄_d / (s_d/√n) | t_{n−1} |
| Two proportions | — | (p₁−p₂) / √(p̂(1−p̂)(1/n₁+1/n₂)) | N(0,1) |
A workbook of 10 exercises for you to practice: https://datalad.co.uk/hypothesis-testing-workbook-10-exercises-with-full-solutions/
See you soon.
[…] and tighten the CTA copy across three pages. All 30 changes test one idea. A win confirms the hypothesis; a loss challenges […]
[…] workbook accompanies the Hypothesis Testing lesson and is built for practice with pen, paper, and a calculator. The ten exercises follow the […]