Hypothesis Testing

A hypothesis test evaluates competing statements about a population based on sample data, using a null hypothesis and an alternative hypothesis, while considering error types, significance levels, and test statistics.

The Logic of a Hypothesis Test

A hypothesis test is a formal procedure for choosing between two competing statements about a population parameter on the basis of sample evidence. Rather than accepting a claim without scrutiny, the test asks: if the default position were true, how likely is the data we actually observed?

The null hypothesis H₀ represents the default position — no effect, no difference, no change — and is always stated with an equality. The alternative hypothesis H₁ represents what the researcher is trying to find evidence for. It specifies either a direction (the parameter is greater than, or less than, the null value) or simply that the parameter differs from it.

When H₁ states only that the parameter differs without specifying a direction, the test is two-tailed. When a specific direction is anticipated, the test is upper-tailed or lower-tailed accordingly. The form of H₁ must be chosen before looking at the data; it reflects the research question, not the sample outcome.

To illustrate, suppose a researcher believes comprehensive financial planning raises average bank returns above 10.2%. The parameter of interest is μ, the mean return. The null hypothesis is H₀: μ = 10.2, and because the researcher expects an increase, the alternative is H₁: μ > 10.2, making this an upper-tailed test.

Test Statistics and Critical Regions

Given a null hypothesis, a test statistic is computed from the sample. Its general form is:

test statistic=point estimatehypothesised value(estimated) standard error\text{test statistic} = \frac{\text{point estimate} – \text{hypothesised value}}{\text{(estimated) standard error}}

Under the assumption that H₀ is true, the test statistic follows a known distribution, typically a standard normal or a t distribution. The critical region is the set of values extreme enough to cast doubt on H₀, defined by critical values that mark the boundary between retaining and rejecting it.

The decision rule is direct: reject H₀ if the test statistic falls in the critical region; do not reject otherwise. Failing to reject H₀ does not prove it is true — it means the data are consistent with it.

Types of Error

Because a hypothesis test makes a decision under uncertainty, two types of error are possible. A Type I error occurs when H₀ is rejected even though it is true — a false positive — with probability α. A Type II error occurs when H₀ is not rejected even though it is false — a false negative — with probability β. The power of a test is 1 − β, the probability of correctly rejecting a false H₀.

In a drug trial where H₀ states the drug has no effect, a Type I error means concluding the drug works when it does not, wasting resources and raising false hope. A Type II error means concluding the drug does not work when it does, causing a genuine treatment to be overlooked. Type I errors are generally considered the more serious of the two, which is why the significance level α is fixed in advance to control them.

Significance Levels and the Decision Framework

The significance level α is the threshold for the Type I error probability. Common values are 10%, 5%, and 1%. A standard approach is to test at two levels sequentially: begin at 5%, then move to 1% if H₀ is rejected, or to 10% if it is not. This produces four possible conclusions. Rejection at 1% is highly significant; rejection at 5% but not 1% is moderately significant; rejection at 10% but not 5% is weakly significant; failure to reject at 10% is not significant.

The critical values for a standard normal test statistic at each level are:

Level αTwo-tailed (±z_{α/2})Upper-tailed (z_α)Lower-tailed (−z_α)
10%±1.64491.2816−1.2816
5%±1.96001.6449−1.6449
1%±2.57582.3263−2.3263

p-Values

The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the observed value, under the assumption that H₀ is true. H₀ is rejected when the p-value falls below α.

For a two-tailed test the p-value doubles the tail probability beyond the observed statistic:

p-value=2×P(X|x|) p\text{-value} = 2 \times P(X \geq |x|)

For a lower-tailed test it is the left-tail probability:

p-value=P(Xx)p\text{-value} = P(X \leq x)

For an upper-tailed test it is the right-tail probability:

p-value=P(Xx)p\text{-value} = P(X \geq x)

As a worked example, for a lower-tailed test with z = −1.82 and Z ∼ N(0, 1), the p-value is P(Z ≤ −1.82) = 1 − Φ(1.82) = 1 − 0.9656 = 0.0344. Since 0.01 < 0.0344 < 0.05, the result is moderately significant.

Test for a Single Mean: σ Known

When the population standard deviation σ is known, the test statistic under H₀: μ = μ₀ is:

Z=Xμ0σ/nN(0,1)under H0Z = \frac{\bar{X} – \mu_0}{\sigma/\sqrt{n}} \sim N(0,1) \quad \text{under } H_0

Suppose n = 100, x̄ = 1570, and σ = 120, testing H₀: μ = 1600 against H₁: μ ≠ 1600. The test statistic is z = (1570 − 1600) / (120/√100) = −30/12 = −2.50. At the 5% level the critical values are ±1.960 and |−2.50| > 1.960, so H₀ is rejected. At the 1% level the critical values are ±2.5758 and |−2.50| < 2.5758, so H₀ is not rejected at this stricter threshold. The result is moderately significant: there is evidence that μ ≠ 1600. The p-value confirms this: 2 × P(Z > 2.50) = 2 × 0.0062 = 0.0124, which lies between 0.01 and 0.05.

Test for a Single Mean: σ Unknown

When σ is unknown and estimated by the sample standard deviation s, the test statistic follows a t distribution with n − 1 degrees of freedom:

T=Xμ0S/ntn1under H0T = \frac{\bar{X} – \mu_0}{S/\sqrt{n}} \sim t_{n-1} \quad \text{under } H_0

Returning to the bank returns example with n = 26, x̄ = 10.5, and s = 0.714, testing H₀: μ = 10.2 against H₁: μ > 10.2, the test statistic is t = (10.5 − 10.2) / (0.714/√26) = 0.3/0.1400 = 2.14 on 25 degrees of freedom. At the 5% level t_{0.05, 25} = 1.708 and 2.14 > 1.708, so H₀ is rejected. At the 1% level t_{0.01, 25} = 2.485 and 2.14 < 2.485, so H₀ is not rejected. The result is moderately significant: there is evidence that comprehensive planning raises returns above 10.2%.

Test for a Single Proportion

When testing a claim about a population proportion π₀, the standard error is computed using the null value π₀, not the sample proportion p. The test statistic is:

Z=Pπ0π0(1π0)/nN(0,1)approximately, for large nZ = \frac{P – \pi_0}{\sqrt{\pi_0(1-\pi_0)/n}} \sim N(0,1) \quad \text{approximately, for large } n

Suppose 68 out of 150 customers favour mobile banking and the claim is that π = 0.40. The sample proportion is p = 0.453 and the test statistic is z = (0.453 − 0.40) / √(0.40 × 0.60 / 150) = 0.053/0.04000 = 1.325. For the two-tailed test H₁: π ≠ 0.40, the 5% critical values are ±1.960 and 1.325 < 1.960, so H₀ is not rejected. Moving to the 10% level, the critical values are ±1.6449 and 1.325 < 1.6449, so H₀ is again not rejected. The result is not significant: the data are consistent with π = 0.40.

Test for the Difference Between Two Proportions

To compare two population proportions under H₀: π₁ = π₂, the common unknown proportion π is estimated by pooling both samples:

p^=r1+r2n1+n2\hat{p} = \frac{r_1 + r_2}{n_1 + n_2}

The test statistic is:

Z=P1P2p^(1p^)(1/n1+1/n2)N(0,1)approximatelyZ = \frac{P_1 – P_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1 + 1/n_2)}} \sim N(0,1) \quad \text{approximately}

In the advertising awareness example, p₂ = 68/150 = 0.4533 before the campaign and p₁ = 65/120 = 0.5417 after. The pooled proportion is (68 + 65)/(150 + 120) = 133/270 = 0.4926, and the test statistic is z = (0.5417 − 0.4533) / √(0.4926 × 0.5074 × (1/150 + 1/120)) = 0.0884/0.0614 = 1.44. For the upper-tailed test H₁: π₁ > π₂, the 5% critical value is 1.645 and 1.44 < 1.645, so H₀ is not rejected. At 10%, the critical value is 1.282 and 1.44 > 1.282, so H₀ is rejected. The result is weakly significant: there is some, but not conclusive, evidence that the campaign increased awareness.

Test for the Difference Between Two Means: Variances Known

For two independent populations with known variances, the test statistic under H₀: μ₁ = μ₂ is:

Z=X1X2σ12/n1+σ22/n2N(0,1)Z = \frac{\bar{X}_1 – \bar{X}_2}{\sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}} \sim N(0,1)

With n₁ = 40, x̄₁ = 52, σ₁ = 6 and n₂ = 50, x̄₂ = 48, σ₂ = 4, the test statistic is z = (52 − 48) / √(36/40 + 16/50) = 4/1.105 = 3.62. This exceeds the critical values at both 5% (±1.96) and 1% (±2.576), making the result highly significant: there is strong evidence of a difference between the means. When variances are unknown but both sample sizes exceed 30, replace σ₁² and σ₂² with s₁² and s₂² and continue using standard normal critical values, justified by the central limit theorem.

Test for the Difference Between Two Means: Equal Variances (Pooled t)

When population variances are unknown but assumed equal, pool the two sample variances:

sp2=(n11)s12+(n21)s22n1+n22 s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}

The test statistic on n₁ + n₂ − 2 degrees of freedom is:

T=X1X2sp2(1/n1+1/n2)tn1+n22under H0T = \frac{\bar{X}_1 – \bar{X}2}{\sqrt{s_p^2(1/n_1 + 1/n_2)}} \sim t{n_1+n_2-2} \quad \text{under } H_0

For Company A with n₁ = 12, x̄₁ = 8.5, s₁ = 3.6 and Company B with n₂ = 10, x̄₂ = 4.8, s₂ = 2.1, the pooled variance on 20 degrees of freedom is:

sp2=11(3.6)2+9(2.1)220=142.56+39.6920=9.1125s_p^2 = \frac{11(3.6)^2 + 9(2.1)^2}{20} = \frac{142.56 + 39.69}{20} = 9.1125

The test statistic is:

t=8.54.89.1125(1/12+1/10)=3.71.2925=2.87t = \frac{8.5 – 4.8}{\sqrt{9.1125(1/12 + 1/10)}} = \frac{3.7}{1.2925} = 2.87

At the 5% level t_{0.025, 20} = 2.086, and at the 1% level t_{0.005, 20} = 2.845. The test statistic exceeds both, giving a highly significant result: Company B reacts faster on average.

Test for the Difference Between Two Means: Paired Samples

When the same individuals are measured twice, compute differences d_i = x_i − y_i and apply a one-sample t-test on those differences. Under H₀: μ_d = 0:

T=XdSd/ntn1under H0T = \frac{\bar{X}d}{S_d/\sqrt{n}} \sim t{n-1} \quad \text{under } H_0

In the diet study with eight participants, the before-minus-after differences were 5, 10, −2, 7, 6, 9, 12, 1, giving x̄_d = 6 and s_d = 4.66. The test statistic is t = 6/(4.66/√8) = 6/1.648 = 3.64 on 7 degrees of freedom. Testing H₁: μ_d > 0, the 5% critical value is t_{0.05, 7} = 1.895 and the 1% critical value is t_{0.01, 7} = 2.998. The test statistic exceeds both, giving a highly significant result: there is strong evidence that the diet reduces weight on average.

Quick Reference

Situationσ known?Test statisticDistribution
Single meanYes(x̄ − μ₀) / (σ/√n)N(0,1)
Single meanNo(x̄ − μ₀) / (s/√n)t_{n−1}
Single proportion(p − π₀) / √(π₀(1−π₀)/n)N(0,1)
Two means, unpairedKnown(x̄₁−x̄₂) / √(σ₁²/n₁+σ₂²/n₂)N(0,1)
Two means, unpairedUnknown, unequal, large n(x̄₁−x̄₂) / √(s₁²/n₁+s₂²/n₂)N(0,1)
Two means, unpairedUnknown, equal(x̄₁−x̄₂) / √(s_p²(1/n₁+1/n₂))t_{n₁+n₂−2}
Two means, pairedx̄_d / (s_d/√n)t_{n−1}
Two proportions(p₁−p₂) / √(p̂(1−p̂)(1/n₁+1/n₂))N(0,1)

A workbook of 10 exercises for you to practice: https://datalad.co.uk/hypothesis-testing-workbook-10-exercises-with-full-solutions/

See you soon.

View Comments (2)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Datalad - Data Science and ML

Subscribe now to keep reading and get access to the full archive.

Continue reading