Chi-Squared Test Workbook: 10 Exercises with Full Solutions

Ten chi-squared exercises with full worked solutions, from contingency tables and expected frequencies to degrees of freedom, tests of association, and goodness-of-fit. Try them, then check your method against the standard one.

This workbook accompanies the Contingency Tables and the Chi-Squared Test lesson and is built for practice with a calculator and a chi-squared table. The ten exercises follow the lesson’s order, starting with choosing the right test and completing a contingency table, then computing expected frequencies, the test statistic, and degrees of freedom, before finishing with goodness-of-fit tests, the validity condition, and a full test of association. Each exercise isolates one skill so you can see exactly where your method needs sharpening.

Attempt every exercise before reading the solutions, and write out each step rather than skipping to the answer. The solutions name the formula, show the arithmetic, and state the decision at the conventional significance levels, so the point is to compare your reasoning against the standard approach. The chi-squared critical values you need are supplied inside each exercise. Remember throughout that the chi-squared test is always upper-tailed, so you reject the null hypothesis only when the statistic exceeds the critical value.

Part One: The Exercises

Exercise 1 (Choosing the test). For each study, name the appropriate method and explain why. (a) A researcher records whether each patient smokes (yes or no) and whether they developed lung disease (yes or no). (b) A researcher records each person’s height and weight. (c) A researcher records each student’s exam score and their study method (online or in person). For part (a), also state the null and alternative hypotheses.

Exercise 2 (Completing a contingency table). A study of 120 students cross-tabulates study method against exam result. The cell counts are: online and pass 45, online and fail 15, in person and pass 30, in person and fail 30. Construct the contingency table and complete all row totals, column totals, and the grand total.

Exercise 3 (Expected frequencies). Using the table from Exercise 2, compute the expected frequency for each of the four cells under the assumption of independence.

Exercise 4 (Degrees of freedom). State the degrees of freedom for a chi-squared test of association on each of the following table sizes: 2 by 2, 2 by 4, 3 by 5, and 4 by 6.

Exercise 5 (Test statistic and decision). Using the observed counts from Exercise 2 and the expected frequencies from Exercise 3, compute the chi-squared test statistic. Then decide at the 5% and 1% levels, using the critical values 3.841 at 5% and 6.635 at 1% on 1 degree of freedom.

Exercise 6 (Interpreting the association). Given the result of Exercise 5, identify which cells are over-represented and which are under-represented, and state in plain language what the association means.

Exercise 7 (Goodness-of-fit for a fair die). A six-sided die is rolled 60 times, producing the counts: face 1 appears 8 times, face 2 appears 12 times, face 3 appears 9 times, face 4 appears 11 times, face 5 appears 10 times, and face 6 appears 10 times. Test whether the die is fair, using the critical value 11.070 at the 5% level on 5 degrees of freedom.

Exercise 8 (The validity condition). A spinner divided into six equal sectors is spun 24 times, and you wish to test whether all six sectors are equally likely. Explain why the chi-squared goodness-of-fit test as initially set up violates the validity condition, and describe how to fix it. State the new expected frequencies and degrees of freedom after your fix.

Exercise 9 (Full test of association). A trial compares a drug against a placebo across 200 patients, recording the outcome as improved, no change, or worse. The drug group of 100 patients had 50 improved, 30 with no change, and 20 worse. The placebo group of 100 patients had 30 improved, 40 with no change, and 30 worse. Carry out the test of association, using the critical values 5.991 at 5% and 9.210 at 1% on 2 degrees of freedom, and interpret the result.

Exercise 10 (Degrees of freedom and decision). A 3 by 4 contingency table produces a chi-squared statistic of 14.2. Find the degrees of freedom and decide at the 5% and 1% levels, using the critical values 12.592 at 5% and 16.812 at 1%.

Part Two: Worked Solutions

Solution 1. In part (a), both variables are categorical, so the method is a chi-squared test of association on a 2 by 2 table. In part (b), both variables are measurable, so the method is correlation or regression. In part (c), one variable is measurable and one is categorical, so you demote the exam score into categories, for example pass and fail, and then apply a chi-squared test of association. For part (a), the null hypothesis is that there is no association between smoking and lung disease, equivalently that the two are independent, and the alternative is that an association exists.

Solution 2. Entering the counts and totalling each row and column gives the completed table.

MethodPassFailRow total
Online451560
In person303060
Column total7545120

The row totals are 60 and 60, the column totals are 75 and 45, and both sets sum to the grand total of 120.

Solution 3. Each expected frequency is the row total times the column total divided by the grand total.

Eij=Ri×CjnE_{ij} = \frac{R_i \times C_j}{n}

Applying this to all four cells gives the following.

E(Online, Pass)=60×75120=37.5E(\text{Online, Pass}) = \frac{60 \times 75}{120} = 37.5
E(Online, Fail)=60×45120=22.5E(\text{Online, Fail}) = \frac{60 \times 45}{120} = 22.5
E(In person, Pass)=60×75120=37.5E(\text{In person, Pass}) = \frac{60 \times 75}{120} = 37.5
E(In person, Fail)=60×45120=22.5E(\text{In person, Fail}) = \frac{60 \times 45}{120} = 22.5

Solution 4. The degrees of freedom for an r by c table are (r minus 1) times (c minus 1).

ν=(r1)(c1) \nu = (r-1)(c-1)

For 2 by 2 this is 1 times 1, which is 1. For 2 by 4 this is 1 times 3, which is 3. For 3 by 5 this is 2 times 4, which is 8. For 4 by 6 this is 3 times 5, which is 15.

Solution 5. The test statistic sums the standardised squared differences over all cells.

χ2=(OijEij)2Eij \chi^2 = \sum \frac{(O_{ij} – E_{ij})^2}{E_{ij}}

The four contributions are as follows.

(4537.5)237.5=56.2537.5=1.5\frac{(45 – 37.5)^2}{37.5} = \frac{56.25}{37.5} = 1.5
(1522.5)222.5=56.2522.5=2.5\frac{(15 – 22.5)^2}{22.5} = \frac{56.25}{22.5} = 2.5
(3037.5)237.5=56.2537.5=1.5\frac{(30 – 37.5)^2}{37.5} = \frac{56.25}{37.5} = 1.5
(3022.5)222.5=56.2522.5=2.5 \frac{(30 – 22.5)^2}{22.5} = \frac{56.25}{22.5} = 2.5

Summing them gives the statistic.

χ2=1.5+2.5+1.5+2.5=8.0\chi^2 = 1.5 + 2.5 + 1.5 + 2.5 = 8.0

With 1 degree of freedom, the statistic 8.0 exceeds the 5% critical value of 3.841, so the null hypothesis is rejected at 5%. It also exceeds the 1% critical value of 6.635, so it is rejected at 1%. The result is highly significant, giving strong evidence of an association between study method and exam result.

Solution 6. A cell is over-represented when the observed count exceeds the expected, and under-represented when it falls short. The online and pass cell has an observed count of 45 against an expected 37.5, so it is over-represented, and correspondingly the online and fail cell at 15 against 22.5 is under-represented. For the in-person group, the fail cell at 30 against 22.5 is over-represented, and the pass cell at 30 against 37.5 is under-represented. In plain language, studying online is associated with passing and studying in person is associated with failing, at least in this sample.

Solution 7. Under the null hypothesis of a fair die, every face is equally likely, so each expected frequency is the number of rolls divided by the number of faces.

Ei=nk=606=10E_i = \frac{n}{k} = \frac{60}{6} = 10

All expected frequencies are 10, comfortably above 5, so the test is valid. The statistic sums the contributions across the six faces.

χ2=(810)2+(1210)2+(910)2+(1110)2+(1010)2+(1010)210 \chi^2 = \frac{(8-10)^2 + (12-10)^2 + (9-10)^2 + (11-10)^2 + (10-10)^2 + (10-10)^2}{10}

Evaluating the numerator gives 4 plus 4 plus 1 plus 1 plus 0 plus 0, which is 10.

χ2=1010=1.0 \chi^2 = \frac{10}{10} = 1.0

With 5 degrees of freedom, the statistic 1.0 is far below the 5% critical value of 11.070, so the null hypothesis is not rejected. There is no evidence that the die is unfair.

Solution 8. The condition is that the chi-squared approximation is only valid when every expected frequency is at least 5. With six equally likely sectors and 24 spins, each expected frequency is 24 divided by 6, which is 4, and because 4 is below 5, the test as set up is invalid. The fix is to merge adjacent categories until the expected counts are large enough. Combining the six sectors into three pairs gives three categories, each with an expected frequency of 24 divided by 3.

Ei=243=8E_i = \frac{24}{3} = 8

Now every expected frequency is 8, which satisfies the condition, and the degrees of freedom become k minus 1, which is 2.

Solution 9. The null hypothesis is no association between treatment and outcome, against an alternative that an association exists. Each expected frequency is the row total of 100 times the column total divided by 200, so the expected counts are 40 for each improved cell, 35 for each no-change cell, and 25 for each worse cell. The contributions for the drug group are as follows.

(5040)240=2.5(3035)235=0.714(2025)225=1.0\frac{(50-40)^2}{40} = 2.5 \qquad \frac{(30-35)^2}{35} = 0.714 \qquad \frac{(20-25)^2}{25} = 1.0

The contributions for the placebo group mirror them.

(3040)240=2.5(4035)235=0.714(3025)225=1.0 \frac{(30-40)^2}{40} = 2.5 \qquad \frac{(40-35)^2}{35} = 0.714 \qquad \frac{(30-25)^2}{25} = 1.0

Summing all six gives the statistic.

χ2=2.5+0.714+1.0+2.5+0.714+1.0=8.43\chi^2 = 2.5 + 0.714 + 1.0 + 2.5 + 0.714 + 1.0 = 8.43

With 2 degrees of freedom, the statistic 8.43 exceeds the 5% critical value of 5.991, so the null hypothesis is rejected at 5%. It does not exceed the 1% critical value of 9.210, so it is not rejected at 1%, making the result moderately significant. Reading the cells, the drug group is over-represented among the improved (50 against 40) and the placebo group is over-represented among the worse (30 against 25), so the drug is associated with better outcomes.

Solution 10. The degrees of freedom for a 3 by 4 table are (3 minus 1) times (4 minus 1).

ν=(31)(41)=6\nu = (3-1)(4-1) = 6

The statistic 14.2 exceeds the 5% critical value of 12.592, so the null hypothesis is rejected at 5%. It does not exceed the 1% critical value of 16.812, so it is not rejected at 1%. The result is therefore moderately significant, with evidence of an association at the 5% level but not at the 1% level.

How to Get the Most From This Workbook

Notice the single procedure underneath all of these. Whether you are testing association in a two-way table or goodness-of-fit against a uniform distribution, you compute an expected frequency for every cell, sum the standardised squared differences to get the statistic, work out the degrees of freedom from the table shape or the number of categories, and compare against an upper-tail critical value. Only two things change between problem types: how you compute the expected frequencies and how you count the degrees of freedom. Drill that shared rhythm, keep the rule that every expected frequency should reach 5, and remember to read the largest cell contributions whenever you reject the null, so you can say not just that a relationship exists but exactly where it lives.

See you soon.

View Comments (1)

Leave a Reply

Prev Next

Subscribe to My Newsletter

Subscribe to my email newsletter to get the latest posts delivered right to your email. Pure inspiration, zero spam.

Discover more from Discuss Data Science, Machine Learning and Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading