Chi-square

Ó 2004, S. D. Cochran. All rights reserved.

CHI-SQUARE

Categorical data analysis is a very common situation in research

	Expected Frequencies
Freshmen	25
Sophomores	25
Juniors	25
Seniors	25

We decided to evenly divide up the 100 students into the four cells. In the absence of any information, our best bet is an equal probability for each cell. But we can also have expected frequencies based on external information. For example, if we knew from the college registrar that 20% of students are freshmen, 20% Sophomores, 25% Juniors and 35% seniors, we could create the following expected frequencies:

	Expected Frequencies
Freshmen	20
Sophomores	20
Juniors	25
Seniors	35

We still have not collected any data. We are only specifying what we would expect to see.

General rules for creating expected frequencies

If we know the distribution of categories in the population, we can use that to generate expected frequencies

If we don't, which is the more likely scenario, then we can estimate the expected frequencies by giving each a category the same probability of occurrence.

Under what conditions will we observe the expected frequency (excluding for the moment the issue of chance error)? If students, no matter what year in school they are, are just as likely to use the student store as students in other classes. Then we would expect to see the distribution of categories among student store users we sample equal to the population of students.

Why wouldn't we see what we expect to see? For the same reasons as before--it could be due to bias or true difference; it could also be due to chance variation.

With both the z-test and the t-test, we found that if we looked at the difference between what we observed and what we expected, and then divided that to weight for the expected chance variation, we generated values that could be compared to probability density distributions (the normal distribution and t-distributions) in order to find the probability of obtaining a value as large or larger when the null hypothesis was true--that is when the observed difference was solely due to chance.
With frequency data we can do the same thing, but the probability density curve we use is the Chi-square distribution
1. Like the t-distribution, the chi-square distribution varies depending upon the degrees of freedom
2. As with both the z-distribution and the t-distributions, the area under the curve approaches 100%. If we choose a certain chi-square value, the area to the right is the probability of this value or greater
The formula for a chi-square test is:

or more formally:

How do we use the chi-square test?
1. Example: we want to see if students' use of the student store varies by their year in school so we sample the first 100 students entering the student union on last Monday at noon.
2. What is our research hypothesis?

Students' use of the student store varies by their year in school

What are our statistical hypotheses?

Null hypothesis (H₀): F_O - F_E = 0, that is the observed frequencies in the cells equal the expected frequencies

Alternative hypothesis (H₁): F_O - F_E ¹ 0, that is the observed frequencies in the cells are not equal to the expected frequencies

We ask each student as he or she enters student union their year in school and then assign them to one of four categories. We obtained the data below:

	Observed Frequencies
Freshmen	36
Sophomores	27
Juniors	19
Seniors	18

Note that the numbers here are counts--36 of our 100 subjects were freshmen.

Now we need to make our table to calculate the chi-square value

f_O
f_E
f_O - f_E
(f_O - f_E)²

(f_O - f_E)²/f_E

Freshmen

36

25

11

121

4.84

Sophomores

27

25

2

4

0.16

Juniors

19

25

-6

36

1.44

Seniors

18

25

-7

49

1.96

sum

100

100

0

8.40 = c ²

Notice again these are counts, not percentages. Note, too, that the sum of raw differences sums to zero. And that by squaring them, we can get an estimate of the distance from expectation, but unlike the z-test and t-test, the deviations are always positive. If in fact, the null hypothesis is true and we could measure with no chance variation, the chi-square will be zero.

Now we need to use the chi-square table in the back of the book to figure out the P-value of obtaining a chi-square this large or larger under the condition that the null hypothesis is true

Like using the t-table we have to decide two things:

The degrees of freedom (df) for our analysis--In this instance they are the number of categories - 1 or 4 -1 = 3. That is, with a sample of 100 subjects, we were absolutely free in deciding the counts in 3 of the cells, but having done that, the fourth cell had to be whatever summed to 100.

What P-value we wish to use. Commonly, a P-value of 5% is chosen (referred to generally as p = .05)

From the table, for df = 3 and p = .05, we need to have a chi-square value of 7.82 or greater to reject our null hypothesis

Our obtained c ² = 8.40. This is greater than the critical value we needed to reject our null hypothesis at the p = .05 level. Therefore, we REJECT the null hypothesis--that the difference in frequencies between what we observed and what we expected was likely to be due to chance

When choosing a P = .05 level, we are saying that 5% of the time we will obtain a critical chi-square value of 7.82 or greater simply due to random variation. That in truth there is no true difference between what we observe and what we expect. So in rejecting the null hypothesis, we could be wrong

Because we have rejected the null hypothesis as unlikely to be true, we accept the alternative hypothesis that there is a difference between what we observed and what we expected

Therefore we conclude that use of the student store does vary by student's year in school. That it appears Freshmen are mostly likely to use the store

What are some of the problems with this conclusion?

The expected frequency may be wrong

Perhaps we sampled in a way that was biased toward observing one class of student over another

Finally, note that the Chi-square test is nondirectional. The alternative hypothesis only states that the observed frequencies are not the same as the expected frequencies

	f_O	f_E	f_O - f_E	(f_O - f_E)²	(f_O - f_E)²/f_E
Freshmen	36	25	11	121	4.84
Sophomores	27	25	2	4	0.16
Juniors	19	25	-6	36	1.44
Seniors	18	25	-7	49	1.96
sum	100	100	0		8.40 = c ²