Ó 2004, S. D. Cochran. All rights reserved.

CHI-SQUARE

  1. Categorical data analysis is a very common situation in research

    1. Observations are categorized into one of several mutually exclusive groupings called cells

      1. Example: The grades students get in school this year can be sorted into cells labeled A's, B's, & C's. The type of housing students live in can be categorized as dorms, apartments, or houses.

      2. Notice that each observation is classified into one and only one category or cell.

    2. After we have sorted observations into categories, the measurement we use is the frequency or counts within each cell or category. This is called the observed frequency. So far, in this class, we have already used counting and classifying to keep running tally's of counts within cells when the data had a binary distribution. Now we're going to see how you can keep track of situations where there is more than 2 categories.

    3. Even without collecting any data, the categories exist. We can fill these cells with what are called expected frequencies, that is counts in the absence of any knowledge about observations.

      1. Example: Let's say we wanted to conduct a study of college undergraduates’ use of the student store. We have four categories of students: Freshmen, Sophomores, Juniors, Seniors. In the absence of any information, if we asked the first 100 students walking through the door their year in school what would we expect to see in the cells?

 

 

Expected Frequencies

Freshmen

25

Sophomores

25

Juniors

25

Seniors

25

  1. We decided to evenly divide up the 100 students into the four cells. In the absence of any information, our best bet is an equal probability for each cell. But we can also have expected frequencies based on external information. For example, if we knew from the college registrar that 20% of students are freshmen, 20% Sophomores, 25% Juniors and 35% seniors, we could create the following expected frequencies:

 

Expected Frequencies

Freshmen

20

Sophomores

20

Juniors

25

Seniors

35

    We still have not collected any data. We are only specifying what we would expect to see.

  1. General rules for creating expected frequencies

    1. If we know the distribution of categories in the population, we can use that to generate expected frequencies

    2.  If we don't, which is the more likely scenario, then we can estimate the expected frequencies by giving each a category the same probability of occurrence.

  2. Under what conditions will we observe the expected frequency (excluding for the moment the issue of chance error)? If students, no matter what year in school they are, are just as likely to use the student store as students in other classes. Then we would expect to see the distribution of categories among student store users we sample equal to the population of students.

  3. Why wouldn't we see what we expect to see? For the same reasons as before--it could be due to bias or true difference; it could also be due to chance variation.

  1. With both the z-test and the t-test, we found that if we looked at the difference between what we observed and what we expected, and then divided that to weight for the expected chance variation, we generated values that could be compared to probability density distributions (the normal distribution and t-distributions) in order to find the probability of obtaining a value as large or larger when the null hypothesis was true--that is when the observed difference was solely due to chance.

  2. With frequency data we can do the same thing, but the probability density curve we use is the Chi-square distribution

    1. Like the t-distribution, the chi-square distribution varies depending upon the degrees of freedom

    2. As with both the z-distribution and the t-distributions, the area under the curve approaches 100%. If we choose a certain chi-square value, the area to the right is the probability of this value or greater

  3. The formula for a chi-square test is:

or more formally:

  1. How do we use the chi-square test?

    1. Example: we want to see if students' use of the student store varies by their year in school so we sample the first 100 students entering the student union on last Monday at noon.

    2. What is our research hypothesis?

Students' use of the student store varies by their year in school

  1. What are our statistical hypotheses?

    1. Null hypothesis (H0): FO - FE = 0, that is the observed frequencies in the cells equal the expected frequencies

    2. Alternative hypothesis (H1): FO - FE ¹ 0, that is the observed frequencies in the cells are not equal to the expected frequencies

  2. We ask each student as he or she enters student union their year in school and then assign them to one of four categories. We obtained the data below:

     

 

Observed Frequencies

Freshmen

36

Sophomores

27

Juniors

19

Seniors

18

Note that the numbers here are counts--36 of our 100 subjects were freshmen.

  1. Now we need to make our table to calculate the chi-square value

 

fO

fE

fO - fE

(fO - fE)2

(fO - fE)2/fE

Freshmen

36

25

11

121

4.84

Sophomores

27

25

2

4

0.16

Juniors

19

25

-6

36

1.44

Seniors

18

25

-7

49

1.96

sum

100

100

0

 

8.40 = c 2

    Notice again these are counts, not percentages. Note, too, that the sum of raw differences sums to zero. And that by squaring them, we can get an estimate of the distance from expectation, but unlike the z-test and t-test, the deviations are always positive. If in fact, the null hypothesis is true and we could measure with no chance variation, the chi-square will be zero.

  1. Now we need to use the chi-square table in the back of the book to figure out the P-value of obtaining a chi-square this large or larger under the condition that the null hypothesis is true

    1. Like using the t-table we have to decide two things:

      1. The degrees of freedom (df) for our analysis--In this instance they are the number of categories - 1 or 4 -1 = 3. That is, with a sample of 100 subjects, we were absolutely free in deciding the counts in 3 of the cells, but having done that, the fourth cell had to be whatever summed to 100.

      2. What P-value we wish to use. Commonly, a P-value of 5% is chosen (referred to generally as p = .05)

    2. From the table, for df = 3 and p = .05, we need to have a chi-square value of 7.82 or greater to reject our null hypothesis

  2. Our obtained c 2 = 8.40. This is greater than the critical value we needed to reject our null hypothesis at the p = .05 level. Therefore, we REJECT the null hypothesis--that the difference in frequencies between what we observed and what we expected was likely to be due to chance

    1. When choosing a P = .05 level, we are saying that 5% of the time we will obtain a critical chi-square value of 7.82 or greater simply due to random variation. That in truth there is no true difference between what we observe and what we expect. So in rejecting the null hypothesis, we could be wrong

    2. Because we have rejected the null hypothesis as unlikely to be true, we accept the alternative hypothesis that there is a difference between what we observed and what we expected

    3. Therefore we conclude that use of the student store does vary by student's year in school. That it appears Freshmen are mostly likely to use the store

  3. What are some of the problems with this conclusion?

    1. The expected frequency may be wrong

    2. Perhaps we sampled in a way that was biased toward observing one class of student over another

  4. Finally, note that the Chi-square test is nondirectional. The alternative hypothesis only states that the observed frequencies are not the same as the expected frequencies