Analysis of Categorical Data
Up until now, almost all of the data we have analyzed have been numerical. Categorical data are quite common, however, particular in the social and medical sciences. The outcome variable is usually in the form of counts: how many people fell into a certain category? And the analysis is usually concerned with comparing the patterns in which the counts fall.
There are three main types of analyses here: "goodness of fit", "homogeneity" and "independence." They are very easy to confuse, partly because, no matter what you call them, the same statistical test is often your best bet to answering the research question. Roughly put, goodness of fit is concerned with whether an observed distribution is well-described by a particular theoretical distribution. Homogeneity is concerned with whether observations of different distributions are actually the same. Independence analysis is concerned with whether one variable is statistically independent of the other.
Example I: Is Sanditon II a good imitation of Jane Austen's style?
When she died, Austen had written only a few chapters of a novel called Sanditon. An admirer completed the novel. Shown below are word counts for Chapters 1 and 3 of Sense and Sensibility, chapters 1,2, and 3 of Emma, chapters 1 and 6 of Sanditon (written by Austen) and chapters 12 and 24 of Sanditon (by the ghostwriter).
Word |
Sense & Sensibility |
Emma |
Sanditon I |
Sanditon II |
a |
147 |
186 |
101 |
83 |
an |
25 |
26 |
11 |
29 |
this |
32 |
39 |
15 |
15 |
that |
94 |
105 |
37 |
22 |
with |
59 |
74 |
28 |
43 |
without |
18 |
10 |
10 |
4 |
Total |
375 |
440 |
202 |
196 |
Example II: Is there a relation between marital status and educational level?
From a study of 1436 women listed in an edition of Who's Who:
Education |
Married Once |
Married More than Once |
Total |
College |
550 |
61 |
611 |
No College |
681 |
144 |
825 |
Total |
1231 |
205 |
1436 |
Example III: Gender bias in promotion
Is there evidence that managers show bias in promotion decisions? 48 supervisors were given a file of a person and were asked to determine whether the person deserved promotion. All files were identical, but 24 were labeled as belonging to males, and 24 to females. The files were randomly assigned.
Male |
Female |
|
Promote |
21 |
14 |
Hold File |
3 |
10 |
Example IV: Is this die fair?
I tossed a die 100 times. Below is a histogram of the results, and a table showing the counts in each cell. Is the die fair?
(Ignore the strange x-axis labels.)
Outcome |
Frequency |
1 |
17 |
2 |
18 |
3 |
15 |
4 |
18 |
5 |
15 |
6 |
17 |
The Chi-Square Statistic
The Statistics of Choice for these studies is the chi-squared statistic. (But not the only statistic.) The theory goes something like this:
If assumptions are met, then X2 = sum of (observed - expected)2/expected is a random variable whose distribution is approximately chi-squared with the appropriate degrees of freedom. "Expected" means the number of observations we would expect to fall in a category according to the null hypothesis. Sometimes this is not known, and must be estimated. I emphasize that this is only an approximation.
The assumptions are simple:
a) at least five expected counts in each category (observed 0's are okay) and
b) observations are independent.
The latter assumption is often overlooked, but is extremely important, as a later example will show.
The formula for the degrees of freedom is, in most textbooks, different for each of these tests. However, as I'll show you, there's really only one formula: df = Number of independent observations - number of independent parameters estimated. More on this later.
Let's put off the mechanics of chi-squared testing for the moment, and examine these data sets in more detail.
Example 4:
This is the simplest. A goodness of fit test. We're are asked to determine whether the data support our hypothesis that this is a uniform distribution. It could have been any distribution, though, that we were testing. Perhaps we had a histogram and wanted to know if it looked "normal." The procedure is the same.
a. Calculations
The chi-squared statistic, is the sum of: (observed - expected)2 / expected.
The "observed" counts are in the table. To calculate the expected, we need to refer to the null hypothesis model. Here, the null hypothesis says the die is fair, and therefore the probability of an outcome is 1/6. So the expected count in each column is (1/6)*100 = 16.66667, approximately.
The observed value of the chi-squared statistic is, therefore (17 - 100/6)2/ (100/6) + ...+(17 - 100/6)2/(100/6) = .56.
Under the null hypothesis, the observed chi-square statistic will be close to zero. Large values are evidence against the null hypothesis. How large is large? To answer this we need to know the sampling distribution of the statistic (which is obviously the chi-square distribution), which means we need to know the degrees of freedom.
b. Degrees of Freedom
The general rule for degrees of freedom is that it is the # of independent observations minus the number of independent parameters estimated from the data. In simple regression, the degrees of freedom for the t-statistic that tests the slope, for example, are n-2: There are n independent observations (if you collected your data correctly!) and you estimate 2 parameters: a slope and an intercept. And when you are doing a t-test on a single group of data, you have n independent observations, and you estimate one parameter (the SD -- you hypothesize that the mean is known), so there are n-1 df.
When dealing with categorical data such as these, rather than use this rule, we do it slightly differently: df = # of indpt. cells minus the number of indpt. parameters estimated from the data. (In other words, the cell-counts are our observations.)
Note that in this problem, we have fixed the number of observations at 100. (The die was tossed 100 times.) This means that although there are 6 cells, only 5 of them are independent. (Knowing the counts of any five of the cells determines the counts of the sixth.) No parameters were estimated, and so the degrees of freedom is 5-0 = 5.
We therefore compare our observed chi-square statistic to a chi-square distribution with 5 degrees of freedom. A computer package, can find the p-value, or a table will allow us to determine the critical value. For example, for a 5% cut-off, the critical value says to reject for observed chi-square statistics greater than or equal to 1.15. So in this case we do not reject.
This theme of considering one set of numbers as "fixed" will be a helpful way of thinking about these problems. In the last problem we considered the total number of counts as fixed since we tossed the die 100 times.
Example 1: Jane Austen
Consider these questions:
a) What are the 'units' that were studied?
b) What are the variables?
c) What's the null hypothesis? The alternative?
d) Which values should be considered "fixed"?
a) The "units" are the books/chapters themselves. The researchers chose these books, they could have chosen others. Or they could have chosen other chapters. Once these are studied, there are attributes of these that we want to know.
b) The variables, or attributes, we measured are the frequencies of certain words. (You might think of "word" as an independent variable, and "count" as the response variable.)
c) The null hypothesis is that the frequencies of words should follow the same distribution in each book. This is analogous to looking at four histograms (or more precisely, four bar-charts) and asking whether they come from the same population. More precisely, the null hypothesis says that the probability that an observation will fall in the "and" bin is the same for each book, and the same can be said for each value of "Word".
An intuitive estimate of each probability, then, would be the number of observations falling in that row divided by the total number of observations.
The alternative hypothesis is that the distributions are different. (More precisely, that one of them is different from all of the others.)
d) In this example, the column totals are fixed according to the null hypothesis. The reason is that we are not conjecturing about the total number of words in each book, merely their distribution. So we accept that S&S has 375 words (or more accurately that the total count of the words we care about is 375), but are contesting whether they had to play out exactly the way we observed.
This is an example of a homogeneity test.
Example 2
Let's answer the same questions. In this example, the units that were sampled, if that's the word, were the women in "who's who". Once they chose to use who's who, the rest was determined for them. There was no way of knowing how many people would have college or not, and how many would be married or not. (These are the two attributes.)
The null hypothesis is that there is no relation between the two variables. An observation that occurs in the first column is just as likely to appear in the first row as the second. And the same could be said for observations that appear in the second column. The probability of occuring in a row is unrelated to which column it appears in. In other words, the variable Marriage (the columns) is independent of the variable College (the rows).
Under this null hypothesis, only the total number of observations is fixed. Outcomes are assigned to the cells according to the relationship of independence. Thus, the probability of falling into cell Married Once AND College degree is equal to the probability of being Married Once * Probability of College Degree. So the probability of landing in a cell is the probability of landing in that row times the probability of landing in that column. probability of college degree.)
This is a test for independence.
Example 3
Here there are two variables (gender and promotion). There is one sample of 48 supervisors. But this study, although it would seem to fit into the "independence" mold, is different from either of the last two examples. The reason is that both the columns and the rows can be considered fixed. The column totals (number of M and F) were fixed by the experimenters. The row totals were fixed by the subjects. (We have to treat this as fixed because we can't question their decision whether or not a file is to be promoted. We merely question whether the ratio of male to female promotions is what it should have been.)
The null hypothesis is that, with no gender bias, there would still have be 35 promotions. But they would be equitably distributed.
In this case we don't need an approximate test like the chi-squared, because the exact distribution is known: the hypergeometric. Let N11 represent the number of outcomes that could land in cell 11 under the null hypothesis. (That is, the number of males promoted.) The distribution of N11 is the number of "successes" when we draw 24 times without replacement from a population with 35 successes and 13 failures. We can calculate this distribution for all values:
P(n11=x) =(number of promotions choose x) * (number of failures choose 24-x)
divided by (total number choose number of men)
This is called Fisher's Exact Test. "Exact" because our p-values are exact, and not large-sample approximations.
Assumptions:
Why is the assumption of independence important? A study by Vianna, Greenwald, and Davies found a relationship between having tonsils and getting Hodgkin's Disease. The possession of tonsils seemed to be protective. A study by Johnson & Johnson, however, used a paired design: they considered Hodgkin's victims with siblings of nearly the same age and the same gender. (So they controlled for gender, age, and to some extent genetics.)
Here are there data:
Tonsillectomy |
No Tonsillectomy |
|
Hodgkin's |
41 |
44 |
Healthy |
33 |
52 |
They calculated a chi-squared statistic (homogeneity? independence?) and found no relation, thus contradicting the previous study.
Note that, however, in this study the rows are not independent. The observations are "paired". An analysis that takes this into account is the following:
Sibling |
Sibling |
||
No Tonsillectomy |
Tonsillectomy |
||
Patient |
No Tonsillectomy |
37 |
7 |
Patient |
Tonsillectromy |
15 |
26 |
The null hypothesis is now that the probability of falling in the first row is the same as the probability of falling in the first column, AND the probability of falling in the second row is the same as falling in the second column. This can be worked into something called McNemar's test (another chi-squared approximation), and in this case the conclusion is that there is an effect.
Calculations
Now let's compute these chi-square statistics. First, some theory about the chi-squared. Remember that although the test statistic is called the Chi-square statistic, its distribution is only approximately chi-square.
Here are the important features:
1) the shape depends on a single parameter called the "degrees of freedom"
2) the density has the value 0 for x < 0.
3) The density is right-skewed, but becomes more symmetric as the degrees of freedom increases.
Where does this density come from?
Take a standard normal random variable, Z. Now square it. What's the distribution of Z2? It turns out that it is a chi-squared distribution with 1 degree of freedom.
Now take n such standard normal variables and add them up: Z12 + Z22 + ... + Zn2. What will the result be? The probability density of this sum is a chi-square density with n degrees of freedom.
This means that the chi-square density often comes into play when dealing with random variables that have sum of squared terms and underlying normal random variables.
Our test statistic X2 does have a sum of squared terms in it. But the underlying variables are NOT normal. (They are counts and therefore integer-valued. And the normal distribution deals with continuous random variables.) In fact, they are multinomial random variables. These are generalizations of the binomial that arise when there are more than two possible outcomes (but still a finite number) for a random variable.
And just as the binomial distribution can be approximated by the normal distribution if it is reasonably symmetric (usually this means np > 10 and n(1-p) >10), so can the multinomial. Here a good rule of thumb is that the expected count in each cell needs to be >= 5.
There are two important conditions that must hold for the chi-square density to be a good approximation of the X2 statistic's density:
1) the trials that produce the observations must be independent
2) the expected count in each cell must be 5 or greater.
The degrees of freedom can be determined using the following "formula"
df = (number of independent bins or cells) (number of independent parameters estimated from the data.)
We'll explain this formula at the very end. More traditionally you have probably seen two different formulas, each used in a different context:
For goodness of fit: df = n (# of estimated parameters) 1
(In Yates, Moore, McCabe, # of estimated parameters is always 0, so df = n-1.)
For homogeneity and independence: (I-1) (J-1) where I is number of rows, J number of columns.
Applying the Chi-square test
Rather than teach formulas, I prefer to teach a general strategy for problem solving in these contexts:
1) Determine the null hypothesis
2) Use the null hypothesis to find expected counts for each cell (which means first finding the probabilities associated with observing an outcome in each cell.)
3) Calculate the test statistic
4) Compare the observed value to the appropriate sampling distribution:
a) Is the chi-square approximation valid?
b) If so, find the df.
5) Find the p-value
6) Make a decision.
Let's consider our examples.
Homogeneity: Jane Austen
Essentially, we have four bar-charts, one for each book. And we want to know if they are the "same." Of course, they are not exactly the same. But differences, we argue, may be due to "chance."
I think you would agree that it is not fair to compare the raw counts from book to book. After all, longer books would be expected to have more of each of the words in them. Thus, we really need to compare the rate at which the words occur in each book. This means dividing each count by the total number of "target" words in that book.
WordSanditon I |
Sanditon II |
|||||
a |
147 (39.2%) |
186 (42.3%) |
101 (50%) |
83 (42.3%) |
||
an |
25 (6.67%) |
26 (5.9%) |
11 (5.4%) |
29 (14.8%) |
||
this |
32 (8.53%) |
39 (8.9%) |
15 (7.4%) |
15 (7.6%) |
||
that |
94 (25.1%) |
105 (23.9%) |
37 (18.3%) |
22 (11.2%) |
||
with |
59 (15.7%) |
74 (16.8%) |
28 (13.9%) |
43 (21.9%) |
||
without |
18 (4.8%) |
10 (2.3%) |
10 (5.0%) |
4 (2.0%) |
||
Total |
375 |
440 |
202 |
196 |
Analysis I: Are Austen's own works consistent? Do her three books come from the same distribution? Let's just compare the first three novels.
Step 1: expected values
The first step is to find the expected count under the null hypothesis. The null hypothesis says that the rate at which the words occur is the same for each novel. So S&S gets to choose 375 words, and p1% are going to be "a". Emma gets to choose 440 words, and p1%, the same p1, are going to be "a"s. Also, each will get p2% "an"s, etc.
What are these percentages? The null hypothesis doesn't tell us values for them, only that they should be the same across the row. We use the data to estimate them. Considering all three books together, there are (375+440+202) = 1017 words. Of these, 434, or .426745, are "a"s. So a good estimate for p1 would be .426745.
Now S&S gets to choose 375 words, and we expect .426745 to be an "a", so the expected count is: 160.029. Emma gets .426745 * 440 = 187.768.
We proceed this way to get the next table:
Word |
Sense & Sensibility |
Emma |
Sanditon I |
Sanditon II |
a |
147 | 160.029 |
186 | 187.768 |
101 | 86.2025 |
83 (42.3%) |
an |
25 | 22.8614 |
26 | 26.824 |
11 | 12.3146 |
29 (14.8%) |
this |
32 | 31.7109 |
39 | 37.2075 |
15 | 17.0816 |
15 (7.6%) |
that |
94 | 95.1327 |
105 | 111.622 |
37 | 51.2448 |
22 (11.2%) |
with |
59 | 59.3658 |
74 | 69.6559 |
28 | 31.9784 |
43 (21.9%) |
without |
18 | 14.0118 |
10 | 16.4405 |
10 | 7.54769 |
4 (2.0%) |
Total |
375 | 373.11 (roundoff) |
440 |
202 |
196 |
Step 2: Calculate test statistic
Now for each cell, we calculate (observed - expected)2/expected and then sum them up. For example, the first cell is (147-160.029)2/160.029 = 1.06078.
The total is X2(observed) = 13.9151.
Step 3: Compare to Sample Distribution
The expected counts are all above 5. We can think of the outcomes as independent if we think of the fact that Austen was probably not keeping track of how many words she used and was therefore not "rationing" or budgeting herself.
The chisquare distribution with (I-1)(J-1) = 5*2 = 10 degrees of freedom is therefore an acceptable approximation.
Step 4: P-value
The p-value is P(X2 > 13.9151) = 0.1769.
Step 5: Decision
This tells us that this is a "typical" value, according to the null hypothesis, and there is no evidence to conclude that these books differ in their distributions for these words. Do not reject the null hypothesis.
Exercise:
Now you decide if Sanditon II belongs with the other Austen novels.
1. Combine the three columns that belong to Austen and treat them as one big book. So we will have only two columns to compare: Austen and non-Austen.
2. State the null and alternative hypotheses.
3. Calculate the probabilities for landing in each cell.
4. Calculate the expected values for landing in each cell.
5. Calculate the test statistics
6. Calculate the degrees of freedom.
7. Calculate the p-value.
8. What do you conclude?
Independence: Women and College
Step 1 Expected Counts
We approach the question "Is there a relationship" by asking "are the two variables Marriage and Education independent? If yes, then there is no relation. If so, then there is.
The null hypothesis is that they are independent.
Now what you want to imagine happening is that each of the 1436 women has to be assigned a value to both variables. But under the null hypothesis, her assignment to one variable has nothing to do with her assignment to another.
That is P(she has College AND Married once) = P(College)*P(Married once)
This is the definition of independence. And this is true for each cell in the table.
If pij = P(in row i AND column j), then pij = P(in row i) * P(in column j) == pi * qj.
So this tells us how to estimate the pij: first estimate pi and qj, and then, if the null hypothesis is true, their product is a good estimate.
Now p1 = P(college) = 611/1436
p2 = P(no college) = 825/1436
q1 = P(married once) = 1231/1436
q2 = P(married more than once) = 205/1436
So
p11 = P(college and married once) = (611/1436) * (1231/1436)
p12 = P(college and married more than once) = (611/1436) *( 205/1436)
etc.
Then the expected count in cell (ij) is n * pij.
So E11 = 1436* p11 =1436 *( 611/1436) * (1231/1436).
We can make this table:
Education |
Married once |
Married > once |
College |
550 523.8 |
61 87.2 |
No College |
681 707.2 |
144 117.8 |
Step 2: Calculate Test Statistic
X2 = (550-523.8)2/523.8 + .....
= 16.01
Step 3: Calculate Degrees of Freedom
(I-1)(J-1) = (2-1)*(2-1) = 1
Step 4: Calculate p-value
P(X2 > 16.01) = 0.000063
Note that we could also have done a (two-sided) z-test using sqrt(16.01).
Step 5: Conclusion
The deviations from the expected counts are too great to be accounted for solely by chance. We reject the null hypothesis that the variables are independent.
VII. A Unified Approach to Degrees of Freedom
Although we didn't make use of it, the formula I gave you earlier for the degrees of freedom was:
degrees of freedom = number of independent cells number of independent estimated parameters.
This works for ALL circumstances. Let's see why.
Example 1 (Goodness of Fit)
YM&M says df = n-1 where n is the number of bins. We had 6 bins. However, only 5 of them were independent because once you knew that the die had been tossed 100 times, and once you knew what the count was in any 5 of the bins, you knew the count in the sixth. So there were 5 or n-1 indpt. bins. We estimated 0 parameters from the data, so df = 5.
In the logozone example, there were 4-1 = 3 independent bins. We estimated two parameters, and so df = 3-2=1.
Example 2 (Homogeneity)
There were 3 columns and 6 rows, so 18 different cells. However, because the count within each book was "fixed", once you knew 5 of the rows, you knew the sixth. So there were only 5 independent rows. Thus, the number of independent cells was J*(I-1).
The null hypothesis was that the probability of falling in a particular row was the same for each column. So we had only to estimate the probability of falling into each row. There were therefore I estimated parameters. However, the sum of these probabilities must be 1, so there were only (I-1) independent parameters. Thus, the degrees of freedom is
J(I-1) (I-1) = IJ J I + 1 = (I-1)(J-1)
Example 3 (Independence)
There were only 4 cells (IJ = 2*2). But because the total count was fixed, once you know three cells, you know the fourth. Thus there are, in general IJ-1 independent cells.
We needed an estimate for each cell, but this required estimating the probability of landing in each row, and the probability of landing in each column. Thus there were I+J total parameters estimated. However, they were not all independent. Both the column probabilities and the row probabilities must add to 1. So there were (I-1)+(J-1) indepedent parameters estimated. So degrees of freedom:
IJ-1 ((I-1)+(J-1)) = IJ-1 I+1-J+1 = IJ-I J + 1 = (I-1)(J-1) again.