Hypothesis Testing Review Sheet

General Framework: There exists a population of wildly exciting observations, but, alas, the population is too large to observe them all. The probability distribution of this population can be summarized with one (or maybe two or three) parameters, and we wish to know what these parameters are. In hypothesis testing, we are actually more interested in what value the parameters aren't. For example, if I'm running for president, I want to know that my support is NOT less than 50%. To test our hypotheses about the value of these parameters, we take a random sample of independent observations. And we test our hypothesis on this random sample. Because the sample is random (and because the population has a variety of values), we can't count on getting the exact same values if we repeat our experiment. So we need to understand how our sample will vary.

1. Single Sample

We've collected data on a single variable which we assume has a normal distribution with mean mu and SD sigma. We are interested in the value of the mean of the population, mu. The conventional wisdom says that the mu is a particular value, let's call it mu0. (For example, conventional wisdom claims that the mean is 0.) We believe instead that something else is true. We have three options: 1) mu is not mu0. 2) mu is greater than mu0 3) mu is less than mu0. (Note that our options are much more vague than the conventional wisdom. This gives us an advantage over the conventional wisdom that some would say is unfair in some contexts. But that's a fine point for later.) The "conventional wisdom" is called the Null Hypothesis. Our alternative is called the "Alternative Hypothesis." Because the null hypothesis represents conventional wisdom, it gets the benefit of the doubt, and will be overturned only with exceptional evidence to the contrary.

Because we want to know about the mean of the population, we estimate using one of our favorite estimators, the average Xbar (sum of the observations divided by the number of observations). We know, however, that Xbar is NOT equal to mu, it is merely an estimate that is unbiased (so it likes to "hang out" around mu) and has a relatively small standard error (so it doesn't stray too far from mu -- and in fact, the larger the sample size, the closer Xbar tends to stay to mu.) But we can be even more precise; because we assumed the population was normal, Xbar is also normal with mean mu and SD equal to sigma/sqrt(n). (The SD of a estimator , like Xbar, is called the Standard Error, or SE). So we can compute, for example, that the probability that xbar is within 1 SE of mu is roughly 68%, within two SE's is roughly 95% and within 3 is roughly 99.7%.

So here's our approach:

1) Choose a significance level -- the probability that we will reject the null hypothesis, even though its true (oops!). 5% is a popular choice, as is 1% and even 10%, if you can live with yourself knowing you are wrong 10% of the time.

2) Calculate your test statistic. In the opinion of the Null hypothesis, is this test statistic "extreme"?

3) Calculate the p-value: what's the probability of getting a test statistic as extreme or MORE extreme than the one you got?

4) If the p-value is less than or equal to the significance level -- reject the null hypothesis. When this happens, it means you just saw an event that happens very rarely if the null hypothesis is true-- so maybe it isn't true! If I said I could predict the outcome of 100 flips of a coin, and I got 52 of them right, you probably wouldn't be too impressed, because your null hypothesis expected this. But if I got 90 of them correct, this would be so unusual that you would have to change your assumptions! (Which doesn't mean I'm psychic of course -- I might have cheated.)

5) If the p-value is NOT less than or equal to the significance level -- we do not say that we "accept" the null hypothesis. (Never admit defeat!) The reason for this is that our alternative might still be right. Perhaps there's just too much variability in our estimator to get a clear picture of what's going on. (The technical term for this is "power" . The power is the probability that you correctly reject the null hypothesis when the null hypothesis is false -- a good thing. Power increases with sample size, so if you don't reject the null hypothesis today, go back and collect more data to increase your power.)

Details:

If sigma is a known number, like 3, then it's easy to calculate the p-value:

a) you know that Xbar has a normal distribution with mean mu (which the null hypothesis says is equal to mu0) and SE = sigma/sqrt(n). So

b) Z = (Xbar - mu0)/(sigma/sqrt(n)) is a standard normal random variable, so you can use the table in the back for standard normal random variables to calculate probabilities.

c) The p-value is

i) P(Z > |observed test stat|) if our alternative hyp. is that mu does NOT equal mu0.

ii) P(Z > observed test stat) if the alternative is that mu is greater than mu0.

iii) P(Z < observed test stat) if the alternative is that mu is less than mu0.

This is called the z-test, and uses Z as a test statistic.

If you don't know sigma, then you must estimate it and pay a slight penalty. To calculate the p-value:

a) estimate sigma with s = the square root of: sum of the squared deviations divided by (n-1). A deviation is the difference between an observation and the average Xbar.

b) Now you standardize Xbar, but it no longer has a normal distribution because now you are dividing Xbar by another random variable, s:

T = (Xbar - mu0)/(s/sqrt(n)).

T follows a distribution very similar to the normal distribution. This is the "t distribution with n-1 degrees of freedom." This is the penalty you pay for not knowing sigma. You can see this by calculating the p-value two ways. First, the right way: use the table for the t distribution. Then , the wrong way: recalculate the p-value using the normal table. The normal table will give you a smaller p-value -- it's friendlier towards the alternative hypothesis. Basically, your lack of knowledge about sigma means you increased the "noise" or "error" or variability in your test statistic, and so you are now a little less sure of what you see.

This is called the t-test and uses T as a test statistic.

Confidence Intervals:

The same thinking helps make confidence intervals. You have an estimate of the mean, Xbar, and you want to give the range over which it varies. Think of it this way: I'm thinking of a number, any number, between 0 and 10000000. You win a prize if you guess it correctly. Would you rather guess a single number, or an interval? Obviously, the larger an interval you say, the more likely you are to be correct.

Well in our game we have a little more information than that. We know the population follows a normal distribution, and we know a lot about the behavior of Xbar, our estimate of mu. If we know sigma, then we can standardize Xbar and use the fact that the standardized form of Xbar is a standard normal RV to form a confidence interval. If we don't know sigma and instead replace it with an estimate, we can still standardize Xbar, but now the distribution is a t with n-1 degrees of freedom.

2. Two-samples

Now there's an added complication: two groups of data. Each group is a sample from a normal distribution with unknown mean and possibly unknown variance. For simplicity, we assume that the variances of each group are the same. Important assumptions: observations in each group are independent, and the groups are independent of each other.

The mathematical notation is too complex for this word processor, but the approach is basically the same as above.

Conventional wisdom says that the difference between the two population means is 0 (so the means are equal.) We say that 1) the difference is not equal to 0 or 2) the difference is positive or 2) the difference is negative.

We use Xbar minus Ybar to estimate the difference in means. If sigma is known, then we can standardize this estimate and we have a Z statistic. If sigma is NOT known, then we estimate it with something called the "pooled" estimate and then standardize. Only now the T statistic follows a t-distribution with (n+m -2) degrees of freedom. (n is the sample size of the X group, m of the Y group). Look at the lecture notes or the book for details. But everything else is the same as in the one sample case.

3. Binomial Approximations

Everything we did with the normal distribution can be used with binomial random variables if we approximate the binomial distribution with the normal distribution. As a rule of thumb, this is a pretty good approximation if np > 10 and n(1-p) > 10.

For example, if X represents the number of heads in n tosses of a coin, then X follows a binomial distribution with mean np and variance np(1-p). Typically, the null hypothesis will be about p. If the coin is fair, then p is 1/2. You can build your test around either X, which then has mean n*.5 or around phat (X/n) which has a mean of .5. If the null hypothesis is correct, that is. Otherwise, when we carry out our experiment, we'll see something very different for X or phat.

To know what is meant by "very different", we approximate the distribution of X as normal with mean n*.5 and variance n*.5*.5 or we approximate the distribution of phat as normal with mean .5 and variance .5*.5/n. This approximation comes to us, by the way, through the good graces of the Central Limit Theorem. The reason is that X is actually a sum of bernoulli random variables (random variables that take on the values 1 or 0 only), and so phat is just like an average.

You can now carry out tests exactly like we did above with the normal case.

Of course, coin flips are not too interesting. But X could also represent the number of cards a self-proclaimed psychic can identify, or the number of free throws a basketball player can shoot, or the number of votes a candidate will get.

Small Sample Size:

What if the np>10 rule doesn't hold? Well you can still get by using the binomial distribution itself. Here's an example from real-life.

A psychiatric program is instituted under the theory that psychiatric intervention for the moderately depressed in an elderly population can lead to decreased medical costs. A randomized study is carried out at 9 hospitals around the country. At each hospital, subjects are randomly assigned to either the treatment group (special psychiatric diagnoses and care) or the control group (usual care). At the end of 1 year, costs are calculated at each hospital for each group of patients. Does the treatment program improve the probability that a hospital's costs will decrease?

As evidence, at each hospital it was recorded whether the treatment group was cheaper than the control group. So we can let X be a random variable that is 1 if the treatment group is cheaper, 0 otherwise. If the treatment has no effect on cost, then p, the probability that the treatment group will be cheaper at a hospital, is 1/2, and we should see about half of the hospitals with the treatment group costing the least. If the treatment works, then p > .5.

At the end of a year, 7 out of 9 hospitals had the treatment group cost less, that is x=7. (The lower case x means that this was an observed value. Upper-case X means its a random variable and unobserved.)

The null hypothesis is p = .5. The alternative is p > .5. The test statistic is X. The p-value, then is P(X >= 7) = P(X =7) + P(X = 8) + P(X = 9). You can calculate this with a calculator, or use tables in the back of the book. From the table (Table II), we get

P(X >=7) = P(X <= 6) = 1-.9102 = .0898.

(Note: the table only gives P(X <= x), not P(X >=x).

Should we reject? If our significance level is 10%, then we would reject. If it were 5%, then we would not.