Mann-Whitney

 

Advantages: This is a class of tests that do not require any assumptions on the distribution of the population. They are therefore used when you do not know, and are not willing to assume, what the shape of the distribution is. If you DO know, then you should use this information and bypass the nonparametric test.

Disadvantages: These tests have a lower power than parametric tests. This means that, if there really is a difference between two groups, these tests are less likely to find it. This should make intuitive sense, since there is always a penalty for ignorance (in this case, ignorance of the distribution), and that penalty usually makes things harder to estimate.

Best Used When: because these are less powerful tests than parametric tests, and because the Central Limit Theorem lets us assume normality in most cases if the sample size is large, these tests are best used for small data sets. In fact, as you're about to see, they can be downright unweildy when applied to large data sets.

Data: Let's set aside our risk data set for a moment. It's large enough (as we saw last time) for the CLT to take effect. Instead we'll use a data set from Chatfield, called cutshoots.ftm.

A botanist is comparing two methods for growing broad-bean plants. As a marker for the success of these methods, he's measuring the concentration of a certain chemical in the plants. 10 plants are raised from cuttings, and 10 are planted in soil and "rooted". Our question: does the chemical concentration differ?

Assumptions:

Data Collection: The observations are independent and randomly sampled from the population. (This last assumption is not so important for the test itself, but is important for being able to generalize the results to the population.)

Population: none.

Ingredients:

We will use a rather unusual test statistic: the sum of the ranks from the group with the least number of observations. (This is explained below.) We'll also need a table that provides the probability distribution.

Reference:

Mathematical Statistics and Data Analysis, John A. Rice, Duxbury Press 1995.

Statistical Sleuth, Ramsey and Schafer, Duxbury Press, 1997

Logic:

Assign ranks to all of the observations pooled together. The smallest of both groups gets a "1" and the biggest an "N". (For the moment, suppose there are no ties.) In our data, N=20. If the observations come from populations with the same distributions, then we should expect the "big" numbers, that is the high ranks, to be spread throughout both groups, and the same for the small numbers. If the populations were very different -- for example suppose that one group had no overlap with the other at all -- then all of the low ranks will be in one group, and all of the high ranks in another.

Now focus our attention on just one group. Usually its easiest to choose the sample with the smallest number of observations, but if both have the same size, then it doesn't matter which group you choose. If we look at just that group (lets say it has n observations), then the null hypothesis says ranks were just randomly assigned to these observations. It is as if we reached into a bag of numbers 1 to N and pulled n out at random and assigned them to our group. Once we've done this, we add the ranks together. Because there are only (N choose n) possible ways of doing this, we can calculate the probability of every possible sum. (At least in theory. In practice N choose n might be very big. For our data 20 choose 10 is 97,240. So we'd better use a computer.)

By looking at this distribution, (the distribution of sums of ranks for a group of size n when there are N total ranks to be assigned), we know what are "typical" and what are "atypical" values. The p-value is found by finding the probability of getting a sum as large or larger than what we've seen.

In practice, we use tables to decide what is typical and atypical, but some computer packages also calculate this. (There are two other alternatives discussed below.)

The table I'm showing you is valid only for groups up to size 20. This table uses a modification of the test statistic. Let R be the sum of the ranks of the sample with the smaller sample size, and let this size be n. Let R' =n(N+1) - R. Then let R* = min(R,R'). Use R* as your test statistic. Reject if R* is smaller than the critical value given on the table.

If you have ties, then replace them with the "average" rank. (For example, if your data were 0.25, 0.25, 2.0, the ranks would be 1.5, 1.5, 3) This gives approximate significant levels as long as there are not too many ties.

N=20. If the observations come from populations with the same distributions, then we should expect the "big" numbers, that is the high ranks, to be spread throughout both groups, and the same for the small numbers. If the populations were very different -- for example suppose that one group had no overlap with the other at all -- then all of the low ranks will be in one group, and all of the high ranks in another.

 

Application:

In many packages, this is an easy test to do. It is simply a matter of pushing the button that says either "Rank-Sum Test" or "Wilcoxon Rank-Sum Test" or Mann-Whitney Rank-Sum Test. (Many names for the same thing.) DataDesk has a "Mann-Whitney U" button that does a very similar test. Fathom does not do this, and so we have to do it ourselves.

Step 1: Calculate the ranks of the entire sample

Open up the data "inspector" so that you can view the cases. You will see two attributes: cutting, concentration. We want to calculate a new attribute, which will be the ranks of the concentration values. To do this, click on the "new" button in the list of attritubutes. Give it a name. I suggest "ranks". Then, double click on the "formula" space, and the formula will enter. Type

rank(concentration)

and hit "OK."

Step 2: Calculate the test statistic

Click on the "measures" tab. (Fathom calls all statistics "measures".) Click on the "new" button and give the measure a name. I suggest "ranksum". Double click on the formula space to open the formula editor. From within the formula editor, type

sum(ranks, cutting=1)

This sums up the values of the ranks, but only for those cases whose value for the 'cutting' attribute is 1.

Result: In this case, you'll see that our ranksum is 142. Is this a large value, or a small value? We can get some insight into this question by asking what the biggest possible value is. The sum of all 20 ranks will be 1+2 + ... + 20 =

And if the two groups were very different, then the lower ten ranks: 1,2, ..., 10 would be in one group and the upper, 11, 12, ...20 in the other. So then the sum would never be bigger than 11+12+....+20 and never smaller than 1+2+...+10.

Finding the p-value: There are three methods. The first is to use a published table of the probability distribution of this statistic. If you can't find such a table, you have two choices: (1) simulation and (2) approximation.

Simulation:

We can get a good sense for how frequently certain sums occur in a random trial by simulating this experience. Basically, take the numbers 1,2,3,...,20 and select 10 withOUT replacement. Add them up. This is your test statistic. Repeat this many times, and we'll see an approximation of the distribution that is calculated exactly in the published tables.

To do this in Fathom, select "sample cases" from the "Analyze" menu. Double-click on the collection that appears, and change it so that the animation is off, sampling is done without replacement, and the sample size is 10.

Note that Fathom automatically collected our test statistic (click on the "measures" tab). We need only repeat this operation 100 times or so to get a picture of the sampling distribution.

To speed this up, select "Collect Measures" under the "Analyze" menu. This will repeat the sampling experiment 5 times, saving the ranksum measure each time. Edit this so that the animation is off and the experiment is repeated 100 times. Also, set up a graph (a histogram) of the ranksum measure.

When you've repeated this experiment, you'll see that the distribution of rank-sum has a symmetric shape, and that 142 is unusually large. (In fact, it occurs fairly rarely, so you might not even see any values that large.) We conclude that our observed value is unusually large, and therefore unlikely to have occured by chance alone.

Normal Approximation: This works if both samples have at least 5 observations and few ties. Our test statistic is R : the sum of the ranks in the group with the least number of observations. The mean of the rank-sum statistic is the average of the ranks in both groups times the size of the smaller group. The SD is the sample SD of both groups (all observations combined) multiplied by

sqrt (n1*n2/(n1 + n2)).

Then

Z = (R - mean)/SD

is approximately standard normal.

For our data, R = 68, AVG(all ranks) = 10.5, SD(all ranks) = 5.91608. So..

Mean(R) = 10.5*10 = 105, SD = 5.91608*sqrt(10*10/20) = 13.228757

So Z = R - 105)/13.229 is approximately a standard normal random variable. We observed z = (68 - 105)/13.229 = 2.7968856, which leads to a p-value of approximately p = 0.0026. (The null hypothesis is cast into doubt by large values of this test statistic.)

We conclude there is a difference.