hw3and4.html

Homework 3 and 4
Due Friday, February 2, 2001

1. Measurement error is often modelled as following a normal distribution with the mean, mu, equal to the "true" value of the thing being measured. That is, we can let X be a random variable that represents the observed outcome, and then X has distribution N(mu, sigma). (Another way of writing the same model is to say that X = mu + E, where mu is a constant equal to the true value, and E is a random variable (usually represented with a lower-case "epsilon") with the distribution N(0, sigma). ) Suppose we're measuring the length of something and X is N(mu, .1mm).

a) Find the probability that your measurement will be within .05 mm of the truth.

b) Suppose you take 3 independent measurements and compute the average. Find the probability that the average will be within .05 mm of the truth.

c) How many measurements must you take to be 95% certain that the average will be within .05 mm of the truth?

d) Suppose we take the following measurements:
24.65, 24.78, 24.86, 24.60, 24.81
Find a 90% CI for the true length. Find a 95% CI.

2. A "spinner" is a dial, like a clock-face, with an arrow that can be spun around and comes to rest at random. It's very much like a clock with a single hand, and the hand is loose so that you can spin it. Imagine spinning such a spinner, and let Theta represent the angle the hand makes with the perpendicular after the hand comes to rest. (So if the hand points to "3 o'clock", then Theta would be 90 degrees.)
a) What's the probability distribution of Theta?
b) What's the expected value of Theta?
c) What's the probability Theta will be between 260 and 265 degrees?
d) Spin the spinner 9 times. What's the approximate probability the average of the Thetas will be between 260 and 265 degrees?

3. Download the "ozone" data file. We'll learn more about this data set later. For now, we need to know only that this consists of about 140 measurements of ozone taken on different days in Upland, CA. (There are other variables in the data set as well.)
(Hint: this link is a "lisp" file. To download it, save it to your harddrive. (Choose "Save As..." from the Netscape menu. Slightly different in Internet Explorer.) Be sure to save it as a text file. Then, from ViSta, choose "File: New Data" and select the appropriate file. NOTE: all versions of the data, including the "raw" versions, are held in www.stat.ucla.edu/~rgould/datasets)

a) Describe the distribution of the ozone observations. Which would be a more suitable summary of the center of the distribution and why: the sample mean or the sample median?
b) ViSta provides a confidence interval for the mean. Calculate a 95% Confidence Interval. (Under "Analyze: Univariate Analysis", select the variable you are interested in the upper left corner of the box, and choose the confidence level in the lower right corner. To actually see the result, you must then select "Models: Report Model" from the menu bar.)

It's worth noting that, because we know nothing about how the sample was collected, we do not know what the population is that we are inferring to. So when asked to calculate the mean ozone level, its worth asking "mean of what?" For example, if we intend to estimate the mean annual ozone level, then our sample should have estimates measured at random times throughout the year. But for the sake of learning about how to calculate confidence intervals, lets just assume that there is some population from which these observations are a random sample.

c) Use the bootstrap to compute a 95% confidence interval for the mean. How does it compare to the answer in (b)?
Hint: You can download the lisp-code to do a simple bootstrap procedure. Just open it, "copy" it to your clipboard, then "paste" it into the Listener window in Vista. See the class homepage for a link that will explain how to do basic bootstrapping in a little more detail.

d) Use the bootstrap to compute a 95% confidence interval for the median.

e) Create a new variable as follows (def lozone (log ozone)). Compare its distribution to the distribution of ozone. Find a 95% confidence interval for the mean of the log(ozone).

4. For over 100 years, conventional wisdom was that "normal" body temperature was 98.6 degrees Farenheit. This was based on the research of one Carl Wunderlich. Download the bodytemp data set. This consists of a number of measurements of body temperature on a sample of men and women. These data are derived from a dataset presented in Mackowiak, P. A., Wasserman, S. S., and Levine, M. M. (1992), "A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body Temperature, and Other Legacies of Carl Reinhold August Wunderlich," _Journal of the American Medical Association_, 268, 1578-1580. Data were constructed to match as closely as possible the histograms and summary statistics
presented in that article.

VARIABLE DESCRIPTIONS:
Columns
1 - 5 Body temperature (degrees Fahrenheit)
9 Gender (1 = male, 2 = female)
14 - 15 Heart rate (beats per minute)

a) In your opinion, do these data refute the hypothesis that "normal" is 98.6 F? Explain, and in your explanation include the following: the average of the data is NOT 98.6. But in a random sample we wouldn't expect it to be 98.6. So if you think that normal temperature is not 98.6, why do you think the difference is due to more than just sampling variation? Support your reasoning with graphics and appropriate summary statistics whenever possible.

b) Extra Credit: Were these data originally collected on Farenheit or Celsius scales? Why?

5. (From Chance Encounters, Wild & Seber): Was Cavendish's determinations of the mean density of the earth biased? Here are 23 measurements from one particular phase of his study:

5.36, 5.29, 5.58, 5.65, 5.57, 5.53, 5.62, 5.29, 5.34, 5.79, 5.10, 5.27, 5.39, 5.42, 5.47, 5.63, 5.34, 5.46, 5.30, 5.75, 5.68, 5.85
The current accepted mean density of the earth 5.517 g/cubic cm. The average of this data is 5.4835 and the sample standard deviation is 0.1904.

Is there evidence that Cavendish's measurements were biased? Choose a significance level and state your conclusion.

For the Cavendish study:
a) Find the power of the hypothesis test you used. You can do so using a web-based "calculator" found at http://home.stat.ucla.edu/calculators/powercalc/
and choosing the one-sample, normal test. It will ask you for the "mean under the alternative hypothesis" among other things. This "mean" requires some imagination: imagine that, for an instant, the truth was revealed to you, and you knew that the truth was different from what the null hypothesis said, and you knew what value it was. Be sure to say which value you used. Try it with at least two different values, and note that the power increases as the distance between the mean of the null hypothesis and the mean of the alternative hypothesis increases.
b) Interpret this result for an imaginary client who does not know much statistics.
c) What sample size do you need to get a power of 80% for the value of the alternative mean you used in part (a)? (Use one of the calculators, again.)

7. (From Chance Encounters: It is generally believed that red-cockaded woodpeckers require or preferentially select old-age pine trees and stands for construcing nest and roost cavities. This hypothesis is important since the need for suitable habitat for nesting and roosting is vital to the continued existence of the species. The information for this exercise came from part of a study conducted by DeLotelle and Epting (1988). Trees were sampled from stands of trees ocupied by colonies of woodpeckers. Trees with current or abandoned woodpecker holes were termed cavity trees. Untouched trees in a defined neighborhood of cavity trees were called colony trees. The ages (in years) of cavity trees and colony trees were compared, giving these summary statsitics:
Cavity: n = 54, xbar = 104.1, standard-deviation = 24.1
Colony: n = 143, xbar = 83.6, standard-deviation 38.3

This is a different structure from the tests we discussed in class. We are comparing two independent samples. You should read about such tests under either an introductory book or pg. 122-125 in Statistical Methods in Atmospheric Sciences, Wilk. If you have, instead, the multivariate geostatistics book, and you know who you are, then you should probably borrow a book from me.

a) Is there evidence that the mean age is different for the two types of trees?
b) State all assumptions made to carry out the test in (a).
c) Note that there are two types of such tests: one assumes the variances of both groups are equal, the other does not. Which one would you expect to give the smallest p-value? Why? Carry out the "other one" (the one you did not do in (a)) and see.

8. (From Statistics, Freedman, Pisani, Purves)
In 1969, Dr. Spock came to trial in the Boston Federal Court House. The charge was conspiracy to violate the Military Service Act. "Of all defedants, Dr. Spock, who had given wise and welcome advice on childrearing to millions of mothers, would have liked women on his jury", one source reported. The jury was drawn from a panel of 350 persons selected by the clerk. This panel included only 102 women, although a majority of the eligible jurors in the district were female. At the next stage in selecting the jury to hear the case, the judge chose 100 potential jurors out of these 350 persons. His choices included 9 women.
a) 350 people are chosen at random from a large population, which is over 50% female. How likely is it that the sample includes 102 women or fewer?
b) Extra Credit: 100 people are chosen at random (without replacement) from a group consisting of 102 women and 248 men. How likely is it that the sample includes 9 women or fewer?
c) What do you conclude?
Hint: for (b), the distribution of X, the number of females in a sample chosen with replacement from N objects which are either female not, is the hypergeometric.
P(X = k) = (r choose k) * (N-r choose m-k) divided by (N choose m).
The phrase "(n choose m)" means n!/(n-m)!m! . Here N is the number of objects in the population. r is the number of females. m is the number of balls chosen for the sample. k is the number of females in the sample.

9. Return to the "ozone" data set. (You can view the data set here, or you can look at the xlisp-object here.) The variables have the following interpretations. (And all you atmospheric scientists feel free to correct my misunderstandings):
ozone: (ppm), reading in Upland, CA on one day in the year.
temp: inversion base temperature, degrees farentheit. This is, I believe, the temperature at the altitude at which the inversion layer begins.
inversionht: (feet) altitude of inversion base
pressure: Daggett pressure gradient (mmhg). I believe this is the difference in air pressure between Daggett Air Force Base, in the desert, and Upland, CA.
visibility: miles
height: vandenburg 400 millibar height (meters). No idea what this is.
humidity: percent
temp2: sandburg airforce base temp, measured in degrees C, and converted to Farenheit (warning, its possible that the two temperature columns got switched, but I think this is correct.)
windspeed (mph)

Plot each variable against ozone (ozone on the y-axis) and interpret. Does it make sense? Do the trends look linear?
one command that might help (scatterplot-matrix (list ozone temp invershionht pressure)) and you can make the list longer. This gives a matrix of all possible pairs of scatterplots.