Name:
SID:
Write clearly and show all necessary work. All questions are worth 5 points, unless otherwise marked.
1. An article in the LA Times in 1997 reports that: "A new study in the May 1 New England Journal of Medicine provides some of the strongest evidence yet that regular exercise helps protect women from breast cancer. The research, conducted in Norway, found that women who exercise at least four hours a week have a breast cancer risk about one-third lower than usual."
a. Is this most likely a controlled experiment or an observational study? Whichever your choice, be sure to explain why this study has the characteristics that make it a controlled experiment or an observational study.
This is most likely an observational study, because the
chief characteristic of an observational study is that the patients or
subjects sort themselves into treatment groups. Here the treatment
group si the group that exercises 4 or more hours per week, and the "control"
group is the group who exercises less. It's unlikely this exercise
regimen could be enforced for long periods of time, so most likely the
researchers just observed the exercise habits of the subjects.
b. A woman in Norway reads this and decides to exercise at least four hours a week to lower her breast cancer risk. Is it valid to conclude from this study that women who exercise four hours a week will see their breast cancer risk decrease?
This was an observational study, so we can not eliminate
the possiblitity that confounding factors might have caused the difference
in the two groups. For example, women who exercise frequently might
also take particular care with their diet. And this diet might protect
them from cancer. (Put slightly differently, women concerned with
their diet might also exercise frequently.)
2. So many times, it seems that scientists tell us that something we enjoy is bad for us. That's why the story involving wine and mortality is so refreshing. Finally, perhaps one of our vices will actually be good for us! Below is a boxplot of the number of liters or wine consumed per person per year for 18 "developed" countries. Full details can be found in
A. S. St. Leger, A. L. Cochrane, and F. Moore, "Factors associated with cardiac mortality in developed countries with particular reference to the consumption of wine," Lancet (June 16, 1979): 1017-20.
In case you are thinking of moving, the outlier up around 80 is actually two outliers. France and Italy tied for first place at 79.9 liters per person per year.
a) Warm Up Question: 79.9 liters per person per year: How many liters per day is that? (365 days/year.)
79.9/365 = .219 liters per day
b) Approximately what is the median wine consumption?
The true median is 5.9, but you would only know this
from looking at the data. From the box plot, anything in the range
of 5 to 8 would be okay.
c) Is the average wine consumption greater than, less than, or about the same as the median wine consumption for these countries? Explain.
The presence of the outliers pulls the average up, so
the average would be greater than the median. Also, even were the
outliers removed, the right-skewed shape of the distribution would mean
that the average would be higher than the median. In fact, the average
is 16.4.
d) Sketch what the histogram for wine consumption could look like.
This is hard to do on the web. Basically, we were looking for a histogram that had about 50% of it's area between 0 and 7 or so, and then tapered off quickly, with a small bump out near 80.
e) If we removed France and Italy from the data set, how would the average be affected? Will it change more or less or about the same amount as the median?
This has a big change on the average, and makes it quite
a bit smaller. The median is affected only slightly and drops just
a little bit.
(The average changes from about 16 to about 9.
The median changes from 5.9 to about 5.5.)
3. Same data set. Below is a scatterplot of heart mortality and wine consumption.
a) Does this graph suggest a relationship between heart mortality and wine consumption? Describe it in words.
Yes. The relation looks exponential. Very
steep decline in mortality for low levels of wine consumption, and then
the slope gradually decreases.
b) The correlation between heart mortality and wine consumption is -0.7456. Interpret this.
This one was a little tricky. The correlation is NOT a good measurement of linearity. What I mean by this is that you can't use the correlation to decide whether or not a relationship is a linear one. There are some non-linear relationships that produce scatterplots with higher correlations than other linear relationships do. But, assuming the relationship IS linear, the correlation roughly speaking measures the tendency of the points on the scatterplot to stick close to a straight line. Another way of thinking about it is that the correlation is measuring the extent of a linear component in the data. (You can think of a quadratic: y = a + bx + bx^2 as a linear component y = a + bx plus a quadratic component. If the linear component is "strong", you might get a high correlation.)
The best answer here is that the negative sign picks up
on the fact that most countries with above average wine consumption tend
to have below average mortality, hence a negative association. But
the value of the correlation doesn't really help us much, since the relation
is non-linear.
Shown below is a graph with the Log of mortality and the Log of wine consumption plotted instead of their actual values. (The computer just took the log of each observation.) (The computer mislabels the axes. Should say "Log of Mortality" and "Log of Wine".)
And here
are the summary statistics for the log of these values:
Data set = Wines, Summary Statistics
Variable N Average Std. Dev Minimum Median Maximum
log[Wine] 18 2.1716 1.0476 1.0296 1.775 4.3294
log[Mortality] 18 1.7833 0.43351 0.74194 1.8707 2.3224
Data set = Wines, Sample Correlations
log[Wine] 1.0000 -0.8593
log[Mortailty] -0.8593 1.0000
Questions appear on the next page.
d) Find the regression line for log(Mortality) and log(Wine)
b = r(sy/sx) = (-.8593)*(1.7833/2.1716) = -0.3556
a = ybar - b xbar = 1.7833 - (- 0.3556)*(2.1716) = 2.5555
yhat = 2.55 -0.3556 x
Note that the correlation is much stronger now that we've transformed the data to make it more linear.
e) Interpret the regression line.
Be careful here! Each point represents a country!
So your interpretation should reflect this. The slope means that
countries with higher wine consumption tend to have lower heart-related
mortality rates. A country whose log(wine consumption) is one unit
higher than another country's will, on average, have a log(mortality) lower
by .3556 units. You can NOT say that increasing wine consumption
tends to be associated with decreased mortality, because the data did not
measure individuals that increased consumption. (Similarly, when
you have weight on the y axis and height on the x axis, it doesn't make
sense to talk about "increasing your height.")
f) The United States has a high heart mortality rate. The Wine Is Nutritional Organization (WINO) is launching an advertising campaign designed to increase wine consumption across the US. By how much would heart-related mortality be lowered if we increased our per-capita consumption by log(10 liters per person per year)? Explain.
Almost (but not quite) everyone fell for this. There
is no evidence that if the US changes it's wine consumption it will
have an effect on heart mortality. This was an observational study (it
couldn't be anything else), and so the relation could be explained by a
variety of confounding factors. This question cannot be answered
from the data presented.
MORE
4. Four balls are put into a box. Three balls are red and have the numbers
1, -2, 4 on them. The fourth ball is blue and has the number -2 on it. A ball is selected at random. You win the amount on the ball. (So if -2 is selected, you lose 2 dollars.) Let A be the event that you get a positive number. Let B be the event that the ball is red.
a) Are A and B independent? Show it.
b) Let X represent the amount you win. Write the
pdf for X. (You should put it in a table.)
c) What's the expected value of X?
END