Homework 1 Discussion
Rather than line by line solutions, I'm going to discuss some common
themes that came up while grading your papers.
First, one thing I hope this class begins to teach -- and that your
undergraduate education will teach -- is curiosity. It is my
belief that curiosity is not necessarily a natural emotion, but needs
to be taught, and learning about a variety of ways of seeing and
understanding the wworld is one way of learning it. All that by
way of saying that although sometimes the books is rather terse and
says "make a plot" I hope you will also look at it more closely and
wonder what it is that this plot teaches us.
In general: (1) Write full, complete sentences. Try to
avoid words like "this" and "they" unless it is very clear what you are
referring to. For example, "They are normally distributed" is not
nearly as good as "The possum's ages are normally distributed."
(2) Put comments and graphs together. It is very hard to follow
your interpretations if the graphs are many pages away.
Possums data: The earconch distribution is bimodal.
(Remember, when looking at histograms or distributions, look for shape,
center, and spread). When we see bimodality, we try to find
reasons, and one is often that we've combined two populations
together. It is natural, therefore, to separate by gender.
However, the histograms of earconch by gender are still bimodal.
Very few people commented on this. This means that there could be
some other source for the bimodality. Also note:
boxplots do NOT give you a sense of the number of modes. They merely
tell you the median and the spread and some crude notions of shape.
O-rings: There's some history here which was not given in the HW
problem but makes this more interesting. The first space shuttle
to be launched with a civilian was to be launched in colder than normal
temperatures. NASA wanted evidence that it was safe. The
first plot you were asked to make (with the subset of the full data
set) was the plot the engineers considered in order to make the
decision. You'll note that this plot suggests that safety and
temperature do not seem to be related. Tragically, had they
looked at the full data set, they would have seen that failures were
associated almost exclusively with lower temperatures. The launch
was approved and the shuttle exploded after take-off, killing all on
board. You should provide both plots and comment on what
the plots say about safety. Many people commented on the relation
between temperature and failure, but of course failure is a way of
assessing safety. Also, many people gave "left to right"
interpretations: "as temperatures increase, the chance of failure
decreases." Technically, this is true, but as you can see from
the historical context, what we really want to know is more "right - to
-left": what happens at low temperatures?
Density Curves vs. histograms: I think most of you saw the main
issue. The shape of a histogram depends on where the cut-points
are for the bins. This is why we often look at several different
histograms of the same data, to get a sense for what different shapes
are plausible. The density curve estimates a "smooth"
histogram. In many ways it provides more detail in that it helps
us see how relative frequencies change in a continuous fashion, while a
histogram suggests that there's no change at all within a bin.
However, as some of you pointed out, sometimes there's value in knowing
the counts in a bin, or in treating even continuous variables as if
they were categorical and comparing one bin to another. One word
of caution, though: this estimated density curve also depends on
several "tuning parameters", one of which is called the bandwidth and
is directly analogous to the bin width of a histogram. Change the
bandwidth and the shape of the density curve will change. It is a bit
of an art, trying to decide which bandwidth is "best".
Heights of Stats 11 men: I made a big mistake. I warned you
about outliers that did not exist. This was because I was looking
at the women's heights when in fact I should have looked at the
men. Two women reported their heights as 300", or nearly 25 feet.
This was clearly a mistake (either on their part or the typists
part). Had there been 25 foot tall people in the class, I would
have noticed. Hence, any comments about the distribution of
women's heights should be made after removing these outliers.
(This is one of the few cases in which it is clear that the outliers
should be removed.) The men's heights, however, did not
have outliers. Many of you said the distribution was normal, despite
showing histograms that looked left skewed. When making claims
about normality, also check the qqnorm function, which is a better way
of assessing normality than eye-balling a histogram. In fact, the
men's histogram has another interesting feature which is best seen if
you make the bins only 1" wide. Note that there are very few 71"
tall men, and quite a few 72" tall men. Why? Well, 72/12 =
6, and so it seems there's a tendancy to round up to six-feet if
you're a man and close to six feet. In fact, in the default
histogram (that has slightly wider bin widths), there's a peak at six
feet with only two observations over it.
Speed -of - light: the assumptions of a t-test in this contexty
are that the data are normally distributed and independent. The
null hypothesis is that the true speed of light is 734.5 (in
appropriate units for these data). The alternative is that it is
NOT 734.5. If you simply put all of the data into the
hopper, you will reject the null hypothesis. Next, however, the
problem tells you to consider the data in "batches", in which each
batch consists of ordered trials. Now you see that the first
trial, in particular, has extremely large variance and a much higher
median than the other trials. This suggests that there might have
been some sort of learning curve going on with the experimenters, and
so perhaps the first batch is not as trustworthy as subsequent
batches. (In other words, the independence assumption could be
violated; knowledge of which trial the data comes from tells us
something about the result). So what do you think would happen if
we eliminated the first trial from the t-test? One further
note: It will NEVER be sufficient to state assumptions and then
plow on regardless of whether they are true. Once you've said
that you're assuming normality, CHECK it. (Many of you did, but
some did not.)
Writing Functions: Many of you were a bit shaky about whether
your function was "correct". The nice thing about computers is
that you can always check. Run the function and see if it does what you
want it to!
qqnorm: this is a very handy function, but I
understand that some of you might not know what it does. By all
means ask if you want to learn about what a quantile is and why this
function works. A difficulty of this class is that you have
varied backgrounds, and so I can't go into all of the "footnotes" of
the class during classtime, but I can individually.
problems loading/viewing/seeing data: whenever you have a
problem viewing or loading data, contact me immediately. Tell me
what you tried and what the result was. Do NOT wait until after
the homework is due. Then it will be too late. This class
requires that you upload/doanload lots of data, and it's important you
be able to do this. Also, it is quite possible that I make a
mistake in posting the data, and so we need to catch these as soon as
possible.