Homework 1 Discussion

Rather than line by line solutions, I'm going to discuss some common themes that came up while grading your papers.

First, one thing I hope this class begins to teach -- and that your undergraduate education will teach -- is curiosity.  It is my belief that curiosity is not necessarily a natural emotion, but needs to be taught, and learning about a variety of ways of seeing and understanding the wworld is one way of learning it.  All that by way of saying that although sometimes the books is rather terse and says "make a plot" I hope you will also look at it more closely and wonder what it is that this plot teaches us.

In general:  (1) Write full, complete sentences.  Try to avoid words like "this" and "they" unless it is very clear what you are referring to. For example, "They are normally distributed" is not nearly as good as "The possum's ages are normally distributed."
(2) Put comments and graphs together.  It is very hard to follow your interpretations if the graphs are many pages away.

Possums data:   The earconch distribution is bimodal.  (Remember, when looking at histograms or distributions, look for shape, center, and spread).  When we see bimodality, we try to find reasons, and one is often that we've combined two populations together.  It is natural, therefore, to separate by gender.  However, the histograms of earconch by gender are still bimodal.  Very few people commented on this.  This means that there could be some other source for the bimodality.   Also note:  boxplots do NOT give you a sense of the number of modes. They merely tell you the median and the spread and some crude notions of shape.

O-rings:  There's some history here which was not given in the HW problem but makes this more interesting.  The first space shuttle to be launched with a civilian was to be launched in colder than normal temperatures.  NASA wanted evidence that it was safe.  The first plot you were asked to make (with the subset of the full data set) was the plot the engineers considered in order to make the decision.  You'll note that this plot suggests that safety and temperature do not seem to be related.  Tragically, had they looked at the full data set, they would have seen that failures were associated almost exclusively with lower temperatures.  The launch was approved and the shuttle exploded after take-off, killing all on board.   You should provide both plots and comment on what the plots say about safety.  Many people commented on the relation between temperature and failure, but of course failure is a way of assessing safety. Also, many people gave "left to right" interpretations:  "as temperatures increase, the chance of failure decreases."  Technically, this is true, but as you can see from the historical context, what we really want to know is more "right - to -left":  what happens at low temperatures?

Density Curves vs. histograms:  I think most of you saw the main issue.  The shape of a histogram depends on where the cut-points are for the bins.  This is why we often look at several different histograms of the same data, to get a sense for what different shapes are plausible.  The density curve estimates a "smooth" histogram.  In many ways it provides more detail in that it helps us see how relative frequencies change in a continuous fashion, while a histogram suggests that there's no change at all within a bin.  However, as some of you pointed out, sometimes there's value in knowing the counts in a bin, or in treating even continuous variables as if they were categorical and comparing one bin to another.  One word of caution, though:  this estimated density curve also depends on several "tuning parameters", one of which is called the bandwidth and is directly analogous to the bin width of a histogram.  Change the bandwidth and the shape of the density curve will change. It is a bit of an art, trying to decide which bandwidth is "best".

Heights of Stats 11 men:  I made a big mistake.  I warned you about outliers that did not exist.  This was because I was looking at the women's heights when in fact I should have looked at the men.  Two women reported their heights as 300", or nearly 25 feet. This was clearly a mistake (either on their part or the typists part).  Had there been 25 foot tall people in the class, I would have noticed.  Hence, any comments about the distribution of women's heights should be made after removing these outliers.  (This is one of the few cases in which it is clear that the outliers should be removed.)   The men's heights, however, did not have outliers. Many of you said the distribution was normal, despite showing histograms that looked left skewed.  When making claims about normality, also check the qqnorm function, which is a better way of assessing normality than eye-balling a histogram.  In fact, the men's histogram has another interesting feature which is best seen if you make the bins only 1" wide.  Note that there are very few 71" tall men, and quite a few 72" tall men.  Why?  Well, 72/12 = 6,  and so it seems there's a tendancy to round up to six-feet if you're a man and close to six feet.  In fact, in the default histogram (that has slightly wider bin widths), there's a peak at six feet with only two observations over it.

Speed -of - light:  the assumptions of a t-test in this contexty are that the data are normally distributed and independent.  The null hypothesis is that the true speed of light is 734.5 (in appropriate units for these data).  The alternative is that it is NOT 734.5.   If you simply put all of the data into the hopper, you will reject the null hypothesis.  Next, however, the problem tells you to consider the data in "batches", in which each batch consists of ordered trials.  Now you see that the first trial, in particular, has extremely large variance and a much higher median than the other trials.  This suggests that there might have been some sort of learning curve going on with the experimenters, and so perhaps the first batch is not as trustworthy as subsequent batches.  (In other words, the independence assumption could be violated;  knowledge of which trial the data comes from tells us something about the result).  So what do you think would happen if we eliminated the first trial from the t-test?   One further note:  It will NEVER be sufficient to state assumptions and then plow on regardless of whether they are true.  Once you've said that you're assuming normality, CHECK it.  (Many of you did, but some did not.) 

Writing Functions:  Many of you were a bit shaky about whether your function was "correct".  The nice thing about computers is that you can always check. Run the function and see if it does what you want it to!

qqnorm:    this is a very handy function, but I understand that some of you might not know what it does.  By all means ask if you want to learn about what a quantile is and why this function works.  A difficulty of this class is that you have varied backgrounds, and so I can't go into all of the "footnotes" of the class during classtime, but I can individually.

problems loading/viewing/seeing data:   whenever you have a problem viewing or loading data, contact me immediately.  Tell me what you tried and what the result was.  Do NOT wait until after the homework is due.  Then it will be too late.  This class requires that you upload/doanload lots of data, and it's important you be able to do this.  Also, it is quite possible that I make a mistake in posting the data, and so we need to catch these as soon as possible.