Homework 2 Solutions

2a) Describe the distribution of income of LA residents in 2000.

About half of the families make less than $43K, but a small percentage make substantially more.  As one might expect, the distribution is very skewed to the right, with a small handful of families over $200K.  Incomes range from near $0 to ove  $500,000.

Your description should (a) discuss the center (b) spread, (c)shape in context of the data.  The above paragraph does all three. The first phrase ("about half...make less than ...") discusses the median (and by saying "half of the families" instead of "half of the data" it's in context). The next sentence the shape, and the last the spread.

The "context test".  What do we mean by "discuss in context"?  The test I use is I ask myself, if all of the labels were stripped off the graph, and the student were given NO information about what the data were and where they came from, would their description still work?  If so, then their description must not have any context and this is a bad thing.

Another note about the histogram:  many of you commented that the histogram appears to show that there are incomes of less than 0 dollars.  Obviously this is not possible unless some data were mis-entered or there's a funny new definition of income being used.  But there's a more plausible explanation; the software creates bins.  In this case it's hard to tell precisely where the bins begin and end, but they're about 14K wide. (There are 6 full bins and two half-bins between 0 and 100K, so 7 bins have to cover the range.) The first bin is centered on 0, but therefore includes every reported income from -7K to +7K.   Even if there are no incomes in the negative dollars, there probably are some in the 0 to 7K range, and that's why this bin makes it look as if it is reported negative incomes, even though it is not.

2b) The easiest way to do this is to recognize that the entire histogram must add to 100%.  The median line at 43,200 is the 50% mark; there are about 4 more bins until you get to 100K, and these represent 11%+6%+5%+6% = 28%, so approximately 50%+28% = 78% of the incomes are below $100K.  

Understand that the purpose of this problem is not to get exactly the true percentage.  This is impossible given the histogram.  But you can get a ballpark.  What *is* important is that your answer state that this is not the true answer (so use "approximately" or "about") or that, if you come up with a method for given an exact answer, you state the assumptions needed to make it true.

2c) The first bar is about 7%, the next is 18%, so the 10th percentile is somewhere in the second bin. We can't say where, exactly, so at best we can say its somewhere between $7000 and $28000.

2d) I can't sketch here, but your boxplot should have these features:

Median-line at 43K.  Bottom of box (Q1) at about 25K. Top of box (Q3) at about 100K.  The IQR is therefore 75K, and 1.5*75 = 112.5K.
Now 25K-112.5K puts us below 0 and therefore below the lowest possible observation, and so your lower whisker extends to 0.    The upper whisker extends to 100K + 112.5K = 212.5K or to the maximum-- whichever produces the shortest whisker.  In this case there are observations above 212.5K, so the whisker stops at 212.5K and you should indicate outside values with dots.  Of course I wouldn't expect you to know precisely how many dots are needed and where exactly they should go, but you should indicate that there are some out there, and at the very least should indicate the maximum value at about 550K.

2e) True.  The histogram is right-skewed, and so the average will be greater than the median. This means "most families" (more than half) will be below average.

2f) Don't look at the data. Think about it.  The minimum values will still be 0, and the max may or may not change (and could be higher or lower).  But in general, incomes are higher now than then (due to inflation, if nothing else.)  This means that we would still expect a right-skewed histogram, but with lower median and average, which means it might be more "bunched up" towards 0 then the 2000 histogram.

NOTE that you can answer all of these questions without looking at the data.  The purpose of this exercise was to explore how much information you can (or cannot) get out of  a histogram.


Wild and Seber, p. 72 #2
Remember that you're welcome to use any statistical software/calculator package you like. The data are available on-line (see the top of the Homework web page), and so Stata or Statcrunch can load it automatically if you're on-line, or you can download it and load it into Stata later.

2a) Average male coyote length is 92.0", give or take 6.7"  (the sample standard deviation is 6.7").  The average male length is greater than the average female length (89.2") but the standard deviation for females, 6.5", is about the same.  Since standard deviation measures variability, both data sets (male and female) have about the same variability, but the males might be slightly more.

In Stata
insheet using "http://www.stat.auckland.ac.nz/~7Ewild/ChanceEnc/WSdata/Ch02data/coyote.txt"
sort gender
by gender: summarize length
There's quite a bit of overlap in these distributions: we know that about 68% of the coyotes in the sample fell within 92 +/- 6.7, and since the averages between male and female are only about 3" apart.  So if we were to randomly select a male and a female, more often than not the male would be longer than the female, but not necessarily and certainly not every time.

2b) The rule predicts 68% lie within 92-6.7 = 85.3" and 92+6.7 = 98.7".  In fact, there are 29 coyotes in this interval which is 29/43*100% = 67.4%.   The rule predicts 95% of the coyotes lie between 78.6" and 105.3". There were 42 coyotes in this group. 42/43 means 97.7%.  
The easiest way to do this is to sort the data and count.  Stata makes this easy by displaying data in groups of 5.  Here's what I did. (You don't have to know how to do this in Stata, but if you're interested...)

sort length
drop if gender=="female"
list if length > 85.3 & length < 98.7

Then you count. There's probably an easier way, but this is what came first to mind.

2c) Again, life is easier if you use a calculator/computer to find this. The summarize, detail command in STATA will provide you with this information.

Q1=87 Q3=96  so IQR = 96-87= 9
The rule of thumb says that the sample SD should be about (3/4)*9 =6.75, which is pretty darn close.   Can you intuitively justify this rule of thumb (using the 68/95/99.7 rule)?