Homework 2 Solutions
2a) Describe the distribution of income of LA residents in 2000.
About half of the families make less than $43K, but a small percentage make
substantially more. As one might expect, the distribution is very skewed
to the right, with a small handful of families over $200K. Incomes
range from near $0 to ove $500,000.
Your description should (a) discuss the center (b) spread, (c)shape in context
of the data. The above paragraph does all three. The first phrase ("about
half...make less than ...") discusses the median (and by saying "half of
the families" instead of "half of the data" it's in context). The next sentence
the shape, and the last the spread.
The "context test". What do we mean by "discuss in context"? The
test I use is I ask myself, if all of the labels were stripped off the graph,
and the student were given NO information about what the data were and where
they came from, would their description still work? If so, then their
description must not have any context and this is a bad thing.
Another note about the histogram: many of you commented that the histogram
appears to show that there are incomes of less than 0 dollars. Obviously
this is not possible unless some data were mis-entered or there's a funny
new definition of income being used. But there's a more plausible explanation;
the software creates bins. In this case it's hard to tell precisely
where the bins begin and end, but they're about 14K wide. (There are 6 full
bins and two half-bins between 0 and 100K, so 7 bins have to cover the range.)
The first bin is centered on 0, but therefore includes every reported income
from -7K to +7K. Even if there are no incomes in the negative dollars,
there probably are some in the 0 to 7K range, and that's why this bin makes
it look as if it is reported negative incomes, even though it is not.
2b) The easiest way to do this is to recognize that the entire histogram
must add to 100%. The median line at 43,200 is the 50% mark; there
are about 4 more bins until you get to 100K, and these represent 11%+6%+5%+6%
= 28%, so approximately 50%+28% = 78% of the incomes are below $100K.
Understand that the purpose of this problem is not to get exactly the true
percentage. This is impossible given the histogram. But you can
get a ballpark. What *is* important is that your answer state that
this is not the true answer (so use "approximately" or "about") or that,
if you come up with a method for given an exact answer, you state the assumptions
needed to make it true.
2c) The first bar is about 7%, the next is 18%, so the 10th percentile is
somewhere in the second bin. We can't say where, exactly, so at best we can
say its somewhere between $7000 and $28000.
2d) I can't sketch here, but your boxplot should have these features:
Median-line at 43K. Bottom of box (Q1) at about 25K. Top of box (Q3)
at about 100K. The IQR is therefore 75K, and 1.5*75 = 112.5K.
Now 25K-112.5K puts us below 0 and therefore below the lowest possible observation,
and so your lower whisker extends to 0. The upper whisker extends
to 100K + 112.5K = 212.5K or to the maximum-- whichever produces the shortest
whisker. In this case there are observations above 212.5K, so the whisker
stops at 212.5K and you should indicate outside values with dots. Of
course I wouldn't expect you to know precisely how many dots are needed and
where exactly they should go, but you should indicate that there are some
out there, and at the very least should indicate the maximum value at about
550K.
2e) True. The histogram is right-skewed, and so the average will be
greater than the median. This means "most families" (more than half) will
be below average.
2f) Don't look at the data. Think about it. The minimum values will
still be 0, and the max may or may not change (and could be higher or lower).
But in general, incomes are higher now than then (due to inflation,
if nothing else.) This means that we would still expect a right-skewed
histogram, but with lower median and average, which means it might be more
"bunched up" towards 0 then the 2000 histogram.
NOTE that you can answer all of these questions without looking at the data.
The purpose of this exercise was to explore how much information you
can (or cannot) get out of a histogram.
Wild and Seber, p. 72 #2
Remember that you're welcome to use any statistical software/calculator package
you like. The data are available on-line (see the top of the Homework web
page), and so Stata or Statcrunch can load it automatically if you're on-line,
or you can download it and load it into Stata later.
2a) Average male coyote length is 92.0", give or take 6.7" (the sample
standard deviation is 6.7"). The average male length is greater than
the average female length (89.2") but the standard deviation for females,
6.5", is about the same. Since standard deviation measures variability,
both data sets (male and female) have about the same variability, but the
males might be slightly more.
In Stata
insheet using "http://www.stat.auckland.ac.nz/~7Ewild/ChanceEnc/WSdata/Ch02data/coyote.txt"
sort gender
by gender: summarize length
There's quite a bit of overlap in these distributions: we know that about
68% of the coyotes in the sample fell within 92 +/- 6.7, and since the averages
between male and female are only about 3" apart. So if we were to randomly
select a male and a female, more often than not the male would be longer
than the female, but not necessarily and certainly not every time.
2b) The rule predicts 68% lie within 92-6.7 = 85.3" and 92+6.7 = 98.7". In
fact, there are 29 coyotes in this interval which is 29/43*100% = 67.4%.
The rule predicts 95% of the coyotes lie between 78.6" and 105.3".
There were 42 coyotes in this group. 42/43 means 97.7%.
The easiest way to do this is to sort the data and count. Stata makes
this easy by displaying data in groups of 5. Here's what I did. (You
don't have to know how to do this in Stata, but if you're interested...)
sort length
drop if gender=="female"
list if length > 85.3 & length < 98.7
Then you count. There's probably an easier way, but this is what came first
to mind.
2c) Again, life is easier if you use a calculator/computer to find this.
The summarize, detail command in STATA will provide you with this information.
Q1=87 Q3=96 so IQR = 96-87= 9
The rule of thumb says that the sample SD should be about (3/4)*9 =6.75,
which is pretty darn close. Can you intuitively justify this rule
of thumb (using the 68/95/99.7 rule)?