In-class Quiz 2: Friday, March
12
Explain why it is better to have a random sample of size 10 than one
of size 5 for estimating the mean of a population.
The main reason: there is less variability in an estimate based on
a sample of size 10 than there is for one of size 5. This means the
estimator based on n=10 is more precise and is likely to be closer to the
population mean. For example, if you use the average to estimate the
mean, the standard error when n = 10 is sigma/sqrt(10), and this standard
error is roughly 70% the size of the standard error you'd get if
if you'd used only 5 observations.
Note that both estimators are unbiased. Many of you said that the estimator
with n=10 would be "more accurate". But accuracy means "tends to hit the
target" and they both tend to do that. The difference is that the n=10 estimator
tends to be closer.
Remember, being a statistician means never having to say you're certain.
In fact, often it's a bad thing. So beware of writing something
like "the estimator for n=10 will be closer to the true population
mean than the estimator for n=5". It is possible, and in fact not highly
unlikely, for the estimator based on n=5 to produce an estimate closer to
the true value than for n=10. However, it is more likely to happen
the other way around. Look at it like this: If the population
SD is sigma=1, then in 95% of all samples of size 5, the n=5 estimator
will produce a result that is within 1.96*(1/sqrt(5))= .88 units of
the true value. On the other hand, 95% of all samples of size 10 will
have an estimator be within 1.96*(1/sqrt(10)) = .62 units of the true value.
The n=5 estimator tends to stray a little bit further from home.
Some of you mentioned the Central Limit Theorem. The CLT becomes a
factor to consider when we're computing probabilities. However, the
two main facts: (1) the average is an unbiased estimator of the true mean
whether n=5 or n=10 and (2) the average is more precise if n=10 than if n=5
--- are true regardless of the probability distribution of the population.
Still, it is worth nothing that if the population is NOT normal, then
the estimator based on n=10 then the normal distribution will provide a better
approximation when doing probabilitiy calculations (for example, determining
the width of a confidence interval or calculating a p-value.)
If I were grading this, you'd get full credit if you mentioned only that
the standard error is smaller, and you'd get partial credit if the only thing
you mentioned was that the central limit theorem says the sampling distribution
is better approximated by a normal distribution when n=10. (The reason
this is only worth partial credit is that it is only useful when (a) the
population distribution is not normal and (b) you need to calculate probabilities
concerning your estimator.
Is a large, non-random sample of n=100 better than one of
size 10 for estimating the mean of a population?
No, it's not. Most of you got this one right. If the sample is
non-random, then it could be biased. And getting a large biased sample
is not helpful at all. One of you said it better than I could: "quality
is more important than quantity."
Some of you said that if the population is small, say close to 100, then
the n=100 estimator would be prefered since you're seeing almost all of the
population. This is true ONLY if the population size is 100. Otherwise,
you will still get a biased sample. Think of this way: since
the sample is non-random, it could exclude any minority group. So even
if you included 90% of the population, if you excluded that 10%, you'd have
a biased view of the population. Suppose you decided to take a survey
of whether people liked hamburgers for dinner, but deliberately excluded
vegetarians. Even if you asked ALL non-vegetarians, you'd have a biased
view of how the population felt about hamburgers.
Some of you said an even more wrong answer: bigger means you're seeing
more of the population and therefore it is always better. But for the
same reason as described above, this is not true if the sample is biased.