Lecture 2
Last time we summarized the 100 songs on the list I gave you.
min q1 m xbar
q3 max
n=100
Classical: 16.00 75.25 217.0
208.20 270.80 573.00 s =
146.4
Rock 48.0 187.0 239.0
241.5 266.0 1111.0
s = 130.4
The list has a new meaning, however, if I tell you that it is a random
sample from one of my iTunes libraries. (I have 3.) This leads
to another source for the variation in times -- variation due to random sampling.
This leads to several questions about our conclusions:
do classical and rock playing times differ in the population as well?
Or is this difference just a trick of the random sampling?
Are the frequencies within categories truly representative of the population?
About how different might the actual values be?
NOte: I"m not sure whether this is a truly random sample. I have
access to the entire population, and so a secondary question would be to
investigate whether the iTunes random sample is really "random".
Probability theory gives us a way of answering these questions, as long as
we
(a) know the distribution of the population and
(b) know the probability mechanism through which the sample was taken
First some talk about populations.
What is a population? Seems pretty concrete when we're talking about
people. But suppose you wanted to know the mean height of adults in
the US. Adults when? AT what time? And here's more
abstract: suppose you want to know the probability a coin lands heads?
What's the population? Does it matter whether you're talking about
one particular coin, or all coins?
Populations are fairly abstract items. In principle we can give the
exact distribution for any variable in a population, but in interesting populations
this isn't possible.
Therefore, we often apply a model: a set of assumptions and descriptions
about a population. The most common model involves something about
the distribution of the variable we're studying.
If the variable is continuous (an assumption that is itself part of the model),
we often assume the distribution is normally distributed.
The mathematical function looks like this:
But more importantly, the picture looks like this?
How can we know if this assumption is the right one? Well, there might
be some sort of theory that tells us this is so. But we can often use
our sample to see if it is consistent with this assumption.
NOte that if we take a simple random sample from this population, then this
distribution not only describes frequencies (or relative frequencies), but
also gives us probabilities. So if we learn that 68% of the population
is between 70" and 73" in height, then the probability that we get someone
that height from a random selection is also 68%.
Checking whether a distribution is truly normal is tricky, and turns out
to be fairly important in regression. So lets take a moment to consider
what a sample from a normal distribution might look like.
Do some random samples of n = 10, 20, 100, 1000 and look at histograms.
qq-plot.
par(mfrow=c(2,3))
test <- matrix(rnorm(60), nrow=6)
hist(test[1,]) etc
qqnorm(test([1,])
qqline(test[1,])
Obviously the normal model doesn't work for all situations. If we're
modeling income in a population, for example, we can be pretty certain the
distribution is skewed right, and so the normal model won't hold. If
we're sampling 1/0's like in an opinion poll, then probabiliy a bernoulli/binomial
model is needed.
But the normal distribution still remains useful and powerful because of
the central limit theorem. Roughly speaking, if you take a random sample
of size n from ANY distribution, and add the numbers together, or take
their average,....the distirbution of that process will be normal.
Random Variables and their Expectations and Variances
A random variable is a mathematical model of this action:
reach into a population and take a (numerical) measurement.
Random variables are variables who values are assumed randomly and are numerical
values.
Here's a very simple example:
Place a dollar that black will come up on roullette . If so you win
$1, if not yo ulose is.
X=1 or -1. These are the values.
Their probabilities are 18/38 and 20/38.
Probabilities are long-run frequencies, so the interpretation of these numbers
is that if we play very many times, this is the fraction of times we'll see
a 1 and a -1. This is called the probability distribution of X.
NOte that X is NOT a number. It is "both" numbers. Strictly speaking,
X is a function and not a "missing" or "uknown" value as it is in algebra.
Think of it as a verb. Everytime you see an "X", it means the roulette wheel
has been spun and a bet placed.
We can plot this distribution. And we can treat it like all others and find
the mean an dSD.