Lecture 2

Last time we summarized the 100 songs on the list I gave you.
min    q1    m    xbar    q3    max
n=100

Classical:  16.00    75.25    217.0    208.20    270.80    573.00  s = 146.4
Rock    48.0    187.0    239.0    241.5    266.0    1111.0    s = 130.4
 The list has a new meaning, however, if I tell you that it is a random sample from one of my iTunes libraries.  (I have 3.)  This leads to another source for the variation in times -- variation due to random sampling.  This leads to several questions about our conclusions:

do classical and rock playing times differ in the population as well?  Or is this difference just a trick of the random sampling?

Are the frequencies within categories truly representative of the population?  About how different might the actual values be?

NOte:  I"m not sure whether this is a truly random sample.  I have access to the entire population, and so a secondary question would be to investigate whether the iTunes random sample is really "random".

Probability theory gives us a way of answering these questions, as long as we
(a) know the distribution of the population and
(b) know the probability mechanism through which the sample was taken

First some talk about populations.

What is a population?  Seems pretty concrete when we're talking about people.  But suppose you wanted to know the mean height of adults in the US.  Adults when?  AT what time?   And here's more abstract:  suppose you want to know the probability a coin lands heads?  What's the population?  Does it matter whether you're talking about one particular coin, or all coins?

Populations are fairly abstract items.  In principle we can give the exact distribution for any variable in a population, but in interesting populations this isn't possible.

Therefore, we often apply a model:   a set of assumptions and descriptions about a population.  The most common model involves something about  the distribution of the variable we're studying.

If the variable is continuous (an assumption that is itself part of the model), we often assume the distribution is normally distributed.

The mathematical function looks like this:

But more importantly, the picture looks like this?

How can we know if this assumption is the right one?  Well, there might be some sort of theory that tells us this is so.  But we can often use our sample to see if it is consistent with this assumption.

NOte that if we take a simple random sample from this population, then this distribution not only describes frequencies (or relative frequencies), but also gives us probabilities.  So if we learn that 68% of the population is between 70" and 73" in height, then the probability that we get someone that height from a random selection is also 68%.

Checking whether a distribution is truly normal is tricky, and turns out to be fairly important in regression.  So lets take a moment to consider what a sample from  a normal distribution might look like.

Do some random samples of n = 10, 20, 100, 1000 and look at histograms.

qq-plot.
par(mfrow=c(2,3))
test <- matrix(rnorm(60), nrow=6)
hist(test[1,]) etc
qqnorm(test([1,])
qqline(test[1,])


Obviously the normal model doesn't work for all situations.  If we're modeling income in a population, for example, we can be pretty certain the distribution is skewed right, and so the normal model won't hold.  If we're sampling 1/0's like in an opinion poll, then probabiliy a bernoulli/binomial model is needed.

But the normal distribution still remains useful and powerful because of the central limit theorem.  Roughly speaking, if you take a random sample of size n from ANY distribution,  and add the numbers together, or take their average,....the distirbution of that process will be normal.

Random Variables and their Expectations and Variances

A random variable is a mathematical model of this action:
reach into a population and take a (numerical) measurement.

Random variables are variables who values are assumed randomly and are numerical values.

Here's a very simple example:
Place a dollar that black will come up on roullette .  If so you win $1, if not yo ulose is.

X=1 or -1.   These are the values.
Their probabilities are 18/38 and 20/38. 

Probabilities are long-run frequencies, so the interpretation of these numbers is that if we play very many times, this is the fraction of times we'll see a 1 and a -1.  This is called the probability distribution of X.

NOte that X is NOT a number.  It is "both" numbers.  Strictly speaking, X is a function and not a "missing" or "uknown" value as it is in algebra.  Think of it as a verb. Everytime you see an "X", it means the roulette wheel has been spun and a bet placed.

We can plot this distribution. And we can treat it like all others and find the mean an dSD.