Lecture 1
Descriptive Statistics
A collection of objects have been measured on one or more variables.
Variables can be numerical, categorical, etc. Goal is to describe
the sample.
If all values are the same, no problem. But usually there's variability,
and the challenge is to summarize, represent that variability. Often
we have to answer over-simplified questions: did students do well on
a test? Are people in that part of town poor? Are people in this
city healthier than in another? Did women score higher than men?
These questions all require that we somehow reduce the variability to some
essence and make a comparison.
Overview, to be followed by details shown with Data
The fundamental tool is the distribution. A distribution is a function
that for each value of the variable, reports the frequency. In this
context, this is called the empirical distribution, because the frequencies
are the observed frequencies.
All analyses are about data, so we should take a moment to say what data
are:
data are numbers in context. (but not always numbers)
The role of context is crucial. It tell us what the numbers *mean*,
and the meaning guides what we're looking for.
For example, which of these two distributions is heights and which is incomes?
(Draw a symmetric and right-skewed, with no labels.) Your knowledge
of the context guides you in what to expect for the distribution of a variable.
Every analysis should begin with a picture. Usually this means a picture
of the distribution of our sample, but it could also means pictures of several
samples.
Pictures should show us teh values of the variables and their frequencies
(counts) or relative frequencies (proportions). The most famous of
these is the histogram, but other popular choices are stemplots and dotplots.
The distributions can themselves be summarized with numbers. But to
do so requires knowing what features to look for. In general there
are two: center and spread. Center is the "typical" value of a distribution.
Spread is measure of variability. A third, more difficult to quantify
is the shape. This is vague, but means several things:
is it symmetric, or skewed?
unimodal?
interesting features
Later, we'll ask: does it look like a representative of
a known mathematical function?
Numerical Summaries
The center of a distribution is most commonly measured by the mean or the
median, also called the "sample mean" and the "sample median". The
sample mean is just the average, and it essentially gives the center of mass
of the distribution. The median is the value that sits right in the
middle if you were to sort the data. Visually, it cuts a distribution in
half.
Let's consider an example from Friday.
3. The number of issues of a particular monthly magazine read by 20 people
in a year:
0,1,11,0,0,0,2,12,0,0,
12,1,0,0,0,0,12,0,11,0
What can you tell me about the shape of this distribution? It looks
already a bit unusually shaped. But there's nothing to prevent us from
calculating a mean and a median:
mean = 3.15
median = 0
They're rather drastic, and that's not so strange when you consider the bimodal
distribution (which is nearly symmetric). Sorted:
0 0 0 0 0 0 0 0 0 0
0 0 1 1 2 11 12 12 12 12
Which helps you see the distribution better. The context is again key.
It tells us what we should report -- and maybe neither figure is good.
Probably better to report the percentages in each of the two categoreis "1
or fewer" "11 or more".
But still, what if we change the question slightly: I'm going to pick
a number at random from these numbers. You need to predict which
number I pick. The amount of money you win depends on how close
you get. Also, you will play this game many times, so you want a strategy
that will do well in the long run.
What strategies can you come up with?
A good strategy here is to choose 0: the most frequent response. But
this isn't necessarily a good strategy if the game penalizes you for a lot
for errors. If the 11 does come up, it could wipe you out, for example.
Guass was one who used what we now call a Loss Function to think about describing
distributions. Here's a loss function:
L(guess) = sum (data - guess)^2
This is called squared-error loss. You' will be penalized a lot if
your guess is far from observations, and little if not. You want
to choose the number that makes thsi loss as small as possible.
What number is this? Turns out its the average.
If L(guess) = sum|data - guess| than the best guess is the median.
This gives a new interpretation to median and mean: they minimize their
loss functions. If we want to make a prediction, they each give
us teh best prediction, in the sense that they come close to the known data.
Depending on what we mean by close.
The spread is measured by the SD/variance of the IQR. The SD makes
more sense, I hope, after you've seen the LS context. If we think of
"x - xbar" as our "residual" or error, then the variance is the average squared
errors. In any sense, the SD tells us the smount of spread with respect
to the mean.
The IQR, on the other hand, tells us the distance occcupied by the middle
50% of the data.
Resistence to spread:
the median and IQE have the property of being "resistant to spread" which
means, in effect, they're not affected by outliers or skew.
Now let's look at some data.
For a start, load "songlist" into R. We're going to cheat and not look at
a picture.
Give me some numerical summaries about time. What's the typical
song-length? What's the range/spread of songlengths? Tell me
what you want to do, and I"ll give you the R command.
I deliberately put off making paicture. What do you think the distribution
will look like?
Let's make a picture.
There's a fair amount of variation. Everything from 16 seconds to 17
minutes. The mean and median are both just under 4 minutes. (NOte
that it's easy to convert from seconds to minutes!)
What accounts for this variation? What other variables might we explore?
***************
R commands
#load data, assuming that the file songlist.txt is in the current working
directory.
music <- read.table("songlist.txt", header=T, sep="\t")
#if not in working directory, need to specify the full path name.
attach(music)
#attach lets you use the variable names contained in the "data table" named
music.
names(music)
# to view the names of the variables
summarytime)
# gives summary statistics of the time variable.
table(genre)
#gives frequencies of the categorical variable genre.
hist(time)
times.class <- time[genre=="Classical"]
#create new vector containing only times of classical tracks
times.rock <- time[genre=="Rock"]
summary(times.class)
summary(times.rock)
boxplot(times.class, times.rock)