Lecture 1

Descriptive Statistics

A collection of objects have been measured on one or more variables. Variables can be numerical, categorical, etc.   Goal is to describe the sample.

If all values are the same, no problem. But usually there's variability, and the challenge is to summarize, represent that variability. Often we have to answer over-simplified questions: did students do well on a test? Are people in that part of town poor? Are people in this city healthier than in another? Did women score higher than men?   These questions all require that we somehow reduce the variability to some essence and make a comparison.

Overview, to be followed by details shown with Data

The fundamental tool is the distribution. A distribution is a function that for each value of the variable, reports the frequency. In this context, this is called the empirical distribution, because the frequencies are the observed frequencies.

All analyses are about data, so we should take a moment to say what data are:

data are numbers in context. (but not always numbers)

The role of context is crucial. It tell us what the numbers *mean*, and the meaning guides what we're looking for.

For example, which of these two distributions is heights and which is incomes? (Draw a symmetric and right-skewed, with no labels.) Your knowledge of the context guides you in what to expect for the distribution of a variable.

Every analysis should begin with a picture. Usually this means a picture of the distribution of our sample, but it could also means pictures of several samples.

Pictures should show us teh values of the variables and their frequencies (counts) or relative frequencies (proportions). The most famous of these is the histogram, but other popular choices are stemplots and dotplots.

The distributions can themselves be summarized with numbers. But to do so requires knowing what features to look for. In general there are two: center and spread. Center is the "typical" value of a distribution. Spread is measure of variability. A third, more difficult to quantify is the shape. This is vague, but means several things:
   is it symmetric, or skewed?
   unimodal?
   interesting features
   Later, we'll ask: does it look like a representative of a known mathematical function?

Numerical Summaries

The center of a distribution is most commonly measured by the mean or the median, also called the "sample mean" and the "sample median". The sample mean is just the average, and it essentially gives the center of mass of the distribution. The median is the value that sits right in the middle if you were to sort the data. Visually, it cuts a distribution in half.

Let's consider an example from Friday.
3. The number of issues of a particular monthly magazine read by 20 people in a year:
0,1,11,0,0,0,2,12,0,0,
12,1,0,0,0,0,12,0,11,0

What can you tell me about the shape of this distribution? It looks already a bit unusually shaped. But there's nothing to prevent us from calculating a mean and a median:
mean = 3.15
median = 0

They're rather drastic, and that's not so strange when you consider the bimodal distribution (which is nearly symmetric). Sorted:
0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 11 12 12 12 12

Which helps you see the distribution better. The context is again key. It tells us what we should report -- and maybe neither figure is good. Probably better to report the percentages in each of the two categoreis "1 or fewer" "11 or more".

But still, what if we change the question slightly: I'm going to pick a number at random from these numbers.   You need to predict which number I pick.   The amount of money you win depends on how close you get. Also, you will play this game many times, so you want a strategy that will do well in the long run.

What strategies can you come up with?
A good strategy here is to choose 0: the most frequent response. But this isn't necessarily a good strategy if the game penalizes you for a lot for errors. If the 11 does come up, it could wipe you out, for example.

Guass was one who used what we now call a Loss Function to think about describing distributions. Here's a loss function:
L(guess) = sum (data - guess)^2
This is called squared-error loss. You' will be penalized a lot if your guess is far from observations, and little if not.   You want to choose the number that makes thsi loss as small as possible. What number is this? Turns out its the average.

If L(guess) = sum|data - guess| than the best guess is the median.

This gives a new interpretation to median and mean: they minimize their loss functions. If we want to make a prediction, they each give us teh best prediction, in the sense that they come close to the known data. Depending on what we mean by close.

The spread is measured by the SD/variance of the IQR. The SD makes more sense, I hope, after you've seen the LS context. If we think of "x - xbar" as our "residual" or error, then the variance is the average squared errors. In any sense, the SD tells us the smount of spread with respect to the mean.

The IQR, on the other hand, tells us the distance occcupied by the middle 50% of the data.

Resistence to spread:
the median and IQE have the property of being "resistant to spread" which means, in effect, they're not affected by outliers or skew.

Now let's look at some data.

For a start, load "songlist" into R. We're going to cheat and not look at a picture.

Give me some numerical summaries about time. What's the typical song-length? What's the range/spread of songlengths? Tell me what you want to do, and I"ll give you the R command.

I deliberately put off making paicture. What do you think the distribution will look like?

Let's make a picture.

There's a fair amount of variation. Everything from 16 seconds to 17 minutes. The mean and median are both just under 4 minutes. (NOte that it's easy to convert from seconds to minutes!)

What accounts for this variation? What other variables might we explore?
***************

R commands

#load data, assuming that the file songlist.txt is in the current working directory.
music <- read.table("songlist.txt", header=T, sep="\t")
#if not in working directory, need to specify the full path name.
attach(music)
#attach lets you use the variable names contained in the "data table" named music.
names(music)
# to view the names of the variables
summarytime)
# gives summary statistics of the time variable.
table(genre)
#gives frequencies of the categorical variable genre.
hist(time)
times.class <- time[genre=="Classical"]
#create new vector containing only times of classical tracks
times.rock <- time[genre=="Rock"]
summary(times.class)
summary(times.rock)
boxplot(times.class, times.rock)