Statistics M11/Economics 40 Lecture 5

Statistics M11/Economics 40
Lecture 5

DESCRIBING DATA NUMERICALLY (2.3, 2.7 )

A. Overview

The usual two numbers summarizing a distribution are the "center"[the "typical" value] and the "spread" [how close or far the data are to each other].

B. Reviewing "Center"

1. Median: the median is the "middle number" of a sorted list
e.g. 8.2, 9.3, 10.1, 11.5, 12.7
The median is 10.1
2. Mean: the mean is = ( x_i)/n <--- recall what this means
e.g. 8.2, 9.3, 10.1, 11.5, 12.7 the mean is 10.36
3. Outliers: suppose these are the returns instead
e.g. 0.2, 9.3, 10.1, 11.5, 12.7 the median is still 10.1 but the mean has changed to 8.76
(you could stemplot this to see what I mean by outlier)

C. "Spread"

1. Minimum, Maximum, Range, Percentiles and the IQR

a. The lowest and highest values of a variable are the minimum and maximum. The range is maximum-minimum.
b. Definition: a number _y_ is the nth PERCENTILE for the data if n% of the data are less than or equal to _y_.
c. The QUARTILES are the 25th, 50th, and 75th percentiles and are denoted Q1, Q2, and Q3. (Another name for Q2 is the median.)
d. The INTER-QUARTILE RANGE, or IQR, is defined as IQR = Q3 - Q1; the IQR measures the spread of the middle 50% of the data.
e. BOXPLOT: another way of graphically summarizing data. It uses the Min, Max, Median, Q1, Q3. Know how to recognize one, know how to contruct one. (p.43)
THOSE ARE NICE ROBUST MEASURES (e.g. relatively resistant to extreme observations) GOOD FOR GETTING AN IDEA OF WHAT THE DISTRIBUTION LOOKS LIKE (e.g. center, spread)
BUT NOT TOO COMMONLY USED FOR ANALYSIS instead...

2. The Standard Deviation (SD) (2.7)

a. The usual measure of spread is the STANDARD DEVIATION, written as SD or as a lowercase "s".
b. The SD is defined as follows: given a list of _n_ numbers x₁, x₂, ... , x_n,
		   	   ___________________________________________
		          /      
		         / (x₁ - )² + (x₂ -)² + ... + (x_n - )²

		s  =    / --------------------------------------------
                      \/                      n

		   	   ________________
		          /                        
		         /  sum (x_i - )²
		   =    / -----------------
                      \/        n

                   
            where    is the average of the _n_ numbers.


	    An equivalent, easier formula for the SD is

		  	   ________________________
		          /                        
		         /  sum x_i² - (sum x_i)²/n
		s  =    / -------------------------
                      \/             n
Example: Here's a First-year MBA Student's grades

93, 90, 81, 80, 77

The sum of these is 421
n = 5 (she took 5 courses)
the mean () is 84.2 (421 divided by 5)
the SD is:
calculate the first chunk
(93² + 90² + 81² + 80² + 77²)=(8649 + 8100 + 6561 + 6400 + 5929) = 35639
calculate the second chunk
421² divided by 5 = 35,448.2
subtract the second chunk from the first
35639 - 35448.2 = 190.80
divide the result by 5 (i.e. n):
190.80 divided by 5 = 38.16
and take the square root:
SQRT(38.16) = 6.1774

A Table may help you to see what is going on:

Original Score Deviation from Average Squared Deviation from Average

93 8.8 77.44

90 5.8 33.64

81 -3.2 10.24

80 -4.2 17.64

77 -7.2 51.84

Sum = 421 Sum = 0 Sum = 190.80

Average = 84.2 SD = SQRT(190.80/5) = 6.1774

c. Note that the standard deviation is in the same units as the data. In her case, it's points. This is why the standard deviation involves the square root, it allows for easier interpretation than points-squared for example.
d. The SD measures how close the numbers in the list are to the average; i.e., not all numbers are equal to the mean; the SD is a measure of the "average" distance between each point and the average. The SD is tied to the mean, usually people talk about them together.
Remember the 2 funds A & B. The SD for A = 1.5895, for B it is 38.1657
e. Another thing about the standard deviation. For many datasets 68% of the entries on a list will fall within one SD of the average. 95% of the entries will fall within two SD of the average. But we'll talk about this some more after the midterm.

Original Score	Deviation from Average	Squared Deviation from Average
93	8.8	77.44
90	5.8	33.64
81	-3.2	10.24
80	-4.2	17.64
77	-7.2	51.84
Sum = 421	Sum = 0	Sum = 190.80
Average = 84.2		SD = SQRT(190.80/5) = 6.1774

D. Numerical Examples

1. Given the list 1,2,3,4,5,6,7,8,9,10,11,12:

a. n=12
b. median = 6.5; Q1 = 3.5, Q3 = 9.5
c. mean = 78/12 = 6.5
d. range = 12-1 = 11
e. IQR = Q3-Q1 = 9.5-3.5 = 6
f. (x-)² = 143; x_i = 78, x² = 650;
SD = sqrt(143/12) = sqrt((650-78²/12)/12) = 3.4521

2. Given the list 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22

a. n=12
b. median = 16.5; Q1 = 13.5, Q3 = 19.5
c. mean = 156/12=16.5
d. range = 22-11 = 11
e. IQR = Q3-Q1 = 19.5-13.5 = 6
f. (x-)² = 286; x_i = 156, x² = 1300; SD = sqrt(286/12) = sqrt((1300-156²/12)/12) = 3.4521
NOTE: The mean, median, and quartiles increased by 10. The range, IQR and SD stayed the same.

3. Given the list 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36

a. Try this at home.

E. Properties of center and spread

1. Suppose we are given a list of numbers
2. Suppose we construct a new list by adding a constant "a" to each number in the old list.
a. Picture: shifted histogram.
b. The median and the mean both go up by a.
c. The range, IQR, and SD are unchanged.
3. Suppose we construct a new list by multiplying each number by some constant "b".
a. Picture: stretched histogram
b. The median and the mean are both multiplied by b.
c. The range, IQR, and SD are also multiplied by b.

Some well known transformed scores:

Mean SD Scale Name

500 100 SAT; GRE; GMAT

20 5 ACT

100 15 Wechsler IQ Test

100 16 Stanford Binet IQ Test

Mean	SD	Scale Name
500	100	SAT; GRE; GMAT
20	5	ACT
100	15	Wechsler IQ Test
100	16	Stanford Binet IQ Test

F. HOMEWORK (due 10/15/99)

p. 63 #6 (all), p.75 #12