Statistics 50 Lecture 5

Statistics 10/50
Lecture 5

DESCRIBING DATA NUMERICALLY (4.5-4.6)

A. Overview

The usual two numbers summarizing a distribution are the "center" [the "typical" value] and the "spread" [how close the data are to each other].

B. "Spread"

Minimum, Maximum, Range, and Percentiles
a. The lowest and highest values of a variable are the minimum and maximum. The range is maximum-minimum.
b. Definition: a number y is the nth PERCENTILE for the data if n% of the data are less than or equal to y.
c. The QUARTILES are the 25th, 50th, and 75th percentiles and are denoted Q1, Q2, and Q3. (Another name for Q2 is the median.)
THOSE ARE NICE ROBUST MEASURES (e.g. relatively resistant to extreme observations) GOOD FOR GETTING AN IDEA OF WHAT THE DISTRIBUTION LOOKS LIKE
BUT NOT TOO COMMONLY USED FOR ANALYSIS instead...

The Standard Deviation (SD)

The Standard Deviation may be thought of as the average size deviations of the individual members of a list from the average of the list.
In general, most numbers in a list are only one standard deviation away from the average. A few numbers will be two standard deviations away. And very few will deviate beyond that.
a. STANDARD DEVIATION is abbreviated as SD or as a lowercase "s".
b. The SD is defined as follows: given a list of n numbers x₁, x₂, ... , x_n,

		   	   ___________________________________________
		          .      _          _                 _
		         . (x₁ - x)² + (x₂ - x)² + ... + (x_n - x)²
		s  =    . --------------------------------------------
                      \/                      n

		   	   ________________
		          .           _            
		         .  sum (x_i - x)²
		   =    . -----------------
                      \/        n

                   _
            where  x  is the average of the n numbers.


	    An equivalent, easier formula for the SD is

		  	   ________________________
		          .                        
		         .  sum x_i² - (sum x_i)²/n
		s  =    . -------------------------
                      \/             n

Example: Remember the First-year Law Student's grades?

93, 90, 81, 80, 77

The sum of these is 421

n = 5 (she took 5 courses)

the mean (x-bar) is 84.2 (421 divided by 5)

the SD is:

calculate the first chunk
(93² + 90² + 81² + 80² + 77²)=(8649 + 8100 + 6561 + 6400 + 5929) = 35639

calculate the second chunk
421² divided by 5 = 35,448.2

subtract the second chunk from the first
35639 - 35448.2 = 190.80

divide the result by 5 (i.e. n):
190.80 divided by 5 = 38.16

and take the square root:
SQRT(38.16) = 6.1774

A Table may help you see what is going on:

Original Score Deviation from Average Squared Deviation from Average

93 8.8 77.44

90 5.8 33.64

81 -3.2 10.24

80 -4.2 17.64

77 -7.2 51.84

Sum = 421 Sum = 0 Sum = 190.80

Average = 84.2 SD = SQRT(190.80/5) = 6.1774

Original Score	Deviation from Average	Squared Deviation from Average
93	8.8	77.44
90	5.8	33.64
81	-3.2	10.24
80	-4.2	17.64
77	-7.2	51.84
Sum = 421	Sum = 0	Sum = 190.80
Average = 84.2		SD = SQRT(190.80/5) = 6.1774

c. Note that the standard deviation is in the same units as the data. In her case, it's points. This is why the standard deviation involves the square root, it allows for easier interpretation than points-squared for example.
d. As with the mean, it is not necessary to know HOW MANY items are in a list when computing a standard deviation, only the relative frequency of the values in the list.
e. The SD measures how close the numbers in the list are to the average; i.e., not all numbers are equal to the mean; the SD is a measure of the "average" distance between each point and the average.
f. Another thing about the standard deviation. For many datasets 68% of the entries on a list will fall within one SD of the average. 95% of the entries will fall within two SD of the average. But we'll talk about this some more in Chapter 5.

Numerical Examples

    A.  Given the list 1,2,3,4,5,6,7,8,9,10,11,12:

        a.  n=12
        b.  median = 6.5; Q1 = 3.5, Q3 = 9.5
        c.  mean = 78/12 = 6.5

        d.  range = 12-1 = 11
        e.  sum (x-xbar)² = 143; sum x = 78, sum x² = 650;
            SD = sqrt(143/12) = sqrt((650-78²/12)/12) = 3.452

    B.  Given the list 1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,
                       10,10,11,11,12,12

        a.  n=24
        b.  median = 6.5; Q1 = 3.5, Q3 = 9.5 
	c.  mean = 156/24=6.5

        d.  range = 12-1 = 11
        e.  sum (x-xbar)² = 286; sum x = 156, sum x² = 1300;
            SD = sqrt(286/24) = sqrt((1300-156²/24)/24) = 3.452

    NOTE: everything stayed the same.  What matters
          here is the relative frequency of a value.

Properties of center and spread
A. Suppose we are given a list of numbers x₁, x₂, ... , x_n
B. Suppose we construct a new list by adding a constant "a" to each number in the old list.
a. Picture: shifted histogram.
b. The median and the mean both go up by a.
c. The range and SD are unchanged!

C. Suppose we construct a new list by multiplying each number
x_i by some constant "b".
a. Picture: stretched histogram
b. The median and the mean are both multiplied by b.
c. The range and SD are also multiplied by b.
Summary of definitions for "spread"
A. Range: the largest value less the smallest value
B. Standard Deviation (SD): s = sqrt((sum[x_i - xbar]²)/n)
= sqrt((sum x_i² - ((sum x)²/n))/n)

C. Problem Set 1 Due Oct 16, 1998 in Lecture

Chapter 2 Exercise Set A: 7, 14 (on page 22 and on page 24)
Chapter 2 Review Exercises: 2, 4, 9, and 11 (on pages 24-27)
Chapter 3 Review Exercises: 4a, 4c, 4d, 7, 8a, 8c (on pages 51-53)
Chapter 4 Exercise Set B: 1, 2 (p. 65)
Chapter 4 Review Exercises: 1, 4, 6a, 6b, 7 (pp. 74-75)

Return to the Fall 1998 Statistics 10/50 Home Page

Last Update: 11 October 1998 by VXL

Statistics 10/50 Lecture 5

DESCRIBING DATA NUMERICALLY (4.5-4.6)

A. Overview

B. "Spread"

C. Problem Set 1 Due Oct 16, 1998 in Lecture

Statistics 10/50
Lecture 5