Statistics 10/50
Lecture 5


DESCRIBING DATA NUMERICALLY (4.5-4.6)

A. Overview

The usual two numbers summarizing a distribution are the "center" [the "typical" value] and the "spread" [how close the data are to each other].

B. "Spread"

  1. Minimum, Maximum, Range, and Percentiles
    a. The lowest and highest values of a variable are the minimum and maximum. The range is maximum-minimum.

    b. Definition: a number y is the nth PERCENTILE for the data if n% of the data are less than or equal to y.

    c. The QUARTILES are the 25th, 50th, and 75th percentiles and are denoted Q1, Q2, and Q3. (Another name for Q2 is the median.)

    THOSE ARE NICE ROBUST MEASURES (e.g. relatively resistant to extreme observations) GOOD FOR GETTING AN IDEA OF WHAT THE DISTRIBUTION LOOKS LIKE

    BUT NOT TOO COMMONLY USED FOR ANALYSIS instead...

  2. The Standard Deviation (SD)
    The Standard Deviation may be thought of as the average size deviations of the individual members of a list from the average of the list.

    In general, most numbers in a list are only one standard deviation away from the average. A few numbers will be two standard deviations away. And very few will deviate beyond that.

    a. STANDARD DEVIATION is abbreviated as SD or as a lowercase "s".

    b. The SD is defined as follows: given a list of n numbers x1, x2, ... , xn,

    		   	   ___________________________________________
    		          .      _          _                 _
    		         . (x1 - x)2 + (x2 - x)2 + ... + (xn - x)2
    		s  =    . --------------------------------------------
                          \/                      n
    
    		   	   ________________
    		          .           _            
    		         .  sum (xi - x)2
    		   =    . -----------------
                          \/        n
    
                       _
                where  x  is the average of the n numbers.
    
    
    	    An equivalent, easier formula for the SD is
    
    		  	   ________________________
    		          .                        
    		         .  sum xi2 - (sum xi)2/n
    		s  =    . -------------------------
                          \/             n
    
    
    
    Example: Remember the First-year Law Student's grades?

    93, 90, 81, 80, 77

    The sum of these is 421

    n = 5 (she took 5 courses)

    the mean (x-bar) is 84.2 (421 divided by 5)

    the SD is:

    calculate the first chunk
    (932 + 902 + 812 + 802 + 772)=(8649 + 8100 + 6561 + 6400 + 5929) = 35639

    calculate the second chunk
    4212 divided by 5 = 35,448.2

    subtract the second chunk from the first
    35639 - 35448.2 = 190.80

    divide the result by 5 (i.e. n):
    190.80 divided by 5 = 38.16

    and take the square root:
    SQRT(38.16) = 6.1774

    A Table may help you see what is going on:
    Original Score Deviation from Average Squared Deviation from Average
    93 8.8 77.44
    90 5.8 33.64
    81 -3.2 10.24
    80 -4.2 17.64
    77 -7.2 51.84
    Sum = 421 Sum = 0 Sum = 190.80
    Average = 84.2 SD = SQRT(190.80/5) = 6.1774

    c. Note that the standard deviation is in the same units as the data. In her case, it's points. This is why the standard deviation involves the square root, it allows for easier interpretation than points-squared for example.

    d. As with the mean, it is not necessary to know HOW MANY items are in a list when computing a standard deviation, only the relative frequency of the values in the list.

    e. The SD measures how close the numbers in the list are to the average; i.e., not all numbers are equal to the mean; the SD is a measure of the "average" distance between each point and the average.

    f. Another thing about the standard deviation. For many datasets 68% of the entries on a list will fall within one SD of the average. 95% of the entries will fall within two SD of the average. But we'll talk about this some more in Chapter 5.

  3. Numerical Examples
        A.  Given the list 1,2,3,4,5,6,7,8,9,10,11,12:
    
            a.  n=12
            b.  median = 6.5; Q1 = 3.5, Q3 = 9.5
            c.  mean = 78/12 = 6.5
    
            d.  range = 12-1 = 11
            e.  sum (x-xbar)2 = 143; sum x = 78, sum x2 = 650;
                SD = sqrt(143/12) = sqrt((650-782/12)/12) = 3.452
    
        B.  Given the list 1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,
                           10,10,11,11,12,12
    
            a.  n=24
            b.  median = 6.5; Q1 = 3.5, Q3 = 9.5 
    	c.  mean = 156/24=6.5
    
            d.  range = 12-1 = 11
            e.  sum (x-xbar)2 = 286; sum x = 156, sum x2 = 1300;
                SD = sqrt(286/24) = sqrt((1300-1562/24)/24) = 3.452
    
        NOTE: everything stayed the same.  What matters
              here is the relative frequency of a value.  
         
  4. Properties of center and spread
    A. Suppose we are given a list of numbers x1, x2, ... , xn

    B. Suppose we construct a new list by adding a constant "a" to each number in the old list.

    a. Picture: shifted histogram.
    b. The median and the mean both go up by a.
    c. The range and SD are unchanged!

    C. Suppose we construct a new list by multiplying each number
    xi by some constant "b".

    a. Picture: stretched histogram
    b. The median and the mean are both multiplied by b.
    c. The range and SD are also multiplied by b.

  5. Summary of definitions for "spread"
    A. Range: the largest value less the smallest value

    B. Standard Deviation (SD): s = sqrt((sum[xi - xbar]2)/n)

    = sqrt((sum xi2 - ((sum x)2/n))/n)

C. Problem Set 1 Due Oct 16, 1998 in Lecture

Chapter 2 Exercise Set A: 7, 14 (on page 22 and on page 24)
Chapter 2 Review Exercises: 2, 4, 9, and 11 (on pages 24-27)
Chapter 3 Review Exercises: 4a, 4c, 4d, 7, 8a, 8c (on pages 51-53)
Chapter 4 Exercise Set B: 1, 2 (p. 65)
Chapter 4 Review Exercises: 1, 4, 6a, 6b, 7 (pp. 74-75)


button Return to the Fall 1998 Statistics 10/50 Home Page

Last Update: 11 October 1998 by VXL