Statistics 50 Lecture 4

Statistics 50
Lecture 4

DESCRIBING DATA NUMERICALLY

A. Overview

The usual two numbers summarizing a distribution are the "center" [the "typical" value] and the "spread" [how close the data are to each other].

B. "Center"

The Mean

	a.  The usual measure of "center" is the MEAN, often called the 
	    AVERAGE.
                                    _
	b.  The mean is denoted as  x  (read as "x-bar").

        c.  The mean is computed as follows:  given a list of n numbers 
	    x₁, x₂, ... , x_n, the mean is
 

		_      x₁ + x₂ + ... + x_n       sum x_i

		x  =  --------------------  =  --------

			        n		  n


        d.  Example:  A First-year Law Student takes 5 courses,
            these are her grades at the end of the first year:

       
            93, 90, 81, 80, 77 
 
      
            The sum of these is 421 (sum x_i above) 
            n = 5 (she took 5 courses)
            the mean (x-bar) is 84.2

The Median
a. The median is the "middle point" of a list: half of the data are larger than (or equal to) the median, and half of the data are smaller than (or equal to) the median.
b. The median is computed as follows: given a list of n numbers x₁, x₂, ... , x_n, sort all the numbers and pick the middle number from the list. (If the list has an even number of elements, take the average of the two middle numbers.)
c. Example:
The sorted law school grades: 77, 80, 81, 90, 93
The median (M) of this list is 81
If she had taken SIX classes instead of FIVE:

77, 80, 81, 90, 93, 97
Take the average of the middle two numbers (81 and 90), that is, 81+90 divided by 2 or 85.5
Remarks
a. The mean is the "balancing point" of a histogram; the median simply divides the data in half.
b. For a symmetric distribution, the mean equals the median.
c. The mean is sensitive to outliers and long tails! The median is not:

e.g., the list "77, 80, 81, 90, 93" has mean 84.2 and median 81;
if the list were changed to "17, 80, 81, 90, 93", the mean would be 72.2, but the median would still be 81.
HINT: TRY THIS AT HOME...
d. It is not necessary to know HOW MANY numbers are in a list, only the RELATIVE FREQUENCY of the values; e.g., the list "77,77,80,80,81,81,90,90,93,93" has mean 84.2, as does any list that has 20% x1's, 20% x2's, etc.

C. "Spread"

Minimum, Maximum, Range, Percentiles and the IQR
a. The lowest and highest values of a variable are the minimum and maximum. The range is maximum-minimum.
b. Definition: a number y is the nth PERCENTILE for the data if n% of the data are less than or equal to y.
c. The QUARTILES are the 25th, 50th, and 75th percentiles and are denoted Q1, Q2, and Q3. (Another name for Q2 is the median.)
d. The INTER-QUARTILE RANGE, or IQR, is defined as IQR = Q3 - Q1; the IQR measures the spread of the middle 50% of the data.
e. BOXPLOT: another way of graphically summarizing data. It uses the Min, Max, Median, Q1, Q3. Know how to recognize one, know how to contruct one.
THOSE ARE NICE ROBUST MEASURES (e.g. relatively resistant to extreme observations) GOOD FOR GETTING AN IDEA OF WHAT THE DISTRIBUTION LOOKS LIKE (e.g. center, spread)
BUT NOT TOO COMMONLY USED FOR ANALYSIS instead...

The Standard Deviation (SD)

        a.  The usual measure of spread is the STANDARD DEVIATION, written
            as SD or as a lowercase "s".

        b.  The SD is defined as follows:  given a list of n numbers  
            x₁, x₂, ... , x_n, 
		   	   ___________________________________________
		          .      _          _                 _
		         . (x₁ - x)² + (x₂ - x)² + ... + (x_n - x)²
		s  =    . --------------------------------------------
                      \/                      n-1

		   	   ________________
		          .           _            
		         .  sum (x_i - x)²
		   =    . -----------------
                      \/        n-1

                   _
            where  x  is the average of the n numbers.


	    An equivalent, easier formula for the SD is

		  	   ________________________
		          .                        
		         .  sum x_i² - (sum x_i)²/n
		s  =    . -------------------------
                      \/             n-1


            Example:  the list  "0, 2, 3, 3, 4, 6"  has s = 2.0.


        c.  As with the mean, it is not necessary to know HOW MANY items
	    are in a list when computing a standard deviation, only the
	    relative frequency of the values in the list.


        d.  The SD measures how close the numbers in the list are to the
            average; i.e., not all numbers are equal to the mean; the SD
	    is a measure of the "average" distance between each point and
	    the average.

Numerical Examples

    1.  Given the list 1,2,3,4,5,6,7,8,9,10,11,12:

        a.  n=12
        b.  median = 6.5; Q1 = 3.5, Q3 = 9.5
        c.  mean = 78/12 = 6.5

        d.  range = 12-1 = 11
        e.  IQR = Q3-Q1 = 9.5-3.5 = 6
        f.  sum (x-xbar)² = 143; sum x = 78, sum x² = 650;
            SD = sqrt(143/11) = sqrt((650-78²/12)/11) = 3.6056

    2.  Given the list 1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,
                       10,10,11,11,12,12

        a.  n=24
        b.  median = 6.5; Q1 = 3.5, Q3 = 9.5 
	c.  mean = 156/24=6.5

        d.  range = 12-1 = 11
        e.  IQR = Q3-Q1 = 9.5-3.5 = 6
        f.  sum (x-xbar)² = 286; sum x = 156, sum x² = 1300;
            SD = sqrt(286/23) = sqrt((1300-156²/24)/23) = 3.53

    NOTE: the Standard deviation got smaller but the mean, median,
          quartiles, IQR and range stayed the same.  What matters
          here is the relative frequency of a value.

Properties of center and spread
1. Suppose we are given a list of numbers x₁, x₂, ... , x_n
2. Suppose we construct a new list by adding a constant "a" to each number in the old list.
a. Picture: shifted histogram.
b. The median and the mean both go up by a.
c. The range, IQR, and SD are unchanged!

3. Suppose we construct a new list by multiplying each number
x_i by some constant "b".
a. Picture: stretched histogram
b. The median and the mean are both multiplied by b.
c. The range, IQR, and SD are also multiplied by b!
Summary of definitions for "center" and "spread
1. Median: the median is the "middle number" of a list
e.g. 77, 80, 81, 90, 93 the median is 81
2. Mean: the mean is x-bar = (sigma x_i)/n <--- recall what this means
e.g. 77, 80, 81, 90, 93 the mean is 82.4
3. Outliers: suppose these are the grades instead
e.g. 17, 80, 81, 90, 93
the median is still 81 but the mean has changed to 72.2
(you could stemplot this to see what I mean by outlier)
4. Range: the largest value less the smallest value
5. Inter-Quartile Range (IQR): IQR = Q3-Q1, where Q3 is the third quartile (the 75th percentile) and Q1 is the first quartile (25th percentile) One way of finding Q1 and Q3 is to find the median of the sublists.
6. Standard Deviation (SD): s = sqrt((sum[x_i - xbar]²)/(n-1)) [**]
= sqrt((sum x_i² - (sum x)²/n)/(n-1))

Return to the Fall 1997 Statistics 50 Home Page

Last Update: 1 October 1997 by VXL