Statistics 50
Lecture 4


Do-Over from the Last Lecture: Measures of "Spread"

  1. Minimum, Maximum, Range, Percentiles and the IQR
    a. The lowest and highest values of a variable are the minimum and maximum. The range is maximum-minimum.

    b. Definition: a number y is the nth PERCENTILE for the data if n% of the data are less than or equal to y.

    c. The QUARTILES are the 25th, 50th, and 75th percentiles and are denoted Q1, Q2, and Q3. (Another name for Q2 is the median.)

    d. The INTER-QUARTILE RANGE, or IQR, is defined as IQR = Q3 - Q1; the IQR measures the spread of the middle 50% of the data. NOTE Q3 - Q1 does not necessarily equal Q2.

    e. BOXPLOT: another way of graphically summarizing data. It uses the Min, Max, Median, Q1, Q3. Know how to recognize one, know how to contruct one.

  2. The Standard Deviation (SD)
            a.  The usual measure of spread is the STANDARD DEVIATION, written
                as SD or as a lowercase "s".
    
            b.  s = sqrt((sum[xi - xbar]2)/(n-1))	    [**]
        

    = sqrt((sum xi2 - (sum x)2/n)/(n-1))

BIVARIATE DATA AND REGRESSION

We move from talking about a single variable or single list to relationships between 2 variables or lists

Here we ask questions like "how is SAT score related to GPA" or "how is height related to weight"

Please read the definition for explanatory and response variables on page 94

A. Basic Definitions

1. Scatterplot

A scatterplot is a two dimensional plot of data; the horizontal dimension is called X, and the vertical dimension is called Y. Each point on a scatterplot shows two values, an X value and a Y value; each point represents a single case. Generate your own scatterplot.

See page 97 of your text for a nice definition of scatterplot.

Once you have a scatterplot the first thing you want to look for are patterns in the plot.

2. Positive and negative relationships

There is a POSITIVE relationship if above-average values of X are associated with above-average values of Y; conversely, there is a NEGATIVE relationship if above-average values of X are associated with below average values of Y.

3. Warning! Scatter diagrams only show association; but association is not causation (wine drinking and age of death)

*** REMEMBER ASSOCIATION IS NOT CAUSATION ***

CORRELATION

A. The correlation coefficient

  1. Idea: the CORRELATION COEFFICIENT, denoted r, measures how close the data are to a straight line. The shape and form of a scatter plot can give you an idea of the strength and the direction of the relationship between two variables. Here, we have a number which can do the same thing.
  2. Formula
    r = (1/(n-1)) sum{((x-xbar)/SD-x)((y-ybar)/SD-y)}		    [**]
    

    = ((sum xy) - (1/n)(sum x)(sum y))/((n-1) SDx SDy) (know how to use calculator formulas)

  3. An Internet Correlation Calculator

B. Properties of r

  1. Basic facts
    a. -1 <= r <= 1

    b. If r is close to 1 or -1, the data are close to a line

    c. If r is close to 0, the data are not close to a line

    d. Pictures! (see page 115)

  2. Interpreting correlation as a measure of linearity
    a. The correlation r measures how close the data are to a line

    b. r does NOT describe curved relationships, no matter how strong they are (i.e. even if the points are tightly spaced around a curve, r will not describe the relationship well)

  3. Advanced mathematical facts
    a. The correlation between X and Y is the same as the correlation between Y and X

    b. Invariant under addition: if some constant "a" is added to ALL of the X or Y values, the correlation is unchanged.

    c. Invariant under multiplication: if all of the X or the Y values are multiplied by some positive constant "b", the correlation is unchanged.

    d. Nonrobustness: the correlation can change very dramatically if only ONE of the data points is changed.

C. Examples

  1. Given the five points {(2,7), (3,3), (5,1), (8,4), (13,2)}, find r.

    Answer: r = -0.47.

    If you change the five points to:
    {(2,7), (3,3), (5,2), (8,4), (13,1)}, r will now be -.67 .

  2. The dataset on the left has a correlation of r = 0.415. Find the correlation for the dataset on the right.
    	 X	 Y		 X	 Y
    	---	---		---	---
    	 1	 2		 4	-12
    	 1	 3		 6	-12
    	 2	 6		12	-11
    	 3	 5		10	-10
    	 5	 9		18	 -8
    	 7	 8		16	 -6
    	11	 8		16	 -2
    	13	 4		 8	  0
    	13	 7		14	  0
    
    Since the new list is just a transformation of the old list (i.e.,
    the "new" X = 2Y, and the "new" Y = X-13), the correlation is
    the same as in the previous list:  r=0.415.
    
    If you only modify one of the lists (either X or Y) by
    adding or multiplying by a constant, it will not change
    the correlation.
    
  3. For six year olds, the correlation between height and weight is r=0.6.

    Daily high temperatures in June 1985 in New York and Boston had r=0.7.

    Among men age 25 to 34, education and income had r = 0.34.

    Husbands and wives heights have r=0.25; ages, r=0.95.

D. Recap of Correlation

  1. Idea: the correlation coefficient r is a measure of how close the data are to a straight line.
  2. Formula:
    	  r = [1/(n-1)] sum[((x-xbar)/SD-X) ((y-ybar)/SD-Y)]
    = ((sum xy) - (1/n)(sum x)(sum y))/((n-1) SDx SDy)
  3. Basic facts
    a. -1 <= r <= 1

    b. If r is close to 1 or -1, the data are close to a line

    c. If r is close to 0, the data are not close to a line

  4. Advanced mathematical facts
    a. The correlation between X and Y is the same as the correlation between Y and X

    b. Invariance: if some constant "a" is added to ALL of the X or Y values, or if some constant "b" is multiplied to ALL of the X or Y values, the correlation is unchanged.

    c. Nonrobustness: the correlation can change very dramatically if only ONE of the data points is changed.


button Return to the Fall 1996 Statistics 50 Home Page

Last Update: 7 October 1996 by VXL