a. The lowest and highest values of a variable are the minimum and maximum. The range is maximum-minimum.b. Definition: a number y is the nth PERCENTILE for the data if n% of the data are less than or equal to y.
c. The QUARTILES are the 25th, 50th, and 75th percentiles and are denoted Q1, Q2, and Q3. (Another name for Q2 is the median.)
d. The INTER-QUARTILE RANGE, or IQR, is defined as IQR = Q3 - Q1; the IQR measures the spread of the middle 50% of the data. NOTE Q3 - Q1 does not necessarily equal Q2.
e. BOXPLOT: another way of graphically summarizing data. It uses the Min, Max, Median, Q1, Q3. Know how to recognize one, know how to contruct one.
a. The usual measure of spread is the STANDARD DEVIATION, written as SD or as a lowercase "s". b. s = sqrt((sum[xi - xbar]2)/(n-1)) [**]= sqrt((sum xi2 - (sum x)2/n)/(n-1))
Here we ask questions like "how is SAT score related to GPA" or "how
is height related to weight"
Please read the definition for explanatory and response variables
on page 94
1. ScatterplotA scatterplot is a two dimensional plot of data; the horizontal dimension is called X, and the vertical dimension is called Y. Each point on a scatterplot shows two values, an X value and a Y value; each point represents a single case. Generate your own scatterplot.
See page 97 of your text for a nice definition of scatterplot.
Once you have a scatterplot the first thing you want to look for are patterns in the plot.
2. Positive and negative relationships
There is a POSITIVE relationship if above-average values of X are associated with above-average values of Y; conversely, there is a NEGATIVE relationship if above-average values of X are associated with below average values of Y.
3. Warning! Scatter diagrams only show association; but association is not causation (wine drinking and age of death)
*** REMEMBER ASSOCIATION IS NOT CAUSATION ***
r = (1/(n-1)) sum{((x-xbar)/SD-x)((y-ybar)/SD-y)} [**]= ((sum xy) - (1/n)(sum x)(sum y))/((n-1) SDx SDy) (know how to use calculator formulas)
a. -1 <= r <= 1b. If r is close to 1 or -1, the data are close to a line
c. If r is close to 0, the data are not close to a line
d. Pictures! (see page 115)
a. The correlation r measures how close the data are to a lineb. r does NOT describe curved relationships, no matter how strong they are (i.e. even if the points are tightly spaced around a curve, r will not describe the relationship well)
a. The correlation between X and Y is the same as the correlation between Y and Xb. Invariant under addition: if some constant "a" is added to ALL of the X or Y values, the correlation is unchanged.
c. Invariant under multiplication: if all of the X or the Y values are multiplied by some positive constant "b", the correlation is unchanged.
d. Nonrobustness: the correlation can change very dramatically if only ONE of the data points is changed.
Answer: r = -0.47.
If you change the five points to:
{(2,7), (3,3), (5,2), (8,4), (13,1)}, r will now be -.67 .
X Y X Y --- --- --- --- 1 2 4 -12 1 3 6 -12 2 6 12 -11 3 5 10 -10 5 9 18 -8 7 8 16 -6 11 8 16 -2 13 4 8 0 13 7 14 0 Since the new list is just a transformation of the old list (i.e., the "new" X = 2Y, and the "new" Y = X-13), the correlation is the same as in the previous list: r=0.415. If you only modify one of the lists (either X or Y) by adding or multiplying by a constant, it will not change the correlation.
Daily high temperatures in June 1985 in New York and Boston had r=0.7.
Among men age 25 to 34, education and income had r = 0.34.
Husbands and wives heights have r=0.25; ages, r=0.95.
r = [1/(n-1)] sum[((x-xbar)/SD-X) ((y-ybar)/SD-Y)]
= ((sum xy) - (1/n)(sum x)(sum y))/((n-1) SDx SDy)
a. -1 <= r <= 1b. If r is close to 1 or -1, the data are close to a line
c. If r is close to 0, the data are not close to a line
a. The correlation between X and Y is the same as the correlation between Y and Xb. Invariance: if some constant "a" is added to ALL of the X or Y values, or if some constant "b" is multiplied to ALL of the X or Y values, the correlation is unchanged.
c. Nonrobustness: the correlation can change very dramatically if only ONE of the data points is changed.
Last Update: 7 October 1996 by VXL