The usual two numbers summarizing a distribution are the "center" [the "typical" value] and the "spread" [how close the data are to each other].
a. The usual measure of "center" is the MEAN, often called the AVERAGE. _ b. The mean is denoted as x (read as "x-bar"). c. The mean is computed as follows: given a list of n numbers x1, x2, ... , xn, the mean is
_ x1 + x2 + ... + xn sum xi
x = -------------------- = --------
n n
d. Example: A First-year Law Student takes 5 courses, these are her grades at the end of the first year:
93, 90, 81, 80, 77
The sum of these is 421 (sum xi above) n = 5 (she took 5 courses) the mean (x-bar) is 84.2
a. The median is the "middle point" of a list: half of the data are larger than (or equal to) the median, and half of the data are smaller than (or equal to) the median.b. The median is computed as follows: given a list of n numbers x1, x2, ... , xn, sort all the numbers and pick the middle number from the list. (If the list has an even number of elements, take the average of the two middle numbers.)
c. Example:
The sorted law school grades: 77, 80, 81, 90, 93
The median (M) of this list is 81
If she had taken SIX classes instead of FIVE:
77, 80, 81, 90, 93, 97
Take the average of the middle two numbers (81 and 90), that is, 81+90 divided by 2 or 85.5
a. The mean is the "balancing point" of a histogram; the median simply divides the data in half.b. For a symmetric distribution, the mean equals the median.
c. The mean is sensitive to outliers and long tails! The median is not:
e.g., the list "77, 80, 81, 90, 93" has mean 84.2 and median 81;
if the list were changed to "17, 80, 81, 90, 93", the mean would be 72.2, but the median would still be 81.
HINT: TRY THIS AT HOME...
d. It is not necessary to know HOW MANY numbers are in a list, only the RELATIVE FREQUENCY of the values; e.g., the list "77,77,80,80,81,81,90,90,93,93" has mean 84.2, as does any list that has 20% x1's, 20% x2's, etc.
a. The lowest and highest values of a variable are the minimum and maximum. The range is maximum-minimum.b. Definition: a number y is the nth PERCENTILE for the data if n% of the data are less than or equal to y.
c. The QUARTILES are the 25th, 50th, and 75th percentiles and are denoted Q1, Q2, and Q3. (Another name for Q2 is the median.)
d. The INTER-QUARTILE RANGE, or IQR, is defined as IQR = Q3 - Q1; the IQR measures the spread of the middle 50% of the data.
e. BOXPLOT: another way of graphically summarizing data. It uses the Min, Max, Median, Q1, Q3. Know how to recognize one, know how to contruct one.
THOSE ARE NICE ROBUST MEASURES (e.g. relatively resistant to extreme observations) GOOD FOR GETTING AN IDEA OF WHAT THE DISTRIBUTION LOOKS LIKE (e.g. center, spread)
BUT NOT TOO COMMONLY USED FOR ANALYSIS instead...
a. The usual measure of spread is the STANDARD DEVIATION, written as SD or as a lowercase "s". b. The SD is defined as follows: given a list of n numbers x1, x2, ... , xn, ___________________________________________ . _ _ _ . (x1 - x)2 + (x2 - x)2 + ... + (xn - x)2 s = . -------------------------------------------- \/ n-1 ________________ . _ . sum (xi - x)2 = . ----------------- \/ n-1 _ where x is the average of the n numbers. An equivalent, easier formula for the SD is ________________________ . . sum xi2 - (sum xi)2/n s = . ------------------------- \/ n-1 Example: the list "0, 2, 3, 3, 4, 6" has s = 2.0. c. As with the mean, it is not necessary to know HOW MANY items are in a list when computing a standard deviation, only the relative frequency of the values in the list. d. The SD measures how close the numbers in the list are to the average; i.e., not all numbers are equal to the mean; the SD is a measure of the "average" distance between each point and the average.
1. Given the list 1,2,3,4,5,6,7,8,9,10,11,12: a. n=12 b. median = 6.5; Q1 = 3.5, Q3 = 9.5 c. mean = 78/12 = 6.5 d. range = 12-1 = 11 e. IQR = Q3-Q1 = 9.5-3.5 = 6 f. sum (x-xbar)2 = 143; sum x = 78, sum x2 = 650; SD = sqrt(143/11) = sqrt((650-782/12)/11) = 3.6056 2. Given the list 1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9, 10,10,11,11,12,12 a. n=24 b. median = 6.5; Q1 = 3.5, Q3 = 9.5 c. mean = 156/24=6.5 d. range = 12-1 = 11 e. IQR = Q3-Q1 = 9.5-3.5 = 6 f. sum (x-xbar)2 = 286; sum x = 156, sum x2 = 1300; SD = sqrt(286/23) = sqrt((1300-1562/24)/23) = 3.53 NOTE: the Standard deviation got smaller but the mean, median, quartiles, IQR and range stayed the same. What matters here is the relative frequency of a value.
1. Suppose we are given a list of numbers x1, x2, ... , xn2. Suppose we construct a new list by adding a constant "a" to each number in the old list.
a. Picture: shifted histogram.
b. The median and the mean both go up by a.
c. The range, IQR, and SD are unchanged!
3. Suppose we construct a new list by multiplying each number
xi by some constant "b".a. Picture: stretched histogram
b. The median and the mean are both multiplied by b.
c. The range, IQR, and SD are also multiplied by b!
1. Median: the median is the "middle number" of a liste.g. 77, 80, 81, 90, 93 the median is 81
2. Mean: the mean is x-bar = (sigma xi)/n <--- recall what this means
e.g. 77, 80, 81, 90, 93 the mean is 82.4
3. Outliers: suppose these are the grades instead
e.g. 17, 80, 81, 90, 93
the median is still 81 but the mean has changed to 72.2
(you could stemplot this to see what I mean by outlier)4. Range: the largest value less the smallest value
5. Inter-Quartile Range (IQR): IQR = Q3-Q1, where Q3 is the third quartile (the 75th percentile) and Q1 is the first quartile (25th percentile) One way of finding Q1 and Q3 is to find the median of the sublists.
6. Standard Deviation (SD): s = sqrt((sum[xi - xbar]2)/(n-1)) [**]
= sqrt((sum xi2 - (sum x)2/n)/(n-1))
Last Update: 1 October 1997 by VXL