How do you start to work with data? Some data sets are simply too large for one to make sense of by looking at the list of numbers they contain. For example, I work on problems in which the data consist of millions of numbers. To make sense of such large data sets requires somehow condensing the information into a smaller collection of numbers or into pictures.
If data are like letters, then we're learning the mechanics of statistics. The next two chapters talk about summarizing or describing data.
a. VARIABLE: a characteristic of a person, animal, place, thing that can be expressed as a number. For example height, weight, race, gender.b. VALUE: A value is the actual number that describes height, weight, race, gender.
c. TYPES OF VARIABLES
(i) Quantitative - have a "natural" ordering like 1st grade, 2nd grade, 3rd grade.
(ii) Categorical - like gender, race, yes/no
Definition: the DISTRIBUTION of a variable is its "pattern of variation"; i.e., how many large numbers, how many small numbers, etc.
A. A HISTOGRAM is a graph that shows percentages by area. The "class intervals" are the intervals on the horizontal axis between the vertical sides of the rectangles. The rectangles themselves are called "bins." The key to a histogram is that it is the area of the bin, not the height of the bin, that is important. The area of the bin is proportional to the relative frequency of observations in the bin: (#observations in the bin)/(total number of observations). The horizontal axis needs a scale with units. The vertical axis has units of percent per unit of the horizontal axis, and a scale is automatically imposed by the fact that the area of the histogram must be 100% (all the data fall somewhere on the plot). We also need an "endpoint convention" to be able to draw a histogram: if an observation falls on the boundary between two class intervals, to which one should we associate it? The two standard choices are always to include the left boundary and exclude the right, except for the rightmost bin, or always to include the right boundary and exclude the left, except for the leftmost bin. To plot a histogram, we first need to sort the data into increasing order, and pick the class intervals. We then count the number of data that fall in each class interval, and plot rectangles with the areas proportional to the relative frequencies with which the data fall in each class interval. There are no hard-and-fast rules for determining appropriate class intervals, and the impression one gets of how the data are distributed depends on the number and location of the intervals.B. The total area under the histogram is 100%; histograms are usually used for large datasets.
C. To construct a histogram: [cf. p. 14-21 also problem 1.6(c)]
i. Divide the data into CLASS INTERVALS.
ii. Determine what percentage fall within each class interval.
iii. Determine the appropriate height by dividing the percentage in the class interval by the width of the class interval.
iv. Draw the histogram.
1. Center: what is the "typical" score?
2. Symmetry or skewness: are the data evenly divided, or is there a tail?
3. Spread: are the data near to each other or far apart?
4. Exceptions ("outliers"): are there points that don't fit the general distribution?
5. Remark: the histogram may be bell shaped, but not necessarily! They can take all kinds of shapes.