Data, in this class, is information on people, places, companies, almost anything. Some vocabulary.
VARIABLE
: a characteristic of a person, animal, place, thing that can be expressed as a number.VALUE: A value is the actual number associated with the variable
TYPES OF VARIABLES (1) Quantitative - have a "natural" ordering which may be discrete like the spots on a single die. They may also be continuous like time, temperature. (2) Qualitative - do not have a "natural" ordering like major, occupation, name brand.
The Distribution (pattern) of Data
Statisticians love examining distributions of variables. And a graphical representation of a distribution can answer questions like how many are large, how many are small, how many fall between two numbers, what is the most common number?
2. Bar Charts, Pie Charts & Stemplots
Do not worry about them for this class. You should probably know what they are, but you won't be tested on them.
3. The Histogram -- a way to examine distributions
A HISTOGRAM is a graph that shows percentages by area. The rectangles are called "bins." The key to a histogram is that it is the AREA of the bin, not the height of the bin, that is important. The area of the bin is proportional to the relative frequency of observations in the bin:
(#observations in the bin)/(total number of observations).
The horizontal axis must have a scale with units. Your text uses frequency as the scale for the vertical axis and calls it a "frequency histogram". Technically, the vertical axis should have units of percentage per unit of the horizontal axis. Moore and McCabe try to state this in the second paragraph of page 16, but are not entirely successful. They make a distinction between their "frequency histogram" and "a histogram of relative frequencies".
For a proper histogram (a histogram of relative frequencies) the scale of the vertical axis is automatically imposed by the fact that the area of the histogram must be 100% (all the data fall somewhere on the plot).
4. Building a Histogram
To plot a histogram, we first need to sort the data into increasing order, and pick the class intervals. We then count the number of data that fall in each class interval, and plot rectangles with the areas proportional to the relative frequencies with which the data fall in each class interval.
There are no standard rules for determining appropriate class intervals, and the impression one gets of how the data are distributed depends on the number and location of the intervals.
Histograms are usually used for large datasets.
5. Things to be aware of with respect to histograms
Center: what is the "typical" value?
Symmetry or skewness: are the data evenly divided is there a tail? Are there bumps?
Spread: are the data near to each other or far apart?
Exceptions ("outliers"): are there points that don't fit the general distribution?
Remark: the histogram can take all kinds of shapes.
The whole point of this exercise is really to help you learn how to convey information in a meaningful way using graphics.
According to A.C. Nielsen, this site rates as one of the most visited on the internet. They are also capable of producing some very misleading graphics. While this may seem harmless, consider two graphics related to the Challenger Disaster 15 years ago.
7. Time Plots (optional)
A time plot is simply a plot with some variable representing time (e.g. order of observation, years, minutes) on the horizontal axis. Points are generally connected in their time order (see Figure 1.7 on page 20). Economic data is frequently organized in times series. Time plots can help you identify patterns in the data:
A. Seasonal variation or seasonality
A pattern in a time plot that repeats itself regularly is evidence of seasonality. For example, consumer expenditures tend to be high in the month December (holiday shopping) and drop in January (bad weather and post-holiday cash problems).
B. Trend
A persistent pattern, either a long-term rise or a long-term fall.