1.
About
Data
Data, in
this class, is information on people, places, companies, almost anything. Some
vocabulary.
VARIABLE: a characteristic of a person, animal, place, thing that can be
expressed as a number.
VALUE: A value is the actual number associated with the variable
TYPES OF VARIABLES (1) Quantitative - have a "natural"
ordering which may be discrete like the spots on a single die. They may also be
continuous like time, temperature. (2) Qualitative - do not have a
"natural" ordering like major, occupation, name brand.
The Distribution (pattern) of Data
Statisticians love examining distributions of variables. And a graphical representation of a distribution can answer questions like how many are large, how many are small, how many fall between two numbers, what is the most common number?
2. The
Histogram -- a way to examine distributions
A histogram is a way of examining distributions. It can quickly summarize an enormous amount of information on a single variable and it makes use of your natural ability to recognize patterns. Examples are on pages 30 & 31.
A HISTOGRAM is a graph that shows percentages
by area. The rectangles are called "bins." The key to a histogram is
that it is the AREA of the bin, not the height of the bin, that is important.
The area of the bin is proportional to the relative frequency of observations
in the bin:
¬
A
histogram represents data observations by AREA for different class intervals,
not height. The area of each "block" is proportional to the number of
observations in the class interval. The total area MUST be 100%. Specifically:
(#observations
in the bin)/(total number of observations).
¬
It
has class intervals on its horizontal axis. This axis must be scaled.
¬
It
does not require a scale on its vertical axis. The vertical axis is scaled in
percent per unit of the horizontal axis, and a scale is automatically imposed
by the fact that the area of the histogram must be 100%
We also need
an "endpoint convention" to be able to draw a histogram: if an
observation falls on the boundary between two class intervals, to which one
should we associate it? The two standard choices are always to include the left
boundary and exclude the right, except for the rightmost bin, or always to
include the right boundary and exclude the left, except for the leftmost bin.
4. Building
a Histogram
To plot a histogram, we first need to sort
the data into increasing order, and pick the class intervals. We then count the
number of data that fall in each class interval, and plot rectangles with the
areas proportional to the relative frequencies with which the data fall in each
class interval.
There are no standard rules for determining
appropriate class intervals, and the impression one gets of how the data are
distributed depends on the number and location of the intervals.
Histograms are usually used for large datasets.
5. Things
to be aware of with respect to histograms
Center: what is the "typical" value?
Symmetry or skewness: are the data evenly divided is there a tail? Are
there bumps?
Spread: are the data near to each other or far apart?
Exceptions ("outliers"): are there points that don't fit the general
distribution?
Remark: the histogram can take all kinds of shapes.
The whole point of this exercise is really to help you learn how to convey information in a meaningful way using graphics.
6.
Bad
graphical examples
According to
A.C. Nielsen, this site rates as one of the most visited on the internet. They
are also capable of producing some very misleading graphics. While this may
seem harmless, consider two graphics related to the Challenger Disaster 15
years ago.