Ó 2004, S. D. Cochran. All rights reserved.
SUMMARIZING DATA GRAPHICALLY
- Researchers begin with recording observations and then must organize them so that they can tell their story
- Researchers don't record all they see, but only those events in the domains that they are interested in--these are called variables
- Example: We are interested in whether people who are taller are also heavier than people who are shorter. There are two domains or variables of interest we record: height and weight.
- Each observation is comprised of two components
- Where or who it comes from--this is referred to as the case
- What is observed--this is called the value
The value is the observation we record for the case in a specific domain or variable
- Example: For the 1st person in our height-weight study, Ming, we record his height as 5'5" and his weight as 140 pounds. For the variable, HEIGHT, then this case (Ming) has a value of 65 (65 inches). Because HEIGHT can take on many values, that is its value can vary, HEIGHT is called a variable.
- Statisticians make distinctions among types of variables, because different types of variables can be analyzed in different ways.
- The first distinction is whether the values a variable takes on are qualitative or quantitative
These are also called nominal (or name) variables because they are nominally scaled
- Example: A sample of students might look like this: Under the variable, NAME, are the values SHARIKA, VIVIAN, BRAD, JORGE.
Qualitative variables are also called categorical variables because at most we can sort them into categories
- Example: In our sample of students, for the variable, GENDER, are the values FEMALE, FEMALE, MALE, MALE
We could sort them into categories like this:
Category Count Females
2
Males
2
Qualitative variables are also called discrete variables because observations can only take on certain fixed values. In the example above, you are either SHARIKA or BRAD, there is no in between point.
The precision with which we can order values of variables differs
- Some variables have values that can only be put in order--these are ordinal variables (they have ordinal scaling)--we can rank them from first to last, smallest to biggest, closest to farthest but we can't say how far apart the values are
- Example: With our sample of 4 students we might ask them how much they like pizza. The variable, PIZZA, might have values HATE, OK, LOVE, LOVE. Notice these responses are categorical (like nominal variables) although we can put them in order and the responses are discrete (like nominal variables). So we might put them in a table like this:
Evaluation Count % Hate pizza 1 25% It's OK 1 25% Love pizza 2 50%
Notice that we can't tell how far the value OK is from HATE. Is OK really closer to LOVE than HATE?
- Some variables can be put in order and we can also say something about how far apart the values are from each--these are called interval variables (they have interval scaling)
- Example: In our sample of 4 students, let's say we measure their intention to eat pizza in the following week on a 5-point scale below where they circle the number that best corresponds to their intentions:
1
2
3
4
5
Definitely
won'tDefinitely
will
We find on this new variable, INTENT, the observations had the following values: 1, 4, 5, 5.
Notice, now we can say that value 1 is farther from 5 than the value 4 is from 5, that is we can say something about the interval distance though not with much precision.
Interval scales are said to be continuous variables. In the example above, our students could conceivably have intentions that fell between any of the two numbers. Another way of thinking about it, we could have used a 10-point scale or a 100-point scale. The point is with a continuous variable, you can never measure the value precisely--there is always some finer measurement you could have done. The values of the variable are said to have a continuous distribution.
- For a very few variables, we can put them in order, we can say how far apart the values are (how wide the interval is) and we can say where 0 is--these are called ratio variables (they have a ratio scale)
- Example: A ruler has a ratio scale. We can measure in a sample of lines the variable, LENGTH, and record as values the inches in length that we observe.
When a variable has ratio scaling we can say something meaningful about the ratio of any two values.
- Example: A line that is 2 inches long is twice as long as a line that is 1 inch long.
- When a variable has an underlying continuous distribution, we can step down in the hierarchy and treat the values we measure as discrete, but we can't go the other way
- Example: We can measure AGE, which has a continuous underlying distribution ranging from 0 on up, exactly or in whole years. Whole years, such as 18, 19, 20 years, are discrete values. If I know your birthday and what day and year this is, I can code your age as, for example 20.7 years or 20.712 years, carrying the precision out as many decimal places as I choose. I also know that you are 20. But if, for example I know only that you are 20 years old, I can't specify anymore precisely what your age is. You will be like any other 20 year old, even though your birthdays are not the same.
- Example: We cannot measure MONTH, which has an underlying discrete distribution including January, February, and so on, in a continuous way--it is either January or February, one stops and one begins at a distinct, measurable point
After we collect data, the scaling of any variable can always be rescaled to a lower level of precision in the hierarchy of scaling. That is, ratio scaled variables can be measured intervally, ordinally, or nominally. But we cannot go the other direction.
- Example: We can measure the number of pizzas slices eaten by a random sample of students, a ratio-scaled variable, or we can use an interval scale (1 (didn't eat any), 2, 3, 4 (ate a medium amount), 5, 6 (ate the most I've ever eaten)), an ordinal scale (none, one, more than one), or a nominal scale (no pizza eaten, yes pizza eaten). If we do the former, then later we can recode it to a lower level of precision. If we do the latter, we can never know exactly how many pieces someone ate.
- Definition: The pattern of variation of a variable. A distribution is the set of values that a variable takes on.
- Distributions are commonly organized by grouping values into classes.
- Each class has a boundary or limit that separates it from other classes. These are called left-sided and right-sided
- Each observation falls into a unique class. By convention, classes include observations that have the left-sided boundary value
- Example: If we divide AGE into two classes, 0-18 years and 18 and above, the '18' is the right-sided boundary for the first class interval and the left-sided boundary for the second class interval. By convention, 18 year olds are included in the second class, unless specifically stated elsewise.
- A histogram is a graphical display of data translating frequency or counts of values into percentages of area
- The total area under the histogram is 100%. Remember the formula for the area of a rectangle is base X height and for a triangle is 1/2 base X height
- Steps to making a histogram
- Group data into class intervals
- Determine what percentage of the responses fall into the interval
- Determine the height of the rectangle by the width of the class interval
- Draw the histogram