Graphical and Numerical Descriptions of Data

Objectives:

(1) summarizing datasets graphically and numerically

(2) exploring relationships between two variables

(3) reading a description of data

(4) learn some new STATA commands

STATA commands that you will find helpful today and in the future:

Graph varname1, box produces a boxplot

Tabulate varname1 for one- and two-way frequency tables

Graph varname1 varname2 produces a scatterplot

Tabulate varname1, plot produces a bar chart of relative frequencies in a one-way table

Graph varname1 varname2, symbol() changes the labels for the variables in scatterplot

Graph varname1 varname2, pen() changes the color of graphs and data points

Graph varname1, title() to label graphs

Summarize varname1, detail

Activity: Ants

Go to http://www.stat.ucla.edu/projects/datasets and read the description of the “Ant Study,” a research project carried out by a UCLA faculty member. Review the codebook and identify each variable as quantitative or qualitative (categorical). In this lab, we will focus on the “Thatch Ant” data.

1. Which variables are quantitative?

http://www.stat.ucla.edu/projects/datasets/thatch-ant.dta

Let’s explore the distributions of “mass” and “headwidth” graphically and numerically. We’ll explore them separately first and then look at their relationship.

For each variable, do the following.

2. Create a boxplot, be sure to title it using the stata command, and print it.

To title a plot, in this case a boxplot, type

Graph varname1, box title(title)

3. Determine the median, the first quartile, and the third quartile and write them in on the boxplot. Identify the outliers in each boxplot, if there are any. Identify which colony each outlier belongs to. Is there anything systematic about which colony the outliers belong to?

To identify observations in a graph (in this case, the outliers), type

Graph varname1, box symbol([colony])

4. Use 1.5 x IQR to determine if there are outliers. Do your calculations agree with the boxplot?

5. Now determine the sample average and standard deviation.

6. Approximately 68% of the data should fall between ______________ and ____________.

7. What is the actual percentage of data that falls between those two values? You probably want to use the tabulate command here.

8. If you square the standard deviation, what value do you get? What is the square of the standard deviation called?

9. What is the value in each distribution at which 90% of the data falls above it?

10. Now create a scatterplot of mass and headwidth, be sure to title it using the stata command, and print it. Assign mass as the explanatory variable and headwidth as the response variable (although it is probably arbitrary in this case). Remember that the explanatory variable should be on the x-axis and the response variable should be on the y-axis. Describe their relationship. Does it look linear or not?

STAT 13

Instructor: Ivo Dinov, Asst. Prof.

Lab 4

Thursday, Oct. 16, 2001

Graphical and Numerical Descriptions of Data

Activity: Ants