STAT 13

(Sec. 1a-1c)

Introduction to Statistical Methods for the Life and Health Sciences

Instructor: Ivo Dinov, Asst. Prof.

Departments of Statistics & Neurology
 http://www.stat.ucla.edu/~dinov/

Lab 4

Also see this for an additional tutorial.

Graphical and Numerical Descriptions of Data

Objectives:

(1)    summarizing datasets graphically and numerically

(2)    exploring relationships between two variables

(3)    reading a description of data

(4)    learn some new STATA commands

STATA commands that you will find helpful today and in the future:

Graph varname1, box                                     produces a boxplot

Tabulate varname1                                             for one- and two-way frequency tables

Graph varname1 varname2                                 produces a scatterplot

Tabulate varname1, plot                         produces a bar chart of relative frequencies in a one-way table

Graph varname1 varname2, symbol()                      changes the labels for the variables in scatterplot

Graph varname1 varname2, pen()                changes the color of graphs and data points

Graph varname1, title()                           to label graphs

Summarize varname1, detail

Activity: Ants

Go to http://www.stat.ucla.edu/projects/datasets and read the description of the “Ant Study,” a research project carried out by a UCLA faculty member.  Review the codebook and identify each variable as quantitative or qualitative (categorical).  In this lab, we will focus on the “Thatch Ant” data.

1.  Which variables are quantitative?

http://www.stat.ucla.edu/projects/datasets/thatch-ant.dta

Let’s explore the distributions of “mass” and “headwidth” graphically and numerically.  We’ll explore them separately first and then look at their relationship.

For each variable, do the following.

2.  Create a boxplot, be sure to title it using the stata command, and print it.

To title a plot, in this case a boxplot, type

Graph varname1, box title(title)

3.  Determine the median, the first quartile, and the third quartile and write them in on the boxplot.  Identify the outliers in each boxplot, if there are any.  Identify which colony each outlier belongs to.  Is there anything systematic about which colony the outliers belong to?

To identify observations in a graph (in this case, the outliers), type

Graph varname1, box symbol([colony])

4.  Use 1.5 x IQR to determine if there are outliers.  Do your calculations agree with the boxplot?

5.  Now determine the sample average and standard deviation.

6.  Approximately 68% of the data should fall between ______________  and ____________.

7.  What is the actual percentage of data that falls between those two values?  You probably want to use the tabulate command here.

8.  If you square the standard deviation, what value do you get?  What is the square of the standard deviation called?

9.  What is the value in each distribution at which 90% of the data falls above it?

10.  Now create a scatterplot of mass and headwidth, be sure to title it using the stata command, and print it.  Assign mass as the explanatory variable and headwidth as the response variable (although it is probably arbitrary in this case).  Remember that the explanatory variable should be on the x-axis and the response variable should be on the y-axis.  Describe their relationship.  Does it look linear or not?

You may want to try some additional examples.
And compare your work to the template solution (this is not a unique solution, just a template)
\Ivo D. Dinov, Ph.D., Departments of Statistics and Neurology, UCLA School of Medicine/