Introduction to Statistical Methods for the Life and Health Sciences
|
Objectives:
(1) summarizing datasets graphically and numerically
(2) exploring relationships between two variables
(3) reading a description of data
(4) learn some new STATA commands
STATA commands that you will find helpful today and in the future:
Graph varname1, box produces a boxplot
Tabulate varname1 for one- and two-way frequency tables
Graph varname1 varname2 produces a scatterplot
Tabulate varname1, plot produces a bar chart of relative frequencies in a one-way table
Graph varname1 varname2, symbol() changes the labels for the variables in scatterplot
Graph varname1 varname2, pen() changes the color of graphs and data points
Graph varname1, title() to label graphs
Summarize varname1, detail
Go to http://www.stat.ucla.edu/projects/datasets and read the description of the “Ant Study,” a research project carried out by a UCLA faculty member. Review the codebook and identify each variable as quantitative or qualitative (categorical). In this lab, we will focus on the “Thatch Ant” data.
1. Which variables are quantitative?
http://www.stat.ucla.edu/projects/datasets/thatch-ant.dta
Let’s explore the distributions of “mass” and “headwidth” graphically and numerically. We’ll explore them separately first and then look at their relationship.
For each variable, do the following.
2. Create a boxplot, be sure to title it using the stata command, and print it.
To title a plot, in this case a boxplot, type
Graph varname1, box title(title)
3. Determine the median, the first quartile, and the third quartile and write them in on the boxplot. Identify the outliers in each boxplot, if there are any. Identify which colony each outlier belongs to. Is there anything systematic about which colony the outliers belong to?
To identify observations in a graph (in this case, the outliers), type
Graph varname1, box symbol([colony])
4. Use 1.5 x IQR to determine if there are outliers. Do your calculations agree with the boxplot?
5. Now determine the sample average and standard deviation.
6. Approximately 68% of the data should fall between ______________ and ____________.
7. What is the actual percentage of data that falls between those two values? You probably want to use the tabulate command here.
8. If you square the standard deviation, what value do you get? What is the square of the standard deviation called?
9. What is the value in each distribution at which 90% of the data falls above it?
10. Now create a scatterplot of mass and headwidth, be sure to title it using the stata command, and print it. Assign mass as the explanatory variable and headwidth as the response variable (although it is probably arbitrary in this case). Remember that the explanatory variable should be on the x-axis and the response variable should be on the y-axis. Describe their relationship. Does it look linear or not?