STAT 13
Introduction to Statistical Methods for the Life and Health Sciences
Laboratory 5 - Example - College Professors Salaries
Overview: The data that we’re working with contains
information about the salaries and numbers of faculty members at colleges
and universities across the United States (including here at UCLA). To get
the data into Stata, type the command:
use http://www.stat.ucla.edu/labs/datasets/profsal.dta
With this lab many questions you might have had about the payscale of your
professors can be answered. Maybe you’ve wondered how much the salary for
a full professor influences the number of full professors at a school. Maybe
you’ve wondered if the number of associate professors at a school has any
relationship to the number of assistant professors. These are both questions
that can be answered with regression analysis. What do preliminary scatterplots
tell us about these questions? Maybe you’re interested in how much more full
professors make than associate professors. To do this we’d have to generate
a new variable:
generate saldif=avesalfull-avesalaso
then a boxplot would be nice to illustrate the range of this difference.
What does the histogram look like? To get a nice histogram, the graph command
with the xline and bin options can be used.
help graph
Using this help menu create a histogram with 20 bins and a reference line
at x=0 (zero). What is interesting about this histogram?
We can also look at how two states in different regions of the country compare.
As an example, we’ll look here at how Illinois and Georgia compare
in certain ways. To eliminate all observations that are from states other
than Illinois and Georgia, the “keep” command is used. For more information
on the format of the “keep” command, type
help keep
So, in this example we would type “keep if state=="IL" | state=="GA"” into our command line since these are the only states we are interested in. Note
that the “|” mark is a pipe, not a number one or the letter “l”. Once we
have only the data of interest in our dataset we can type
by state: summarize
to get a better idea of what the data looks like. Notice that the summary
reports that there are no observations present for the first three variables;
state, type, and schoolname. This is because these are character rather than
numerical variables and no numerical summaries of them are possible.
Don’t worry, the data is there. Now consider in what ways you’d like to compare
colleges in the two states. Maybe you’re interested in the size of schools
and see the number of faculty at an insitution as a good measurement of this.
To do side by side boxplots of the numfaculty variable as shown below for
this data, type in the command:
box numfaculty state
What do we see here about school sizes in Gerogia (the box on the left) compared to school size in Illinois?
Maybe you’re more interested in what percent of faculty are full professors.
To answer this question a new variable will need to be generated. Using the
“gen” command we’ve seen in earlier labs (or the equivalent “generate” command)
we can create a variable that shows the percent of faculty that
hold the rank of full professor at each school.
generate pctfull=numfull/numfaculty
And then side by side boxplots by state can be created similarly to how we created the ones for number of faculty.
What is interesting about these box plots? What else might be interesting
to see? Here are side by side histograms of the pctfull variable. Remember
that the “bin()” option allows you to generate histograms with more than
the default five groups.
Assignment:
This lab handout has mentioned only a few of the many questions that can
be asked about this data set and shown you only a few examples of applicable
commands you can use to answer those questions. Your job is to come up with
questions of your own, answer them with Stata, and type up a report describing
your findings. You should consider questions about the entire dataset, as
well as questions about how two or more states compare (choose states other
than Illinois and Georgia). Your TA will be available for questions about
this assignment today in lab, in lab next week, and office hours in between,
but will answer no questions after 4pm next Thursday. Remember that the “help”
and “search” commands can be very useful. Your report should be at least
three pages in length, with all graphs and/or charts you referrence attached
at the end. This lab project is due two weeks from today.
Last modified on
by
.
Ivo D. Dinov, Ph.D., Departments of Statistics and Neurology,
UCLA School of Medicine