Economics 40/Statistics M11 Lab 6:
Scatterplots & Regression
Due March 16, 2001
Purpose: The purpose of this lab is to use Stata to learn
how to generate a scatter plot, perform some correlations, run a simple
regression model (chapter 2), run a multiple regression model (chapter 11), and
test a hypothesis (chapter 10).
Data: In Part I we will examine box office receipts from
stage productions on Broadway in New York City:
use
http://www.stat.ucla.edu/~vlew/stat11/labs/broadway
. describe
Contains data from
http://www.stat.ucla.edu/~vlew/stat11/labs/broadway.dta
obs: 18
vars: 7 5 Mar 2001 15:43
size: 882 (99.5%
of memory free)
------------------------------------------------------------------------
storage
display value
variable name type format label variable label
------------------------------------------------------------------------
show str28 %28s Broadway Show Name
receipts long %12.0g Average Box Office Receipt
attendnc int %8.0g Average Attendance
capacity int %8.0g Total Capacity
type byte
%8.0g Type of Production 1=other
2=musical
ratio float %9.0g Average Proportion
Seats
Filled
musical float %9.0g Musical? yes=1 no=0
------------------------------------------------------------------------
Sorted by:
I would like you to
generate scatterplots and correlations for this dataset using the variable
receipts as the "y-variable" and any of the other variables as your
x-variable. Generate at least two
scatterplots.
The scatterplot
command (for example) is:
graph receipts attendnc
and if you want to
reveal the identity of any unusual observations, issue this command:
graph receipts attendnc, symbol([show])
The correlation
command is:
Correlate variable1 variable2 variable3 variable4
a correlation
matrix will be generated. In general,
you should only have continuous quantitative variables in your
correlations. In other words, you can’t
correlate a variable like “type”. There
are four variables which I would classify as continuous quantitative
variables. I leave it to you to decide
which ones they are.
Finally, run a
simple regression trying to “predict” box office receipts from one other variable
in the dataset. Choose the one that
yields the highest r-square for your model. Both variables must be
quantitative. Write out the regression
equation in the form:
y = a + bx
and clearly
identify the y variable, the x variable, the slope and the intercept. Please interpret the value of the slope and
the value of the intercept in plain English.
ASSIGNMENT PART I: RECAP
1.
Produce
at least 2 scatterplots, both involving receipts as the "y-variable"
and any other variable as the "x-variable"
2.
Generate
a correlation matrix (half-triangle actually) of all of the quantitative
variables in the dataset. What is your
interpretation of the correlations.
3. Run a simple regression which tries to “predict” box office receipts using one other variable. Write out the resulting regression equation, identify its components. Interpret the slope and intercept.
IN PART II WE WILL ANALYZE SOME MOVIE
INDUSTRY DATA
To get the data,
issue the command:
use
http://www.stat.ucla.edu/~vlew/stat11/labs/movies
There should be 907
observations, each one is a different movie released between 1996-2000. Use the
"summarize" command to generate summary statistics for each of the
variables in the dataset (you don’t need to print that out for me). Unless you
are extremely clever, you can only use the variables which have valid means and
standard deviations as shown by the summarize command. The variables are, for
your information:
storage display value
variable name type format label variable label
------------------------------------------------------------------------
title str40 %40s Title
totgross long %12.0g Total Gross Receipts
grossopn long %12.0g Gross on Opening weekend
openday int %8.0g Opening number of screens
widest int %8.0g Widest number of screens
firstwk float %9.0g 1st Wk % of total
perthtr long %12.0g Per Theater receipts
dist str18 %18s Distributor
opendate float %dn/d/y Opening Date
cldate float
%dn/d/y Closing
Date
totdays float %9.0g Total Days Opened
wednesdy float %9.0g Opened on a Wednesday
1=yes 0=no
winter float %9.0g Opened during the winter
1=yes 0=no
zper float %9.0g Standardized values of (
perthtr)
zopenday float %9.0g Standardized values of (
openday)
The
last two variables zper and zopenday are z-scores for the variables perthtr and
openday, so they give you some idea of how unusual or how “average” a
particular movie was compared with the mean for all other movies. These “standardized” variables are useful
because their interpretation is easy.
Remember, Z-scores, or standard units, are just the number of standard
deviations a given value is away from its mean.
If you tabulate or list any variable, you will have a fairly good idea of their interpretation. Ask for assistance if you cannot figure it out from the description given above and from a listing of the variable itself.
ASSIGNMENT PART II
The goal of the
assignment is to model variable totgross (in other words, treat this variable
as the y-variable or the dependent variable) using some of the other variables
as predictors. There is no one right way of doing this, what is important is to
treat this as a learning "adventure" and to put scatterplots,
correlations, and regression to work for you.
1. Generate some scatterplots. You can
do this with the graph command, for example,
graph totgross
grossopn
Will generate a
scatterplot with the opening weekend gross on the x-axis and the total gross
receipts (after the movie ended its run in the theaters). A scatterplot can
give you a sense of the relationship between two variables.
Please note that
you can only generate scatterplots for numeric variables in Stata. You cannot
create a scatterplot total gross receipts by distributor.
Include at least 2
scatterplots in your lab assignment. Your choice of the variables but totgross
must be in both as the y-variable. For each one, please tell me what the
scatterplot tells you about the relationship between x and y.
2. Generate some correlations. Ask
yourself, what is the interpretation of the correlation coefficient and how is
it related to a regression model? Again, totgross must be included in the
correlations.
Please include a
correlation matrix of variables that you used in the regression models (see #3
below) with your lab. Which variable
has the strongest correlation with totgross?
What is the direction of the correlation for the strongest one? Make sure you use that one in your
regression model as an x-variable.
3. Construct
regression models that predict totgross from at least (A) one other variable
and then (B) least two of the other variables in the dataset. For example:
regress
totgross grossopn
And
then issue the command:
regress
totgross grossopn wednesdy
the results will
give you regression coefficients, test results, and other statistics. Please
include a copy of your regression output for each model and then write an
interpretation of the resulting coefficients and the result of the t-test. What
do the other variables tell you about the predicting the total gross receipts
of a movie? In other words, what is the relationship (using the example above)
between the gross receipts on the opening weekend and the total gross receipts?
How did this relationship change (or did it change at all) when I add the
variable wednesdy (opened on a Wednesday – generally the signal of a holiday
weekend)?
Come up with your
own model for total gross receipts and your own interpretation. Remember, there
is no "correct" answer, this is just a learning exercise. Who knows,
maybe you will become an entertainment industry analyst…
Extras/ Advanced
The variables wednesdy and winter are unusual in that they only have two values: 0=no and 1= yes. They are known as “dummy” or “indicator” variables and while they really are not continuous quantitative variables, they behave a bit like them. Here is a model:
.
regress totgross grossopn wednesdy
Source | SS df MS Number of obs =
907
-------------+------------------------------ F(
2, 904) = 1163.05
Model | 1.5047e+18 2 7.5234e+17 Prob > F
= 0.0000
Residual | 5.8477e+17 904 6.4687e+14 R-squared
= 0.7201
-------------+------------------------------ Adj R-squared = 0.7195
Total | 2.0895e+18 906 2.3062e+15 Root MSE
= 2.5e+07
------------------------------------------------------------------------------
totgross | Coef. Std. Err. t
P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grossopn | 4.156319 .0868341 47.87
0.000 3.985899 4.326739
wednesdy | 6026951 2545001 2.37
0.018 1032152 1.10e+07
_cons | -414296.5 1128741 -0.37
0.714 -2629553 1800960
------------------------------------------------------------------------------
The interpretation of the coefficient for the dummy variable Wednesday is as follows: “as we move from 0 to 1, meaning that if a movie was released on a Wednesday as opposed to any other day, we see that the predicted total gross receipts for these films were higher by $6,026,951, on average” and this is statistically significant.
This is advanced material and I don’t want to see this regression in your lab write up. You will see it in advanced statistics courses in the economics department.