Statistics 110A Lab 3: Scatterplots & Regression

Economics 40/Statistics M11 Lab 6: Scatterplots & Regression

Due March 16, 2001

Purpose: The purpose of this lab is to use Stata to learn how to generate a scatter plot, perform some correlations, run a simple regression model (chapter 2), run a multiple regression model (chapter 11), and test a hypothesis (chapter 10).

Data: In Part I we will examine box office receipts from stage productions on Broadway in New York City:

use http://www.stat.ucla.edu/~vlew/stat11/labs/broadway

. describe

Contains data from http://www.stat.ucla.edu/~vlew/stat11/labs/broadway.dta

obs: 18

vars: 7 5 Mar 2001 15:43

size: 882 (99.5% of memory free)

------------------------------------------------------------------------

storage display value

variable name type format label variable label

------------------------------------------------------------------------

show str28 %28s Broadway Show Name

receipts long %12.0g Average Box Office Receipt

attendnc int %8.0g Average Attendance

capacity int %8.0g Total Capacity

type byte %8.0g Type of Production 1=other

2=musical

ratio float %9.0g Average Proportion Seats

Filled

musical float %9.0g Musical? yes=1 no=0

------------------------------------------------------------------------

Sorted by:

I would like you to generate scatterplots and correlations for this dataset using the variable receipts as the "y-variable" and any of the other variables as your x-variable. Generate at least two scatterplots.

The scatterplot command (for example) is:

graph receipts attendnc

and if you want to reveal the identity of any unusual observations, issue this command:

graph receipts attendnc, symbol([show])

The correlation command is:

Correlate variable1 variable2 variable3 variable4

a correlation matrix will be generated. In general, you should only have continuous quantitative variables in your correlations. In other words, you can’t correlate a variable like “type”. There are four variables which I would classify as continuous quantitative variables. I leave it to you to decide which ones they are.

Finally, run a simple regression trying to “predict” box office receipts from one other variable in the dataset. Choose the one that yields the highest r-square for your model. Both variables must be quantitative. Write out the regression equation in the form:

y = a + bx

and clearly identify the y variable, the x variable, the slope and the intercept. Please interpret the value of the slope and the value of the intercept in plain English.

ASSIGNMENT PART I: RECAP

1. Produce at least 2 scatterplots, both involving receipts as the "y-variable" and any other variable as the "x-variable"

2. Generate a correlation matrix (half-triangle actually) of all of the quantitative variables in the dataset. What is your interpretation of the correlations.

3. Run a simple regression which tries to “predict” box office receipts using one other variable. Write out the resulting regression equation, identify its components. Interpret the slope and intercept.

IN PART II WE WILL ANALYZE SOME MOVIE INDUSTRY DATA

To get the data, issue the command:

use http://www.stat.ucla.edu/~vlew/stat11/labs/movies

There should be 907 observations, each one is a different movie released between 1996-2000. Use the "summarize" command to generate summary statistics for each of the variables in the dataset (you don’t need to print that out for me). Unless you are extremely clever, you can only use the variables which have valid means and standard deviations as shown by the summarize command. The variables are, for your information:

storage display value

variable name type format label variable label

------------------------------------------------------------------------

title str40 %40s Title

totgross long %12.0g Total Gross Receipts

grossopn long %12.0g Gross on Opening weekend

openday int %8.0g Opening number of screens

widest int %8.0g Widest number of screens

firstwk float %9.0g 1st Wk % of total

perthtr long %12.0g Per Theater receipts

dist str18 %18s Distributor

opendate float %dn/d/y Opening Date

cldate float %dn/d/y Closing Date

totdays float %9.0g Total Days Opened

wednesdy float %9.0g Opened on a Wednesday

1=yes 0=no

winter float %9.0g Opened during the winter

1=yes 0=no

zper float %9.0g Standardized values of (

perthtr)

zopenday float %9.0g Standardized values of (

openday)

The last two variables zper and zopenday are z-scores for the variables perthtr and openday, so they give you some idea of how unusual or how “average” a particular movie was compared with the mean for all other movies. These “standardized” variables are useful because their interpretation is easy. Remember, Z-scores, or standard units, are just the number of standard deviations a given value is away from its mean.

If you tabulate or list any variable, you will have a fairly good idea of their interpretation. Ask for assistance if you cannot figure it out from the description given above and from a listing of the variable itself.

ASSIGNMENT PART II

The goal of the assignment is to model variable totgross (in other words, treat this variable as the y-variable or the dependent variable) using some of the other variables as predictors. There is no one right way of doing this, what is important is to treat this as a learning "adventure" and to put scatterplots, correlations, and regression to work for you.

1. Generate some scatterplots. You can do this with the graph command, for example,

graph totgross grossopn

Will generate a scatterplot with the opening weekend gross on the x-axis and the total gross receipts (after the movie ended its run in the theaters). A scatterplot can give you a sense of the relationship between two variables.

Please note that you can only generate scatterplots for numeric variables in Stata. You cannot create a scatterplot total gross receipts by distributor.

Include at least 2 scatterplots in your lab assignment. Your choice of the variables but totgross must be in both as the y-variable. For each one, please tell me what the scatterplot tells you about the relationship between x and y.

2. Generate some correlations. Ask yourself, what is the interpretation of the correlation coefficient and how is it related to a regression model? Again, totgross must be included in the correlations.

Please include a correlation matrix of variables that you used in the regression models (see #3 below) with your lab. Which variable has the strongest correlation with totgross? What is the direction of the correlation for the strongest one? Make sure you use that one in your regression model as an x-variable.

3. Construct regression models that predict totgross from at least (A) one other variable and then (B) least two of the other variables in the dataset. For example:

regress totgross grossopn

And then issue the command:

regress totgross grossopn wednesdy

the results will give you regression coefficients, test results, and other statistics. Please include a copy of your regression output for each model and then write an interpretation of the resulting coefficients and the result of the t-test. What do the other variables tell you about the predicting the total gross receipts of a movie? In other words, what is the relationship (using the example above) between the gross receipts on the opening weekend and the total gross receipts? How did this relationship change (or did it change at all) when I add the variable wednesdy (opened on a Wednesday – generally the signal of a holiday weekend)?

Come up with your own model for total gross receipts and your own interpretation. Remember, there is no "correct" answer, this is just a learning exercise. Who knows, maybe you will become an entertainment industry analyst…

Extras/ Advanced

The variables wednesdy and winter are unusual in that they only have two values: 0=no and 1= yes. They are known as “dummy” or “indicator” variables and while they really are not continuous quantitative variables, they behave a bit like them. Here is a model:

. regress totgross grossopn wednesdy

Source | SS df MS Number of obs = 907

-------------+------------------------------ F( 2, 904) = 1163.05

Model | 1.5047e+18 2 7.5234e+17 Prob > F = 0.0000

Residual | 5.8477e+17 904 6.4687e+14 R-squared = 0.7201

-------------+------------------------------ Adj R-squared = 0.7195

Total | 2.0895e+18 906 2.3062e+15 Root MSE = 2.5e+07

------------------------------------------------------------------------------

totgross | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

grossopn | 4.156319 .0868341 47.87 0.000 3.985899 4.326739

wednesdy | 6026951 2545001 2.37 0.018 1032152 1.10e+07

_cons | -414296.5 1128741 -0.37 0.714 -2629553 1800960

------------------------------------------------------------------------------

The interpretation of the coefficient for the dummy variable Wednesday is as follows: “as we move from 0 to 1, meaning that if a movie was released on a Wednesday as opposed to any other day, we see that the predicted total gross receipts for these films were higher by $6,026,951, on average” and this is statistically significant.

This is advanced material and I don’t want to see this regression in your lab write up. You will see it in advanced statistics courses in the economics department.