Economics 40/Statistics M11 Lab 6: Scatterplots & Regression

Due March 16, 2001

Purpose: The purpose of this lab is to use Stata to learn how to generate a scatter plot, perform some correlations, run a simple regression model (chapter 2), run a multiple regression model (chapter 11), and test a hypothesis (chapter 10).

Data: In Part I we will examine box office receipts from stage productions on Broadway in New York City:

use http://www.stat.ucla.edu/~vlew/stat11/labs/broadway

. describe

 

Contains data from http://www.stat.ucla.edu/~vlew/stat11/labs/broadway.dta

  obs:            18                         

 vars:             7                          5 Mar 2001 15:43

 size:           882 (99.5% of memory free)

------------------------------------------------------------------------

              storage  display     value

variable name   type   format      label      variable label

------------------------------------------------------------------------

show            str28  %28s                   Broadway Show Name

receipts        long   %12.0g                 Average Box Office Receipt

attendnc        int    %8.0g                  Average Attendance

capacity        int    %8.0g                  Total Capacity

type            byte   %8.0g                  Type of Production 1=other

                                                2=musical

ratio           float  %9.0g                  Average Proportion Seats      

                                              Filled

musical         float  %9.0g                  Musical? yes=1 no=0

------------------------------------------------------------------------

Sorted by: 

I would like you to generate scatterplots and correlations for this dataset using the variable receipts as the "y-variable" and any of the other variables as your x-variable.  Generate at least two scatterplots. 

The scatterplot command (for example) is:

graph receipts attendnc

and if you want to reveal the identity of any unusual observations, issue this command:

graph receipts attendnc, symbol([show])

The correlation command is:

Correlate variable1 variable2 variable3 variable4

a correlation matrix will be generated.  In general, you should only have continuous quantitative variables in your correlations.  In other words, you can’t correlate a variable like “type”.  There are four variables which I would classify as continuous quantitative variables.  I leave it to you to decide which ones they are.

Finally, run a simple regression trying to “predict” box office receipts from one other variable in the dataset.  Choose the one that yields the highest r-square for your model. Both variables must be quantitative.  Write out the regression equation in the form:

            y = a + bx

and clearly identify the y variable, the x variable, the slope and the intercept.  Please interpret the value of the slope and the value of the intercept in plain English.

ASSIGNMENT PART I: RECAP

1.                   Produce at least 2 scatterplots, both involving receipts as the "y-variable" and any other variable as the "x-variable"

2.                   Generate a correlation matrix (half-triangle actually) of all of the quantitative variables in the dataset.  What is your interpretation of the correlations.

3.                  Run a simple regression which tries to “predict” box office receipts using one other variable.  Write out the resulting regression equation, identify its components.  Interpret the slope and intercept.

IN PART II WE WILL ANALYZE SOME MOVIE INDUSTRY DATA

To get the data, issue the command:

use  http://www.stat.ucla.edu/~vlew/stat11/labs/movies

There should be 907 observations, each one is a different movie released between 1996-2000. Use the "summarize" command to generate summary statistics for each of the variables in the dataset (you don’t need to print that out for me). Unless you are extremely clever, you can only use the variables which have valid means and standard deviations as shown by the summarize command. The variables are, for your information:

storage  display     value

variable name   type   format      label      variable label

------------------------------------------------------------------------

title           str40  %40s                   Title

totgross        long   %12.0g                 Total Gross Receipts

grossopn        long   %12.0g                 Gross on Opening weekend

openday         int    %8.0g                  Opening number of screens

widest          int    %8.0g                  Widest number of screens

firstwk         float  %9.0g                  1st Wk % of total

perthtr         long   %12.0g                 Per Theater receipts

dist            str18  %18s                   Distributor

opendate        float  %dn/d/y                Opening Date

cldate          float  %dn/d/y                Closing Date

totdays         float  %9.0g                  Total Days Opened

wednesdy        float  %9.0g                  Opened on a Wednesday        

                                              1=yes 0=no

winter          float  %9.0g                  Opened during the winter

                                              1=yes 0=no

zper            float  %9.0g                  Standardized values of (

                                                perthtr)    

zopenday        float  %9.0g                  Standardized values of (

                                                openday) 

 

The last two variables zper and zopenday are z-scores for the variables perthtr and openday, so they give you some idea of how unusual or how “average” a particular movie was compared with the mean for all other movies.  These “standardized” variables are useful because their interpretation is easy.  Remember, Z-scores, or standard units, are just the number of standard deviations a given value is away from its mean.

 

If you tabulate or list any variable, you will have a fairly good idea of their interpretation. Ask for assistance if you cannot figure it out from the description given above and from a listing of the variable itself.

ASSIGNMENT PART II

The goal of the assignment is to model variable totgross (in other words, treat this variable as the y-variable or the dependent variable) using some of the other variables as predictors. There is no one right way of doing this, what is important is to treat this as a learning "adventure" and to put scatterplots, correlations, and regression to work for you.

1.            Generate some scatterplots. You can do this with the graph command, for example,

graph totgross grossopn

Will generate a scatterplot with the opening weekend gross on the x-axis and the total gross receipts (after the movie ended its run in the theaters). A scatterplot can give you a sense of the relationship between two variables.

Please note that you can only generate scatterplots for numeric variables in Stata. You cannot create a scatterplot total gross receipts by distributor.

Include at least 2 scatterplots in your lab assignment. Your choice of the variables but totgross must be in both as the y-variable. For each one, please tell me what the scatterplot tells you about the relationship between x and y.

2.            Generate some correlations. Ask yourself, what is the interpretation of the correlation coefficient and how is it related to a regression model? Again, totgross must be included in the correlations.

Please include a correlation matrix of variables that you used in the regression models (see #3 below) with your lab.  Which variable has the strongest correlation with totgross?  What is the direction of the correlation for the strongest one?  Make sure you use that one in your regression model as an x-variable.

3. Construct regression models that predict totgross from at least (A) one other variable and then (B) least two of the other variables in the dataset. For example:

regress totgross grossopn

And then issue the command:

regress totgross grossopn wednesdy

the results will give you regression coefficients, test results, and other statistics. Please include a copy of your regression output for each model and then write an interpretation of the resulting coefficients and the result of the t-test. What do the other variables tell you about the predicting the total gross receipts of a movie? In other words, what is the relationship (using the example above) between the gross receipts on the opening weekend and the total gross receipts? How did this relationship change (or did it change at all) when I add the variable wednesdy (opened on a Wednesday – generally the signal of a holiday weekend)?

Come up with your own model for total gross receipts and your own interpretation. Remember, there is no "correct" answer, this is just a learning exercise. Who knows, maybe you will become an entertainment industry analyst…

Extras/ Advanced

The variables wednesdy and winter are unusual in that they only have two values: 0=no and 1= yes.  They are known as “dummy” or “indicator” variables and while they really are not continuous quantitative variables, they behave a bit like them.  Here is a model:

. regress  totgross grossopn wednesdy

 

      Source |       SS       df       MS              Number of obs =     907

-------------+------------------------------           F(  2,   904) = 1163.05

       Model |  1.5047e+18     2  7.5234e+17           Prob > F      =  0.0000

    Residual |  5.8477e+17   904  6.4687e+14           R-squared     =  0.7201

-------------+------------------------------           Adj R-squared =  0.7195

       Total |  2.0895e+18   906  2.3062e+15           Root MSE      =  2.5e+07

 

------------------------------------------------------------------------------

    totgross |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

    grossopn |   4.156319   .0868341    47.87   0.000     3.985899    4.326739

    wednesdy |    6026951    2545001     2.37   0.018      1032152    1.10e+07

       _cons |  -414296.5    1128741    -0.37   0.714     -2629553     1800960

------------------------------------------------------------------------------

The interpretation of the coefficient for the dummy variable Wednesday is as follows: “as we move from 0 to 1, meaning that if a movie was released on a Wednesday as opposed to any other day, we see that the predicted total gross receipts for these films were higher by $6,026,951, on average”  and this is statistically significant.

This is advanced material and I don’t want to see this regression in your lab write up. You will see it in advanced statistics courses in the economics department.