Syllabus. Spring 2010.
Statistics 101c: Introduction to Regression and Data Mining.
Prof. Rick Paik Schoenberg.
Lectures: MW 3:00-3:50pm, Boelter 9413.
Office hours: Mondays, 4-5pm, MS 8965.
email: frederic@stat.ucla.edu
Course webpage:
http://www.stat.ucla.edu/~frederic/101c/S10
Required Text:
"A Modern Approach to Regression with R" by Simon Sheather.
The text is available online from a UCLA computer or using a UCLA Proxy at
http://www.springerlink.com/content/q443t1.
Optional readings:
"Linear Models with R" by J. Faraway.
"Extending the Linear Model with R" by J. Faraway.
Discussion sections: Thursday, 4-4:50pm in Boelter 9413 with Sean
Wang.
Description: Applied regression analysis, with emphasis on multiple
regression, kernel regression, and generalized linear model
(e.g. logistic and Poisson regression). Special attention to modern extensions
of regression, including regression diagnostics,
graphical procedures, clustering and multivariate analysis, and
point process regression.
Grading: Homework (15%), Midterm (25%), Project (20%), Final exam
(40%).
The midterm is Wednesday, May 12, 3:00-3:50pm, in Boelter 2444.
The project is due by email by 11pm on Sat, June 5.
The final exam is Thursday, June 10, 3-6pm, in MS 6229.
No class Wednesday, March 31 because I will be away,
and no class Monday, May 31 for Memorial Day.
Homeworks must be handed in at the beginning of class, or may be slipped under
my office door (MS 8965) any time before class. Each homework assignment
is graded out of 10 points.
Homeworks handed in between 5 and 10 minutes after class has begun
will be given a one-point deduction.
Those handed in between 10 and 20 minutes late will be given a two-point deduction.
Homeworks handed in between 20 minutes late and the end of class
will be given a three-point deduction.
Homeworks submitted after lecture is over will not be accepted. Homeworks
must be submitted in hard copy, rather than by email or fax.
Rough Outline:
Week 1: Variable selection in multiple regression.
Week 2: Logistic regression.
Week 3: Logistic regression diagnostics.
Week 4: Poisson regression.
Week 5: Kernel regression.
Week 6: Generalized linear models and the exponential family.
Week 7: Review and midterm.
Week 8: Serial correlation and Generalized Least Squares.
Week 9: Robust regression.
Week 10: Projects, review.
hw1. 7.2 and 7.3 from Sheather, due Friday, April 9.
hw2. 8.2 from Sheather, due Fri Apr 16.
hw3. 3.5 from "Extending the Linear Model in R" by Faraway, due Fri Apr 23. See
day9.txt.
hw4. 4ab from "Extending the Linear Model in R" by Faraway, due Fri May 7
by email in plain text or pdf to seanwang@ucla.edu.
hw5. Problem 2 of Chapter 9 in Sheather, pp. 328-329, due Fri, by email in
pdf to seanwang@ucla.edu or in paper to me at the beginning of class, or
before.
For the projects, find a dataset and analyze it using the methods we have
discussed in class. Your response variable, Y, should be non-negative-integer-valued.
You should also have at least 2 explanatory variables, and at least 20
observations (n). Your topic may be non-academic, and should be based on
something you are genuinely interested in, such as a hobby or
extra-curricular activity of yours. Examples of response variables might
be:
-- the number of points scored by LeBron James, per game.
-- the number of votes, per political candidate.
-- the number of Americans who saw a movie, per movie.
-- the number of Facebook friends, per person.
Analyze your data using OLS, Poisson regression, kernel regression,
and at least one other method (binomial regression, least
trimmed squares, GLS, WLS, m-estimation, ridge regression).
For each of these methods, show your residuals and analyze goodness-of-fit
as appropriate.
Your written report should be 3-5 pages of written text, followed by as
many figures and tables as you'd like at the end, in an appendix. Do not
include the figures in your text -- instead, just have all the figures at
the end. There is no need to explain in your text what the different
methods you use are. Instead, focus on interpreting your results. Your
report should contain an introduction (1/2 to 1 page) explaining why your
data are interesting or important, a results section (2-3 pages),
in which you comment on each of your figures, tables, and results, and
a conclusion (1/2 to 1 page), in which you summarize your main findings
and describe problems with your dataset and analysis. In the results section,
be sure to explain the main interesting features of your figures, and in your
conclusion, you are encouraged to speculate in interpreting your results
and, in particular, to speculate on
how any problems with your
data collection or analysis may have influenced your results.