Introduction to Statistical Models and Data Mining

CCLE link

Course information

Instructor: Vivian Lew (
Teaching assistant: Yibiao Zhao (
Lecture: MWF, 10:00-10:50am, GEOLOGY 3656
Discussion: T, 3-3:50pm (4-4:50pm), PAB 1749
TA office hours: Th, 3-4pm, BH 9406

Course Description

Designed for juniors/seniors. Applied regression analysis, with emphasis on general linear model (e.g., multiple regression) and generalized linear model (e.g., logistic regression). Special attention to modern extensions of regression, including regression diagnostics, graphical procedures, and bootstrapping for statistical influence. Enforced requisite: course 101B.


An Introduction to Statistical Learning with Applications in R
by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Discussion 1 (March 31): ggplot2

- ggplot2 slides
- ggplot2 document

Discussion 2 (April 7): R Markdown and knitr

- Markdown slides
- R Markdown slides
- knitr slides

Discussion 3 (April 14): Generalized Linear Models and Machine Learning Recipe

Family objects for GLM

Andrew Ng's talk about Deep Learning

Discussion 4 (April 21): Practice Midterm

CCLE link

Discussion 5 (April 28): Review of classification methods

Naive Bayes

Logistic Regression

Linear discriminant analysis

LDA vs. QDA and PCA

K-nearest neighbor (1 nearest neighbor and 15 Nearest Neighbors)

Error rate vs. degree of freedom

Discussion 6 (May 5): Loss function and Regularization

Review of Cross-validation: remind the machine learning recipe as the criterion for training/testing partitioning.
Loss function and Regularization:
L1 regularization on least squares (Lasso):

L2 regularization on least squares (Ridge regression):

Lp ball in three dimensions. As the value of p decreases, the size of the corresponding Lp space also decreases.

The difference between L1 and L2. In Graph (a), the black square represents the feasible region of of the L1 regularization while graph (b) represents the feasible region for L2 regularization. The contours in the plots represent different loss values (for the unconstrained regression model ). The feasible point that minimizes the loss is more likely to happen on the coordinates on graph (a) than on graph (b) since graph (a) is more angular. This effect amplifies when your number of coefficients increases, i.e. from 2 to 200.

The implication of this is that the L1 regularization gives you sparse estimates. Namely, in a high dimensional space, you got mostly zeros and a small number of non-zero coefficients. This is huge since it incorporates variable selection to the modeling problem. In addition, if you have to score a large sample with your model, you can have a lot of computational savings since you don't have to compute features(predictors) whose coefficient is 0.

Discussion board