# Introduction to Statistical Models and Data Mining

## Course information

Instructor: Vivian Lew (vlew@stat.ucla.edu)
Teaching assistant: Yibiao Zhao (ybzhao@ucla.edu)
Lecture: MWF, 10:00-10:50am, GEOLOGY 3656
Discussion: T, 3-3:50pm (4-4:50pm), PAB 1749
TA office hours: Th, 3-4pm, BH 9406

## Course Description

Designed for juniors/seniors. Applied regression analysis, with emphasis on general linear model (e.g., multiple regression) and generalized linear model (e.g., logistic regression). Special attention to modern extensions of regression, including regression diagnostics, graphical procedures, and bootstrapping for statistical influence. Enforced requisite: course 101B.

## Textbook

An Introduction to Statistical Learning with Applications in R
by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

## Discussion 5 (April 28): Review of classification methods

Naive Bayes

Logistic Regression

Linear discriminant analysis

LDA vs. QDA and PCA

K-nearest neighbor (1 nearest neighbor and 15 Nearest Neighbors)

Error rate vs. degree of freedom

## Discussion 6 (May 5): Loss function and Regularization

Review of Cross-validation: remind the machine learning recipe as the criterion for training/testing partitioning.
Loss function and Regularization:
L1 regularization on least squares (Lasso):

L2 regularization on least squares (Ridge regression):

Lp ball in three dimensions. As the value of p decreases, the size of the corresponding Lp space also decreases.

The difference between L1 and L2. In Graph (a), the black square represents the feasible region of of the L1 regularization while graph (b) represents the feasible region for L2 regularization. The contours in the plots represent different loss values (for the unconstrained regression model ). The feasible point that minimizes the loss is more likely to happen on the coordinates on graph (a) than on graph (b) since graph (a) is more angular. This effect amplifies when your number of coefficients increases, i.e. from 2 to 200.

The implication of this is that the L1 regularization gives you sparse estimates. Namely, in a high dimensional space, you got mostly zeros and a small number of non-zero coefficients. This is huge since it incorporates variable selection to the modeling problem. In addition, if you have to score a large sample with your model, you can have a lot of computational savings since you don't have to compute features(predictors) whose coefficient is 0.