Statistics 202a.
Statistics programming. Prof. Rick Paik Schoenberg.
F22
Lectures: Mon Wed 1230-145pm in Young 1044.
I am not maintaining the CCLE or Canvas site. The main course website is
http://www.stat.ucla.edu/~frederic/202a/F22 , and course materials, including the syllabus and lecture notes, will be there.
Texts:
1. Introduction to Data Science (2020) by Rafael Irizarry.
https://rafalab.github.io/dsbook .
2. Automate the Boring Stuff with Python, 2nd edition (2019) by Al Sweigart.
https://automatetheboringstuff.com .
3. R Programming for Data Science (2020) by Roger Peng.
https://bookdown.org/rdpeng/rprogdatascience .
4. The C programming language, 2nd edition, by BW Kernighan and DM Ritchie (1988).
We will mostly be using the first two books.
The first three books are free. The 4th book is more for reference and is completely optional.
Office hours: Wed 11-1145am, 3873 Slichter.
email: frederic@stat.ucla.edu
Course Website: http://www.stat.ucla.edu/~frederic/202a/F22 .
Statistics 202a will explore computational statistics and will focus especially
on computing in Python, R, and C.
The course is designed for graduate students
with solid mathematical and statistical backgrounds.
A preliminary outline of the class is given below, though the order may change.
1. Managing input and output in R, tidyverse, programming basics.
Peng ch4-6, Peng ch8, Irizarry ch3-5.
2. Subsetting R objects, managing dataframes, dplyr, join, bind, data visualization.
Peng ch9, 12, Irizarry ch6-10, 22.
3. Functions, regular expressions, debugging, profiling, web scraping, stringr, text mining.
Peng ch14,17,18,19, Irizarry ch23, 24, 26.
4. Simulation, parallel computation. Python basics, functions, methods.
Peng ch20,21. Sweigart ch1-4.
5. Machine learning, smoothing. Python dictionaries, string manipulations, regular expressions, reading and writing to files, web scraping, and time.
Irizarry ch27,28. Sweigart ch5-9, 12, 17.
6. Cross validation, caret, classification, regression trees, random forests.
Irizarry ch29-31.
7. Large datasets.
Irizarry ch33.
8. Compiling C, functions on matrices and dataframes,
kernel density estimation in R, C basics.
9. Functions and loops in C, using C in R.
10. Nonparametric regression in R, generalized additive models in R.
11. Variables, vectors, matrices, arrays, structures, strings, and pointers in C.
12. Managing input and output in C, calling C functions from C,
running C from terminal.
13. Optimization in R.
14. Calling R from C.
15. MLE in general, and for Hawkes point processes.
Newton-Raphson optimization for the MLE using optim().
16. Building R packages.
Grading:
Homeworks (80%), written project (15%),
oral presentation/participation (5%).
Homeworks will be assigned on the main course website.
Homeworks may be turned in by email to statgrader@stat.ucla.edu . Any homeworks submitted after the first five minutes of class, however, will be marked late, and 1 point is taken off for every 5 minutes after the beginning of class. Each homework is graded out of 10 points.
Attendance Nov23, Nov28 and Nov30 is mandatory for all students, for the oral reports.
Attendance for lectures, on all other days, is generally not mandatory and not
counted as part of the grade.
However, if you cannot attend, please contact another student to find out what you missed rather than asking me to fill you in.
Late homeworks will not be accepted at all.
There will be no extensions for the project or presentation.
Students who are unable to make these dates or otherwise fulfill
the course requirements must consult with the instructor in advance, if possible.
Students with learning disabilities must consult with the instructor by the 2nd
week of class if special arrangements are required.
Written Project: due Fri Dec9 11:59pm, by email to frederic@stat.ucla.edu.
Oral presentations: Nov23, Nov28 and Nov30.
No final exam.
Description of Written Project:
(to come)
Oral presentations of project results will take place during lecture.
These will involve simply presenting a clear, concise, and very brief summary of
some data using some of the methods we discuss in class. At least one method should be performed in C. More description will be given in later lectures.