Statistics 202a.
Statistics programming. Prof. Rick Schoenberg.
F24
Lectures: Tue Thu 930-1045am in Bunche 3164.
I am not maintaining the CCLE or Canvas site. The main course website is
http://www.stat.ucla.edu/~frederic/202a/F24 , and course materials, including the syllabus and lecture notes, will be there.
Texts:
1. Introduction to Data Science (2020) by Rafael Irizarry.
https://rafalab.github.io/dsbook .
2. Automate the Boring Stuff with Python, 2nd edition (2019) by Al Sweigart.
https://automatetheboringstuff.com .
3. R Programming for Data Science (2020) by Roger Peng.
https://bookdown.org/rdpeng/rprogdatascience .
4. The C programming language, 2nd edition, by BW Kernighan and DM Ritchie (1988).
We will mostly be using the first two books.
The first three books are free. The 4th book is more for reference and is completely optional.
Office hours: Thu 1050-1130am, 3873 Slichter.
email: frederic@stat.ucla.edu
Course Website: http://www.stat.ucla.edu/~frederic/202a/F24 .
There will be no class on Thu Oct31!!!
Also no class on Thanksgiving, Thu Nov28.
Statistics 202a will explore computational statistics and will focus especially
on computing in Python, R, and C.
The course is designed for graduate students
with solid mathematical and statistical backgrounds.
A preliminary outline of the class is given below, though the order may change.
1. Managing input and output in R, tidyverse, programming basics.
Peng ch4-6, Peng ch8, Irizarry ch3-5.
2. Subsetting R objects, managing dataframes, dplyr, join, bind, data visualization.
Peng ch9, 12, Irizarry ch6-10, 22.
3. Functions, regular expressions, debugging, profiling, web scraping, stringr, text mining.
Peng ch14,17,18,19, Irizarry ch23, 24, 26.
4. Simulation, parallel computation. Python basics, functions, methods.
Peng ch20,21. Sweigart ch1-4.
5. Machine learning, smoothing. Python dictionaries, string manipulations, regular expressions, reading and writing to files, web scraping, and time.
Irizarry ch27,28. Sweigart ch5-9, 12, 17.
6. Cross validation, caret, classification, regression trees, random forests.
Irizarry ch29-31.
7. Large datasets.
Irizarry ch33.
8. Compiling C, functions on matrices and dataframes,
kernel density estimation in R, C basics.
9. Functions and loops in C, using C in R.
10. Nonparametric regression in R, generalized additive models in R.
11. Variables, vectors, matrices, arrays, structures, strings, and pointers in C.
12. Managing input and output in C, calling C functions from C,
running C from terminal.
13. Optimization in R.
14. Calling R from C.
15. MLE in general, and for Hawkes point processes.
Newton-Raphson optimization for the MLE using optim().
16. Building R packages.
Grading:
Homeworks (80%), written project (15%),
oral presentation/participation (5%).
Homeworks will be assigned on the main course website.
Homeworks may be turned in by email to statgrader@stat.ucla.edu .
Each homework is graded out of 10 points.
Attendance Nov26, Dec3 and Dec5 is mandatory for all students, for the oral reports.
Attendance for lectures, on all other days, is generally not mandatory and not
counted as part of the grade.
However, if you cannot attend, please contact another student to find out what you missed rather than asking me to fill you in.
Late homeworks will not be accepted at all.
There will be no extensions for the project or presentation.
Students who are unable to make these dates or otherwise fulfill
the course requirements must consult with the instructor in advance, if possible.
Students with learning disabilities must consult with the instructor by the 2nd
week of class if special arrangements are required.
Written Project: due Dec10 11:59pm, by email to frederic@stat.ucla.edu.
Oral presentations: Nov26, Dec3 and Dec5.
No final exam.
Description of Written Project:
(to come)
Oral presentations of project results will take place during lecture.
These will involve simply presenting a clear, concise, and very brief summary of
some data using some of the methods we discuss in class. At least one method should be performed in C. More description will be given in later lectures.