Instructor (老师) : Yitzchak Elchanan Solomon (水神恩)

Office Hours: By appointment

TA: Taozheng Zhu (taozheng.zhu@duke.edu)

TA Office Hours: Mondays 9am-11am, IB2001, and by appointment

Fall Term 2, Session 2, 2021 / 秋天第二学期

MoTuWeTh 8:10pm-9:10pm / 星期一二三四 8：10– 9： 10PM

There are no mandatory textbooks, but I will make use of the following, free texts:

1. An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani)

2. Pattern Recognition and Machine Learning (Bishop)

3. Neural Nets and Deep Learning (Nielsen)

What is this course about?

Stats 302 is an introductory course designed to teach students about the major concepts, themes, and techniques used in contemporary machine learning. The emphasis of this course is on practical understanding.

The course is divided into a number of modules. The first module focuses on the broad goals and challenges of machine learning: regression and classification, the bias-variance tradeoff, sampling, accuracy vs interpretability, supervised vs unsupervised learning, etc. The second module covers linear models — this is a large class of simple and useful techniques for making predictions with data. The third module transitions to the world of non-linear models by introducing decision trees, as well as a variety of techniques for combining models, like bagging and boosting. The next two modules set aside prediction and focus on unsupervised goals: dimensionality reduction, visualization, clustering, and outlier detection. In the final two modules, we will introduce neural networks, study how they are used and trained, with applications to image analysis, time series analysis, and representations of challenging data types, e.g. text.

Students should be familiar with the principles of multivariable calculus, linear algebra, and probability theory, as covered in an undergraduate course. The course will also have a coding component (in python), so familiarity with coding will be important. The following links may be useful for learning/brushing up your python skills:

– https://www.learnpython.org/

– https://numpy.org/learn/ (numpy is an important python math library)

How will this course be organized?

The course will consist of daily lectures, over Zoom. The lectures will be recorded and uploaded to YouTube for asynchronous students to access, as well as for synchronous students to review. Each week will be accompanied by a coding “lab” that is an iPython notebook with some draft code that explores the concepts of this weekly module, and how they can be implemented on data. There will be weekly, graded homework that contains both theoretical problems, as well as challenges related to the iPython notebook. Homework will be due to the Wednesday of the week after it is assigned. The iPython notebook does most of the “heavy lifting” in terms of preparing the data, visualizing it, and running the analysis, but you will need to adjust the code or add your own code to solve the challenges. As the course progresses, you will have to read some Python documentation (links will be provided) to learn how to implement new techniques — this is a very important skill in applied machine learning.

The midterm will be a data set analysis project, where you will be put in a group and assigned a data set to analyze using the techniques of the first four weeks, producing a report that will be graded. The final will also be a written report, prepared in a group, where you will pick an advanced topic to research and explain.

Grading in more detail

The grading scheme is: Homework: 40%, Midterm: 30%, Final: 30%. There will be one homework assignment each week, excluding the final week, with the lowest grade dropped. As the lowest homework grade will be dropped, no late homework will be accepted outside of emergency situations.

The midterm will be a typed-up document or iPython notebook containing an explanation of your data analysis project: the data you analyzed, the problems you tried to solve, the techniques you applied (and why you chose them), your results, and a conclusion. **Midterm Grading Rubric: Graded out of 20, 5 points each for the following components:** (1) Explain the data and conduct some exploratory data analysis, (2) Decide on a learning task, explain what model you will use and why, (3) Implement and tune a model/models, and explain the process of optimizing/developing the model, (4) Explain the output of the model, its implication for your learning task, conclusions, next steps, etc.

The final will be a group written report, on an advanced topic taken from a list of suggested topics, or a topic of your choice that has my approval: **Final Grading Rubric: Graded out of 20, 5 points for each of the following components:** (1) Goal: What is the learning task that motivates this advanced technique? (~1-2 pages), (2) The Model: Explain the model used to address the learning task. Explain how the model works, both conceptually and mathematically. (~2-4 pages), (3) Training: Explain how the model is trained, i.e. what the loss function is, what the optimization scheme looks, etc. (~1-3 pages), (4) Application: Implement the model on a data set and showcase the results. (~2-3 pages).

LECTURE NOTES

LECTURE RECORDINGS

Week | Topic | ipynb code | homework |

1 | Basic Concepts of ML (Statistical Learning, Accuracy vs. Interpretability, Supervised vs. Unsupervised, Regression vs. Classification, Bias-Variance Tradeoff, Curse of Dimensionality, Validation, Resampling) | BiasVariance BiasVarianceSols | Hwk 1 Hwk 1 Sols |

2 | Linear Models (Decision boundaries, Linear Regression, Logistic Regression, Discriminant Functions, Generative Models, Geometry of Least Squares, Subset Selection.) | BostonOLS BostonOLSsols | Hwk 2 Hwk 2 Sols |

3 | Tree-Based Models (Decision Trees, Classification Trees, CART, Pruning, Bagging, Boosting, Random Forest), kNN and Generative Models | Tree Models Tree Models Sols | Hwk 3 Hwk 3 Sols |

4 | Dimensionality Reduction (PCA,MDS, Isomap, Kernel PCA) | Dim Reduction Dim Reduction Sols ChineseMNIST | Hwk 4 Hwk 4 Sols |

5 | Clustering and Outlier Analysis (KMeans, KMedioids, Hierarchical Clustering, Gaussian Mixtures, LOF, Mahalanobis Distance) | Clustering Clustering Sols | Hwk 5 + Midterm due end of week Hwk 5 Sols |

6 | Introduction to Neural Networks (Neural Network, Sigmoids, Activations, Hidden Layers, Backpropagation, Cross-Entropy, Softmax, Regularization, Universal Approximation,Optimization). | NNandGD NNandGDsols | Hwk 6 Hwk 6 Sols |

7 | CNNs, RNNs, Representation Learning (Convolutional Layers, Padding, Pooling, Segmentation, Detection, Recurrent Neurons, Backprop Through Time, NLP, Word2Vec) . | Convolution | Hwk 7 + Final due beginning of week 8 |