This course is designed for graduate and advanced undergraduate students who wish to learn the fundamentals of data science and machine learning in the context of real world applications. An emphasis will be placed on problems encountered by companies such as Amazon, Booking.com, Netflix, Uber/Lyft, The New York Times and others. Despite a focus on applications, the course will be mathematically rigorous, but the goal is to motivate each tool by a concrete problem arising in industry. The course will follow an online iPython notebook where students can try out various algorithms in real time as we go through the course.
There will be no midterms or exams, but rather assignments which will be handed in periodically throughout the term.
Update: While in prevoius years the students were free to select their own projects, for various reasons I have decided to have everyone work with the same dataset this year. Due to the growing size of the class, this will allow me to more efficiently answer questions and to focus on the relevant data science concepts. The project will be announced during the first few lecture of the class.
Exposure to undergraduate-level probability, statistics, calculus, programming, and linear algebra.
- 50% Assignments
- 50% Final Project
- Problems that arise in industry involving data.
- Introduction to regression, classification, clustering. Model training and evaluation.
- Regression: Linear Regression, Random Forest, Gradient Boosting. Examples: ETA prediction for taxis, real estate prediction, news paper demand forecasting.
- Classification: Logistic Regression, Random Forest, Gradient Boosting. Examples: User Churn, Acquisition and Conversion.
- Model selection and feature selection. Regularization. Real world performance evaluation and monitoring.
- Examples from publishing, ride sharing, online commerce and more.
- Clustering: K means, DBScan, Gaussian Mixture Models and Expectation Maximization.
- Correlation of features. Principle Component Analysis. Problem of dimensionality.
- LDA and topic modeling.
- A/B experiments. Causal inference introduction.
- Offline and Online policy discovery.
- Map Reduce. SQL.
- Feature engineering: Testing out new features and verifying their predictive power.
- The basics of API building.
- Collaborative Filtering: Matrix Factorization, Neighborhood Models and Graph Diffusion.
- Content Filtering: Topic Modeling, Regression, Classification.
- Cold Starts. Continous Cold starts. Warm Starts. Performance Comparison and Analysis.
- Introduction to Bayesian statistics. Bayesian vs. Frequentist approach.
- Multi-armed Bandits. Thompson Sampling. LinUCB.
- Markov Decision Processes.
- When and why? The problem surrounding hype in deep learning.
- Image and sound signal processing.
- Embeddings.
These are references to deepen your understanding of material presented in lecture. The list is by no means exhaustive.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning, Springer 2013
Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning, Springer 2013
Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
Cameron Davidson-Pilon, Bayesian Methods for Hackers, https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers