Make it run, make it right, make it fast.
- Offering STA 633: Statistical computing and computation in Spring 2015
- Schedule is 2 lectures (75 mins) and 1 lab (75 mins) per week
- This is a 2nd course in statistical computing - pre-req is Colin Rundel's class
- STA 523
- Quite fast-paced - recommended books are
- Advanced R - Wickham
- R Packages - Wickham
- Will cover
Unix shell
,make
,git
,markdown
and programming in R - We will have a pretest to determine eligibility if students have not taken STA 523
- Basically teach all the computing that we would personally like to see in a PhD student or postdoc working with us
- Comfortable using both high (Python/Julia?/R) and low level languages (C/C++)
- Understand data management and use of relational database
- Working with "bad" data
- Examples?
- Hands-on exercise building a normalized database from a spreadsheet and querying it via SQL
- Working with "bad" data
- Can build reproducible data analysis pipelines (testing + make + literate programming)
- Can convert a statistical model (e.g. from manuscript or textbook) into a numerical algorithm
- Understanding of basic algorithms for optimization, simulation and smoothing * Building blocks for large classes of statistical algorithms * What algorithms should students know?
- Pragmatic usage of libraries for established numerical routines * Recommendations for C/C++ libraries
- Can write code that is correct
- How much and what kind of testing is appropriate?
- How to test code with stochastic elements
- Can write code that runs fast
- Trade-off between computation and programmer time (premature optimization)
- Some understanding of complexity trade-offs for algorithms and data structures
- Benchmarking and profiling
- JIT compilation
- Writing native code
- Exploiting multiple cores (threading, multiprocessing, OpenMP)
- Exploiting multiple machines (MPI)
- Exploiting GPUs (CUDA, maybe OpenCL)
- Working with really big data (MapReduce)
Unit 1: Reproducible analysis and introducing Python as a glue language (10%) Unit 2: Working with data - data munging and relational databases (10%) Unit 3: Exploratory data analysis and visualization (10%) Unit 4: Core statistical algorithms and libraries (40%) Unit 5: C bootcamp, code profiling and writing native code (15%) Unit 6: Parallel computing and working with big data (15%)
- Overall course objectives?
- Overall course content?
- Are there useful classes of topics we have left out?
- Within each topic, what content should students learn?
- Unit 1: Reproducible analysis and introducing Python as a glue language (10%)
- Unit 2: Working with data - data munging and relational databases (10%)
- Unit 3: Exploratory data analysis and visualization (10%)
- Unit 4: Core statistical algorithms and libraries (40%)
- Unit 5: C bootcamp, code profiling and writing native code (15%)
- Unit 6: Parallel computing and working with big data (15%)
- How can programming be taught effectively?
- Every good programmer I know is self-taught ...
- MCQs for rapid sanity check on level of understanding each week
- Less talking, more doing - mini-project after each unit
- Individual or group work?
- What are statistical algorithms students should know?
- Know the theory and how to use a good implementation
- Teach understanding with toy example
- Use library to solve more realistic problem
- Examples
- Linear algebra e.g. projection, normal equations
- Optimization - e.g. Newton, IRLS, multivariate gradient descent, EM
- Simulation - resampling methods, Monte Carlo, MCMC
- Others? Smoothing, interpolation etc
- Know the theory and how to use a good implementation
- What are good data sets and problems to use for teaching?
- Bad data
- Big data
- Slow and fast versions