Syllabus for STA 663

Instructor:: Cliburn Chan cliburn.chan@duke..edu
Instructor: Janice McCarthy janice.mccarthy@duke.edu
TA: Matt Johnson mcj15@stat.duke.edu

Overview


These pages are no longer maintained. lease use current verison.

The goal of STA 663 is to learn statistical programming - how to write code to solve statistical problems. In general, statistical problems have to do with the estimation of some characteristic derived from data - this can be a point estimate, an interval, or an entire function. Almost always, solving such statistical problems involves writing code to collect, organize, explore, analyze and present the data. For obvious reasons, we would like to write good code that is readable, correct and efficient, preferably without reinventing the wheel.

This course will cover general ideas relevant for high-performance code (data structures, algorithms, code optimization including parallelization) as well as specific numerical methods important for data analysis (computer numbers, matrix decompositions, linear and nonlinear regression, numerial optimization, function estimation, Monte Carlo methods). We will mostly assume that you are comfortable with basic programming concepts (functions, classes, loops), have good habits (literate programming, testing, version control) and a decent mathematical and statistical background (linear algebra, calculus, probability).

To solve statistical problems, you will typically need to (1) have the basic skills to collect, organize, explore and present the data, (2) apply specific numerical methods to analyze the data and (3) optimize the code to make it run acceptably fast (increasingly important in this era of “big data”). STA 663 is organized in 3 parts to reflect these stages of statistical programming - basics (20%), numerical methods (60%) and high performance computing (20%).

Learning objectives

The course will focus on the development of various algorithms for optimization and simulation, the workhorses of much of computational statistics. The emphasis is on comptutation for statistics - how to prototype, optimize and develop high performance computing (HPC) algorithms in Python and C/C++. A variety of algorithms and data sets of gradually increasing complexity (1 dimension \(\rightarrow\) many dimensions, serial \(\rightarrow\) parallel \(\rightarrow\) massively parallel, small data \(\rightarrow\) big data) will allow students to develop and practise the following skills:

Pre-requisites

Review the following if you are not familiar with them

The course will cover the basics of Python at an extremely rapid pace. Unless you are an experienced programmer, you should probably review basic Python programming skills from the Think Python book. This is also useful as a reference when doing assignments.

Another very useful as a reference is the official Python tutorial

Grading

Computing Platform

Each student will be provided with access to a virtual machine image running Ubuntu - URLs for inidividual students will be provided on the first day. For GPU computing and map-reduce examples, we will be using the Amazon Web Services (AWS) cloud platform. Again, details for how to acccess will be provided when appropriate.

All code developed for the course should be in a personal Github repository called sta–663-firstname-lastname. Make the instructors and TA collaborators so that we have full access to your code. We trust that you can figure out how to do this on your own.

Lecture 1

Computer lab 1

See Lab01/Exercises01.ipynb in the course Github repository.

Lecture 2

Lecture 3

Computer lab 2

See GitLab/GitExercises.ipynb in the course Github repository.

Lecture 4

Computer lab 3

See Lab02/Exercises02.ipynb in the course Github repository.

Lecture 5

Lecture 6

Computer lab 4

See Lab03/Exercises03.ipynb in the course Github repository.

Lecture 7

Lecture 8

Computer lab 5

See Lab04/Exercises04.ipynb in the course Github repository.

Lecture 9

Lecture 10

Computer lab 6

See Lab05/Exercises05.ipynb in the course Github repository.

Lecture 11

Lecture 12

Computer lab 7

See Lab06/Exercises06-Revised.ipynb in the course Github repository.

Lecture 13 - Mid-term exams

Lecture 14

Computer lab 8

See Lab07/Exercises07.ipynb in the course Github repository.

Lecture 15

Lecture 16

Computer lab 9

See Lab08/Exercises08.ipynb in the course Github repository.

Lecture 17

Lecture 18

Final Project

See the Projects folder in the course Github repository.

Lecture 19

Lecture 20

Lecture 21

Lecture 22

Lecture 23

Supplementary Material

SM 1

SM 2