Computational Statistics in Python¶

In statistics, we apply probability theory to real-world data in order to make informed guesses. For any practical analysis, the use of computers is necessary. This course (book) is designed for graduate research students who need to analyze complex data sets, and/or implement efficient statistical algorithms from the literature. In particular, we focus on the following computational skills necessary for modern data analysis:

Exploratory data analysis: Real-world data may come in many forms, some of which are not directly amenable to statistical modeling (e.g. binary data, images, raw text). After obtaining the data, we need to preprocess it into a form that is convenient for analysis - sometimes numerical arrays suffice, but usually we will want to use more structured data frames. Once the data is in a tractable format, an essential first step is simply to play with the data to get a feel for it (e.g. types, missing information, duplication, outliers, clumps in the distribution). To do so, we peek at samples of the data, generate data summaries and eyeball the data using visualizations.
Statistical model fitting: Once we have a feel for how the data is distributed, we can construct probabilistic models to try to capture aspects that we are interested in. In addition, and unlike deterministic mathematical models, statistical models need to also characterize the uncertainty in the data, and are thus formulated as a probability distribution or a family of probability distributions. Our models (probability distributions) typically have tunable parameters for location, scale or other more complex properties. We want to find the parameter values so that the model describes the data adequately - this is known as model fitting, and the two main algorithmic approaches for doing so are optimization and simulation.
Model scaling: With real-world data being generated at an ever-increasing clip, we also need to be concerned with computational performance, so that we can complete our calculations in a reasonable time or handle data that is too large to fit into memory. To do so, we need to understand how to evaluate the performance of different data structures and algorithms, language-specific idioms for efficient data processing, native code compilation, and exploit resources for parallel or distributed computing.

Pre-requisites¶

The ideal student has prior programming experience (not necessarily in Python) and is aware of basic data structures and algorithms, has taken courses in linear algebra and multivariable calculus, and is familiar with probability theory and statistical modeling. The typical participant in this course is a graduate student in statistics or biostatistics, but motivated students from engineering, environmental science and the social sciences have also successfully completed it. Most of our students know R, and Python is used in this course to provide exposure to a second general purpose language widely used in industry and the computational and data science communities.