Computational Statistics in Python¶
In statistics, we apply probability theory to real-world data in order to make informed guesses. For any practical analysis, the use of computers is necessary. This course (book) is designed for graduate research students who need to analyze complex data sets, and/or implement efficient statistical algorithms from the literature. In particular, we focus on the following computational skills necessary for modern data analysis:
- Exploratory data analysis: Real-world data may come in many forms, some of which are not directly amenable to statistical modeling (e.g. binary data, images, raw text). After obtaining the data, we need to preprocess it into a form that is convenient for analysis - sometimes numerical arrays suffice, but usually we will want to use more structured data frames. Once the data is in a tractable format, an essential first step is simply to play with the data to get a feel for it (e.g. types, missing information, duplication, outliers, clumps in the distribution). To do so, we peek at samples of the data, generate data summaries and eyeball the data using visualizations.
- Statistical model fitting: Once we have a feel for how the data is distributed, we can construct probabilistic models to try to capture aspects that we are interested in. In addition, and unlike deterministic mathematical models, statistical models need to also characterize the uncertainty in the data, and are thus formulated as a probability distribution or a family of probability distributions. Our models (probability distributions) typically have tunable parameters for location, scale or other more complex properties. We want to find the parameter values so that the model describes the data adequately - this is known as model fitting, and the two main algorithmic approaches for doing so are optimization and simulation.
- Model scaling: With real-world data being generated at an ever-increasing clip, we also need to be concerned with computational performance, so that we can complete our calculations in a reasonable time or handle data that is too large to fit into memory. To do so, we need to understand how to evaluate the performance of different data structures and algorithms, language-specific idioms for efficient data processing, native code compilation, and exploit resources for parallel or distributed computing.
Pre-requisites¶
The ideal student has prior programming experience (not necessarily in Python) and is aware of basic data structures and algorithms, has taken courses in linear algebra and multivariable calculus, and is familiar with probability theory and statistical modeling. The typical participant in this course is a graduate student in statistics or biostatistics, but motivated students from engineering, environmental science and the social sciences have also successfully completed it. Most of our students know R, and Python is used in this course to provide exposure to a second general purpose language widely used in industry and the computational and data science communities.
Setup¶
Exploratory Data Analysis¶
- Getting Started with Python
- Live Demo of Jupyter Features
- Using Markdown in Jupyter for Literate Programming
- Elements of Python
- Indexing a container
- Conversion between types
- Generator objects
- Controlling program flow
- Using Libraries
- Working with vectors and arrays
- Input and output
- Getting comfortable with error messages
- Exercises
- Version information
- Functions
- Strings
- Get “Through the Looking Glass”
- Slice to get Jabberwocky
- Find palindromic words in poem if any
- Top 10 most frequent words
- Words that appear exactly twice.
- Trigrams
- Find words in poem that are over-represented
- Encode and decode poem using a Caesar cipher
- Using Regular Expressions
- Natural language processing
- String formatting
- I/O
- Classes
- Using
numpy
- Resources
- Array creation
- Array manipulaiton
- Array indexing
- Calculations and broadcasting
- Combining and splitting arrays
- Reductions
- Example: Calculating pairwise distance matrix using broadcasting and vectorization
- Example: Consructing leave-one-out arrays
- Generalized ufucns
- Saving and loading NDArrays
- Version information
- Symbolic Algebra with
sympy
- Combining and reshaping data
- Manipulating and querying data
- Getting Started With Graphics
- Types of Plots
- Customizing Plots
- Data
- SQL
Statistical Modeling¶
- Linear Algebra and Linear Systems
- Review of solving simultaneous linear equations (HS)
- Matrix Decompositions
- Linear Algebra Examples
- Applications of Linear Alebra: PCA
- Optimization and Root Finding
- Algorithms for Optimization and Root Finding for Multivariate Problems
- Expecation Maximization
- Using optimization routines from
scipy
andstatsmodels
- Machine Learning with
sklearn
- Random Numbers
- Resampling and Monte Carlo Simulations
- Numerical Evaluation of Integrals
- Metropolis and Gibbs Sampling
- Using Auxiliary Variables in MCMC proposals
- Using PyMC3
- PyStan
Model Scaling¶
- C Crash Course
- Compiled vs Interpreted Languages
- What you should know about C
- What you should know about C++
- C++
- Code Optimization
- Foreign Function Interface
- Just-in-time compilation (JIT)
- Cython
- Making Python faster
- Interfacing with compiled languages
- From Python \(\rightarrow\) Compiled langauges
- From Compiled langauges \(\rightarrow\) Python
- Bake-off
- Summary
- Using
pybind11
- Parallel Programming
- Multi-Core Parallelism
- Vanilla Python
- Using
numba
to speed up computation - Using
cython
to speed up computation - The
concurrent.futures
module - Using processes in parallel with
ProcessPoolExecutor
- Using processes in parallel with ThreadPoolExecutor
- Turning off the GIL in
cython
- Using processes in parallel with
ThreadPoolExecutor
andnogil
- Using
multiprocessing
- Simplifying use of
multiprocessing
with thedeco
package - Common issues with use of shared memory in parallel programs
- Using
ipyparallel
- Biggish Data
- Efficient storage of data in memeory
- Introduction to Spark
- Using Spark Efficiently
- Spark SQL
- Spark MLLib
- Spark on Cloud
- Azure
- AWS
- Know your AWS public and private access keys
- Know your AWS EC2 key-pair
- Install AWS command line client
- Configure the AWS command line client
- Create a cluster
- Get information about the cluster
- Connect to the cluster via
ssh
- Note the IP address that is returned
- Run
pyspark
- Run the
Zepellin
notebook - Connect to
Zeppelin
notebook - Create notebook and run Spark within it
- Terminate the cluster