Computational Statistics in Python
0.1
Site
Introduction to Python
Variables
Operators
Iterators
Conditional Statements
Functions
Strings and String Handling
Lists, Tuples, Dictionaries
Lists
Dictionaries
Classes
Modules
The standard library
Keeping the Anaconda distribution up-to-date
Exercises
Getting started with Python and the IPython notebook
Cells
Markdown
Code Cells
Getting started with Python and the IPython notebook
We will redo these examples in Python
Functions are first class objects
Function argumnents
Call by “object reference”
Binding of default arguments occurs at function
definition
Higher-order functions
Anonymous functions
Pure functions
Recursion
Iterators
Generators
Generators and comprehensions
Utilites - enumerate, zip and the ternary if-else operator
Decorators
The
operator
module
The
functools
module
The
itertools
module
The
toolz
,
fn
and
funcy
modules
Exercises
Data science is OSEMN
Obtaining data
Remote data
Plain text files
Delimited files
JSON files
Web scraping
HDF5
Relational databases
Scrubbing data
Exercises
Workign with text
String methods
Splitting and joining strings
The string module
Regular expressions
The NLTK toolkit
Exercises
Working with text
String methods
Splitting and joining strings
The string module
Regular expressions
The NLTK toolkit
Exercises
Preprocessing textual data
Example: Counting words in a document
Using SQLite3
Working example dataset
Creating and populating a table
SQL queries
Working wiht multiple tables in SQL
Basic concepts of database normalization
Using HDF5
Interfacing withPandas
References
Example
Numerical computing in Python
Exercises
Using Pandas
Series
DataFrame
Panels
Split-Apply-Combine
Using statsmodels
Using R from IPython
Using Rmagic
Using R from pandas
Computational problems in statistics
Textbook example - is coin fair?
Using binomial test
Using z-test approximation with continuity correction
Using simulation to estimate null distribution
Maximum likelihood estimate of pcoin
Using bootstrap to esitmate confidenc intervals for pcoin
Bayesian approach
Comment
Computer numbers and mathematics
Some examples of numbers behaving badly
Finite representation of numbers
Using arbitrary precision libraries
From numbers to Functions: Stability and conditioning
Exercises
Algorithmic complexity
Profling and benchmarking
Measuring algorithmic complexity
Comparing complexity of
\(\mathcal{O}(n^2)\)
(e.g. bubble sort) and
\(\mathcal{O} (n \log n)\)
(e.g. merge sort).
Ranking of common Big O complexity classes
Complexity of common operations on Python data structures
Space complexity
How much space do I need?
Linear Algebra and Linear Systems
Simultaneous Equations
Exercises
Large Linear Systems
Example: Netflix Competition (circa 2006-2009)
Matrix Decompositions
LU Decomposition and Gaussian Elimination
Cholesky Decomposition
Matrix Decompositions for PCA and Least Squares
Eigendecomposition
QR decompositon
Singular Value Decomposition
Stabilty and Condition Number
Exercises
Variance and covariance
Eigendecomposition of the covariance matrix
PCA
Data matrices that have zero mean for all feature vectors
Change of basis via PCA
We can transform the original data set so that the eigenvectors are the basis vectors amd find the new coordinates of the data points with respect to this new basis
Linear algebra review for change of basis
Graphical illustration of change of basis
Dimension reduction via PCA
Using Singular Value Decomposition (SVD) for PCA
Exercises
Optimization and Non-linear Methods
Example: Maximum Likelihood Estimation (MLE)
Bisection Method
Secant Method
Newton-Rhapson Method
Gauss-Newton
Inverse Quadratic Interpolation
Brent’s Method
Finding roots
Univariate roots and fixed points
Mutlivariate roots and fixed points
Optimization Primer
Is the function convex?
Are there any constraints that the solution must meet?
Using
scipy.optimize
Local and global minima
Gradient deescent
Newton’s method and variants
Constrained optimization
Some applications of optimization
Finding paraemeters for ODE models
Optimization of standard statistical models
Logistic regression as optimization
Solving as a GLM with IRLS
Solving as logistic model with bfgs
Home-brew logistic regression using a generic minimization function
Resources
Algorithms for Optimization and Root Finding for Multivariate Problems
Optimizers
Solvers
GLM Estimation and IRLS
Outline
Jensen’s inequality
Maximum likelihood with complete information
Coin toss example from What is the expectation maximization algorithm?
Exact solution
Numerical estimate
Incomplete information
Gaussian mixture models
Using EM
Vectorized version
Vectorization with Einstein summation notation
Comparison of EM routines
Topics
Pseudorandom number generators (PRNG)
Generating standard uniform random numbers
Using a random number generator
Monte Carlo integration
Example
Montioring variance
Monte Carlo swindles (Variance reduction techniques)
Variance reduction by change of variables
Importance sampling
Quasi-random numbers
Outline
Resampling
Simulations
Setting the random seed
Resampling
Sampling with and without replacement
Bootstrap
Leave some-out resampling
Calculation of Cook’s distance
Permutation resampling
Monte Carlo Simulations
Design of simulation experiments
Example: Simulations to estimate power
Check with R
Characterizing Monte Carlo samples
Estimating the CDF
Estimating the PDF
Kernel density estimation
Multivariate kerndel density estimation
Outline of topics for MCMC
Bayesian Data Analysis
Motivating example
Markov Chain Monte Carlo (MCMC)
Intuition
The Gibbs sampler
Motivating example
Slice sampling
Hierarchical models
LaTeX for Markov chain diagram
PyMC2
Examples
Coin toss
Estimating mean and standard deviation of normal distribution
Estimating parameters of a linear regreession model
Estimating parameters of a logistic model
Using a hierarchcical model
PyMC3
Examples
Coin toss
Estimating mean and standard deviation of normal distribution
Estimating parameters of a linear regreession model
Alternative fromulation using GLM formulas
Simple Logistic model
There is no convergence!
Estimating parameters of a logistic model
Using a hierarchcical model
PyStan
Useful links
Examples
Coin toss
Estimating mean and standard deviation of normal distribution
Estimating parameters of a linear regreession model
Simple Logistic model
Estimating parameters of a logistic model
Using a hierarchcical model
C Crash Course
Hello world
A tutorial example - coding a Fibonacci function in C
C Basics
Types in C
Operators
Control of program flow
Arrays and pointers
Functions
Function pointers
Using make to compile C programs
Exercise
Debugging programs (understanding compiler warnings and errors)
Why not C?
Learning Obfuscated C
Code Optimization Overview
Profiling
Using the timeit modules
Using cProfile
Using the line profiler
Using the memory profiler
Using better algorihtms and data structures
I/O Bound problems
Problem set for optimization
Matrix Multiplication
Pairwise distance matrix
Word count
The Fibonacci Sequence
Using clang and bitey
Using gcc and ctypes
Using Cython
Benchmark
From compiled code to Python
Calling a C function
Using bitey and clang
Using Cython
Wrapping a C++ function
Calling a Fortran function
Benchmarking
Wrapping a function from a C library for use in Python
Wrapping functions from C++ library for use in Pyton
Defining a function in Julia
Using it in Python
Using Python libraries in Julia
Benchmarking
Fibonacci
Matrix multiplication
Pairwise distance matrix
Python
Profiling code
Numba
Cython
Comparison with optimized C from scipy
Interfacing with compiled languages
Make up some test data for use later
“Pure” Python version
From Python
\(\rightarrow\)
Compiled langauges
Numpy version
Numexpr version
Numba version
NumbaPro version
Parakeet version
Cython version
From Compiled langauges
\(\rightarrow\)
Python
C version
C++ version
Fortran
Bake-off
Summary
Recommendations for optimizing Python code
Analysis of problems for parallelism
Concepts
Embarassingly parallel programs
Estimating
\(\pi\)
using Monte Carlo integration
Using parallel execution
Other parallel programmign approaches not covered
References
Why GPU Programming?
Programming GPUs
GPU Architecture
CPU veruss GPU
Inside a GPU
The streaming multiprocessor
The CUDA Core
Memory Hiearchy
Processing flow
CUDA Kernels
CUDA execution model
CUDA threads
Memoery access levels
Generations
Hardware
Device memory types
Thread scheduling model
Programming model
Performance tuning
Getting Started with CUDA
How do we find out the unique global thread identity?
Version 1 of the kernel
Launching the kernel
Version 2 of the kernel
Launching the kernel
Low level
cuda.jit
requires correct instantiation of the kernel with blockspergrid and threadsperblock
Using
vectorize
Pure Python
Numba
CUDA
Matrix multiplication wiht
cublas
Random numbers with
curand
FFT and IFFT
Extras
Kernel function (no shared memory)
Using shared mmeory by using tiling to exploit locality
Kernel function (with shared memory)
Benchmark
Page
Index