STA-663-2017¶

Contents:

Notes on using Jupyter
- Keyboard Shortcuts
- Magic commands
- Using shell commands
- Load extension for R
- Getting help
- Further exploration
Introduction to Python
- Resources
- Overview
- Hello, world
- Types
- Operators
- Names, assignment and identity
- Naming conventions
- Collections
- Sets
- Dictionary
- Control Structures
- Functions
Functions
- What’s wrong with this code?
- Refactoring to use functions
- Function is re-usable
- Functions can be passed in as arguments
- Functions can also be returned by functions
- Decorators
- Functools
- Where does Python search for modules?
- Creating your own module
Classes
- Defining a new class
- Docstrings
- Creating an instance of a class
- Class inheritance
Strings
- Get “Through the Looking Glass”
- Slice to get Jabberwocky
- Find palindromic words in poem if any
- Top 10 most frequent words
- Words that appear exactly twice.
- Trigrams
- Find words in poem that are over-represented
- Encode and decode poem using a Caesar cipher
- Using Regular Expressions
- Natural language processing
- String formatting
Using numpy
- Resources
- Array creation
- Array manipulation
- Array indexing
- Calculations and broadcasting
- Combining and splitting arrays
- Reductions
- Example: Calculating pairwise distance matrix using broadcasting and vectorization
- Example: Consructing leave-one-out arrays
- Generalized ufuncs
- Saving and loading NDArrays
- Version information
Graphics in Python
- Resources
- Matplotlib
- Types of plots
- Colors
- Plot layouts
- Seaborn
- Using ggplot as an alternative to seaborn.
Data
- Resources
- Working with Series
- DataFrame
- Data conversions
SQL
- SQL via pandas DataFrames
- Joins
- User defined functions
Machine Learning with sklearn
- Resources
- Example
- Exploratory data analysis
- Preprocessing
- Dimension reduction
- Classification
- Using a Pipeline
Code Optimization
- Testing code
- Timing and profiling code
- Data structures and algorithms
Just-in-time compilation (JIT)
- Using numexpr
- Using numba
- Using numba vectorize and guvectoize
- Parallelization with vectorize and guvectorize
Cython
- Resources
- Using Cython annnotations to identify bottlenecks
- Using Cython cdefs and directives
- Parallel execution with Cython
Parallel Programming
- Embarrassingly parallel programs
- Executing parallel code
- Coming highlights
Multi-Core Parallelism
- Vanilla Python
- Using numba to speed up computation
- Using cython to speed up computation
- The concurrent.futures module
- Using processes in parallel with ProcessPoolExecutor
- Using processes in parallel with ThreadPoolExecutor
- Turning off the GIL in cython
- Using processes in parallel with ThreadPoolExecutor and nogil
- Using multiprocessing
- Common issues with use of shared memory in parallel programs
Using ipyparallel
- Starting engines
- Basic concepts of ipyparallel
- Working with compiled code
- Using parallel magic commands
Using C++
- The compilation process
- Arrays, pointers and dereferencing
- Loops in C++
- Functions in C++
- Anonymous functions
- Templates
- Function pointers
- Using a numeric library
Using pybind11
- Resources
- A first example of using pybind11
- Using cppimport
- Vectorizing functions for use with numpy arrays
- Using numpy arrays as function arguments and return values
- More on working with numpy arrays
- Using the C++ eigen library to calculate matrix inverse and determinant
- Using pybind11 with openmp
Linear Algebra Review
Linear Algebra and Linear Systems
- Motivation - Simultaneous Equations
- Vector Spaces
- Matrices and Linear Transformations
- What does all this have to do with linear systems?
- More Properties of Vectors, Vector Spaces and Matrices
- Inner Products
- Exercises
Matrix Decompositions
- LU Decomposition and Gaussian Elimination
- Cholesky Decomposition
- Matrix Decompositions for PCA and Least Squares
- Eigendecomposition
- QR decompositon
- Singular Value Decomposition
- Stabilty and Condition Number
- Exercises
Linear Algebra Examples
- Resources
- Exact solution of linear system of equations
- Basic information about a matrix
- Least-squares solution
- Matrix Decompositions
Applications of Linear Alebra: PCA
- Variance and covariance
- Eigendecomposition of the covariance matrix
- Covariance matrix as a linear transformation
- PCA
- Eigendecomposition of the covariance matrix
- Change of basis via PCA
- Graphical illustration of change of basis
- Dimension reduction via PCA
Sparse Matrices
- Creating a sparse matrix
- Application: Confusion matrix
- Application: PageRank
Optimization and Root Finding
- Example: Maximum Likelihood Estimation (MLE)
- Example: Linear Least Squares
- Main Issues in Root Finding in One Dimension
- Bisection Method
- Secant Method
- Newton-Rhapson Method
- Gauss-Newton
- Inverse Quadratic Interpolation
- Brent’s Method
Algorithms for Optimization and Root Finding for Multivariate Problems
- Optimization/Roots in n Dimensions - First Some Calculus
- Convexity
- Line Search Methods
- Steepest Descent
- Newton’s Method
- Coordinate Descent
- Solvers
- GLM Estimation and IRLS
Using optimization routines from scipy and statsmodels
- Finding roots
- Optimization Primer
- Using scipy.optimize
- Gradient descent
- Optimization of standard statistical models
Random numbers and probability models
- Python analog of R random number functions
- Why are random numbers useful?
- Where do random numbers in the computer come from?
- Rejection sampling
- Using the numpy.random and scipy.stats PRNGs
Resampling and Monte Carlo Simulations
- Setting the random seed
- Adjusting p-values for multiple testing
- Jackknife estimate of parameters
- Leave one out cross validation (LOOCV)
- Check with R
- Estimating the CDF
- Estimating the PDF
Numerical Evaluation of Integrals
- Quadrature
- Monte Carlo integration
- Variance Reduction
- Quasi-random numbers
Probabilistic Graphical Models with pgmpy
Working with large data sets
- Lazy evaluation, pure functions and higher order functions
- Working with out-of-core memory
- Probabilistic data structures
- Small-scale distributed programming
Biggish Data
- 1 billion numbers
- Using Dask
- Using Blaze
Efficient storage of data in memory
- Selective retrieval from disk-based storage
- Storing numbers
- Storing strings
- Data Sketches
Working with large data sets
- Small-scale distributed programming
Using Spark
- Cheat sheet
- Local installation of Spark
- Start a spark session
- SparkContext
- Actions and transforms with parallelized collections
Using Spark Efficiently
- Resources
- Accumulators
- Broadcast Variables
- The Spark Shuffle and Partitioning
- Piping to External Programs
Spark MLLib
- MLLib Pipeline
- Unsupervised Learning
- Supervised Learning
- Using the newer ml pipeline
- Spark MLLIb and sklearn integration
Spark SQL
- Resources
- DataFrame from pandas
- DataFrame from CSV files
- DataFrame from JSON files
- DataFrame from SQLite3
- DataSets
Spark Streaming
- Resources
- Streaming using sockets
- Using Spark Streaming
Spark on Cloud
- Azure
- AWS
- Know your AWS public and private access keys
- Know your AWS EC2 key-pair
- Install AWS command line client
- Configure the AWS command line client
- Create a cluster
- Get information about the cluster
- Connect to the cluster via ssh
- Note the IP address that is returned
- Run pyspark
- Run the Zepellin notebook
- Connect to Zeppelin notebook
- Create notebook and run Spark within it
- Terminate the cluster
Using PyMC3
- Part 1 - A tutorial example: Estimating coin bias
- Part 2: Gallery
PyStan
- Useful links
- Coin toss
- Fit model
- MAP
- Estimating mean and standard deviation of normal distribution
- Estimating parameters of a linear regression model
- Simple Logistic model
Metropolis and Gibbs Sampling
- Island hopping
- Bayesian Data Analysis
- Hierarchical models
Using Auxiliary Variables in MCMC proposals
- Slice sampling
- A simple slice sampler example
- Hamiltonian Monte Carlo (HMC)
- Hamiltonian systems
- Finite difference methods
- From Hamiltonians to probability distributions
TensorFlow and Edward
- TensorFlow
- TensorFlow Examples
- Edward
- Edward examples
Bonus Material: The Humble For Loop
- (1) Filling a list
- (2) Filling a list conditionally
- (3) Writing as functions
- (4) Timing
- (5) Compilation
- (6) Vectorizing scalar functions
- (7) With numba and JIT compilation, we need to reconsider the old taboo against looping.
Bonus Material: Word count
- Convert to list of words
- Slower version without translate
- Using a regular dictionary
- Using a default dictionary
- Using a Counter
- Using third party function
- Counting without dictionaries
- Vectorized version
Symbolic Algebra with sympy
- Basics
- Simplify
- Calculus
- Working with matrices
- Solving Algebraic and Differential Equations
- Numerics
- Statistics

Indices and tables¶