What you should know and learn more about — Computational Statistics in Python 0.1 documentation

Statistical foundations¶

Experimental design
- Usualy want to isolate a main effect from confounders
- Can we use a randomized experiment design?
  - Batch effects
Replication
- Essential for science
Exploratory data analysis
- Always eyeball the data
- Facility with graphics librareis is essential
- Even better are interactive graphics libraries (IPython notebook is ideal)
  - Bokeh
As amount of data grows
- Simple algorithms may perform better than complex ones
- Non-parametric models may perform better than parametric ones
- But big data can often be interpreted as many pieces of small data

Polyglot programming
- R and/or SAS (for statistical libraries)
- Python (for glue and data munging)
- C/C++ (for high performance)
- Command line tools and Unix philosophy
- SQL (for managing data)
- Scala (for Spark)
Need for concurrency
- Functional style is increasingly important
  - Prefer immutable data structures
  - Prefer pure functions
    - Same input always gives same output
    - Does not cause any side effects
With big data, lazy evaluation can be helpful
- Prefer generators to lists
- Look at the itertools standaard library in Python 2
Composability for maintainability and extensibility
- Small pieces, loosely joined
- Combinator pattern
- Again, all this was in the original Unix philosophy

Numbers as leaky abstractions
Don’t just use black boxes
- Make an effort to understand what each algorithm you call is doing
- At minimum, can you explain what the algorithm is doing in plain English?
- Can you implement a simple version from the ground up?
Categories of algorithms
- Big matrix manipulations (matrix decomposition is key)
- Continuous optimization - order 0, 1, 2
- EM algorithm has wide applicability in both frequentist and Bayesian domains
- Monte Carlo methods, MCMC and simulations
Making code fast
- Make it run, make it right, make it fast
- Python has amazing profiling tools - use them
- For profiling C code, try gperftools
- Compilation: Try numba or Cython in preference to writing raw C/C++
- Parallel programming
  - Python GIL
  - Use Queue from threading or multiprocessing to build a pipeline
  - Skip OpenMP (except within Cython) and MPI
Big data
- Spark is the killer app