What you should know and learn more about
Statistical foundations
- Experimental design
- Usualy want to isolate a main effect from confounders
- Can we use a randomized experiment design?
- Replication
- Exploratory data analysis
- Always eyeball the data
- Facility with graphics librareis is essential
- Even better are interactive graphics libraries (IPython notebook
is ideal)
- As amount of data grows
- Simple algorithms may perform better than complex ones
- Non-parametric models may perform better than parametric ones
- But big data can often be interpreted as many pieces of small data
Computing foundations
- Polyglot programming
- R and/or SAS (for statistical libraries)
- Python (for glue and data munging)
- C/C++ (for high performance)
- Command line tools and Unix
philosophy
- SQL (for managing data)
- Scala (for Spark)
- Need for concurrency
- Functional style is increasingly important
- Prefer immutable data structures
- Prefer pure functions
- Same input always gives same output
- Does not cause any side effects
- With big data, lazy evaluation can be helpful
- Prefer generators to lists
- Look at the
itertools
standaard library in Python 2
- Composability for maintainability and extensibility
- Small pieces, loosely joined
- Combinator pattern
- Again, all this was in the original Unix
philosophy
Mathematical foundations
- Core: probability and linear algebra
- Calculus is important but secondary
- Graphs and networks increasingly relevant
Statistical algorithms
- Numbers as leaky abstractions
- Don’t just use black boxes
- Make an effort to understand what each algorithm you call is doing
- At minimum, can you explain what the algorithm is doing in plain
English?
- Can you implement a simple version from the ground up?
- Categories of algorithms
- Big matrix manipulations (matrix decomposition is key)
- Continuous optimization - order 0, 1, 2
- EM algorithm has wide applicability in both frequentist and
Bayesian domains
- Monte Carlo methods, MCMC and simulations
- Making code fast
- Make it run, make it right, make it fast
- Python has amazing profiling tools - use them
- For profiling C code, try
gperftools
- Compilation: Try numba or Cython in preference to writing raw
C/C++
- Parallel programming
- Python GIL
- Use Queue from threading or multiprocessing to build a pipeline
- Skip OpenMP (except within Cython) and MPI
- Big data
Libraries worth knowing about after numpy, scipy and matplotlib