Comparsion of optimizaiton approaches¶
Interfacing with compiled languages¶
Python is a high-level interpreted language, which greatly reduces the time taken to prototyte and develop useful statistical programs. However, the trade-off is that pure Python programs can be orders of magnitude slower than programs in compiled languages such as C/C++ or Forran. Hence most numerical and statistical programs often include interfaces to compiled code (e.g. numpy which is written in C) or more recently, are just-in-time compiled to native machine code (e.g. numba, pymc3). Fortunately, it is relatively easy to write custom modules that comple to native machine code and call them from Pytthon, an important factor in the popularity of Python as a langugae for scientific and statistical computing.
We will use the example of calculating the pairwsise Euclidean distance between all points to illustrate the various methods of interfacing with native code.
Adapted and extended from http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/NumbaCython.ipynb
In [36]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from numba import jit, njit
import numexpr as ne
%precision 2
Out[36]:
u'%.2f'
Make up some test data for use later¶
In [2]:
A = np.array([[0.0,0.0],[3.0,4.0]])
n = 1000
p = 3
xs = np.random.random((n, p))
“Pure” Python version¶
In [3]:
def pdist_python(xs):
n, p = xs.shape
D = np.empty((n, n), np.float)
for i in range(n):
for j in range(n):
s = 0.0
for k in range(p):
tmp = xs[i,k] - xs[j,k]
s += tmp * tmp
D[i, j] = s**0.5
return D
In [4]:
print pdist_python(A)
%timeit -n 1 pdist_python(xs)
[[ 0. 5.]
[ 5. 0.]]
1 loop, best of 3: 3.55 s per loop
From Python \(\rightarrow\) Compiled langauges¶
Numpy version¶
The numpy version makes use of advanced broadcasting. To follow the code below, we will have to understand numpy broadcasting rules a little better. Here is the gist from:
From http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#numpy.newaxis
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
- they are equal, or
- one of them is 1
Arrays do not need to have the same number of dimensions. When either of the dimensions compared is one, the larger of the two is used. In other words, the smaller of two axes is stretched or “copied” to match the other.
Distance between scalars¶
In [5]:
x = np.arange(10)
x
Out[5]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [6]:
# if we insert an extra dimension into x with np.newaxis
# we get a (10, 1) matrix
x[:, np.newaxis].shape
Out[6]:
(10, 1)
In [7]:
# if we insert an extra dimension into x with np.newaxis
# we get a (10, 1) matrix
x[:, np.newaxis].shape
Out[7]:
(10, 1)
Comparing shape
x[:, None] = 10 x 1
x = 10
When we subtract the two arrays, broadcasting rules first match the the trailing axis to 10 (so x[:, None] is stretched to be (10,10)), and then matching the next axis, x is stretechd to also be (10,10).
In [8]:
# This is the pairwise distance matrix!
x[:, None] - x
Out[8]:
array([[ 0, -1, -2, -3, -4, -5, -6, -7, -8, -9],
[ 1, 0, -1, -2, -3, -4, -5, -6, -7, -8],
[ 2, 1, 0, -1, -2, -3, -4, -5, -6, -7],
[ 3, 2, 1, 0, -1, -2, -3, -4, -5, -6],
[ 4, 3, 2, 1, 0, -1, -2, -3, -4, -5],
[ 5, 4, 3, 2, 1, 0, -1, -2, -3, -4],
[ 6, 5, 4, 3, 2, 1, 0, -1, -2, -3],
[ 7, 6, 5, 4, 3, 2, 1, 0, -1, -2],
[ 8, 7, 6, 5, 4, 3, 2, 1, 0, -1],
[ 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]])
Distance between vectors¶
In [9]:
# Suppose we have a collection of vectors of dimeniosn 2
# In the example below, there are 5 such 2-vectors
# We want to calculate the Euclidean distance
# for all pair-wise comparisons in a 5 x 5 matrix
x = np.arange(10).reshape(5,2)
print x.shape
print x
(5, 2)
[[0 1]
[2 3]
[4 5]
[6 7]
[8 9]]
In [10]:
x[:, None, :].shape
Out[10]:
(5, 1, 2)
Comparing shape
x[:, None, :] = 5 x 1 x 2
x = 5 x 2
From the rules of broadcasting, we expect the result of subtraction to be a 5 x 5 x 2 array. To calculate Euclidean distance, we need to find the square root of the sum of squares for the 5 x 5 collection of 2-vectors.
In [11]:
delta = x[:, None, :] - x
pdist = np.sqrt((delta**2).sum(-1))
pdist
Out[11]:
array([[ 0. , 2.83, 5.66, 8.49, 11.31],
[ 2.83, 0. , 2.83, 5.66, 8.49],
[ 5.66, 2.83, 0. , 2.83, 5.66],
[ 8.49, 5.66, 2.83, 0. , 2.83],
[ 11.31, 8.49, 5.66, 2.83, 0. ]])
Finally, we come to the anti-climax - a one-liner function!¶
In [12]:
def pdist_numpy(xs):
return np.sqrt(((xs[:,None,:] - xs)**2).sum(-1))
In [13]:
print pdist_numpy(A)
%timeit pdist_numpy(xs)
[[ 0. 5.]
[ 5. 0.]]
10 loops, best of 3: 97 ms per loop
Numexpr version¶
In [14]:
def pdist_numexpr(xs):
a = xs[:, np.newaxis, :]
return np.sqrt(ne.evaluate('sum((a-xs)**2, axis=2)'))
In [15]:
print pdist_numexpr(A)
%timeit pdist_numexpr(xs)
[[ 0. 5.]
[ 5. 0.]]
10 loops, best of 3: 31.1 ms per loop
Numba version¶
In [39]:
pdist_numba = jit(pdist_python)
In [40]:
print pdist_numba(A)
%timeit pdist_numba(xs)
[[ 0. 5.]
[ 5. 0.]]
100 loops, best of 3: 11.3 ms per loop
Cython version¶
For more control over the translation to C, most Python scientific developers will use the Cython package. Essentially, this is a language that resembles Python with type annotations. The Cython code is then compiled into native code tranaparently. The great advantage of Cythonn over ther approaches are:
- A Python program is also valid Cython program, so optimization can occur incrementally
- Fine degree of control over degree of optimization
- Easy to use - handles details about the C compiler and shared library generation
- Cythonmagic extension comes built into IPyhton notebook
- Can run parallel code with the nogil decorator
- Fully optimized code runs at thee same speed as C in most cases
In [18]:
%load_ext cython
In [19]:
%%cython
import numpy as np
cimport cython
from libc.math cimport sqrt
@cython.boundscheck(False)
@cython.wraparound(False)
def pdist_cython(double[:, ::1] xs):
cdef int n = xs.shape[0]
cdef int p = xs.shape[1]
cdef double tmp, d
cdef double[:, ::1] D = np.empty((n, n), dtype=np.float)
for i in range(n):
for j in range(n):
d = 0.0
for k in range(p):
tmp = xs[i, k] - xs[j, k]
d += tmp * tmp
D[i, j] = sqrt(d)
return np.asarray(D)
In [20]:
print pdist_cython(A)
%timeit pdist_cython(xs)
[[ 0. 5.]
[ 5. 0.]]
100 loops, best of 3: 9.03 ms per loop
From Compiled langauges \(\rightarrow\) Python¶
C version¶
There are many ways to wrap C code for Python, such as Cython, Swig or Boost Python with numpy. However, the standard library comes with ctypes, a foreign function library that can be used to wrap C functions for use in pure python. This involves a little more work than the other approaches as shown below.
In [21]:
%%file pdist_c.c
#include <math.h>
void pdist_c(int n, int p, double xs[n*p], double D[n*n]) {
for (int i=0; i<n; i++) {
for (int j=0; j<n; j++) {
double s = 0.0;
for (int k=0; k<p; k++) {
double tmp = xs[i*p+k] - xs[j*p+k];
s += tmp*tmp;
}
D[i*n+j] = sqrt(s);
}
}
}
Overwriting pdist_c.c
In [22]:
# Compile to a shared library
# Mac
! gcc -O3 -bundle -undefined dynamic_lookup pdist_c.c -o pdist_c.so
# Linux:
# ! gcc -O3 -fPIC -shared -std=c99 -lm pdist_c.c -o pdist_c.so
In [23]:
from ctypes import CDLL, c_int, c_void_p
def pdist_c(xs):
# Use ctypes to load the library
lib = CDLL('./pdist_c.so')
# We need to give the argument adn return types explicitly
lib.pdist_c.argtypes = [c_int, c_int, np.ctypeslib.ndpointer(dtype = np.float), np.ctypeslib.ndpointer(dtype = np.float)]
lib.pdist_c.restype = c_void_p
n, p = xs.shape
D = np.empty((n, n), np.float)
lib.pdist_c(n, p, xs, D)
return D
In [24]:
print pdist_c(A)
%timeit pdist_c(xs)
[[ 0. 5.]
[ 5. 0.]]
100 loops, best of 3: 10 ms per loop
C++ version¶
Using C++ is almost the same as using C. Just add an extern C statement and use an appropriate C++ compiler.
In [25]:
%%file pdist_cpp.cpp
#include <cmath>
extern "C"
// Variable length arrays are OK for C99 but not legal in C++
// void pdist_cpp(int n, int p, double xs[n*p], double D[n*n]) {
void pdist_cpp(int n, int p, double *xs, double *D) {
for (int i=0; i<n; i++) {
for (int j=0; j<n; j++) {
double s = 0.0;
for (int k=0; k<p; k++) {
double tmp = xs[i*p+k] - xs[j*p+k];
s += tmp*tmp;
}
D[i*n+j] = sqrt(s);
}
}
}
Overwriting pdist_cpp.cpp
In [26]:
# Compile to a shared library
! g++ -O3 -bundle -undefined dynamic_lookup pdist_cpp.cpp -o pdist_cpp.so
# Linux:
# ! g++ -O3 -fPIC -shared pdist_cpp.cpp -o pdist_cpp.so
In [27]:
from ctypes import CDLL, c_int, c_void_p
def pdist_cpp(xs):
# Use ctypes to load the library
lib = CDLL('./pdist_cpp.so')
# We need to give the argument adn return types explicitly
lib.pdist_cpp.argtypes = [c_int, c_int, np.ctypeslib.ndpointer(dtype = np.float), np.ctypeslib.ndpointer(dtype = np.float)]
lib.pdist_cpp.restype = c_void_p
n, p = xs.shape
D = np.empty((n, n), np.float)
lib.pdist_cpp(n, p, xs, D)
return D
In [28]:
print pdist_cpp(A)
%timeit pdist_cpp(xs)
[[ 0. 5.]
[ 5. 0.]]
100 loops, best of 3: 10 ms per loop
Fortran¶
In [29]:
%%file pdist_fortran.f90
subroutine pdist_fortran (n, p, A, D)
integer, intent(in) :: n
integer, intent(in) :: p
real(8), intent(in), dimension(n,p) :: A
real(8), intent(inout), dimension(n,n) :: D
integer :: i, j, k
real(8) :: s, tmp
! note order of indices is different from C
do j = 1, n
do i = 1, n
s = 0.0
do k = 1, p
tmp = A(i, k) - A(j, k)
s = s + tmp*tmp
end do
D(i, j) = sqrt(s)
end do
end do
end subroutine
Overwriting pdist_fortran.f90
In [30]:
! f2py -c -m flib pdist_fortran.f90 > /dev/null
In [31]:
import flib
print flib.pdist_fortran.__doc__
pdist_fortran(a,d,[n,p])
Wrapper for ``pdist_fortran``.
Parameters
----------
a : input rank-2 array('d') with bounds (n,p)
d : in/output rank-2 array('d') with bounds (n,n)
Other Parameters
----------------
n : input int, optional
Default: shape(a,0)
p : input int, optional
Default: shape(a,1)
In [32]:
def pdist_fortran(xs):
import flib
n, p = xs.shape
xs = np.array(xs, order='F')
D = np.empty((n,n), np.float, order='F')
flib.pdist_fortran(xs, D)
return D
In [33]:
print pdist_fortran(A)
%timeit pdist_fortran(xs)
[[ 0. 5.]
[ 5. 0.]]
100 loops, best of 3: 9.33 ms per loop
Bake-off¶
In [34]:
# Final bake-off
w = 10
print 'Python'.ljust(w),
%timeit pdist_python(xs)
print 'Numpy'.ljust(w),
%timeit pdist_numpy(xs)
print 'Numexpr'.ljust(w),
%timeit pdist_numexpr(xs)
print 'Numba'.ljust(w),
%timeit pdist_numba(xs)
print 'Cython'.ljust(w),
%timeit pdist_cython(xs)
print 'C'.ljust(w),
%timeit pdist_c(xs)
print 'C++'.ljust(w),
%timeit pdist_cpp(xs)
print 'Fortran'.ljust(w),
%timeit pdist_fortran(xs)
from scipy.spatial.distance import pdist as pdist_scipy
print 'Scipy'.ljust(w),
%timeit pdist_scipy(xs)
Python 1 loop, best of 3: 3.66 s per loop
Numpy 10 loops, best of 3: 97.2 ms per loop
Numexpr 10 loops, best of 3: 30.8 ms per loop
Numba 100 loops, best of 3: 11.3 ms per loop
Cython 100 loops, best of 3: 9.02 ms per loop
C 100 loops, best of 3: 9.98 ms per loop
C++ 100 loops, best of 3: 9.97 ms per loop
Fortran 100 loops, best of 3: 9.27 ms per loop
Scipy 100 loops, best of 3: 4.37 ms per loop
Final optimization: Scipy only calculates for i < j < n since the pairwise distance matrix is symmetric, and hence takes about half the time of our solution. Can you modify our pdist_X functions to also exploit symmetry?
Summary¶
- Using C, C++ or Fortran give essentially identcial performance
- Of the JIT solutions:
- Cython is the fastest but needs the extra work of type annotations
- numba is almost as fast and simplest to use - just say jit(functiion)
- numexpr is slightly slower and works best for small numpy expressions but is also very convenient
- A pure numpy solution also perfroms reasonably and will be the shortest solutoin (a one-liner in this case)
- The pure python approach is very slow, but serves as a useful template for converting to native langauge directly or via a JIT compiler
- Note that the fsatest alternatives are approximately 1000 times faster than the pure python version for the test problem with n=1000 and p=3.
Recommendations for optimizing Python code¶
- Does a reliable fast implementiaont already exist? If so, consider using that
- Start with a numpy/python prototype - if this is fast enough, stop
- See if better use of vectoriazaiton via numpy will help
- Moving to native code:
- Most Python devleopers will use Cython as the tool of choice. Cython can also be used to access/wrap C/C++ code
- JIT compilation with numba is improving fast and may become competitive with Cython in the near future
- If the function is “minimal”, it is usually worth considering numexpr because there is almost no work to be done
- Use C/C++/Fortran if you are fluent in those languages - you have seen how to call these functions from Python
- If appropriate, consider parallelization (covered in later session)
- As you optimize your code, remmeber:
- Check that is is giving correct results!
- Profile often - it is very hard to preidct the effect of an optimizaiton in general
- Remember that your time is precious - stop when fast enough
- If getting a bigger, faster machine will sovle the problem, that is sometimes the best solution
In [35]:
%load_ext version_information
%version_information numpy, scipy, numexpr, numba, cython, f2py,
Out[35]:
Software | Version |
---|---|
Python | 2.7.11 64bit [GCC 4.2.1 (Apple Inc. build 5577)] |
IPython | 4.1.2 |
OS | Darwin 15.4.0 x86_64 i386 64bit |
numpy | 1.10.4 |
scipy | 0.17.0 |
numexpr | 2.4.6 |
numba | 0.23.1 |
cython | 0.23.4 |
f2py | The 'f2py' distribution was not found and is required by the application |
Fri Apr 01 09:10:36 2016 EDT |
In [ ]: