Using numpy

The foundation for numerical computation in Python is the numpy package, and essentially all scientific libraries in Python build on this - e.g. scipy, pandas, statsmodels, scikit-learn, cv2 etc. The basic data structure in numpy is the NDArray, and it is essential to become familiar with how to slice and dice this object.

Numpy also has the random, and linalg modules that we will discuss in later lectures.

Resources

NDArray

The base structure in numpy is ndarray, used to represent vectors, matrices and higher-dimensional arrays. Each ndarray has the following attributes:

  • dtype = corresponds to data types in C
  • shape = dimensions of array
  • strides = number of bytes to step in each direction when traversing the array
x = np.array([1,2,3,4,5,6])
print(x)
print('dytpe', x.dtype)
print('shape', x.shape)
print('strides', x.strides)
[1 2 3 4 5 6]
dytpe int64
shape (6,)
strides (8,)
x.shape = (2,3)
print(x)
print('dytpe', x.dtype)
print('shape', x.shape)
print('strides', x.strides)
[[1 2 3]
 [4 5 6]]
dytpe int64
shape (2, 3)
strides (24, 8)
x = x.astype('complex')
print(x)
print('dytpe', x.dtype)
print('shape', x.shape)
print('strides', x.strides)
[[ 1.+0.j  2.+0.j  3.+0.j]
 [ 4.+0.j  5.+0.j  6.+0.j]]
dytpe complex128
shape (2, 3)
strides (48, 16)

Array creation

np.array([1,2,3])
array([1, 2, 3])
np.array([1,2,3], np.float64)
array([ 1.,  2.,  3.])
np.arange(3)
array([0, 1, 2])
np.arange(3, 6, 0.5)
array([ 3. ,  3.5,  4. ,  4.5,  5. ,  5.5])
np.array([[1,2,3],[4,5,6]])
array([[1, 2, 3],
       [4, 5, 6]])
np.ones(3)
array([ 1.,  1.,  1.])
np.zeros((3,4))
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])
np.eye(4)
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])
np.diag([1,2,3,4])
array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])
np.fromfunction(lambda i, j: i**2+j**2, (4,5))
array([[  0.,   1.,   4.,   9.,  16.],
       [  1.,   2.,   5.,  10.,  17.],
       [  4.,   5.,   8.,  13.,  20.],
       [  9.,  10.,  13.,  18.,  25.]])

Array manipulation

x = np.fromfunction(lambda i, j: i**2+j**2, (4,5))
x
array([[  0.,   1.,   4.,   9.,  16.],
       [  1.,   2.,   5.,  10.,  17.],
       [  4.,   5.,   8.,  13.,  20.],
       [  9.,  10.,  13.,  18.,  25.]])
x.shape
(4, 5)
x.size
20
x.dtype
dtype('float64')
x.astype(np.int64)
array([[ 0,  1,  4,  9, 16],
       [ 1,  2,  5, 10, 17],
       [ 4,  5,  8, 13, 20],
       [ 9, 10, 13, 18, 25]])
x.T
array([[  0.,   1.,   4.,   9.],
       [  1.,   2.,   5.,  10.],
       [  4.,   5.,   8.,  13.],
       [  9.,  10.,  13.,  18.],
       [ 16.,  17.,  20.,  25.]])
x.reshape(2,-1)
array([[  0.,   1.,   4.,   9.,  16.,   1.,   2.,   5.,  10.,  17.],
       [  4.,   5.,   8.,  13.,  20.,   9.,  10.,  13.,  18.,  25.]])

Array indexing

x
array([[  0.,   1.,   4.,   9.,  16.],
       [  1.,   2.,   5.,  10.,  17.],
       [  4.,   5.,   8.,  13.,  20.],
       [  9.,  10.,  13.,  18.,  25.]])
x[0]
array([  0.,   1.,   4.,   9.,  16.])
x[0,:]
array([  0.,   1.,   4.,   9.,  16.])
x[:,0]
array([ 0.,  1.,  4.,  9.])
x[-1]
array([  9.,  10.,  13.,  18.,  25.])
x[1,1]
2.0
x[:, 1:3]
array([[  1.,   4.],
       [  2.,   5.],
       [  5.,   8.],
       [ 10.,  13.]])

Boolean indexing

x >= 2
array([[False, False,  True,  True,  True],
       [False,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)
x[x > 2]
array([  4.,   9.,  16.,   5.,  10.,  17.,   4.,   5.,   8.,  13.,  20.,
         9.,  10.,  13.,  18.,  25.])

Fancy indexing

x[0, [1,2]]
array([ 1.,  4.])

Calculations and broadcasting

Broadcasting refers to the set of rules that numpy uses to perfrom operations on arrays with different shapes. See official documentation for a clear explanation of the rules. Array shapes can be manipulated using the reshape method or by inserting a new axis with np.newaxis. Note that np.newaxis is an alias for None, which I sometimes use in my examples.

x = np.fromfunction(lambda i, j: i**2+j**2, (2,3))
x
array([[ 0.,  1.,  4.],
       [ 1.,  2.,  5.]])
x * 5
array([[  0.,   5.,  20.],
       [  5.,  10.,  25.]])
x + x
array([[  0.,   2.,   8.],
       [  2.,   4.,  10.]])
x @ x.T
array([[ 17.,  22.],
       [ 22.,  30.]])
x.T @ x
array([[  1.,   2.,   5.],
       [  2.,   5.,  14.],
       [  5.,  14.,  41.]])
np.log1p(x)
array([[ 0.        ,  0.69314718,  1.60943791],
       [ 0.69314718,  1.09861229,  1.79175947]])
np.exp(x)
array([[   1.        ,    2.71828183,   54.59815003],
       [   2.71828183,    7.3890561 ,  148.4131591 ]])

Combining and splitting arrays

x
array([[ 0.,  1.,  4.],
       [ 1.,  2.,  5.]])
np.r_[x, x]
array([[ 0.,  1.,  4.],
       [ 1.,  2.,  5.],
       [ 0.,  1.,  4.],
       [ 1.,  2.,  5.]])
np.vstack([x, x])
array([[ 0.,  1.,  4.],
       [ 1.,  2.,  5.],
       [ 0.,  1.,  4.],
       [ 1.,  2.,  5.]])
np.concatenate([x, x], axis=0)
array([[ 0.,  1.,  4.],
       [ 1.,  2.,  5.],
       [ 0.,  1.,  4.],
       [ 1.,  2.,  5.]])
np.c_[x,x]
array([[ 0.,  1.,  4.,  0.,  1.,  4.],
       [ 1.,  2.,  5.,  1.,  2.,  5.]])
np.hstack([x, x])
array([[ 0.,  1.,  4.,  0.,  1.,  4.],
       [ 1.,  2.,  5.,  1.,  2.,  5.]])
np.concatenate([x,x], axis=1)
array([[ 0.,  1.,  4.,  0.,  1.,  4.],
       [ 1.,  2.,  5.,  1.,  2.,  5.]])
y = np.r_[x, x]
y
array([[ 0.,  1.,  4.],
       [ 1.,  2.,  5.],
       [ 0.,  1.,  4.],
       [ 1.,  2.,  5.]])
a, b, c = np.hsplit(y, 3)
a
array([[ 0.],
       [ 1.],
       [ 0.],
       [ 1.]])
b
array([[ 1.],
       [ 2.],
       [ 1.],
       [ 2.]])
c
array([[ 4.],
       [ 5.],
       [ 4.],
       [ 5.]])
np.vsplit(y, [3])
[array([[ 0.,  1.,  4.],
        [ 1.,  2.,  5.],
        [ 0.,  1.,  4.]]), array([[ 1.,  2.,  5.]])]
np.split(y, [3], axis=0)
[array([[ 0.,  1.,  4.],
        [ 1.,  2.,  5.],
        [ 0.,  1.,  4.]]), array([[ 1.,  2.,  5.]])]
np.hstack(np.hsplit(y, 3))
array([[ 0.,  1.,  4.],
       [ 1.,  2.,  5.],
       [ 0.,  1.,  4.],
       [ 1.,  2.,  5.]])

Reductions

y
array([[ 0.,  1.,  4.],
       [ 1.,  2.,  5.],
       [ 0.,  1.,  4.],
       [ 1.,  2.,  5.]])
y.sum()
26.0
y.sum(0) # column sum
array([  2.,   6.,  18.])
y.sum(1) # row sum
array([ 5.,  8.,  5.,  8.])

Standardize by column mean and standard deviation

z = (y - y.mean(0))/y.std(0)
z
array([[-1., -1., -1.],
       [ 1.,  1.,  1.],
       [-1., -1., -1.],
       [ 1.,  1.,  1.]])
z.mean(0), z.std(0)
(array([ 0.,  0.,  0.]), array([ 1.,  1.,  1.]))

Standardize by row mean and standard deviation

z = (y - y.mean(1)[:,None])/y.std(1)[:,None]
z
array([[-0.98058068, -0.39223227,  1.37281295],
       [-0.98058068, -0.39223227,  1.37281295],
       [-0.98058068, -0.39223227,  1.37281295],
       [-0.98058068, -0.39223227,  1.37281295]])
z.mean(1), z.std(1)
(array([ -7.40148683e-17,   7.40148683e-17,  -7.40148683e-17,
          7.40148683e-17]), array([ 1.,  1.,  1.,  1.]))

Example: Calculating pairwise distance matrix using broadcasting and vectorization

Calculate the pairwise distance matrix between the following points

  • (0,0)
  • (4,0)
  • (4,3)
  • (0,3)
def distance_matrix_py(pts):
    """Returns matrix of pairwise Euclidean distances. Pure Python version."""
    n = len(pts)
    p = len(pts[0])
    m = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            s = 0
            for k in range(p):
                s += (pts[i,k] - pts[j,k])**2
            m[i, j] = s**0.5
    return m
def distance_matrix_np(pts):
    """Returns matrix of pairwise Euclidean distances. Vectorized numpy version."""
    return np.sum((pts[None,:] - pts[:, None])**2, -1)**0.5
pts = np.array([(0,0), (4,0), (4,3), (0,3)])
pts
array([[0, 0],
       [4, 0],
       [4, 3],
       [0, 3]])
pts.shape
(4, 2)
n = pts.shape[0]
p = pts.shape[1]
dist = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        s = 0
        for k in range(p):
            s += (pts[i, k] - pts[j, k])**2
        dist[i, j] = np.sqrt(s)
dist
array([[ 0.,  4.,  5.,  3.],
       [ 4.,  0.,  3.,  5.],
       [ 5.,  3.,  0.,  4.],
       [ 3.,  5.,  4.,  0.]])

Using broadcasting

pts[None, :].shape
(1, 4, 2)
pts[:, None].shape
(4, 1, 2)
m = pts[None, :] - pts[:, None]
m
array([[[ 0,  0],
        [ 4,  0],
        [ 4,  3],
        [ 0,  3]],

       [[-4,  0],
        [ 0,  0],
        [ 0,  3],
        [-4,  3]],

       [[-4, -3],
        [ 0, -3],
        [ 0,  0],
        [-4,  0]],

       [[ 0, -3],
        [ 4, -3],
        [ 4,  0],
        [ 0,  0]]])
m**2
array([[[ 0,  0],
        [16,  0],
        [16,  9],
        [ 0,  9]],

       [[16,  0],
        [ 0,  0],
        [ 0,  9],
        [16,  9]],

       [[16,  9],
        [ 0,  9],
        [ 0,  0],
        [16,  0]],

       [[ 0,  9],
        [16,  9],
        [16,  0],
        [ 0,  0]]])
(m**2).shape
(4, 4, 2)

We want to end up with a 4 by 4 matrix, so sum over the axis with dimension 2. This is axis=2, or axis=-1 since it is the first axis from the end.

np.sum((pts[None, :] - pts[:, None])**2, -1)
array([[ 0, 16, 25,  9],
       [16,  0,  9, 25],
       [25,  9,  0, 16],
       [ 9, 25, 16,  0]])

Basically, the distance matrix can be calculated in one line of numpy code

np.sqrt(np.sum((pts[None, :] - pts[:, None])**2, -1))
array([[ 0.,  4.,  5.,  3.],
       [ 4.,  0.,  3.,  5.],
       [ 5.,  3.,  0.,  4.],
       [ 3.,  5.,  4.,  0.]])

Let’s put them in functions and compare the time.

def pdist1(pts):
    n = pts.shape[0]
    p = pts.shape[1]
    dist = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            s = 0
            for k in range(p):
                s += (pts[i, k] - pts[j, k])**2
            dist[i, j] = s
    return np.sqrt(dist)
def pdist2(pts):
    return np.sqrt(np.sum((pts[None, :] - pts[:, None])**2, -1))

Check that the outputs are the same

np.alltrue(pdist1(pts) == pdist2(pts))
True
pts = np.random.random((1000, 2))
%timeit pdist1(pts)
1 loops, best of 3: 3.26 s per loop
%timeit pdist2(pts)
10 loops, best of 3: 77.3 ms per loop

But don’t give up on loops yet

from numba import njit
@njit
def pdist3(pts):
    n = pts.shape[0]
    p = pts.shape[1]
    dist = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            s = 0
            for k in range(p):
                s += (pts[i, k] - pts[j, k])**2
            dist[i, j] = s
    return np.sqrt(dist)
%timeit pdist3(pts)
The slowest run took 27.17 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 16.1 ms per loop

What is going on?

This is 3-5 times faster than the broadcasting version! We have just performed Just In Time (JIT) compilation of a function, which will be discussed in a later lecture.

Example: Consructing leave-one-out arrays

Another example of numpy trickery is to construct a leave-one-out matrix of a vector of length k. In the matrix, each row is a vector of length k-1, with a different vector component dropped each time. This can be used for LOOCV to evalaute the out-of-sample accuracy of a predictive model.

For example, suppose you have data points [(1,4), (2,7), (3,11), (4,9), (5,15)] that you want to perfrom LOOCV on for a simple regression model. For each cross-validation, you use one point for testing, and the remaining 4 points for training. In other words, you want the training set to be:

[(2,7), (3,11), (4,9), (5,15)]
[(1,4), (3,11), (4,9), (5,15)]
[(1,4), (2,7),  (4,9), (5,15)]
[(1,4), (2,7), (3,11), (5,15)]
[(1,4), (2,7), (3,11), (4,9)]

Here is one way to do create the training set using numpy tricks.

Create a triangular matrix with N rows, N-1 columns and offset from diagnonal by -1

N = 5
np.tri(N)
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  1.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  0.,  0.],
       [ 1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.]])
np.tri(N, N-1)
array([[ 1.,  0.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])
np.tri(N, N-1, -1)
array([[ 0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.]])

Use broadcasting to create a new index matrix

np.arange(1, N)
array([1, 2, 3, 4])
np.arange(1, N) - np.tri(N, N-1, -1)
array([[ 1.,  2.,  3.,  4.],
       [ 0.,  2.,  3.,  4.],
       [ 0.,  1.,  3.,  4.],
       [ 0.,  1.,  2.,  4.],
       [ 0.,  1.,  2.,  3.]])
idx = np.arange(1, N) - np.tri(N, N-1, -1).astype('int')
data = np.array([(1,4), (2,7), (3,11), (4,9), (5,15)])
data
array([[ 1,  4],
       [ 2,  7],
       [ 3, 11],
       [ 4,  9],
       [ 5, 15]])
data[idx]
array([[[ 2,  7],
        [ 3, 11],
        [ 4,  9],
        [ 5, 15]],

       [[ 1,  4],
        [ 3, 11],
        [ 4,  9],
        [ 5, 15]],

       [[ 1,  4],
        [ 2,  7],
        [ 4,  9],
        [ 5, 15]],

       [[ 1,  4],
        [ 2,  7],
        [ 3, 11],
        [ 5, 15]],

       [[ 1,  4],
        [ 2,  7],
        [ 3, 11],
        [ 4,  9]]])

All but one

R uses negative indexing to mean delete the component at that index. Because Python uses negative indexing to mean count from the end, we have to do a little more work to get the same effect. Here are two ways of deleting one item from a vector.

def f1(a, k):
    idx = np.ones_like(a).astype('bool')
    idx[k] = 0
    return a[idx]
def f2(a, k):
    return np.r_[a[:k], a[k+1:]]
a = np.arange(100)
k = 50
%timeit f1(a, k)
The slowest run took 6.39 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 12.4 µs per loop
%timeit f2(a, k)
10000 loops, best of 3: 47.6 µs per loop

Universal functions (Ufuncs)

Functions that work on both scalars and arrays are known as ufuncs. For arrays, ufuncs apply the function in an element-wise fashion. Use of ufuncs is an esssential aspect of vectorization and typically much more computationally efficient than using an explicit loop over each element.

import matplotlib.pyplot as plt
%matplotlib inline

xs = np.linspace(0, 2*np.pi, 100)
ys = np.sin(xs) # np.sin is a universal function
plt.plot(xs, ys);
_images/05_Numbers_123_0.png

Generalized ufuncs

A universal function performs vectorized looping over scalars. A generalized ufunc performs looping over vectors or arrays. Currently, numpy only ships with a single generalized ufunc. However, they play an important role for JIT compilation with numba, a topic we will cover in future lectures.

from numpy.core.umath_tests import matrix_multiply

print(matrix_multiply.signature)
(m,n),(n,p)->(m,p)
us = np.random.random((5, 2, 3)) # 5 2x3 matrics
vs = np.random.random((5, 3, 4)) # 5 3x4 matrices
us
array([[[ 0.45041464,  0.73156889,  0.20199586],
        [ 0.07597661,  0.13069672,  0.57386177]],

       [[ 0.10830059,  0.26695388,  0.44054188],
        [ 0.57974703,  0.17978862,  0.52472549]],

       [[ 0.40794462,  0.35751635,  0.36870809],
        [ 0.63494551,  0.11960905,  0.51381859]],

       [[ 0.49510212,  0.46783668,  0.07856113],
        [ 0.28401281,  0.47199107,  0.54560703]],

       [[ 0.79458848,  0.43756637,  0.06759583],
        [ 0.40228528,  0.50838122,  0.56375008]]])
vs
array([[[  5.12266952e-02,   8.40637879e-01,   2.38644940e-01,
           4.35712252e-01],
        [  7.81217918e-01,   6.77203544e-01,   7.28623630e-01,
           8.70980358e-02],
        [  9.53278422e-01,   6.57880491e-01,   4.45387233e-01,
           5.54732924e-02]],

       [[  6.52946888e-01,   2.29030756e-01,   5.91273241e-01,
           8.29711164e-06],
        [  5.25251656e-01,   5.06625358e-01,   1.87996526e-01,
           3.57795468e-02],
        [  6.45830540e-01,   7.67992588e-01,   3.53764451e-01,
           8.93663158e-01]],

       [[  7.05223217e-01,   5.68551438e-01,   2.21699577e-01,
           3.66118249e-01],
        [  3.59834031e-01,   8.86827366e-01,   8.90595276e-01,
           9.32417623e-01],
        [  9.20454420e-01,   8.21082903e-01,   9.46367477e-01,
           6.79992096e-02]],

       [[  5.17735631e-01,   7.57666718e-01,   9.76354847e-01,
           8.23073506e-01],
        [  9.12873687e-01,   7.01160089e-01,   9.41861241e-01,
           5.75122377e-01],
        [  4.29768204e-01,   6.52651201e-01,   4.52733332e-01,
           7.76359616e-01]],

       [[  3.29293395e-01,   6.01805518e-01,   8.82878227e-01,
           9.58704548e-01],
        [  1.57352491e-01,   8.34085919e-01,   2.60621284e-01,
           1.61751295e-01],
        [  1.81993190e-01,   3.98928140e-01,   1.31517889e-01,
           4.99537484e-03]]])
# perform matrix multiplication for each of the 5 sets of matrices
ws = matrix_multiply(us, vs)
ws.shape
(5, 2, 4)
ws
array([[[ 0.78714627,  1.00694579,  0.73049393,  0.27117477],
        [ 0.65304469,  0.52990956,  0.36895086,  0.07632137]],

       [[ 0.4954479 ,  0.49838267,  0.2700697 ,  0.40324843],
        [ 0.81186204,  0.62685067,  0.56221777,  0.47536541]],

       [[ 0.75571755,  0.85173269,  0.75777686,  0.50778237],
        [ 0.96376431,  0.88895942,  0.73355161,  0.37892998]],

       [[ 0.71717088,  0.75442382,  0.95959983,  0.73756047],
        [ 0.81239634,  0.90221944,  0.96886187,  0.92880331]],

       [[ 0.34280688,  0.87012156,  0.82445404,  0.83289018],
        [ 0.31506361,  0.89102688,  0.5618071 ,  0.47072019]]])

Saving and loading NDArrays

Saving to and loading from text files

x1 = np.arange(1,10).reshape(3,3)
x1
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
np.savetxt('../data/x1.txt', x1)
!cat ../data/x1.txt
1.000000000000000000e+00 2.000000000000000000e+00 3.000000000000000000e+00
4.000000000000000000e+00 5.000000000000000000e+00 6.000000000000000000e+00
7.000000000000000000e+00 8.000000000000000000e+00 9.000000000000000000e+00
x2 = np.loadtxt('../data/x1.txt')
x2
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.],
       [ 7.,  8.,  9.]])

Saving to and loading from binary files (much faster and also preserves dtype)

np.save('../data/x1.npy', x1)
!cat ../data/x1.npy
�NUMPYF{'descr': '<i8', 'fortran_order': False, 'shape': (3, 3), }

x3 = np.load('../data/x1.npy')
x3
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Version information

%load_ext version_information
%version_information numpy, numba, matplotlib
SoftwareVersion
Python3.5.1 64bit [GCC 4.2.1 (Apple Inc. build 5577)]
IPython4.0.1
OSDarwin 15.2.0 x86_64 i386 64bit
numpy1.10.2
numba0.22.1
matplotlib1.5.1
Fri Jan 15 13:34:06 2016 EST