Using numpy
¶
The foundation for numerical computation in Python is the numpy
package, and essentially all scientific libraries in Python build on
this - e.g. scipy
, pandas
, statsmodels
, scikit-learn
,
cv2
etc. The basic data structure in numpy
is the NDArray, and
it is essential to become familiar with how to slice and dice this
object.
Numpy also has the random
, and linalg
modules that we will
discuss in later lectures.
Resources¶
- Numpy for R users
- NumPy: creating and manipulating numerical data
- Advanced Numpy
- 100 Numpy Exercises
NDArray¶
The base structure in numpy
is ndarray
, used to represent
vectors, matrices and higher-dimensional arrays. Each ndarray
has
the following attributes:
- dtype = corresponds to data types in C
- shape = dimensions of array
- strides = number of bytes to step in each direction when traversing the array
x = np.array([1,2,3,4,5,6])
print(x)
print('dytpe', x.dtype)
print('shape', x.shape)
print('strides', x.strides)
[1 2 3 4 5 6]
dytpe int64
shape (6,)
strides (8,)
x.shape = (2,3)
print(x)
print('dytpe', x.dtype)
print('shape', x.shape)
print('strides', x.strides)
[[1 2 3]
[4 5 6]]
dytpe int64
shape (2, 3)
strides (24, 8)
x = x.astype('complex')
print(x)
print('dytpe', x.dtype)
print('shape', x.shape)
print('strides', x.strides)
[[ 1.+0.j 2.+0.j 3.+0.j]
[ 4.+0.j 5.+0.j 6.+0.j]]
dytpe complex128
shape (2, 3)
strides (48, 16)
Array creation¶
np.array([1,2,3])
array([1, 2, 3])
np.array([1,2,3], np.float64)
array([ 1., 2., 3.])
np.arange(3)
array([0, 1, 2])
np.arange(3, 6, 0.5)
array([ 3. , 3.5, 4. , 4.5, 5. , 5.5])
np.array([[1,2,3],[4,5,6]])
array([[1, 2, 3],
[4, 5, 6]])
np.ones(3)
array([ 1., 1., 1.])
np.zeros((3,4))
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
np.eye(4)
array([[ 1., 0., 0., 0.],
[ 0., 1., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 0., 0., 1.]])
np.diag([1,2,3,4])
array([[1, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 4]])
np.fromfunction(lambda i, j: i**2+j**2, (4,5))
array([[ 0., 1., 4., 9., 16.],
[ 1., 2., 5., 10., 17.],
[ 4., 5., 8., 13., 20.],
[ 9., 10., 13., 18., 25.]])
Array manipulation¶
x = np.fromfunction(lambda i, j: i**2+j**2, (4,5))
x
array([[ 0., 1., 4., 9., 16.],
[ 1., 2., 5., 10., 17.],
[ 4., 5., 8., 13., 20.],
[ 9., 10., 13., 18., 25.]])
x.shape
(4, 5)
x.size
20
x.dtype
dtype('float64')
x.astype(np.int64)
array([[ 0, 1, 4, 9, 16],
[ 1, 2, 5, 10, 17],
[ 4, 5, 8, 13, 20],
[ 9, 10, 13, 18, 25]])
x.T
array([[ 0., 1., 4., 9.],
[ 1., 2., 5., 10.],
[ 4., 5., 8., 13.],
[ 9., 10., 13., 18.],
[ 16., 17., 20., 25.]])
x.reshape(2,-1)
array([[ 0., 1., 4., 9., 16., 1., 2., 5., 10., 17.],
[ 4., 5., 8., 13., 20., 9., 10., 13., 18., 25.]])
Array indexing¶
x
array([[ 0., 1., 4., 9., 16.],
[ 1., 2., 5., 10., 17.],
[ 4., 5., 8., 13., 20.],
[ 9., 10., 13., 18., 25.]])
x[0]
array([ 0., 1., 4., 9., 16.])
x[0,:]
array([ 0., 1., 4., 9., 16.])
x[:,0]
array([ 0., 1., 4., 9.])
x[-1]
array([ 9., 10., 13., 18., 25.])
x[1,1]
2.0
x[:, 1:3]
array([[ 1., 4.],
[ 2., 5.],
[ 5., 8.],
[ 10., 13.]])
Boolean indexing¶
x >= 2
array([[False, False, True, True, True],
[False, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
x[x > 2]
array([ 4., 9., 16., 5., 10., 17., 4., 5., 8., 13., 20.,
9., 10., 13., 18., 25.])
Calculations and broadcasting¶
Broadcasting refers to the set of rules that numpy uses to perfrom
operations on arrays with different shapes. See official
documentation
for a clear explanation of the rules. Array shapes can be manipulated
using the reshape
method or by inserting a new axis with
np.newaxis
. Note that np.newaxis
is an alias for None
, which
I sometimes use in my examples.
x = np.fromfunction(lambda i, j: i**2+j**2, (2,3))
x
array([[ 0., 1., 4.],
[ 1., 2., 5.]])
x * 5
array([[ 0., 5., 20.],
[ 5., 10., 25.]])
x + x
array([[ 0., 2., 8.],
[ 2., 4., 10.]])
x @ x.T
array([[ 17., 22.],
[ 22., 30.]])
x.T @ x
array([[ 1., 2., 5.],
[ 2., 5., 14.],
[ 5., 14., 41.]])
np.log1p(x)
array([[ 0. , 0.69314718, 1.60943791],
[ 0.69314718, 1.09861229, 1.79175947]])
np.exp(x)
array([[ 1. , 2.71828183, 54.59815003],
[ 2.71828183, 7.3890561 , 148.4131591 ]])
Combining and splitting arrays¶
x
array([[ 0., 1., 4.],
[ 1., 2., 5.]])
np.r_[x, x]
array([[ 0., 1., 4.],
[ 1., 2., 5.],
[ 0., 1., 4.],
[ 1., 2., 5.]])
np.vstack([x, x])
array([[ 0., 1., 4.],
[ 1., 2., 5.],
[ 0., 1., 4.],
[ 1., 2., 5.]])
np.concatenate([x, x], axis=0)
array([[ 0., 1., 4.],
[ 1., 2., 5.],
[ 0., 1., 4.],
[ 1., 2., 5.]])
np.c_[x,x]
array([[ 0., 1., 4., 0., 1., 4.],
[ 1., 2., 5., 1., 2., 5.]])
np.hstack([x, x])
array([[ 0., 1., 4., 0., 1., 4.],
[ 1., 2., 5., 1., 2., 5.]])
np.concatenate([x,x], axis=1)
array([[ 0., 1., 4., 0., 1., 4.],
[ 1., 2., 5., 1., 2., 5.]])
y = np.r_[x, x]
y
array([[ 0., 1., 4.],
[ 1., 2., 5.],
[ 0., 1., 4.],
[ 1., 2., 5.]])
a, b, c = np.hsplit(y, 3)
a
array([[ 0.],
[ 1.],
[ 0.],
[ 1.]])
b
array([[ 1.],
[ 2.],
[ 1.],
[ 2.]])
c
array([[ 4.],
[ 5.],
[ 4.],
[ 5.]])
np.vsplit(y, [3])
[array([[ 0., 1., 4.],
[ 1., 2., 5.],
[ 0., 1., 4.]]), array([[ 1., 2., 5.]])]
np.split(y, [3], axis=0)
[array([[ 0., 1., 4.],
[ 1., 2., 5.],
[ 0., 1., 4.]]), array([[ 1., 2., 5.]])]
np.hstack(np.hsplit(y, 3))
array([[ 0., 1., 4.],
[ 1., 2., 5.],
[ 0., 1., 4.],
[ 1., 2., 5.]])
Reductions¶
y
array([[ 0., 1., 4.],
[ 1., 2., 5.],
[ 0., 1., 4.],
[ 1., 2., 5.]])
y.sum()
26.0
y.sum(0) # column sum
array([ 2., 6., 18.])
y.sum(1) # row sum
array([ 5., 8., 5., 8.])
Standardize by column mean and standard deviation¶
z = (y - y.mean(0))/y.std(0)
z
array([[-1., -1., -1.],
[ 1., 1., 1.],
[-1., -1., -1.],
[ 1., 1., 1.]])
z.mean(0), z.std(0)
(array([ 0., 0., 0.]), array([ 1., 1., 1.]))
Standardize by row mean and standard deviation¶
z = (y - y.mean(1)[:,None])/y.std(1)[:,None]
z
array([[-0.98058068, -0.39223227, 1.37281295],
[-0.98058068, -0.39223227, 1.37281295],
[-0.98058068, -0.39223227, 1.37281295],
[-0.98058068, -0.39223227, 1.37281295]])
z.mean(1), z.std(1)
(array([ -7.40148683e-17, 7.40148683e-17, -7.40148683e-17,
7.40148683e-17]), array([ 1., 1., 1., 1.]))
Example: Calculating pairwise distance matrix using broadcasting and vectorization¶
Calculate the pairwise distance matrix between the following points
- (0,0)
- (4,0)
- (4,3)
- (0,3)
def distance_matrix_py(pts):
"""Returns matrix of pairwise Euclidean distances. Pure Python version."""
n = len(pts)
p = len(pts[0])
m = np.zeros((n, n))
for i in range(n):
for j in range(n):
s = 0
for k in range(p):
s += (pts[i,k] - pts[j,k])**2
m[i, j] = s**0.5
return m
def distance_matrix_np(pts):
"""Returns matrix of pairwise Euclidean distances. Vectorized numpy version."""
return np.sum((pts[None,:] - pts[:, None])**2, -1)**0.5
pts = np.array([(0,0), (4,0), (4,3), (0,3)])
pts
array([[0, 0],
[4, 0],
[4, 3],
[0, 3]])
pts.shape
(4, 2)
n = pts.shape[0]
p = pts.shape[1]
dist = np.zeros((n, n))
for i in range(n):
for j in range(n):
s = 0
for k in range(p):
s += (pts[i, k] - pts[j, k])**2
dist[i, j] = np.sqrt(s)
dist
array([[ 0., 4., 5., 3.],
[ 4., 0., 3., 5.],
[ 5., 3., 0., 4.],
[ 3., 5., 4., 0.]])
Using broadcasting¶
pts[None, :].shape
(1, 4, 2)
pts[:, None].shape
(4, 1, 2)
m = pts[None, :] - pts[:, None]
m
array([[[ 0, 0],
[ 4, 0],
[ 4, 3],
[ 0, 3]],
[[-4, 0],
[ 0, 0],
[ 0, 3],
[-4, 3]],
[[-4, -3],
[ 0, -3],
[ 0, 0],
[-4, 0]],
[[ 0, -3],
[ 4, -3],
[ 4, 0],
[ 0, 0]]])
m**2
array([[[ 0, 0],
[16, 0],
[16, 9],
[ 0, 9]],
[[16, 0],
[ 0, 0],
[ 0, 9],
[16, 9]],
[[16, 9],
[ 0, 9],
[ 0, 0],
[16, 0]],
[[ 0, 9],
[16, 9],
[16, 0],
[ 0, 0]]])
(m**2).shape
(4, 4, 2)
We want to end up with a 4 by 4 matrix, so sum over the axis with dimension 2. This is axis=2, or axis=-1 since it is the first axis from the end.
np.sum((pts[None, :] - pts[:, None])**2, -1)
array([[ 0, 16, 25, 9],
[16, 0, 9, 25],
[25, 9, 0, 16],
[ 9, 25, 16, 0]])
Basically, the distance matrix can be calculated in one line of numpy code¶
np.sqrt(np.sum((pts[None, :] - pts[:, None])**2, -1))
array([[ 0., 4., 5., 3.],
[ 4., 0., 3., 5.],
[ 5., 3., 0., 4.],
[ 3., 5., 4., 0.]])
Let’s put them in functions and compare the time.
def pdist1(pts):
n = pts.shape[0]
p = pts.shape[1]
dist = np.zeros((n, n))
for i in range(n):
for j in range(n):
s = 0
for k in range(p):
s += (pts[i, k] - pts[j, k])**2
dist[i, j] = s
return np.sqrt(dist)
def pdist2(pts):
return np.sqrt(np.sum((pts[None, :] - pts[:, None])**2, -1))
Check that the outputs are the same¶
np.alltrue(pdist1(pts) == pdist2(pts))
True
pts = np.random.random((1000, 2))
%timeit pdist1(pts)
1 loops, best of 3: 3.26 s per loop
%timeit pdist2(pts)
10 loops, best of 3: 77.3 ms per loop
But don’t give up on loops yet¶
from numba import njit
@njit
def pdist3(pts):
n = pts.shape[0]
p = pts.shape[1]
dist = np.zeros((n, n))
for i in range(n):
for j in range(n):
s = 0
for k in range(p):
s += (pts[i, k] - pts[j, k])**2
dist[i, j] = s
return np.sqrt(dist)
%timeit pdist3(pts)
The slowest run took 27.17 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 16.1 ms per loop
What is going on?¶
This is 3-5 times faster than the broadcasting version! We have just performed Just In Time (JIT) compilation of a function, which will be discussed in a later lecture.
Example: Consructing leave-one-out arrays¶
Another example of numpy trickery is to construct a leave-one-out matrix of a vector of length k. In the matrix, each row is a vector of length k-1, with a different vector component dropped each time. This can be used for LOOCV to evalaute the out-of-sample accuracy of a predictive model.
For example, suppose you have data points [(1,4), (2,7), (3,11), (4,9), (5,15)] that you want to perfrom LOOCV on for a simple regression model. For each cross-validation, you use one point for testing, and the remaining 4 points for training. In other words, you want the training set to be:
[(2,7), (3,11), (4,9), (5,15)]
[(1,4), (3,11), (4,9), (5,15)]
[(1,4), (2,7), (4,9), (5,15)]
[(1,4), (2,7), (3,11), (5,15)]
[(1,4), (2,7), (3,11), (4,9)]
Here is one way to do create the training set using numpy tricks.
Create a triangular matrix with N rows, N-1 columns and offset from diagnonal by -1
N = 5
np.tri(N)
array([[ 1., 0., 0., 0., 0.],
[ 1., 1., 0., 0., 0.],
[ 1., 1., 1., 0., 0.],
[ 1., 1., 1., 1., 0.],
[ 1., 1., 1., 1., 1.]])
np.tri(N, N-1)
array([[ 1., 0., 0., 0.],
[ 1., 1., 0., 0.],
[ 1., 1., 1., 0.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
np.tri(N, N-1, -1)
array([[ 0., 0., 0., 0.],
[ 1., 0., 0., 0.],
[ 1., 1., 0., 0.],
[ 1., 1., 1., 0.],
[ 1., 1., 1., 1.]])
Use broadcasting to create a new index matrix
np.arange(1, N)
array([1, 2, 3, 4])
np.arange(1, N) - np.tri(N, N-1, -1)
array([[ 1., 2., 3., 4.],
[ 0., 2., 3., 4.],
[ 0., 1., 3., 4.],
[ 0., 1., 2., 4.],
[ 0., 1., 2., 3.]])
idx = np.arange(1, N) - np.tri(N, N-1, -1).astype('int')
data = np.array([(1,4), (2,7), (3,11), (4,9), (5,15)])
data
array([[ 1, 4],
[ 2, 7],
[ 3, 11],
[ 4, 9],
[ 5, 15]])
data[idx]
array([[[ 2, 7],
[ 3, 11],
[ 4, 9],
[ 5, 15]],
[[ 1, 4],
[ 3, 11],
[ 4, 9],
[ 5, 15]],
[[ 1, 4],
[ 2, 7],
[ 4, 9],
[ 5, 15]],
[[ 1, 4],
[ 2, 7],
[ 3, 11],
[ 5, 15]],
[[ 1, 4],
[ 2, 7],
[ 3, 11],
[ 4, 9]]])
All but one¶
R uses negative indexing to mean delete the component at that index. Because Python uses negative indexing to mean count from the end, we have to do a little more work to get the same effect. Here are two ways of deleting one item from a vector.
def f1(a, k):
idx = np.ones_like(a).astype('bool')
idx[k] = 0
return a[idx]
def f2(a, k):
return np.r_[a[:k], a[k+1:]]
a = np.arange(100)
k = 50
%timeit f1(a, k)
The slowest run took 6.39 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 12.4 µs per loop
%timeit f2(a, k)
10000 loops, best of 3: 47.6 µs per loop
Universal functions (Ufuncs)¶
Functions that work on both scalars and arrays are known as ufuncs. For arrays, ufuncs apply the function in an element-wise fashion. Use of ufuncs is an esssential aspect of vectorization and typically much more computationally efficient than using an explicit loop over each element.
import matplotlib.pyplot as plt
%matplotlib inline
xs = np.linspace(0, 2*np.pi, 100)
ys = np.sin(xs) # np.sin is a universal function
plt.plot(xs, ys);
Generalized ufuncs¶
A universal function performs vectorized looping over scalars. A
generalized ufunc performs looping over vectors or arrays. Currently,
numpy only ships with a single generalized ufunc. However, they play an
important role for JIT compilation with numba
, a topic we will cover
in future lectures.
from numpy.core.umath_tests import matrix_multiply
print(matrix_multiply.signature)
(m,n),(n,p)->(m,p)
us = np.random.random((5, 2, 3)) # 5 2x3 matrics
vs = np.random.random((5, 3, 4)) # 5 3x4 matrices
us
array([[[ 0.45041464, 0.73156889, 0.20199586],
[ 0.07597661, 0.13069672, 0.57386177]],
[[ 0.10830059, 0.26695388, 0.44054188],
[ 0.57974703, 0.17978862, 0.52472549]],
[[ 0.40794462, 0.35751635, 0.36870809],
[ 0.63494551, 0.11960905, 0.51381859]],
[[ 0.49510212, 0.46783668, 0.07856113],
[ 0.28401281, 0.47199107, 0.54560703]],
[[ 0.79458848, 0.43756637, 0.06759583],
[ 0.40228528, 0.50838122, 0.56375008]]])
vs
array([[[ 5.12266952e-02, 8.40637879e-01, 2.38644940e-01,
4.35712252e-01],
[ 7.81217918e-01, 6.77203544e-01, 7.28623630e-01,
8.70980358e-02],
[ 9.53278422e-01, 6.57880491e-01, 4.45387233e-01,
5.54732924e-02]],
[[ 6.52946888e-01, 2.29030756e-01, 5.91273241e-01,
8.29711164e-06],
[ 5.25251656e-01, 5.06625358e-01, 1.87996526e-01,
3.57795468e-02],
[ 6.45830540e-01, 7.67992588e-01, 3.53764451e-01,
8.93663158e-01]],
[[ 7.05223217e-01, 5.68551438e-01, 2.21699577e-01,
3.66118249e-01],
[ 3.59834031e-01, 8.86827366e-01, 8.90595276e-01,
9.32417623e-01],
[ 9.20454420e-01, 8.21082903e-01, 9.46367477e-01,
6.79992096e-02]],
[[ 5.17735631e-01, 7.57666718e-01, 9.76354847e-01,
8.23073506e-01],
[ 9.12873687e-01, 7.01160089e-01, 9.41861241e-01,
5.75122377e-01],
[ 4.29768204e-01, 6.52651201e-01, 4.52733332e-01,
7.76359616e-01]],
[[ 3.29293395e-01, 6.01805518e-01, 8.82878227e-01,
9.58704548e-01],
[ 1.57352491e-01, 8.34085919e-01, 2.60621284e-01,
1.61751295e-01],
[ 1.81993190e-01, 3.98928140e-01, 1.31517889e-01,
4.99537484e-03]]])
# perform matrix multiplication for each of the 5 sets of matrices
ws = matrix_multiply(us, vs)
ws.shape
(5, 2, 4)
ws
array([[[ 0.78714627, 1.00694579, 0.73049393, 0.27117477],
[ 0.65304469, 0.52990956, 0.36895086, 0.07632137]],
[[ 0.4954479 , 0.49838267, 0.2700697 , 0.40324843],
[ 0.81186204, 0.62685067, 0.56221777, 0.47536541]],
[[ 0.75571755, 0.85173269, 0.75777686, 0.50778237],
[ 0.96376431, 0.88895942, 0.73355161, 0.37892998]],
[[ 0.71717088, 0.75442382, 0.95959983, 0.73756047],
[ 0.81239634, 0.90221944, 0.96886187, 0.92880331]],
[[ 0.34280688, 0.87012156, 0.82445404, 0.83289018],
[ 0.31506361, 0.89102688, 0.5618071 , 0.47072019]]])
Saving and loading NDArrays¶
Saving to and loading from text files¶
x1 = np.arange(1,10).reshape(3,3)
x1
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
np.savetxt('../data/x1.txt', x1)
!cat ../data/x1.txt
1.000000000000000000e+00 2.000000000000000000e+00 3.000000000000000000e+00
4.000000000000000000e+00 5.000000000000000000e+00 6.000000000000000000e+00
7.000000000000000000e+00 8.000000000000000000e+00 9.000000000000000000e+00
x2 = np.loadtxt('../data/x1.txt')
x2
array([[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.]])
Saving to and loading from binary files (much faster and also preserves dtype)¶
np.save('../data/x1.npy', x1)
!cat ../data/x1.npy
�NUMPYF{'descr': '<i8', 'fortran_order': False, 'shape': (3, 3), }
x3 = np.load('../data/x1.npy')
x3
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Version information¶
%load_ext version_information
%version_information numpy, numba, matplotlib
Software | Version |
---|---|
Python | 3.5.1 64bit [GCC 4.2.1 (Apple Inc. build 5577)] |
IPython | 4.0.1 |
OS | Darwin 15.2.0 x86_64 i386 64bit |
numpy | 1.10.2 |
numba | 0.22.1 |
matplotlib | 1.5.1 |
Fri Jan 15 13:34:06 2016 EST |