Algorithms and Data Structures¶
If you will be coding a lot, take an Algorithms and Data Structures course from CS or a MOOC. It will help you become a better programmer. Here we just go over very basic concepts needed to write efficient Python code.
Data structures¶
Sequence containers¶
Searching takes linear time since a full scan is required.
In [1]:
n = 100
xs = list(range(n))
%timeit -r3 -n10 n in xs
1.77 µs ± 32 ns per loop (mean ± std. dev. of 3 runs, 10 loops each)
In [2]:
n = 10000
xs = list(range(n))
%timeit -r3 -n10 n in xs
176 µs ± 7.8 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
Mapping containers¶
Search takes constant time since mapping containers in Python are based on hash tables.
In [3]:
n = 100
xs = set(range(n))
%timeit -r3 -n10 n in xs
114 ns ± 20.3 ns per loop (mean ± std. dev. of 3 runs, 10 loops each)
In [4]:
n = 10000
xs = set(range(n))
%timeit -r3 -n10 n in xs
111 ns ± 16.4 ns per loop (mean ± std. dev. of 3 runs, 10 loops each)
Heap¶
The heap is an interesting data structure often used to build efficient priority queues.
In [5]:
import heapq
In [6]:
h = []
xs = [5,1,3,2,4]
for x in xs:
heapq.heappush(h, x)
print(h)
[5]
[1, 5]
[1, 5, 3]
[1, 2, 3, 5]
[1, 2, 3, 5, 4]
In [7]:
heapq.nlargest(3, h)
Out[7]:
[5, 4, 3]
In [8]:
heapq.nsmallest(3, h)
Out[8]:
[1, 2, 3]
In [9]:
while h:
x = heapq.heappop(h)
print(x)
1
2
3
4
5
Collections¶
The collections
module includes a few useful data structures.
named tuple
¶
A named tuple is exactly as efficient as a tuple, just more convenient for named access.
In [10]:
import collections as c
In [11]:
Point = c.namedtuple('Point', ['x', 'y', 'z'])
In [12]:
p = Point(1,2,3)
In [13]:
p
Out[13]:
Point(x=1, y=2, z=3)
In [14]:
p.x, p.y, p.z
Out[14]:
(1, 2, 3)
In [15]:
Rec = c.namedtuple('Record', ['first', 'last', 'email'])
In [16]:
tom = Rec(email='john.doe@example.com', first='John', last='Doe')
In [17]:
tom[-1]
Out[17]:
'john.doe@example.com'
In [18]:
tom.email
Out[18]:
'john.doe@example.com'
counter
¶
Convenience class for counting.
In [19]:
dna = 'AATTTAACCAAGGGATT'
dna_counter = c.Counter(dna)
dna_counter
Out[19]:
Counter({'A': 7, 'T': 5, 'C': 2, 'G': 3})
In [20]:
words = 'this old man he played one he played knick knack on my thumb'.split()
word_counter = c.Counter(words)
In [21]:
word_counter.most_common(2)
Out[21]:
[('he', 2), ('played', 2)]
deque
¶
Double-ended queue.
In [22]:
deq = c.deque([5,1,3,2,4])
In [23]:
deq
Out[23]:
deque([5, 1, 3, 2, 4])
In [24]:
deq.append(99)
deq
Out[24]:
deque([5, 1, 3, 2, 4, 99])
In [25]:
deq.appendleft(99)
deq
Out[25]:
deque([99, 5, 1, 3, 2, 4, 99])
In [26]:
deq.pop(), deq.pop()
Out[26]:
(99, 4)
In [27]:
deq.popleft(), deq.popleft()
Out[27]:
(99, 5)
The deque
is also useful as a ring
data structure when the
maxlen
argument is given
In [28]:
ring = c.deque(maxlen=4)
In [29]:
for i in range(10):
ring.appendleft(i)
print(ring)
deque([0], maxlen=4)
deque([1, 0], maxlen=4)
deque([2, 1, 0], maxlen=4)
deque([3, 2, 1, 0], maxlen=4)
deque([4, 3, 2, 1], maxlen=4)
deque([5, 4, 3, 2], maxlen=4)
deque([6, 5, 4, 3], maxlen=4)
deque([7, 6, 5, 4], maxlen=4)
deque([8, 7, 6, 5], maxlen=4)
deque([9, 8, 7, 6], maxlen=4)
defaultdict
¶
Provides a default given by a factory function when key is not found.
In [30]:
d = c.defaultdict(list)
In [31]:
d['foo'].append(2)
In [32]:
d['bar'].extend([1,2,3])
In [33]:
d
Out[33]:
defaultdict(list, {'foo': [2], 'bar': [1, 2, 3]})
ChainMap¶
A ChainMap
lets you combine multiple dictionaries and access keys as
though it was a single dictionary. If there are duplicate keys, only the
first one will be accessible.
In [34]:
d1 = dict(a=1, b=2, c=3)
d2 = dict(d=1, e=2, f=3)
d3 = dict(g=4, h=4, i=4)
In [35]:
cm = c.ChainMap(d1, d2, d3)
In [36]:
cm
Out[36]:
ChainMap({'a': 1, 'b': 2, 'c': 3}, {'d': 1, 'e': 2, 'f': 3}, {'g': 4, 'h': 4, 'i': 4})
In [37]:
cm['a'], cm['e'], cm['h']
Out[37]:
(1, 2, 4)
Example¶
What numbers are present in both xs
and ys
?
In [38]:
import numpy as np
In [39]:
xs = np.random.randint(0, int(1e6), 10000)
ys = np.random.randint(0, int(1e6), 10000)
In [40]:
%%time
common = []
for x in xs:
for y in ys:
if y == x:
common.append(y)
print(set(common))
{584714, 102410, 200726, 347162, 921120, 25637, 445477, 912432, 751154, 658485, 809027, 747592, 981072, 706129, 12888, 929378, 527460, 506983, 233580, 299632, 453744, 509045, 259703, 922744, 998011, 244861, 319103, 635520, 136839, 232073, 670864, 935057, 549520, 402582, 64663, 798361, 215195, 971419, 219811, 932523, 703153, 734386, 208564, 81589, 133307, 836284, 985290, 96975, 116950, 312540, 200926, 472289, 505571, 447204, 323310, 716545, 5898, 564495, 183065, 109850, 197406, 442655, 940837, 543526, 394536, 959274, 495403, 152364, 696111, 106803, 350517, 980800, 883523, 113988, 280396, 256338, 448346, 645978, 274781, 753504, 351074, 868724, 776577, 427908, 27013, 586119, 673682, 588179, 731542, 987032, 522138, 111518, 937888, 810407, 294828, 258992, 613818, 646081, 436164, 670667, 947665, 110037, 426473, 224747, 281580, 735225}
CPU times: user 16.6 s, sys: 14.4 ms, total: 16.6 s
Wall time: 16.6 s
In [41]:
%%time
common = []
for y in ys:
if y in xs:
common.append(y)
print(set(common))
{102410, 584714, 200726, 347162, 921120, 25637, 445477, 912432, 751154, 658485, 809027, 747592, 981072, 706129, 12888, 929378, 527460, 506983, 233580, 299632, 453744, 509045, 259703, 922744, 998011, 244861, 319103, 635520, 136839, 232073, 549520, 670864, 935057, 402582, 64663, 798361, 971419, 215195, 219811, 932523, 703153, 734386, 208564, 81589, 133307, 836284, 985290, 96975, 116950, 312540, 200926, 472289, 505571, 447204, 323310, 716545, 5898, 564495, 183065, 109850, 197406, 442655, 940837, 543526, 394536, 959274, 495403, 152364, 696111, 106803, 350517, 980800, 883523, 113988, 280396, 256338, 448346, 645978, 274781, 753504, 351074, 868724, 776577, 427908, 27013, 586119, 673682, 588179, 731542, 987032, 522138, 111518, 937888, 810407, 294828, 258992, 613818, 646081, 436164, 670667, 947665, 110037, 426473, 224747, 281580, 735225}
CPU times: user 186 ms, sys: 2.5 ms, total: 188 ms
Wall time: 186 ms
In [42]:
%%time
common = set(xs).intersection(ys)
print(common)
{102410, 584714, 200726, 347162, 921120, 25637, 445477, 912432, 751154, 658485, 809027, 747592, 981072, 706129, 12888, 929378, 527460, 506983, 233580, 299632, 453744, 509045, 259703, 922744, 998011, 244861, 319103, 635520, 136839, 232073, 549520, 670864, 935057, 402582, 64663, 798361, 971419, 215195, 219811, 932523, 703153, 734386, 208564, 81589, 133307, 836284, 985290, 96975, 116950, 312540, 200926, 472289, 505571, 447204, 323310, 716545, 5898, 564495, 183065, 109850, 197406, 442655, 940837, 543526, 394536, 959274, 495403, 152364, 696111, 106803, 350517, 980800, 883523, 113988, 280396, 256338, 448346, 645978, 274781, 753504, 351074, 868724, 776577, 427908, 27013, 586119, 673682, 588179, 731542, 987032, 522138, 111518, 937888, 810407, 294828, 258992, 613818, 646081, 436164, 670667, 947665, 110037, 426473, 224747, 281580, 735225}
CPU times: user 2.73 ms, sys: 195 µs, total: 2.93 ms
Wall time: 2.91 ms
Algorithms¶
Big O complexity¶
A function \(f(n)\) had Big O complexity \(g(n)\) if \(\vert f(n) \vert \le M g(n)\) where \(M\) is a constant. Common classes for \(g(n)\) in increasing order of complexity are
\(\log n\)
linear \(n\)
\(n \log n\)
polynomial \(n^k\)
exponential \(e^n\)
factorial n!
Notes
Note 1: parallel processing in most cases gives at best a linear speedup.
Note 2: The constant factor can be important, especially for small to moderate values of \(n\)
In [43]:
from functools import reduce
In [44]:
def factorial(n):
return reduce(lambda a, b: a* b, range(1, n+1))
In [45]:
for n in (5, 20, 50):
print('n =', n)
print('-'*40)
print('log ', int(np.log2(n)))
print('linear ', n)
print('n log n', int(n*np.log2(n)))
print('n^2 ', n**2)
print('n^3 ', n**3)
print('e^n ', int(np.exp(n)))
print('n! ', factorial(n))
print()
n = 5
----------------------------------------
log 2
linear 5
n log n 11
n^2 25
n^3 125
e^n 148
n! 120
n = 20
----------------------------------------
log 4
linear 20
n log n 86
n^2 400
n^3 8000
e^n 485165195
n! 2432902008176640000
n = 50
----------------------------------------
log 5
linear 50
n log n 282
n^2 2500
n^3 125000
e^n 5184705528587072045056
n! 30414093201713378043612608166064768844377641568960512000000000000
Example¶
If you have to search a sequence container repeatedly, it is more efficient to first sort, then use a bisection algorithm.
Initial sort takes \(n \log n\) time
Subsequent searches takes \(\log n\)
Total time is \(n \log n + k \log n\) versus \(k \times n/2\)
In [46]:
testx = np.random.randint(0, 2*n, 1000)
In [47]:
%%time
n = 10000
xs = list(range(n))
hits = 0
for x in testx:
if x in xs:
hits += 1
print(hits)
1000
CPU times: user 8.5 ms, sys: 496 µs, total: 8.99 ms
Wall time: 8.66 ms
In [48]:
import bisect
In [49]:
%%time
n = 10000
xs = list(range(n))
xs.sort()
hits = 0
for x in testx:
if bisect.bisect(xs, x) != n:
hits += 1
print(hits)
1000
CPU times: user 3.68 ms, sys: 688 µs, total: 4.37 ms
Wall time: 3.87 ms
In [ ]:
Recursion¶
You should know how to code recursive functions. You should also be
aware of how inefficient they are, and how you can use memoization
(cahceing) to speed up some recursive functions (e.g. using
functools.lrucache
). Try to implement the following simple recursive
functions.
In [50]:
def rfac(n):
"""Find the nth factorail recursively."""
In [51]:
def rsum(xs):
"""Sum the values in xs recursively."""
In [52]:
def rfib(n):
"""Find the nth Fibonaccci number recursively."""
In [53]:
def rmap(f, xs):
"""Implement a map function recursively.
f is a function that takes a single argument
xs is a list."""
What is the Big O complexity of rfac
with and without cacheing?
In [ ]:
Sorting algorithms¶
Generally, use the sort function provided by the language
(e.g. sort
, sorteed
). However sort functions are useful to
illustrate algorithmic concepts such as in-place editing, recursion and
algorithmic complexity, and you should know how to write simple sort
functions.
How much memory does the sort use?
What is its big O complexity class?
Is it iterative or recursive? (note - all recursive algorithm have an iterative equivalent, but some (e.g. quicksort) are easier to think of in a recursive way.
Bubble sort¶
In [54]:
def bubblesort(xs):
"""Bubble sort."""
n = len(xs)
for i in range(n):
print(xs)
for j in range(i+1, n):
if xs[i] > xs[j]:
xs[i], xs[j] = xs[j], xs[i]
In [55]:
xs = [5,1,3,4,2]
bubblesort(xs)
xs
[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 5, 4, 3]
[1, 2, 3, 5, 4]
[1, 2, 3, 4, 5]
Out[55]:
[1, 2, 3, 4, 5]
Selection sort¶
In [56]:
def selectionsort(xs):
"""Selection sort."""
n = len(xs)
for i in range(n):
best = xs[i]
idx = i
print(xs)
for j in range(i+1, n):
if xs[j] < best:
best = xs[j]
idx = j
xs[i], xs[idx] = xs[idx], xs[i]
In [57]:
xs = [5,1,3,4,2]
selectionsort(xs)
xs
[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
Out[57]:
[1, 2, 3, 4, 5]
Quicksort¶
In [58]:
def quicksort(xs):
"""Quicksort."""
if len(xs) < 2:
return xs
else:
pivot = xs[0]
left = [x for x in xs[1:] if x <= pivot]
right = [x for x in xs[1:] if x > pivot]
print(pivot, left, right)
return quicksort(left) + [pivot] + quicksort(right)
In [59]:
xs = [5,1,3,4,2]
quicksort(xs)
5 [1, 3, 4, 2] []
1 [] [3, 4, 2]
3 [2] [4]
Out[59]:
[1, 2, 3, 4, 5]
Memory usage¶
In [60]:
import sys
In [61]:
xs = np.random.randint(0, 10, (100,100))
In [62]:
sys.getsizeof(xs)
Out[62]:
80112
In [63]:
xs.nbytes
Out[63]:
80000
Timing¶
In [64]:
from time import sleep
In [65]:
%time sleep(0.1)
CPU times: user 8.4 ms, sys: 2.76 ms, total: 11.2 ms
Wall time: 104 ms
In [66]:
%%time
sleep(0.1)
CPU times: user 1.29 ms, sys: 1.25 ms, total: 2.54 ms
Wall time: 101 ms
In [67]:
%timeit sleep(0.1)
102 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [68]:
%timeit -r 3 -n 10 sleep(0.1)
103 ms ± 673 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
In [69]:
from timeit import timeit
In [70]:
t = timeit('from time import sleep; sleep(0.1)', number=1)
t
Out[70]:
0.1012009109999994
In [71]:
t = timeit(lambda: sleep(0.1), number=1)
t
Out[71]:
0.10307493699999881