Algorithms and Data Structures

If you will be coding a lot, take an Algorithms and Data Structures course from CS or a MOOC. It will help you become a better programmer. Here we just go over very basic concepts needed to write efficient Python code.

Data structures

Sequence containers

Searching takes linear time since a full scan is required.

In [1]:
n = 100
xs  = list(range(n))
%timeit -r3 -n10 n in xs
1.77 µs ± 32 ns per loop (mean ± std. dev. of 3 runs, 10 loops each)
In [2]:
n = 10000
xs  = list(range(n))
%timeit -r3 -n10 n in xs
176 µs ± 7.8 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)

Mapping containers

Search takes constant time since mapping containers in Python are based on hash tables.

In [3]:
n = 100
xs  = set(range(n))
%timeit -r3 -n10 n in xs
114 ns ± 20.3 ns per loop (mean ± std. dev. of 3 runs, 10 loops each)
In [4]:
n = 10000
xs  = set(range(n))
%timeit -r3 -n10 n in xs
111 ns ± 16.4 ns per loop (mean ± std. dev. of 3 runs, 10 loops each)

Heap

The heap is an interesting data structure often used to build efficient priority queues.

In [5]:
import heapq
In [6]:
h = []
xs = [5,1,3,2,4]
for x in xs:
    heapq.heappush(h, x)
    print(h)
[5]
[1, 5]
[1, 5, 3]
[1, 2, 3, 5]
[1, 2, 3, 5, 4]
In [7]:
heapq.nlargest(3, h)
Out[7]:
[5, 4, 3]
In [8]:
heapq.nsmallest(3, h)
Out[8]:
[1, 2, 3]
In [9]:
while h:
    x = heapq.heappop(h)
    print(x)
1
2
3
4
5

Collections

The collections module includes a few useful data structures.

named tuple

A named tuple is exactly as efficient as a tuple, just more convenient for named access.

In [10]:
import collections as c
In [11]:
Point = c.namedtuple('Point', ['x', 'y', 'z'])
In [12]:
p = Point(1,2,3)
In [13]:
p
Out[13]:
Point(x=1, y=2, z=3)
In [14]:
p.x, p.y, p.z
Out[14]:
(1, 2, 3)
In [15]:
Rec = c.namedtuple('Record', ['first', 'last', 'email'])
In [16]:
tom = Rec(email='john.doe@example.com', first='John', last='Doe')
In [17]:
tom[-1]
Out[17]:
'john.doe@example.com'
In [18]:
tom.email
Out[18]:
'john.doe@example.com'

counter

Convenience class for counting.

In [19]:
dna = 'AATTTAACCAAGGGATT'
dna_counter = c.Counter(dna)
dna_counter
Out[19]:
Counter({'A': 7, 'T': 5, 'C': 2, 'G': 3})
In [20]:
words = 'this old man he played one he played knick knack on my thumb'.split()
word_counter = c.Counter(words)
In [21]:
word_counter.most_common(2)
Out[21]:
[('he', 2), ('played', 2)]

deque

Double-ended queue.

In [22]:
deq = c.deque([5,1,3,2,4])
In [23]:
deq
Out[23]:
deque([5, 1, 3, 2, 4])
In [24]:
deq.append(99)
deq
Out[24]:
deque([5, 1, 3, 2, 4, 99])
In [25]:
deq.appendleft(99)
deq
Out[25]:
deque([99, 5, 1, 3, 2, 4, 99])
In [26]:
deq.pop(), deq.pop()
Out[26]:
(99, 4)
In [27]:
deq.popleft(), deq.popleft()
Out[27]:
(99, 5)

The deque is also useful as a ring data structure when the maxlen argument is given

In [28]:
ring = c.deque(maxlen=4)
In [29]:
for i in range(10):
    ring.appendleft(i)
    print(ring)
deque([0], maxlen=4)
deque([1, 0], maxlen=4)
deque([2, 1, 0], maxlen=4)
deque([3, 2, 1, 0], maxlen=4)
deque([4, 3, 2, 1], maxlen=4)
deque([5, 4, 3, 2], maxlen=4)
deque([6, 5, 4, 3], maxlen=4)
deque([7, 6, 5, 4], maxlen=4)
deque([8, 7, 6, 5], maxlen=4)
deque([9, 8, 7, 6], maxlen=4)

defaultdict

Provides a default given by a factory function when key is not found.

In [30]:
d = c.defaultdict(list)
In [31]:
d['foo'].append(2)
In [32]:
d['bar'].extend([1,2,3])
In [33]:
d
Out[33]:
defaultdict(list, {'foo': [2], 'bar': [1, 2, 3]})

ChainMap

A ChainMap lets you combine multiple dictionaries and access keys as though it was a single dictionary. If there are duplicate keys, only the first one will be accessible.

In [34]:
d1 = dict(a=1, b=2, c=3)
d2 = dict(d=1, e=2, f=3)
d3 = dict(g=4, h=4, i=4)
In [35]:
cm = c.ChainMap(d1, d2, d3)
In [36]:
cm
Out[36]:
ChainMap({'a': 1, 'b': 2, 'c': 3}, {'d': 1, 'e': 2, 'f': 3}, {'g': 4, 'h': 4, 'i': 4})
In [37]:
cm['a'], cm['e'], cm['h']
Out[37]:
(1, 2, 4)

Example

What numbers are present in both xs and ys?

In [38]:
import numpy as np
In [39]:
xs = np.random.randint(0, int(1e6), 10000)
ys = np.random.randint(0, int(1e6), 10000)
In [40]:
%%time

common = []
for x in xs:
    for y in ys:
        if y == x:
            common.append(y)
print(set(common))
{584714, 102410, 200726, 347162, 921120, 25637, 445477, 912432, 751154, 658485, 809027, 747592, 981072, 706129, 12888, 929378, 527460, 506983, 233580, 299632, 453744, 509045, 259703, 922744, 998011, 244861, 319103, 635520, 136839, 232073, 670864, 935057, 549520, 402582, 64663, 798361, 215195, 971419, 219811, 932523, 703153, 734386, 208564, 81589, 133307, 836284, 985290, 96975, 116950, 312540, 200926, 472289, 505571, 447204, 323310, 716545, 5898, 564495, 183065, 109850, 197406, 442655, 940837, 543526, 394536, 959274, 495403, 152364, 696111, 106803, 350517, 980800, 883523, 113988, 280396, 256338, 448346, 645978, 274781, 753504, 351074, 868724, 776577, 427908, 27013, 586119, 673682, 588179, 731542, 987032, 522138, 111518, 937888, 810407, 294828, 258992, 613818, 646081, 436164, 670667, 947665, 110037, 426473, 224747, 281580, 735225}
CPU times: user 16.6 s, sys: 14.4 ms, total: 16.6 s
Wall time: 16.6 s
In [41]:
%%time

common = []
for y in ys:
    if y in xs:
        common.append(y)
print(set(common))
{102410, 584714, 200726, 347162, 921120, 25637, 445477, 912432, 751154, 658485, 809027, 747592, 981072, 706129, 12888, 929378, 527460, 506983, 233580, 299632, 453744, 509045, 259703, 922744, 998011, 244861, 319103, 635520, 136839, 232073, 549520, 670864, 935057, 402582, 64663, 798361, 971419, 215195, 219811, 932523, 703153, 734386, 208564, 81589, 133307, 836284, 985290, 96975, 116950, 312540, 200926, 472289, 505571, 447204, 323310, 716545, 5898, 564495, 183065, 109850, 197406, 442655, 940837, 543526, 394536, 959274, 495403, 152364, 696111, 106803, 350517, 980800, 883523, 113988, 280396, 256338, 448346, 645978, 274781, 753504, 351074, 868724, 776577, 427908, 27013, 586119, 673682, 588179, 731542, 987032, 522138, 111518, 937888, 810407, 294828, 258992, 613818, 646081, 436164, 670667, 947665, 110037, 426473, 224747, 281580, 735225}
CPU times: user 186 ms, sys: 2.5 ms, total: 188 ms
Wall time: 186 ms
In [42]:
%%time

common = set(xs).intersection(ys)
print(common)
{102410, 584714, 200726, 347162, 921120, 25637, 445477, 912432, 751154, 658485, 809027, 747592, 981072, 706129, 12888, 929378, 527460, 506983, 233580, 299632, 453744, 509045, 259703, 922744, 998011, 244861, 319103, 635520, 136839, 232073, 549520, 670864, 935057, 402582, 64663, 798361, 971419, 215195, 219811, 932523, 703153, 734386, 208564, 81589, 133307, 836284, 985290, 96975, 116950, 312540, 200926, 472289, 505571, 447204, 323310, 716545, 5898, 564495, 183065, 109850, 197406, 442655, 940837, 543526, 394536, 959274, 495403, 152364, 696111, 106803, 350517, 980800, 883523, 113988, 280396, 256338, 448346, 645978, 274781, 753504, 351074, 868724, 776577, 427908, 27013, 586119, 673682, 588179, 731542, 987032, 522138, 111518, 937888, 810407, 294828, 258992, 613818, 646081, 436164, 670667, 947665, 110037, 426473, 224747, 281580, 735225}
CPU times: user 2.73 ms, sys: 195 µs, total: 2.93 ms
Wall time: 2.91 ms

Algorithms

Big O complexity

A function \(f(n)\) had Big O complexity \(g(n)\) if \(\vert f(n) \vert \le M g(n)\) where \(M\) is a constant. Common classes for \(g(n)\) in increasing order of complexity are

  • \(\log n\)

  • linear \(n\)

  • \(n \log n\)

  • polynomial \(n^k\)

  • exponential \(e^n\)

  • factorial n!

Notes

  • Note 1: parallel processing in most cases gives at best a linear speedup.

  • Note 2: The constant factor can be important, especially for small to moderate values of \(n\)

In [43]:
from functools import reduce
In [44]:
def factorial(n):
    return reduce(lambda a, b: a* b, range(1, n+1))
In [45]:
for n in (5, 20, 50):
    print('n =', n)
    print('-'*40)
    print('log    ', int(np.log2(n)))
    print('linear ', n)
    print('n log n', int(n*np.log2(n)))
    print('n^2    ', n**2)
    print('n^3    ', n**3)
    print('e^n    ', int(np.exp(n)))
    print('n!     ', factorial(n))
    print()
n = 5
----------------------------------------
log     2
linear  5
n log n 11
n^2     25
n^3     125
e^n     148
n!      120

n = 20
----------------------------------------
log     4
linear  20
n log n 86
n^2     400
n^3     8000
e^n     485165195
n!      2432902008176640000

n = 50
----------------------------------------
log     5
linear  50
n log n 282
n^2     2500
n^3     125000
e^n     5184705528587072045056
n!      30414093201713378043612608166064768844377641568960512000000000000

Example

If you have to search a sequence container repeatedly, it is more efficient to first sort, then use a bisection algorithm.

  • Initial sort takes \(n \log n\) time

  • Subsequent searches takes \(\log n\)

  • Total time is \(n \log n + k \log n\) versus \(k \times n/2\)

In [46]:
testx = np.random.randint(0, 2*n, 1000)
In [47]:
%%time

n = 10000
xs  = list(range(n))
hits = 0
for x in testx:
    if x in xs:
        hits += 1
print(hits)
1000
CPU times: user 8.5 ms, sys: 496 µs, total: 8.99 ms
Wall time: 8.66 ms
In [48]:
import bisect
In [49]:
%%time

n = 10000
xs  = list(range(n))
xs.sort()
hits = 0
for x in testx:
    if bisect.bisect(xs, x) != n:
        hits += 1
print(hits)
1000
CPU times: user 3.68 ms, sys: 688 µs, total: 4.37 ms
Wall time: 3.87 ms
In [ ]:

Recursion

You should know how to code recursive functions. You should also be aware of how inefficient they are, and how you can use memoization (cahceing) to speed up some recursive functions (e.g. using functools.lrucache). Try to implement the following simple recursive functions.

In [50]:
def rfac(n):
    """Find the nth factorail recursively."""
In [51]:
def rsum(xs):
    """Sum the values in xs recursively."""
In [52]:
def rfib(n):
    """Find the nth Fibonaccci number recursively."""
In [53]:
def rmap(f, xs):
    """Implement a map function recursively.

    f is a function that takes a single argument
    xs is a list."""

What is the Big O complexity of rfac with and without cacheing?

In [ ]:

Sorting algorithms

Generally, use the sort function provided by the language (e.g. sort, sorteed). However sort functions are useful to illustrate algorithmic concepts such as in-place editing, recursion and algorithmic complexity, and you should know how to write simple sort functions.

  • How much memory does the sort use?

  • What is its big O complexity class?

  • Is it iterative or recursive? (note - all recursive algorithm have an iterative equivalent, but some (e.g. quicksort) are easier to think of in a recursive way.

Bubble sort

In [54]:
def bubblesort(xs):
    """Bubble sort."""

    n = len(xs)
    for i in range(n):
        print(xs)
        for j in range(i+1, n):
            if xs[i] > xs[j]:
                xs[i], xs[j] = xs[j], xs[i]
In [55]:
xs = [5,1,3,4,2]
bubblesort(xs)
xs
[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 5, 4, 3]
[1, 2, 3, 5, 4]
[1, 2, 3, 4, 5]
Out[55]:
[1, 2, 3, 4, 5]

Selection sort

In [56]:
def selectionsort(xs):
    """Selection sort."""

    n = len(xs)
    for i in range(n):
        best = xs[i]
        idx = i
        print(xs)
        for j in range(i+1, n):
            if xs[j] < best:
                best = xs[j]
                idx = j
        xs[i], xs[idx] = xs[idx], xs[i]
In [57]:
xs = [5,1,3,4,2]
selectionsort(xs)
xs
[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
Out[57]:
[1, 2, 3, 4, 5]

Quicksort

In [58]:
def quicksort(xs):
    """Quicksort."""

    if len(xs) < 2:
        return xs
    else:
        pivot = xs[0]
        left = [x for x in xs[1:] if x <= pivot]
        right = [x for x in xs[1:] if x > pivot]
        print(pivot, left, right)
        return quicksort(left) + [pivot] + quicksort(right)
In [59]:
xs = [5,1,3,4,2]
quicksort(xs)
5 [1, 3, 4, 2] []
1 [] [3, 4, 2]
3 [2] [4]
Out[59]:
[1, 2, 3, 4, 5]

Memory usage

In [60]:
import sys
In [61]:
xs = np.random.randint(0, 10, (100,100))
In [62]:
sys.getsizeof(xs)
Out[62]:
80112
In [63]:
xs.nbytes
Out[63]:
80000

Timing

In [64]:
from time import sleep
In [65]:
%time sleep(0.1)
CPU times: user 8.4 ms, sys: 2.76 ms, total: 11.2 ms
Wall time: 104 ms
In [66]:
%%time

sleep(0.1)
CPU times: user 1.29 ms, sys: 1.25 ms, total: 2.54 ms
Wall time: 101 ms
In [67]:
%timeit sleep(0.1)
102 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [68]:
%timeit -r 3 -n 10 sleep(0.1)
103 ms ± 673 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
In [69]:
from timeit import timeit
In [70]:
t = timeit('from time import sleep; sleep(0.1)', number=1)
t
Out[70]:
0.1012009109999994
In [71]:
t = timeit(lambda: sleep(0.1), number=1)
t
Out[71]:
0.10307493699999881