Algorithms and Data Structures¶

If you will be coding a lot, take an Algorithms and Data Structures course from CS or a MOOC. It will help you become a better programmer. Here we just go over very basic concepts needed to write efficient Python code.

Data structures¶

Sequence containers¶

Searching takes linear time since a full scan is required.

In [1]:

n = 100
xs  = list(range(n))
%timeit -r3 -n10 n in xs

1.77 µs ± 32 ns per loop (mean ± std. dev. of 3 runs, 10 loops each)

In [2]:

n = 10000
xs  = list(range(n))
%timeit -r3 -n10 n in xs

176 µs ± 7.8 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)

Mapping containers¶

Search takes constant time since mapping containers in Python are based on hash tables.

In [3]:

n = 100
xs  = set(range(n))
%timeit -r3 -n10 n in xs

114 ns ± 20.3 ns per loop (mean ± std. dev. of 3 runs, 10 loops each)

In [4]:

n = 10000
xs  = set(range(n))
%timeit -r3 -n10 n in xs

111 ns ± 16.4 ns per loop (mean ± std. dev. of 3 runs, 10 loops each)

Heap¶

The heap is an interesting data structure often used to build efficient priority queues.

In [5]:

import heapq

In [6]:

h = []
xs = [5,1,3,2,4]
for x in xs:
    heapq.heappush(h, x)
    print(h)

[5]
[1, 5]
[1, 5, 3]
[1, 2, 3, 5]
[1, 2, 3, 5, 4]

In [7]:

heapq.nlargest(3, h)

Out[7]:

[5, 4, 3]

In [8]:

heapq.nsmallest(3, h)

Out[8]:

[1, 2, 3]

In [9]:

while h:
    x = heapq.heappop(h)
    print(x)

Collections¶

The collections module includes a few useful data structures.

`named tuple`¶

A named tuple is exactly as efficient as a tuple, just more convenient for named access.

In [10]:

import collections as c

In [11]:

Point = c.namedtuple('Point', ['x', 'y', 'z'])

In [12]:

p = Point(1,2,3)

In [13]:

Out[13]:

Point(x=1, y=2, z=3)

In [14]:

p.x, p.y, p.z

Out[14]:

(1, 2, 3)

In [15]:

Rec = c.namedtuple('Record', ['first', 'last', 'email'])

In [16]:

tom = Rec(email='john.doe@example.com', first='John', last='Doe')

In [17]:

tom[-1]

Out[17]:

'john.doe@example.com'

In [18]:

tom.email

Out[18]:

'john.doe@example.com'

`counter`¶

Convenience class for counting.

In [19]:

dna = 'AATTTAACCAAGGGATT'
dna_counter = c.Counter(dna)
dna_counter

Out[19]:

Counter({'A': 7, 'T': 5, 'C': 2, 'G': 3})

In [20]:

words = 'this old man he played one he played knick knack on my thumb'.split()
word_counter = c.Counter(words)

In [21]:

word_counter.most_common(2)

Out[21]:

[('he', 2), ('played', 2)]

`deque`¶

Double-ended queue.

In [22]:

deq = c.deque([5,1,3,2,4])

In [23]:

deq

Out[23]:

deque([5, 1, 3, 2, 4])

In [24]:

deq.append(99)
deq

Out[24]:

deque([5, 1, 3, 2, 4, 99])

In [25]:

deq.appendleft(99)
deq

Out[25]:

deque([99, 5, 1, 3, 2, 4, 99])

In [26]:

deq.pop(), deq.pop()

Out[26]:

(99, 4)

In [27]:

deq.popleft(), deq.popleft()

Out[27]:

(99, 5)

The deque is also useful as a ring data structure when the maxlen argument is given

In [28]:

ring = c.deque(maxlen=4)

In [29]:

for i in range(10):
    ring.appendleft(i)
    print(ring)

deque([0], maxlen=4)
deque([1, 0], maxlen=4)
deque([2, 1, 0], maxlen=4)
deque([3, 2, 1, 0], maxlen=4)
deque([4, 3, 2, 1], maxlen=4)
deque([5, 4, 3, 2], maxlen=4)
deque([6, 5, 4, 3], maxlen=4)
deque([7, 6, 5, 4], maxlen=4)
deque([8, 7, 6, 5], maxlen=4)
deque([9, 8, 7, 6], maxlen=4)

`defaultdict`¶

Provides a default given by a factory function when key is not found.

In [30]:

d = c.defaultdict(list)

In [31]:

d['foo'].append(2)

In [32]:

d['bar'].extend([1,2,3])

In [33]:

Out[33]:

defaultdict(list, {'foo': [2], 'bar': [1, 2, 3]})

ChainMap¶

A ChainMap lets you combine multiple dictionaries and access keys as though it was a single dictionary. If there are duplicate keys, only the first one will be accessible.

In [34]:

d1 = dict(a=1, b=2, c=3)
d2 = dict(d=1, e=2, f=3)
d3 = dict(g=4, h=4, i=4)

In [35]:

cm = c.ChainMap(d1, d2, d3)

In [36]:

cm

Out[36]:

ChainMap({'a': 1, 'b': 2, 'c': 3}, {'d': 1, 'e': 2, 'f': 3}, {'g': 4, 'h': 4, 'i': 4})

In [37]:

cm['a'], cm['e'], cm['h']

Out[37]:

(1, 2, 4)

Example¶

What numbers are present in both xs and ys?

In [38]:

import numpy as np

In [39]:

xs = np.random.randint(0, int(1e6), 10000)
ys = np.random.randint(0, int(1e6), 10000)

In [40]:

%%time

common = []
for x in xs:
    for y in ys:
        if y == x:
            common.append(y)
print(set(common))

{584714, 102410, 200726, 347162, 921120, 25637, 445477, 912432, 751154, 658485, 809027, 747592, 981072, 706129, 12888, 929378, 527460, 506983, 233580, 299632, 453744, 509045, 259703, 922744, 998011, 244861, 319103, 635520, 136839, 232073, 670864, 935057, 549520, 402582, 64663, 798361, 215195, 971419, 219811, 932523, 703153, 734386, 208564, 81589, 133307, 836284, 985290, 96975, 116950, 312540, 200926, 472289, 505571, 447204, 323310, 716545, 5898, 564495, 183065, 109850, 197406, 442655, 940837, 543526, 394536, 959274, 495403, 152364, 696111, 106803, 350517, 980800, 883523, 113988, 280396, 256338, 448346, 645978, 274781, 753504, 351074, 868724, 776577, 427908, 27013, 586119, 673682, 588179, 731542, 987032, 522138, 111518, 937888, 810407, 294828, 258992, 613818, 646081, 436164, 670667, 947665, 110037, 426473, 224747, 281580, 735225}
CPU times: user 16.6 s, sys: 14.4 ms, total: 16.6 s
Wall time: 16.6 s

In [41]:

%%time

common = []
for y in ys:
    if y in xs:
        common.append(y)
print(set(common))

{102410, 584714, 200726, 347162, 921120, 25637, 445477, 912432, 751154, 658485, 809027, 747592, 981072, 706129, 12888, 929378, 527460, 506983, 233580, 299632, 453744, 509045, 259703, 922744, 998011, 244861, 319103, 635520, 136839, 232073, 549520, 670864, 935057, 402582, 64663, 798361, 971419, 215195, 219811, 932523, 703153, 734386, 208564, 81589, 133307, 836284, 985290, 96975, 116950, 312540, 200926, 472289, 505571, 447204, 323310, 716545, 5898, 564495, 183065, 109850, 197406, 442655, 940837, 543526, 394536, 959274, 495403, 152364, 696111, 106803, 350517, 980800, 883523, 113988, 280396, 256338, 448346, 645978, 274781, 753504, 351074, 868724, 776577, 427908, 27013, 586119, 673682, 588179, 731542, 987032, 522138, 111518, 937888, 810407, 294828, 258992, 613818, 646081, 436164, 670667, 947665, 110037, 426473, 224747, 281580, 735225}
CPU times: user 186 ms, sys: 2.5 ms, total: 188 ms
Wall time: 186 ms

In [42]:

%%time

common = set(xs).intersection(ys)
print(common)

{102410, 584714, 200726, 347162, 921120, 25637, 445477, 912432, 751154, 658485, 809027, 747592, 981072, 706129, 12888, 929378, 527460, 506983, 233580, 299632, 453744, 509045, 259703, 922744, 998011, 244861, 319103, 635520, 136839, 232073, 549520, 670864, 935057, 402582, 64663, 798361, 971419, 215195, 219811, 932523, 703153, 734386, 208564, 81589, 133307, 836284, 985290, 96975, 116950, 312540, 200926, 472289, 505571, 447204, 323310, 716545, 5898, 564495, 183065, 109850, 197406, 442655, 940837, 543526, 394536, 959274, 495403, 152364, 696111, 106803, 350517, 980800, 883523, 113988, 280396, 256338, 448346, 645978, 274781, 753504, 351074, 868724, 776577, 427908, 27013, 586119, 673682, 588179, 731542, 987032, 522138, 111518, 937888, 810407, 294828, 258992, 613818, 646081, 436164, 670667, 947665, 110037, 426473, 224747, 281580, 735225}
CPU times: user 2.73 ms, sys: 195 µs, total: 2.93 ms
Wall time: 2.91 ms

Algorithms¶

Big O complexity¶

A function \(f(n)\) had Big O complexity \(g(n)\) if \(\vert f(n) \vert \le M g(n)\) where \(M\) is a constant. Common classes for \(g(n)\) in increasing order of complexity are

\(\log n\)
linear \(n\)
\(n \log n\)
polynomial \(n^k\)
exponential \(e^n\)
factorial n!

Notes

Note 1: parallel processing in most cases gives at best a linear speedup.
Note 2: The constant factor can be important, especially for small to moderate values of \(n\)

In [43]:

from functools import reduce

In [44]:

def factorial(n):
    return reduce(lambda a, b: a* b, range(1, n+1))

In [45]:

for n in (5, 20, 50):
    print('n =', n)
    print('-'*40)
    print('log    ', int(np.log2(n)))
    print('linear ', n)
    print('n log n', int(n*np.log2(n)))
    print('n^2    ', n**2)
    print('n^3    ', n**3)
    print('e^n    ', int(np.exp(n)))
    print('n!     ', factorial(n))
    print()

n = 5
----------------------------------------
log     2
linear  5
n log n 11
n^2     25
n^3     125
e^n     148
n!      120

n = 20
----------------------------------------
log     4
linear  20
n log n 86
n^2     400
n^3     8000
e^n     485165195
n!      2432902008176640000

n = 50
----------------------------------------
log     5
linear  50
n log n 282
n^2     2500
n^3     125000
e^n     5184705528587072045056
n!      30414093201713378043612608166064768844377641568960512000000000000

Example¶

If you have to search a sequence container repeatedly, it is more efficient to first sort, then use a bisection algorithm.

Initial sort takes \(n \log n\) time
Subsequent searches takes \(\log n\)
Total time is \(n \log n + k \log n\) versus \(k \times n/2\)

In [46]:

testx = np.random.randint(0, 2*n, 1000)

In [47]:

%%time

n = 10000
xs  = list(range(n))
hits = 0
for x in testx:
    if x in xs:
        hits += 1
print(hits)

1000
CPU times: user 8.5 ms, sys: 496 µs, total: 8.99 ms
Wall time: 8.66 ms

In [48]:

import bisect

In [49]:

%%time

n = 10000
xs  = list(range(n))
xs.sort()
hits = 0
for x in testx:
    if bisect.bisect(xs, x) != n:
        hits += 1
print(hits)

1000
CPU times: user 3.68 ms, sys: 688 µs, total: 4.37 ms
Wall time: 3.87 ms

In [ ]:

Recursion¶

You should know how to code recursive functions. You should also be aware of how inefficient they are, and how you can use memoization (cahceing) to speed up some recursive functions (e.g. using functools.lrucache). Try to implement the following simple recursive functions.

In [50]:

def rfac(n):
    """Find the nth factorail recursively."""

In [51]:

def rsum(xs):
    """Sum the values in xs recursively."""

In [52]:

def rfib(n):
    """Find the nth Fibonaccci number recursively."""

In [53]:

def rmap(f, xs):
    """Implement a map function recursively.

    f is a function that takes a single argument
    xs is a list."""

What is the Big O complexity of rfac with and without cacheing?

In [ ]:

Sorting algorithms¶

Generally, use the sort function provided by the language (e.g. sort, sorteed). However sort functions are useful to illustrate algorithmic concepts such as in-place editing, recursion and algorithmic complexity, and you should know how to write simple sort functions.

How much memory does the sort use?
What is its big O complexity class?
Is it iterative or recursive? (note - all recursive algorithm have an iterative equivalent, but some (e.g. quicksort) are easier to think of in a recursive way.

Bubble sort¶

In [54]:

def bubblesort(xs):
    """Bubble sort."""

    n = len(xs)
    for i in range(n):
        print(xs)
        for j in range(i+1, n):
            if xs[i] > xs[j]:
                xs[i], xs[j] = xs[j], xs[i]

In [55]:

xs = [5,1,3,4,2]
bubblesort(xs)
xs

[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 5, 4, 3]
[1, 2, 3, 5, 4]
[1, 2, 3, 4, 5]

Out[55]:

[1, 2, 3, 4, 5]

Selection sort¶

In [56]:

def selectionsort(xs):
    """Selection sort."""

    n = len(xs)
    for i in range(n):
        best = xs[i]
        idx = i
        print(xs)
        for j in range(i+1, n):
            if xs[j] < best:
                best = xs[j]
                idx = j
        xs[i], xs[idx] = xs[idx], xs[i]

In [57]:

xs = [5,1,3,4,2]
selectionsort(xs)
xs

[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]

Out[57]:

[1, 2, 3, 4, 5]

Quicksort¶

In [58]:

def quicksort(xs):
    """Quicksort."""

    if len(xs) < 2:
        return xs
    else:
        pivot = xs[0]
        left = [x for x in xs[1:] if x <= pivot]
        right = [x for x in xs[1:] if x > pivot]
        print(pivot, left, right)
        return quicksort(left) + [pivot] + quicksort(right)

In [59]:

xs = [5,1,3,4,2]
quicksort(xs)

5 [1, 3, 4, 2] []
1 [] [3, 4, 2]
3 [2] [4]

Out[59]:

[1, 2, 3, 4, 5]

Memory usage¶

In [60]:

import sys

In [61]:

xs = np.random.randint(0, 10, (100,100))

In [62]:

sys.getsizeof(xs)

Out[62]:

In [63]:

xs.nbytes

Out[63]:

Timing¶

In [64]:

from time import sleep

In [65]:

%time sleep(0.1)

CPU times: user 8.4 ms, sys: 2.76 ms, total: 11.2 ms
Wall time: 104 ms

In [66]:

%%time

sleep(0.1)

CPU times: user 1.29 ms, sys: 1.25 ms, total: 2.54 ms
Wall time: 101 ms

In [67]:

%timeit sleep(0.1)

102 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [68]:

%timeit -r 3 -n 10 sleep(0.1)

103 ms ± 673 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)

In [69]:

from timeit import timeit

In [70]:

t = timeit('from time import sleep; sleep(0.1)', number=1)
t

Out[70]:

0.1012009109999994

In [71]:

t = timeit(lambda: sleep(0.1), number=1)
t

Out[71]:

0.10307493699999881

Algorithms and Data Structures¶

Data structures¶

Sequence containers¶

Mapping containers¶

Heap¶

Collections¶

`named tuple`¶

`counter`¶

`deque`¶

`defaultdict`¶

ChainMap¶

Example¶

Algorithms¶

Big O complexity¶

Example¶

Recursion¶

Sorting algorithms¶

Bubble sort¶

Selection sort¶

Quicksort¶

Memory usage¶

Timing¶

Page contents

Previous page

Next page

This Page

Algorithms and Data Structures¶

Data structures¶

Sequence containers¶

Mapping containers¶

Heap¶

Collections¶

named tuple¶

counter¶

deque¶

defaultdict¶

ChainMap¶

Example¶

Algorithms¶

Big O complexity¶

Example¶

Recursion¶

Sorting algorithms¶

Bubble sort¶

Selection sort¶

Quicksort¶

Memory usage¶

Timing¶

`named tuple`¶

`counter`¶

`deque`¶

`defaultdict`¶