Python: Data Collections¶

Python produces a large number of useful data structures in its standard library, and more exotic data structures can usually be installed via pip from the Python Package Index.

Algorithmic Complexity¶

It is important to be familiar with the time complexity of these algorithms.

See Complexity of Python Operations for complexity of common data structures and operations.

Strings¶

A string can behave like an immutable data collection, consisting of characters.

Creation¶

In [1]:

name = 'Rumpelsttiltskin'

Indexing¶

In [2]:

name[:3]

Out[2]:

'Rum'

In [3]:

name[-3:]

Out[3]:

'kin'

In [4]:

name[3:-3]

Out[4]:

'pelsttilts'

Concatenation¶

In [5]:

sentence = name + 'is a fairy tale by the Grimm Brothers'

In [6]:

sentence

Out[6]:

'Rumpelsttiltskinis a fairy tale by the Grimm Brothers'

String methods¶

In [7]:

sentence.split()

Out[7]:

['Rumpelsttiltskinis', 'a', 'fairy', 'tale', 'by', 'the', 'Grimm', 'Brothers']

In [8]:

'_'.join(sentence.split())

Out[8]:

'Rumpelsttiltskinis_a_fairy_tale_by_the_Grimm_Brothers'

In [9]:

sentence.upper()

Out[9]:

'RUMPELSTTILTSKINIS A FAIRY TALE BY THE GRIMM BROTHERS'

In [10]:

sentence.title()

Out[10]:

'Rumpelsttiltskinis A Fairy Tale By The Grimm Brothers'

In [11]:

sentence.count('a')

Out[11]:

In [12]:

sentence.replace('Brothers', 'Sisters')

Out[12]:

'Rumpelsttiltskinis a fairy tale by the Grimm Sisters'

Reversing a string¶

In [13]:

''.join(reversed(sentence))

Out[13]:

'srehtorB mmirG eht yb elat yriaf a sinikstlittslepmuR'

In [14]:

sentence[::-1]

Out[14]:

'srehtorB mmirG eht yb elat yriaf a sinikstlittslepmuR'

Multi-line strings¶

In [15]:

mls = """This is
a string with
many, many lines"""

In [16]:

mls

Out[16]:

'This is\na string with\nmany, many lines'

In [17]:

print(mls)

This is
a string with
many, many lines

Tuples¶

Immutable
Useful as records
Random access

Creating¶

In [18]:

paul = ('Paul', 'Graham', 45, 'paul.graham@duke.edu', 3.8, ('Mathematics', 'Statistics'))

Indexing¶

In [19]:

paul[0]

Out[19]:

'Paul'

In [20]:

paul[-1]

Out[20]:

('Mathematics', 'Statistics')

In [21]:

paul[2:4]

Out[21]:

(45, 'paul.graham@duke.edu')

Immutable¶

In [22]:

try:
    paul[2] = 47
except TypeError as e:
    print(e)

'tuple' object does not support item assignment

Tuple unpacking¶

In [23]:

first, last, *stuff, gpa, (major, minor) = paul

In [24]:

first

Out[24]:

'Paul'

In [25]:

last

Out[25]:

'Graham'

In [26]:

stuff

Out[26]:

[45, 'paul.graham@duke.edu']

In [27]:

gpa

Out[27]:

3.8

In [28]:

major

Out[28]:

'Mathematics'

In [29]:

minor

Out[29]:

'Statistics'

Named tuples¶

Use named tuples for readability. Importantly, there are NO performacne differences compared with plain tuples.

Creating¶

In [30]:

from collections import namedtuple

In [31]:

Subject = namedtuple('subject', 'major minor')
Student = namedtuple('Student', 'first last age email gpa subject')

In [32]:

paul = Student('Paul', 'Graham', 45, 'paul.graham@duke.edu', 3.8,
               Subject('Mathematics', 'Statistics'))

In [33]:

paul

Out[33]:

Student(first='Paul', last='Graham', age=45, email='paul.graham@duke.edu', gpa=3.8, subject=subject(major='Mathematics', minor='Statistics'))

Indexing¶

In [34]:

paul.first

Out[34]:

'Paul'

In [35]:

paul[0]

Out[35]:

'Paul'

In [36]:

paul.last

Out[36]:

'Graham'

In [37]:

paul.subject

Out[37]:

subject(major='Mathematics', minor='Statistics')

In [38]:

paul.subject.major

Out[38]:

'Mathematics'

Lists¶

Similar to tuples
Mutable
Often (but not necessarily) used to maintain collections of the same data type
Workhorse of general Python data processing

Creating¶

In [39]:

words = ['the', 'quick', 'brown', 'fox']

In [40]:

words

Out[40]:

['the', 'quick', 'brown', 'fox']

Indexing and unpacking¶

Same as for tuples.

In [41]:

words[-2:]

Out[41]:

['brown', 'fox']

In [42]:

first, *rest, last = words

In [43]:

first, last

Out[43]:

('the', 'fox')

In [44]:

seq = list(range(10))

In [45]:

seq

Out[45]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [46]:

seq[1::2]

Out[46]:

[1, 3, 5, 7, 9]

In [47]:

seq[::-2]

Out[47]:

[9, 7, 5, 3, 1]

Modifying lists¶

As lists are mutable, they can be modified.

In [48]:

words[2]

Out[48]:

'brown'

In [49]:

words[2] = 'blue'

In [50]:

words

Out[50]:

['the', 'quick', 'blue', 'fox']

In [51]:

words[:2] = 'a', 'slow'

In [52]:

words

Out[52]:

['a', 'slow', 'blue', 'fox']

Appending¶

In [53]:

words.append('jumps')

In [54]:

words

Out[54]:

['a', 'slow', 'blue', 'fox', 'jumps']

Extending¶

In [55]:

words.extend(['over', 'the', 'lazy', 'dog'])

In [56]:

words

Out[56]:

['a', 'slow', 'blue', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Concatenation¶

In [57]:

[1,2] + [3,4] + [5,6]

Out[57]:

[1, 2, 3, 4, 5, 6]

List methods¶

In [58]:

seq

Out[58]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [59]:

seq.reverse()

In [60]:

seq

Out[60]:

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

In [61]:

seq.sort()

In [62]:

seq

Out[62]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [63]:

seq.extend([5,5,5])

In [64]:

seq

Out[64]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 5, 5]

In [65]:

seq.count(5)

Out[65]:

Conversion¶

In [66]:

paul2 = list(paul)

In [67]:

paul2

Out[67]:

['Paul',
 'Graham',
 45,
 'paul.graham@duke.edu',
 3.8,
 subject(major='Mathematics', minor='Statistics')]

Dictionaries¶

Use keys rather than positions to index

3 common ways of creating dictionaries¶

In [68]:

colors = {
    'red': (255,0,0),
    'green': (0,255,0),
    'blue': (0,0,255)
}

In [69]:

colors

Out[69]:

{'blue': (0, 0, 255), 'green': (0, 255, 0), 'red': (255, 0, 0)}

In [70]:

colors = dict(red=(255,0), green=(0,255,0), blue=(0,0,255))

In [71]:

colors

Out[71]:

{'blue': (0, 0, 255), 'green': (0, 255, 0), 'red': (255, 0)}

In [72]:

names = ['red', 'green', 'blue']
values = [(255,0,0), (0,255,0), (0,0,255)]

colors = dict(zip(names, values))

In [73]:

colors

Out[73]:

{'blue': (0, 0, 255), 'green': (0, 255, 0), 'red': (255, 0, 0)}

Indexing¶

In [74]:

colors['blue']

Out[74]:

(0, 0, 255)

Keys and values¶

In [75]:

colors.keys()

Out[75]:

dict_keys(['red', 'green', 'blue'])

In [76]:

colors.values()

Out[76]:

dict_values([(255, 0, 0), (0, 255, 0), (0, 0, 255)])

Adding to a dictionary¶

In [77]:

colors['black'] = (0,0,0)

In [78]:

colors

Out[78]:

{'black': (0, 0, 0),
 'blue': (0, 0, 255),
 'green': (0, 255, 0),
 'red': (255, 0, 0)}

Merging dictionaries¶

Note that a key with the same name in the second dictionary will clobber the key in the first dictionary without warning.

In [79]:

more_colors = {
    'gray': ( 128,128,128),
    'maroon': (128,0,0),
    'black': (255, 255, 255)
}

In [80]:

colors.update(more_colors)

In [81]:

colors

Out[81]:

{'black': (255, 255, 255),
 'blue': (0, 0, 255),
 'gray': (128, 128, 128),
 'green': (0, 255, 0),
 'maroon': (128, 0, 0),
 'red': (255, 0, 0)}

Counters¶

A counter is a special kind of dictionary for counting things.

In [82]:

from collections import Counter

In [83]:

dna = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'

In [84]:

c = Counter(dna)

In [85]:

Out[85]:

Counter({'A': 20, 'C': 12, 'G': 17, 'T': 21})

In [86]:

c.most_common(2)

Out[86]:

[('T', 21), ('A', 20)]

In [87]:

words = 'the the the quick brown fox jumps over the quick brown dog'.split()

In [88]:

wc = Counter(words)

In [89]:

wc.most_common(2)

Out[89]:

[('the', 4), ('quick', 2)]

In [90]:

wc['brown']

Out[90]:

Sets¶

A set corresponds to the mathematical idea of a set.

In [91]:

s1 = set(words)

In [92]:

s1

Out[92]:

{'brown', 'dog', 'fox', 'jumps', 'over', 'quick', 'the'}

In [93]:

s2 = set('what does the fox say'.split())

In [94]:

s2

Out[94]:

{'does', 'fox', 'say', 'the', 'what'}

In [95]:

s1 | s2

Out[95]:

{'brown', 'does', 'dog', 'fox', 'jumps', 'over', 'quick', 'say', 'the', 'what'}

In [96]:

s1.union(s2)

Out[96]:

{'brown', 'does', 'dog', 'fox', 'jumps', 'over', 'quick', 'say', 'the', 'what'}

In [97]:

s1 & s2

Out[97]:

{'fox', 'the'}

In [98]:

s1.intersection(s2)

Out[98]:

{'fox', 'the'}

In [99]:

s1 - s2

Out[99]:

{'brown', 'dog', 'jumps', 'over', 'quick'}

In [100]:

s1.difference(s2)

Out[100]:

{'brown', 'dog', 'jumps', 'over', 'quick'}

In [101]:

s2 - s1

Out[101]:

{'does', 'say', 'what'}

In [102]:

s2.difference(s1)

Out[102]:

{'does', 'say', 'what'}

Deques¶

Double ended queues. Sometimes you need to frequently add or remove items from both the left and right ends (list only lets you make changes to the right end).

In [103]:

from collections import deque

In [104]:

seq

Out[104]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 5, 5]

In [105]:

deque(seq)

Out[105]:

deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 5, 5])

Limited capacity deque¶

A queue shows first-in-first-out behavior.

In [106]:

dq = deque(seq, 5)
dq

Out[106]:

deque([8, 9, 5, 5, 5])

Deque methods¶

In [107]:

dq.append(10)
dq

Out[107]:

deque([9, 5, 5, 5, 10])

In [108]:

dq.appendleft(11)
dq

Out[108]:

deque([11, 9, 5, 5, 5])

In [109]:

dq.pop()

Out[109]:

In [110]:

dq

Out[110]:

deque([11, 9, 5, 5])

In [111]:

dq.popleft()

Out[111]:

In [112]:

dq

Out[112]:

deque([9, 5, 5])

In [113]:

dq.rotate()

In [114]:

dq

Out[114]:

deque([5, 9, 5])

Heap or Priority Queue¶

Some algorithms require adding and removing items from a priority queue. A heap or priority queue is a data structure that always generates the lowest priority item even with dynamic insertions of new items.

In [115]:

import heapq

In [116]:

items = [(3, 'A'), (1, 'B'), (2, 'C'),
         (3, 'D'), (1, 'E'), (2, 'F')]

In [117]:

heapq.heapify(items)

In [118]:

items

Out[118]:

[(1, 'B'), (1, 'E'), (2, 'C'), (3, 'D'), (3, 'A'), (2, 'F')]

In [119]:

heapq.heappop(items)

Out[119]:

(1, 'B')

In [120]:

heapq.heappop(items)

Out[120]:

(1, 'E')

In [121]:

heapq.heappop(items)

Out[121]:

(2, 'C')

In [122]:

heapq.heappush(items, (2, 'G'))
heapq.heappush(items, (1, 'H'))
heapq.heappush(items, (3, 'I'))

In [123]:

heapq.heappop(items)

Out[123]:

(1, 'H')

In [124]:

heapq.heappop(items)

Out[124]:

(2, 'F')

In [125]:

heapq.heappop(items)

Out[125]:

(2, 'G')

In [126]:

items = [(3, 'A'), (1, 'B'), (2, 'C'),
         (3, 'D'), (1, 'E'), (2, 'F')]

In [127]:

heapq.nsmallest(3, items)

Out[127]:

[(1, 'B'), (1, 'E'), (2, 'C')]

In [128]:

heapq.nlargest(3, items)

Out[128]:

[(3, 'D'), (3, 'A'), (2, 'F')]

Trees¶

Trees are not available in the Python standard library, so we will use a 3rd party implementation.

In [129]:

! pip3 install --quiet anytree

In [130]:

from anytree import Node, RenderTree, Walker, PreOrderIter, PostOrderIter, LevelOrderIter
from anytree.dotexport import RenderTreeGraph

In [131]:

a = Node('a')
b = Node('b', a)
c = Node('c', a)
d = Node('d', b)
e = Node('e', b)
f = Node('f', c)
g = Node('g', c)
h = Node('h', f)

In [132]:

print(RenderTree(a))

Node('/a')
├── Node('/a/b')
│   ├── Node('/a/b/d')
│   └── Node('/a/b/e')
└── Node('/a/c')
    ├── Node('/a/c/f')
    │   └── Node('/a/c/f/h')
    └── Node('/a/c/g')

In [133]:

g = RenderTreeGraph(a, nodeattrfunc=lambda node: "shape=box")
g.to_picture('figs/tree.png')

Tree

In [134]:

c.ancestors

Out[134]:

(Node('/a'),)

In [135]:

c.descendants

Out[135]:

(Node('/a/c/f'), Node('/a/c/f/h'), Node('/a/c/g'))

In [136]:

c.parent

Out[136]:

Node('/a')

In [137]:

c.children

Out[137]:

(Node('/a/c/f'), Node('/a/c/g'))

In [138]:

w = Walker()

Find a path between two nodes¶

In [139]:

upwards, common, downwards = w.walk(e, h)

In [140]:

upwards

Out[140]:

(Node('/a/b/e'), Node('/a/b'))

In [141]:

common

Out[141]:

Node('/a')

In [142]:

downwards

Out[142]:

(Node('/a/c'), Node('/a/c/f'), Node('/a/c/f/h'))

Iteration¶

In [143]:

[node for node in LevelOrderIter(a)]

Out[143]:

[Node('/a'),
 Node('/a/b'),
 Node('/a/c'),
 Node('/a/b/d'),
 Node('/a/b/e'),
 Node('/a/c/f'),
 Node('/a/c/g'),
 Node('/a/c/f/h')]

In [144]:

[node for node in PreOrderIter(a)]

Out[144]:

[Node('/a'),
 Node('/a/b'),
 Node('/a/b/d'),
 Node('/a/b/e'),
 Node('/a/c'),
 Node('/a/c/f'),
 Node('/a/c/f/h'),
 Node('/a/c/g')]

In [145]:

[node for node in PostOrderIter(a)]

Out[145]:

[Node('/a/b/d'),
 Node('/a/b/e'),
 Node('/a/b'),
 Node('/a/c/f/h'),
 Node('/a/c/f'),
 Node('/a/c/g'),
 Node('/a/c'),
 Node('/a')]