Python: Data Collections¶
Python produces a large number of useful data structures in its standard
library, and more exotic data structures can usually be installed via
pip
from the Python Package
Index.
Algorithmic Complexity¶
It is important to be familiar with the time complexity of these algorithms.
See Complexity of Python Operations for complexity of common data structures and operations.
Strings¶
A string can behave like an immutable data collection, consisting of characters.
Creation¶
In [1]:
name = 'Rumpelsttiltskin'
Indexing¶
In [2]:
name[:3]
Out[2]:
'Rum'
In [3]:
name[-3:]
Out[3]:
'kin'
In [4]:
name[3:-3]
Out[4]:
'pelsttilts'
Concatenation¶
In [5]:
sentence = name + 'is a fairy tale by the Grimm Brothers'
In [6]:
sentence
Out[6]:
'Rumpelsttiltskinis a fairy tale by the Grimm Brothers'
String methods¶
In [7]:
sentence.split()
Out[7]:
['Rumpelsttiltskinis', 'a', 'fairy', 'tale', 'by', 'the', 'Grimm', 'Brothers']
In [8]:
'_'.join(sentence.split())
Out[8]:
'Rumpelsttiltskinis_a_fairy_tale_by_the_Grimm_Brothers'
In [9]:
sentence.upper()
Out[9]:
'RUMPELSTTILTSKINIS A FAIRY TALE BY THE GRIMM BROTHERS'
In [10]:
sentence.title()
Out[10]:
'Rumpelsttiltskinis A Fairy Tale By The Grimm Brothers'
In [11]:
sentence.count('a')
Out[11]:
3
In [12]:
sentence.replace('Brothers', 'Sisters')
Out[12]:
'Rumpelsttiltskinis a fairy tale by the Grimm Sisters'
Reversing a string¶
In [13]:
''.join(reversed(sentence))
Out[13]:
'srehtorB mmirG eht yb elat yriaf a sinikstlittslepmuR'
In [14]:
sentence[::-1]
Out[14]:
'srehtorB mmirG eht yb elat yriaf a sinikstlittslepmuR'
Multi-line strings¶
In [15]:
mls = """This is
a string with
many, many lines"""
In [16]:
mls
Out[16]:
'This is\na string with\nmany, many lines'
In [17]:
print(mls)
This is
a string with
many, many lines
Tuples¶
- Immutable
- Useful as records
- Random access
Creating¶
In [18]:
paul = ('Paul', 'Graham', 45, 'paul.graham@duke.edu', 3.8, ('Mathematics', 'Statistics'))
Indexing¶
In [19]:
paul[0]
Out[19]:
'Paul'
In [20]:
paul[-1]
Out[20]:
('Mathematics', 'Statistics')
In [21]:
paul[2:4]
Out[21]:
(45, 'paul.graham@duke.edu')
Immutable¶
In [22]:
try:
paul[2] = 47
except TypeError as e:
print(e)
'tuple' object does not support item assignment
Tuple unpacking¶
In [23]:
first, last, *stuff, gpa, (major, minor) = paul
In [24]:
first
Out[24]:
'Paul'
In [25]:
last
Out[25]:
'Graham'
In [26]:
stuff
Out[26]:
[45, 'paul.graham@duke.edu']
In [27]:
gpa
Out[27]:
3.8
In [28]:
major
Out[28]:
'Mathematics'
In [29]:
minor
Out[29]:
'Statistics'
Named tuples¶
Use named tuples for readability. Importantly, there are NO performacne differences compared with plain tuples.
Creating¶
In [30]:
from collections import namedtuple
In [31]:
Subject = namedtuple('subject', 'major minor')
Student = namedtuple('Student', 'first last age email gpa subject')
In [32]:
paul = Student('Paul', 'Graham', 45, 'paul.graham@duke.edu', 3.8,
Subject('Mathematics', 'Statistics'))
In [33]:
paul
Out[33]:
Student(first='Paul', last='Graham', age=45, email='paul.graham@duke.edu', gpa=3.8, subject=subject(major='Mathematics', minor='Statistics'))
Indexing¶
In [34]:
paul.first
Out[34]:
'Paul'
In [35]:
paul[0]
Out[35]:
'Paul'
In [36]:
paul.last
Out[36]:
'Graham'
In [37]:
paul.subject
Out[37]:
subject(major='Mathematics', minor='Statistics')
In [38]:
paul.subject.major
Out[38]:
'Mathematics'
Lists¶
- Similar to tuples
- Mutable
- Often (but not necessarily) used to maintain collections of the same data type
- Workhorse of general Python data processing
Creating¶
In [39]:
words = ['the', 'quick', 'brown', 'fox']
In [40]:
words
Out[40]:
['the', 'quick', 'brown', 'fox']
Indexing and unpacking¶
Same as for tuples.
In [41]:
words[-2:]
Out[41]:
['brown', 'fox']
In [42]:
first, *rest, last = words
In [43]:
first, last
Out[43]:
('the', 'fox')
In [44]:
seq = list(range(10))
In [45]:
seq
Out[45]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [46]:
seq[1::2]
Out[46]:
[1, 3, 5, 7, 9]
In [47]:
seq[::-2]
Out[47]:
[9, 7, 5, 3, 1]
Modifying lists¶
As lists are mutable, they can be modified.
In [48]:
words[2]
Out[48]:
'brown'
In [49]:
words[2] = 'blue'
In [50]:
words
Out[50]:
['the', 'quick', 'blue', 'fox']
In [51]:
words[:2] = 'a', 'slow'
In [52]:
words
Out[52]:
['a', 'slow', 'blue', 'fox']
Appending¶
In [53]:
words.append('jumps')
In [54]:
words
Out[54]:
['a', 'slow', 'blue', 'fox', 'jumps']
Extending¶
In [55]:
words.extend(['over', 'the', 'lazy', 'dog'])
In [56]:
words
Out[56]:
['a', 'slow', 'blue', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
List methods¶
In [58]:
seq
Out[58]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [59]:
seq.reverse()
In [60]:
seq
Out[60]:
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
In [61]:
seq.sort()
In [62]:
seq
Out[62]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [63]:
seq.extend([5,5,5])
In [64]:
seq
Out[64]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 5, 5]
In [65]:
seq.count(5)
Out[65]:
4
Conversion¶
In [66]:
paul2 = list(paul)
In [67]:
paul2
Out[67]:
['Paul',
'Graham',
45,
'paul.graham@duke.edu',
3.8,
subject(major='Mathematics', minor='Statistics')]
Dictionaries¶
- Use keys rather than positions to index
3 common ways of creating dictionaries¶
In [68]:
colors = {
'red': (255,0,0),
'green': (0,255,0),
'blue': (0,0,255)
}
In [69]:
colors
Out[69]:
{'blue': (0, 0, 255), 'green': (0, 255, 0), 'red': (255, 0, 0)}
In [70]:
colors = dict(red=(255,0), green=(0,255,0), blue=(0,0,255))
In [71]:
colors
Out[71]:
{'blue': (0, 0, 255), 'green': (0, 255, 0), 'red': (255, 0)}
In [72]:
names = ['red', 'green', 'blue']
values = [(255,0,0), (0,255,0), (0,0,255)]
colors = dict(zip(names, values))
In [73]:
colors
Out[73]:
{'blue': (0, 0, 255), 'green': (0, 255, 0), 'red': (255, 0, 0)}
Keys and values¶
In [75]:
colors.keys()
Out[75]:
dict_keys(['red', 'green', 'blue'])
In [76]:
colors.values()
Out[76]:
dict_values([(255, 0, 0), (0, 255, 0), (0, 0, 255)])
Adding to a dictionary¶
In [77]:
colors['black'] = (0,0,0)
In [78]:
colors
Out[78]:
{'black': (0, 0, 0),
'blue': (0, 0, 255),
'green': (0, 255, 0),
'red': (255, 0, 0)}
Merging dictionaries¶
Note that a key with the same name in the second dictionary will clobber the key in the first dictionary without warning.
In [79]:
more_colors = {
'gray': ( 128,128,128),
'maroon': (128,0,0),
'black': (255, 255, 255)
}
In [80]:
colors.update(more_colors)
In [81]:
colors
Out[81]:
{'black': (255, 255, 255),
'blue': (0, 0, 255),
'gray': (128, 128, 128),
'green': (0, 255, 0),
'maroon': (128, 0, 0),
'red': (255, 0, 0)}
Counters¶
A counter is a special kind of dictionary for counting things.
In [82]:
from collections import Counter
In [83]:
dna = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
In [84]:
c = Counter(dna)
In [85]:
c
Out[85]:
Counter({'A': 20, 'C': 12, 'G': 17, 'T': 21})
In [86]:
c.most_common(2)
Out[86]:
[('T', 21), ('A', 20)]
In [87]:
words = 'the the the quick brown fox jumps over the quick brown dog'.split()
In [88]:
wc = Counter(words)
In [89]:
wc.most_common(2)
Out[89]:
[('the', 4), ('quick', 2)]
In [90]:
wc['brown']
Out[90]:
2
Sets¶
A set corresponds to the mathematical idea of a set.
In [91]:
s1 = set(words)
In [92]:
s1
Out[92]:
{'brown', 'dog', 'fox', 'jumps', 'over', 'quick', 'the'}
In [93]:
s2 = set('what does the fox say'.split())
In [94]:
s2
Out[94]:
{'does', 'fox', 'say', 'the', 'what'}
In [95]:
s1 | s2
Out[95]:
{'brown', 'does', 'dog', 'fox', 'jumps', 'over', 'quick', 'say', 'the', 'what'}
In [96]:
s1.union(s2)
Out[96]:
{'brown', 'does', 'dog', 'fox', 'jumps', 'over', 'quick', 'say', 'the', 'what'}
In [97]:
s1 & s2
Out[97]:
{'fox', 'the'}
In [98]:
s1.intersection(s2)
Out[98]:
{'fox', 'the'}
In [99]:
s1 - s2
Out[99]:
{'brown', 'dog', 'jumps', 'over', 'quick'}
In [100]:
s1.difference(s2)
Out[100]:
{'brown', 'dog', 'jumps', 'over', 'quick'}
In [101]:
s2 - s1
Out[101]:
{'does', 'say', 'what'}
In [102]:
s2.difference(s1)
Out[102]:
{'does', 'say', 'what'}
Deques¶
Double ended queues. Sometimes you need to frequently add or remove items from both the left and right ends (list only lets you make changes to the right end).
In [103]:
from collections import deque
In [104]:
seq
Out[104]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 5, 5]
In [105]:
deque(seq)
Out[105]:
deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 5, 5])
Limited capacity deque¶
A queue shows first-in-first-out behavior.
In [106]:
dq = deque(seq, 5)
dq
Out[106]:
deque([8, 9, 5, 5, 5])
Deque methods¶
In [107]:
dq.append(10)
dq
Out[107]:
deque([9, 5, 5, 5, 10])
In [108]:
dq.appendleft(11)
dq
Out[108]:
deque([11, 9, 5, 5, 5])
In [109]:
dq.pop()
Out[109]:
5
In [110]:
dq
Out[110]:
deque([11, 9, 5, 5])
In [111]:
dq.popleft()
Out[111]:
11
In [112]:
dq
Out[112]:
deque([9, 5, 5])
In [113]:
dq.rotate()
In [114]:
dq
Out[114]:
deque([5, 9, 5])
Heap or Priority Queue¶
Some algorithms require adding and removing items from a priority queue. A heap or priority queue is a data structure that always generates the lowest priority item even with dynamic insertions of new items.
In [115]:
import heapq
In [116]:
items = [(3, 'A'), (1, 'B'), (2, 'C'),
(3, 'D'), (1, 'E'), (2, 'F')]
In [117]:
heapq.heapify(items)
In [118]:
items
Out[118]:
[(1, 'B'), (1, 'E'), (2, 'C'), (3, 'D'), (3, 'A'), (2, 'F')]
In [119]:
heapq.heappop(items)
Out[119]:
(1, 'B')
In [120]:
heapq.heappop(items)
Out[120]:
(1, 'E')
In [121]:
heapq.heappop(items)
Out[121]:
(2, 'C')
In [122]:
heapq.heappush(items, (2, 'G'))
heapq.heappush(items, (1, 'H'))
heapq.heappush(items, (3, 'I'))
In [123]:
heapq.heappop(items)
Out[123]:
(1, 'H')
In [124]:
heapq.heappop(items)
Out[124]:
(2, 'F')
In [125]:
heapq.heappop(items)
Out[125]:
(2, 'G')
In [126]:
items = [(3, 'A'), (1, 'B'), (2, 'C'),
(3, 'D'), (1, 'E'), (2, 'F')]
In [127]:
heapq.nsmallest(3, items)
Out[127]:
[(1, 'B'), (1, 'E'), (2, 'C')]
In [128]:
heapq.nlargest(3, items)
Out[128]:
[(3, 'D'), (3, 'A'), (2, 'F')]
Trees¶
Trees are not available in the Python standard library, so we will use a 3rd party implementation.
In [129]:
! pip3 install --quiet anytree
In [130]:
from anytree import Node, RenderTree, Walker, PreOrderIter, PostOrderIter, LevelOrderIter
from anytree.dotexport import RenderTreeGraph
In [131]:
a = Node('a')
b = Node('b', a)
c = Node('c', a)
d = Node('d', b)
e = Node('e', b)
f = Node('f', c)
g = Node('g', c)
h = Node('h', f)
In [132]:
print(RenderTree(a))
Node('/a')
├── Node('/a/b')
│ ├── Node('/a/b/d')
│ └── Node('/a/b/e')
└── Node('/a/c')
├── Node('/a/c/f')
│ └── Node('/a/c/f/h')
└── Node('/a/c/g')
In [133]:
g = RenderTreeGraph(a, nodeattrfunc=lambda node: "shape=box")
g.to_picture('figs/tree.png')

Tree
In [134]:
c.ancestors
Out[134]:
(Node('/a'),)
In [135]:
c.descendants
Out[135]:
(Node('/a/c/f'), Node('/a/c/f/h'), Node('/a/c/g'))
In [136]:
c.parent
Out[136]:
Node('/a')
In [137]:
c.children
Out[137]:
(Node('/a/c/f'), Node('/a/c/g'))
In [138]:
w = Walker()
Find a path between two nodes¶
In [139]:
upwards, common, downwards = w.walk(e, h)
In [140]:
upwards
Out[140]:
(Node('/a/b/e'), Node('/a/b'))
In [141]:
common
Out[141]:
Node('/a')
In [142]:
downwards
Out[142]:
(Node('/a/c'), Node('/a/c/f'), Node('/a/c/f/h'))
Iteration¶
In [143]:
[node for node in LevelOrderIter(a)]
Out[143]:
[Node('/a'),
Node('/a/b'),
Node('/a/c'),
Node('/a/b/d'),
Node('/a/b/e'),
Node('/a/c/f'),
Node('/a/c/g'),
Node('/a/c/f/h')]
In [144]:
[node for node in PreOrderIter(a)]
Out[144]:
[Node('/a'),
Node('/a/b'),
Node('/a/b/d'),
Node('/a/b/e'),
Node('/a/c'),
Node('/a/c/f'),
Node('/a/c/f/h'),
Node('/a/c/g')]
In [145]:
[node for node in PostOrderIter(a)]
Out[145]:
[Node('/a/b/d'),
Node('/a/b/e'),
Node('/a/b'),
Node('/a/c/f/h'),
Node('/a/c/f'),
Node('/a/c/g'),
Node('/a/c'),
Node('/a')]