Python: Data Collections

Python produces a large number of useful data structures in its standard library, and more exotic data structures can usually be installed via pip from the Python Package Index.

Algorithmic Complexity

It is important to be familiar with the time complexity of these algorithms.

See Complexity of Python Operations for complexity of common data structures and operations.

Strings

A string can behave like an immutable data collection, consisting of characters.

Creation

In [1]:
name = 'Rumpelsttiltskin'

Indexing

In [2]:
name[:3]
Out[2]:
'Rum'
In [3]:
name[-3:]
Out[3]:
'kin'
In [4]:
name[3:-3]
Out[4]:
'pelsttilts'

Concatenation

In [5]:
sentence = name + 'is a fairy tale by the Grimm Brothers'
In [6]:
sentence
Out[6]:
'Rumpelsttiltskinis a fairy tale by the Grimm Brothers'

String methods

In [7]:
sentence.split()
Out[7]:
['Rumpelsttiltskinis', 'a', 'fairy', 'tale', 'by', 'the', 'Grimm', 'Brothers']
In [8]:
'_'.join(sentence.split())
Out[8]:
'Rumpelsttiltskinis_a_fairy_tale_by_the_Grimm_Brothers'
In [9]:
sentence.upper()
Out[9]:
'RUMPELSTTILTSKINIS A FAIRY TALE BY THE GRIMM BROTHERS'
In [10]:
sentence.title()
Out[10]:
'Rumpelsttiltskinis A Fairy Tale By The Grimm Brothers'
In [11]:
sentence.count('a')
Out[11]:
3
In [12]:
sentence.replace('Brothers', 'Sisters')
Out[12]:
'Rumpelsttiltskinis a fairy tale by the Grimm Sisters'

Reversing a string

In [13]:
''.join(reversed(sentence))
Out[13]:
'srehtorB mmirG eht yb elat yriaf a sinikstlittslepmuR'
In [14]:
sentence[::-1]
Out[14]:
'srehtorB mmirG eht yb elat yriaf a sinikstlittslepmuR'

Multi-line strings

In [15]:
mls = """This is
a string with
many, many lines"""
In [16]:
mls
Out[16]:
'This is\na string with\nmany, many lines'
In [17]:
print(mls)
This is
a string with
many, many lines

Tuples

  • Immutable
  • Useful as records
  • Random access

Creating

In [18]:
paul = ('Paul', 'Graham', 45, 'paul.graham@duke.edu', 3.8, ('Mathematics', 'Statistics'))

Indexing

In [19]:
paul[0]
Out[19]:
'Paul'
In [20]:
paul[-1]
Out[20]:
('Mathematics', 'Statistics')
In [21]:
paul[2:4]
Out[21]:
(45, 'paul.graham@duke.edu')

Immutable

In [22]:
try:
    paul[2] = 47
except TypeError as e:
    print(e)
'tuple' object does not support item assignment

Tuple unpacking

In [23]:
first, last, *stuff, gpa, (major, minor) = paul
In [24]:
first
Out[24]:
'Paul'
In [25]:
last
Out[25]:
'Graham'
In [26]:
stuff
Out[26]:
[45, 'paul.graham@duke.edu']
In [27]:
gpa
Out[27]:
3.8
In [28]:
major
Out[28]:
'Mathematics'
In [29]:
minor
Out[29]:
'Statistics'

Named tuples

Use named tuples for readability. Importantly, there are NO performacne differences compared with plain tuples.

Creating

In [30]:
from collections import namedtuple
In [31]:
Subject = namedtuple('subject', 'major minor')
Student = namedtuple('Student', 'first last age email gpa subject')
In [32]:
paul = Student('Paul', 'Graham', 45, 'paul.graham@duke.edu', 3.8,
               Subject('Mathematics', 'Statistics'))
In [33]:
paul
Out[33]:
Student(first='Paul', last='Graham', age=45, email='paul.graham@duke.edu', gpa=3.8, subject=subject(major='Mathematics', minor='Statistics'))

Indexing

In [34]:
paul.first
Out[34]:
'Paul'
In [35]:
paul[0]
Out[35]:
'Paul'
In [36]:
paul.last
Out[36]:
'Graham'
In [37]:
paul.subject
Out[37]:
subject(major='Mathematics', minor='Statistics')
In [38]:
paul.subject.major
Out[38]:
'Mathematics'

Lists

  • Similar to tuples
  • Mutable
  • Often (but not necessarily) used to maintain collections of the same data type
  • Workhorse of general Python data processing

Creating

In [39]:
words = ['the', 'quick', 'brown', 'fox']
In [40]:
words
Out[40]:
['the', 'quick', 'brown', 'fox']

Indexing and unpacking

Same as for tuples.

In [41]:
words[-2:]
Out[41]:
['brown', 'fox']
In [42]:
first, *rest, last = words
In [43]:
first, last
Out[43]:
('the', 'fox')
In [44]:
seq = list(range(10))
In [45]:
seq
Out[45]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [46]:
seq[1::2]
Out[46]:
[1, 3, 5, 7, 9]
In [47]:
seq[::-2]
Out[47]:
[9, 7, 5, 3, 1]

Modifying lists

As lists are mutable, they can be modified.

In [48]:
words[2]
Out[48]:
'brown'
In [49]:
words[2] = 'blue'
In [50]:
words
Out[50]:
['the', 'quick', 'blue', 'fox']
In [51]:
words[:2] = 'a', 'slow'
In [52]:
words
Out[52]:
['a', 'slow', 'blue', 'fox']

Appending

In [53]:
words.append('jumps')
In [54]:
words
Out[54]:
['a', 'slow', 'blue', 'fox', 'jumps']

Extending

In [55]:
words.extend(['over', 'the', 'lazy', 'dog'])
In [56]:
words
Out[56]:
['a', 'slow', 'blue', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Concatenation

In [57]:
[1,2] + [3,4] + [5,6]
Out[57]:
[1, 2, 3, 4, 5, 6]

List methods

In [58]:
seq
Out[58]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [59]:
seq.reverse()
In [60]:
seq
Out[60]:
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
In [61]:
seq.sort()
In [62]:
seq
Out[62]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [63]:
seq.extend([5,5,5])
In [64]:
seq
Out[64]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 5, 5]
In [65]:
seq.count(5)
Out[65]:
4

Conversion

In [66]:
paul2 = list(paul)
In [67]:
paul2
Out[67]:
['Paul',
 'Graham',
 45,
 'paul.graham@duke.edu',
 3.8,
 subject(major='Mathematics', minor='Statistics')]

Dictionaries

  • Use keys rather than positions to index

3 common ways of creating dictionaries

In [68]:
colors = {
    'red': (255,0,0),
    'green': (0,255,0),
    'blue': (0,0,255)
}
In [69]:
colors
Out[69]:
{'blue': (0, 0, 255), 'green': (0, 255, 0), 'red': (255, 0, 0)}
In [70]:
colors = dict(red=(255,0), green=(0,255,0), blue=(0,0,255))
In [71]:
colors
Out[71]:
{'blue': (0, 0, 255), 'green': (0, 255, 0), 'red': (255, 0)}
In [72]:
names = ['red', 'green', 'blue']
values = [(255,0,0), (0,255,0), (0,0,255)]

colors = dict(zip(names, values))
In [73]:
colors
Out[73]:
{'blue': (0, 0, 255), 'green': (0, 255, 0), 'red': (255, 0, 0)}

Indexing

In [74]:
colors['blue']
Out[74]:
(0, 0, 255)

Keys and values

In [75]:
colors.keys()
Out[75]:
dict_keys(['red', 'green', 'blue'])
In [76]:
colors.values()
Out[76]:
dict_values([(255, 0, 0), (0, 255, 0), (0, 0, 255)])

Adding to a dictionary

In [77]:
colors['black'] = (0,0,0)
In [78]:
colors
Out[78]:
{'black': (0, 0, 0),
 'blue': (0, 0, 255),
 'green': (0, 255, 0),
 'red': (255, 0, 0)}

Merging dictionaries

Note that a key with the same name in the second dictionary will clobber the key in the first dictionary without warning.

In [79]:
more_colors = {
    'gray': ( 128,128,128),
    'maroon': (128,0,0),
    'black': (255, 255, 255)
}
In [80]:
colors.update(more_colors)
In [81]:
colors
Out[81]:
{'black': (255, 255, 255),
 'blue': (0, 0, 255),
 'gray': (128, 128, 128),
 'green': (0, 255, 0),
 'maroon': (128, 0, 0),
 'red': (255, 0, 0)}

Counters

A counter is a special kind of dictionary for counting things.

In [82]:
from collections import Counter
In [83]:
dna = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
In [84]:
c = Counter(dna)
In [85]:
c
Out[85]:
Counter({'A': 20, 'C': 12, 'G': 17, 'T': 21})
In [86]:
c.most_common(2)
Out[86]:
[('T', 21), ('A', 20)]
In [87]:
words = 'the the the quick brown fox jumps over the quick brown dog'.split()
In [88]:
wc = Counter(words)
In [89]:
wc.most_common(2)
Out[89]:
[('the', 4), ('quick', 2)]
In [90]:
wc['brown']
Out[90]:
2

Sets

A set corresponds to the mathematical idea of a set.

In [91]:
s1 = set(words)
In [92]:
s1
Out[92]:
{'brown', 'dog', 'fox', 'jumps', 'over', 'quick', 'the'}
In [93]:
s2 = set('what does the fox say'.split())
In [94]:
s2
Out[94]:
{'does', 'fox', 'say', 'the', 'what'}
In [95]:
s1 | s2
Out[95]:
{'brown', 'does', 'dog', 'fox', 'jumps', 'over', 'quick', 'say', 'the', 'what'}
In [96]:
s1.union(s2)
Out[96]:
{'brown', 'does', 'dog', 'fox', 'jumps', 'over', 'quick', 'say', 'the', 'what'}
In [97]:
s1 & s2
Out[97]:
{'fox', 'the'}
In [98]:
s1.intersection(s2)
Out[98]:
{'fox', 'the'}
In [99]:
s1 - s2
Out[99]:
{'brown', 'dog', 'jumps', 'over', 'quick'}
In [100]:
s1.difference(s2)
Out[100]:
{'brown', 'dog', 'jumps', 'over', 'quick'}
In [101]:
s2 - s1
Out[101]:
{'does', 'say', 'what'}
In [102]:
s2.difference(s1)
Out[102]:
{'does', 'say', 'what'}

Deques

Double ended queues. Sometimes you need to frequently add or remove items from both the left and right ends (list only lets you make changes to the right end).

In [103]:
from collections import deque
In [104]:
seq
Out[104]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 5, 5]
In [105]:
deque(seq)
Out[105]:
deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 5, 5])

Limited capacity deque

A queue shows first-in-first-out behavior.

In [106]:
dq = deque(seq, 5)
dq
Out[106]:
deque([8, 9, 5, 5, 5])

Deque methods

In [107]:
dq.append(10)
dq
Out[107]:
deque([9, 5, 5, 5, 10])
In [108]:
dq.appendleft(11)
dq
Out[108]:
deque([11, 9, 5, 5, 5])
In [109]:
dq.pop()
Out[109]:
5
In [110]:
dq
Out[110]:
deque([11, 9, 5, 5])
In [111]:
dq.popleft()
Out[111]:
11
In [112]:
dq
Out[112]:
deque([9, 5, 5])
In [113]:
dq.rotate()
In [114]:
dq
Out[114]:
deque([5, 9, 5])

Heap or Priority Queue

Some algorithms require adding and removing items from a priority queue. A heap or priority queue is a data structure that always generates the lowest priority item even with dynamic insertions of new items.

In [115]:
import heapq
In [116]:
items = [(3, 'A'), (1, 'B'), (2, 'C'),
         (3, 'D'), (1, 'E'), (2, 'F')]
In [117]:
heapq.heapify(items)
In [118]:
items
Out[118]:
[(1, 'B'), (1, 'E'), (2, 'C'), (3, 'D'), (3, 'A'), (2, 'F')]
In [119]:
heapq.heappop(items)
Out[119]:
(1, 'B')
In [120]:
heapq.heappop(items)
Out[120]:
(1, 'E')
In [121]:
heapq.heappop(items)
Out[121]:
(2, 'C')
In [122]:
heapq.heappush(items, (2, 'G'))
heapq.heappush(items, (1, 'H'))
heapq.heappush(items, (3, 'I'))
In [123]:
heapq.heappop(items)
Out[123]:
(1, 'H')
In [124]:
heapq.heappop(items)
Out[124]:
(2, 'F')
In [125]:
heapq.heappop(items)
Out[125]:
(2, 'G')
In [126]:
items = [(3, 'A'), (1, 'B'), (2, 'C'),
         (3, 'D'), (1, 'E'), (2, 'F')]
In [127]:
heapq.nsmallest(3, items)
Out[127]:
[(1, 'B'), (1, 'E'), (2, 'C')]
In [128]:
heapq.nlargest(3, items)
Out[128]:
[(3, 'D'), (3, 'A'), (2, 'F')]

Trees

Trees are not available in the Python standard library, so we will use a 3rd party implementation.

In [129]:
! pip3 install --quiet anytree
In [130]:
from anytree import Node, RenderTree, Walker, PreOrderIter, PostOrderIter, LevelOrderIter
from anytree.dotexport import RenderTreeGraph
In [131]:
a = Node('a')
b = Node('b', a)
c = Node('c', a)
d = Node('d', b)
e = Node('e', b)
f = Node('f', c)
g = Node('g', c)
h = Node('h', f)
In [132]:
print(RenderTree(a))
Node('/a')
├── Node('/a/b')
│   ├── Node('/a/b/d')
│   └── Node('/a/b/e')
└── Node('/a/c')
    ├── Node('/a/c/f')
    │   └── Node('/a/c/f/h')
    └── Node('/a/c/g')
In [133]:
g = RenderTreeGraph(a, nodeattrfunc=lambda node: "shape=box")
g.to_picture('figs/tree.png')
Tree

Tree

In [134]:
c.ancestors
Out[134]:
(Node('/a'),)
In [135]:
c.descendants
Out[135]:
(Node('/a/c/f'), Node('/a/c/f/h'), Node('/a/c/g'))
In [136]:
c.parent
Out[136]:
Node('/a')
In [137]:
c.children
Out[137]:
(Node('/a/c/f'), Node('/a/c/g'))
In [138]:
w = Walker()

Find a path between two nodes

In [139]:
upwards, common, downwards = w.walk(e, h)
In [140]:
upwards
Out[140]:
(Node('/a/b/e'), Node('/a/b'))
In [141]:
common
Out[141]:
Node('/a')
In [142]:
downwards
Out[142]:
(Node('/a/c'), Node('/a/c/f'), Node('/a/c/f/h'))

Iteration

In [143]:
[node for node in LevelOrderIter(a)]
Out[143]:
[Node('/a'),
 Node('/a/b'),
 Node('/a/c'),
 Node('/a/b/d'),
 Node('/a/b/e'),
 Node('/a/c/f'),
 Node('/a/c/g'),
 Node('/a/c/f/h')]
In [144]:
[node for node in PreOrderIter(a)]
Out[144]:
[Node('/a'),
 Node('/a/b'),
 Node('/a/b/d'),
 Node('/a/b/e'),
 Node('/a/c'),
 Node('/a/c/f'),
 Node('/a/c/f/h'),
 Node('/a/c/g')]
In [145]:
[node for node in PostOrderIter(a)]
Out[145]:
[Node('/a/b/d'),
 Node('/a/b/e'),
 Node('/a/b'),
 Node('/a/c/f/h'),
 Node('/a/c/f'),
 Node('/a/c/g'),
 Node('/a/c'),
 Node('/a')]