Bonus Material: Word count¶

The word count problem is the ‘Hello world’ equivalent of distributed programming. Word count is also the basic process by which text is converted into features for text mining and topic modeling. We show a variety of ways to solve the word count problem in Python to familiarize you with different coding approaches.

In [1]:

text = ''''Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe;
     All mimsy were the borogoves,
      And the mome raths outgrabe.

     'Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
     Beware the Jubjub bird, and shun
      The frumious Bandersnatch!'

     He took his vorpal sword in hand:
      Long time the manxome foe he sought--
     So rested he by the Tumtum tree,
      And stood awhile in thought.

     And as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
     Came whiffling through the tulgey wood,
      And burbled as it came!

     One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
     He left it dead, and with its head
      He went galumphing back.

     'And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
     O frabjous day! Callooh! Callay!'
      He chortled in his joy.

     'Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe;
     All mimsy were the borogoves,
      And the mome raths outgrabe.'''

Convert to list of words¶

In [2]:

import string

table = dict.fromkeys(map(ord, string.punctuation))
words = text.translate(table).strip().lower().split()

In [3]:

words[:10]

Out[3]:

['twas',
 'brillig',
 'and',
 'the',
 'slithy',
 'toves',
 'did',
 'gyre',
 'and',
 'gimble']

Slower version without translate¶

In [4]:

for char in string.punctuation:
    text = text.replace(char, '')
words2 = text.strip().lower().split()

In [5]:

words2[:10]

Out[5]:

['twas',
 'brillig',
 'and',
 'the',
 'slithy',
 'toves',
 'did',
 'gyre',
 'and',
 'gimble']

Using a regular dictionary¶

In [6]:

c1 = {}
for word in words:
    c1[word] = c1.get(word, 0) + 1

In [7]:

sorted(c1.items(), key=lambda x: x[1], reverse=True)[:3]

Out[7]:

[('the', 19), ('and', 14), ('he', 7)]

Using a default dictionary¶

In [8]:

from collections import defaultdict

c2 = defaultdict(int)
for word in words:
    c2[word] += 1

In [9]:

sorted(c2.items(), key=lambda x: x[1], reverse=True)[:3]

Out[9]:

[('the', 19), ('and', 14), ('he', 7)]

Using a Counter¶

In [10]:

from collections import Counter

c3 = Counter(words)

In [11]:

c3.most_common(3)

Out[11]:

[('the', 19), ('and', 14), ('he', 7)]

Using third party function¶

In [12]:

from toolz import frequencies

c4 = frequencies(words)

In [13]:

sorted(c4.items(), key=lambda x: x[1], reverse=True)[:3]

Out[13]:

[('the', 19), ('and', 14), ('he', 7)]

Counting without dictionaries¶

In [14]:

from itertools import groupby

c5 = map(lambda x: (x[0], sum(1 for item in x[1])),
         groupby(sorted(words)))

In [15]:

sorted(c5, key=lambda x: x[1], reverse=True)[:3]

Out[15]:

[('the', 19), ('and', 14), ('he', 7)]

Vectorized version¶

In [16]:

import numpy as np

In [17]:

values, counts = np.unique(words, return_counts=True)

In [18]:

c6 = dict(zip(values, counts))

In [19]:

sorted(c6.items(), key=lambda x: x[1], reverse=True)[:3]

Out[19]:

[('the', 19), ('and', 14), ('he', 7)]