{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python: Text Data\n", "\n", "The process of cleaning data for analysis often requires working with text, for example, to correct typos, convert to standard nomenclature and resolve ambiguous labels. In some statistical fields that deal with (say) processing electronic medical records, information science or recommendations based on user feedback, text must be processed before analysis - for example, by converting to a bag of words.\n", "\n", "We will use a whimsical example to illustrate Python tools for *munging* text data using string methods and regular expressions. Finally, we will see how to format text data for reporting. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get \"Through the Looking Glass\" \n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import requests\n", "\n", "try:\n", " with open('looking_glass.txt') as f:\n", " text = f.read()\n", "except IOError:\n", " url = 'http://www.gutenberg.org/cache/epub/12/pg12.txt'\n", " res = requests.get(url)\n", " text = res.text\n", " with open('looking_glass.txt', 'w') as f:\n", " f.write(str(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slice to get Jabberwocky\n", "----" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "start = text.find('JABBERWOCKY')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"JABBERWOCKY\\r\\n\\r\\n 'Twas brillig, and the slithy toves\\r\\n Did gyre and gimble in the wabe;\\r\\n All mimsy were the borogoves,\\r\\n And the mome raths outgrabe.\\r\\n\\r\\n 'Beware the Jabberwock, my son!\\r\\n The jaws that bite, the claws that catch!\\r\\n Beware the Jubjub bird, and shun\\r\\n The frumious Bandersnatch!'\\r\\n\\r\\n He took his vorpal sword in hand:\\r\\n Long time the manxome foe he sought--\\r\\n So rested he by the Tumtum tree,\\r\\n And stood awhile in thought.\\r\\n\\r\\n And as in uffish thought he stood,\\r\\n The Jabberwock, with eyes of flame,\\r\\n Came whiffling through the tulgey wood,\\r\\n And burbled as it came!\\r\\n\\r\\n One, two! One, two! And through and through\\r\\n The vorpal blade went snicker-snack!\\r\\n He left it dead, and with its head\\r\\n He went galumphing back.\\r\\n\\r\\n 'And hast thou slain the Jabberwock?\\r\\n Come to my arms, my beamish boy!\\r\\n O frabjous day! Callooh! Callay!'\\r\\n He chortled in his joy.\\r\\n\\r\\n 'Twas brillig, and the slithy toves\\r\\n Did gyre and gimble in the wabe;\\r\\n All mimsy were the borogoves,\\r\\n And the mome raths outgrabe.\\r\\n\\r\\n\\r\\n'It seems very pretty,' she said when she had finished it, 'but it's\\r\\nRATHER hard to understand!' (You see she didn't like to confess, even\\r\\nto herself, that she couldn't make it out at all.) 'Somehow it seems\\r\\nto fill my head with ideas--only I don't exactly know what they are!\\r\\nHowever, SOMEBODY killed SOMETHING: that's clear, at any rate--'\\r\\n\\r\\n'But oh!' thought Alice, suddenly jumping up, 'if I don't make haste I\\r\\nshall have to go back through the Looking-glass, before I've seen what\\r\\nthe rest of the house is like! Let's have a look at the garden first!'\\r\\nShe was out of the room in a moment, and ran down stairs--or, at least,\\r\\nit wasn't exactly running, but a new invention of hers for getting down\\r\\nstairs quickly and easily, as Alice said to herself. She just kept the\\r\\ntips of her fingers on the hand-rail, and floated gently down without\\r\\neven\"" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[start:start+2000]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "end = text.find('It seems very pretty', start)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"JABBERWOCKY\\r\\n\\r\\n 'Twas brillig, and the slithy toves\\r\\n Did gyre and gimble in the wabe;\\r\\n All mimsy were the borogoves,\\r\\n And the mome raths outgrabe.\\r\\n\\r\\n 'Beware the Jabberwock, my son!\\r\\n The jaws that bite, the claws that catch!\\r\\n Beware the Jubjub bird, and shun\\r\\n The frumious Bandersnatch!'\\r\\n\\r\\n He took his vorpal sword in hand:\\r\\n Long time the manxome foe he sought--\\r\\n So rested he by the Tumtum tree,\\r\\n And stood awhile in thought.\\r\\n\\r\\n And as in uffish thought he stood,\\r\\n The Jabberwock, with eyes of flame,\\r\\n Came whiffling through the tulgey wood,\\r\\n And burbled as it came!\\r\\n\\r\\n One, two! One, two! And through and through\\r\\n The vorpal blade went snicker-snack!\\r\\n He left it dead, and with its head\\r\\n He went galumphing back.\\r\\n\\r\\n 'And hast thou slain the Jabberwock?\\r\\n Come to my arms, my beamish boy!\\r\\n O frabjous day! Callooh! Callay!'\\r\\n He chortled in his joy.\\r\\n\\r\\n 'Twas brillig, and the slithy toves\\r\\n Did gyre and gimble in the wabe;\\r\\n All mimsy were the borogoves,\\r\\n And the mome raths outgrabe.\\r\\n\\r\\n\\r\\n'\"" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poem = text[start:end]\n", "poem" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JABBERWOCKY\r\n", "\r\n", " 'Twas brillig, and the slithy toves\r\n", " Did gyre and gimble in the wabe;\r\n", " All mimsy were the borogoves,\r\n", " And the mome raths outgrabe.\r\n", "\r\n", " 'Beware the Jabberwock, my son!\r\n", " The jaws that bite, the claws that catch!\r\n", " Beware the Jubjub bird, and shun\r\n", " The frumious Bandersnatch!'\r\n", "\r\n", " He took his vorpal sword in hand:\r\n", " Long time the manxome foe he sought--\r\n", " So rested he by the Tumtum tree,\r\n", " And stood awhile in thought.\r\n", "\r\n", " And as in uffish thought he stood,\r\n", " The Jabberwock, with eyes of flame,\r\n", " Came whiffling through the tulgey wood,\r\n", " And burbled as it came!\r\n", "\r\n", " One, two! One, two! And through and through\r\n", " The vorpal blade went snicker-snack!\r\n", " He left it dead, and with its head\r\n", " He went galumphing back.\r\n", "\r\n", " 'And hast thou slain the Jabberwock?\r\n", " Come to my arms, my beamish boy!\r\n", " O frabjous day! Callooh! Callay!'\r\n", " He chortled in his joy.\r\n", "\r\n", " 'Twas brillig, and the slithy toves\r\n", " Did gyre and gimble in the wabe;\r\n", " All mimsy were the borogoves,\r\n", " And the mome raths outgrabe.\r\n", "\r\n", "\r\n", "'\n" ] } ], "source": [ "print(poem)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Jabberwocky\r\n", "\r\n", " 'Twas Brillig, And The Slithy Toves\r\n", " Did Gyre And Gimble In The Wabe;\r\n", " All Mimsy Were The Borogoves,\r\n", " And The Mome Raths Outgrabe.\r\n", "\r\n", " 'Beware The Jabberwock, My Son!\r\n", " The Jaws That Bite, The Claws That Catch!\r\n", " Beware The Jubjub Bird, And Shun\r\n", " The Frumious Bandersnatch!'\r\n", "\r\n", " He Took His Vorpal Sword In Hand:\r\n", " Long Time The Manxome Foe He Sought--\r\n", " So Rested He By The Tumtum Tree,\r\n", " And Stood Awhile In Thought.\r\n", "\r\n", " And As In Uffish Thought He Stood,\r\n", " The Jabberwock, With Eyes Of Flame,\r\n", " Came Whiffling Through The Tulgey Wood,\r\n", " And Burbled As It Came!\r\n", "\r\n", " One, Two! One, Two! And Through And Through\r\n", " The Vorpal Blade Went Snicker-Snack!\r\n", " He Left It Dead, And With Its Head\r\n", " He Went Galumphing Back.\r\n", "\r\n", " 'And Hast Thou Slain The Jabberwock?\r\n", " Come To My Arms, My Beamish Boy!\r\n", " O Frabjous Day! Callooh! Callay!'\r\n", " He Chortled In His Joy.\r\n", "\r\n", " 'Twas Brillig, And The Slithy Toves\r\n", " Did Gyre And Gimble In The Wabe;\r\n", " All Mimsy Were The Borogoves,\r\n", " And The Mome Raths Outgrabe.\r\n", "\r\n", "\r\n", "'\n" ] } ], "source": [ "print(poem.title())" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poem.count('the')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JABBERWOCKY\r\n", "\r\n", " 'Twas brillig, and XXX slithy toves\r\n", " Did gyre and gimble in XXX wabe;\r\n", " All mimsy were XXX borogoves,\r\n", " And XXX mome raths outgrabe.\r\n", "\r\n", " 'Beware XXX Jabberwock, my son!\r\n", " The jaws that bite, XXX claws that catch!\r\n", " Beware XXX Jubjub bird, and shun\r\n", " The frumious Bandersnatch!'\r\n", "\r\n", " He took his vorpal sword in hand:\r\n", " Long time XXX manxome foe he sought--\r\n", " So rested he by XXX Tumtum tree,\r\n", " And stood awhile in thought.\r\n", "\r\n", " And as in uffish thought he stood,\r\n", " The Jabberwock, with eyes of flame,\r\n", " Came whiffling through XXX tulgey wood,\r\n", " And burbled as it came!\r\n", "\r\n", " One, two! One, two! And through and through\r\n", " The vorpal blade went snicker-snack!\r\n", " He left it dead, and with its head\r\n", " He went galumphing back.\r\n", "\r\n", " 'And hast thou slain XXX Jabberwock?\r\n", " Come to my arms, my beamish boy!\r\n", " O frabjous day! Callooh! Callay!'\r\n", " He chortled in his joy.\r\n", "\r\n", " 'Twas brillig, and XXX slithy toves\r\n", " Did gyre and gimble in XXX wabe;\r\n", " All mimsy were XXX borogoves,\r\n", " And XXX mome raths outgrabe.\r\n", "\r\n", "\r\n", "'\n" ] } ], "source": [ "print(poem.replace('the', 'XXX'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find palindromic words in poem if any\n", "----" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "poem = poem.lower()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import string\n", "string.punctuation" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'jabberwocky\\r\\n\\r\\n twas brillig and the slithy toves\\r\\n did gyre and gimble in the wabe\\r\\n all mimsy were the borogoves\\r\\n and the mome raths outgrabe\\r\\n\\r\\n beware the jabberwock my son\\r\\n the jaws that bite the claws that catch\\r\\n beware the jubjub bird and shun\\r\\n the frumious bandersnatch\\r\\n\\r\\n he took his vorpal sword in hand\\r\\n long time the manxome foe he sought\\r\\n so rested he by the tumtum tree\\r\\n and stood awhile in thought\\r\\n\\r\\n and as in uffish thought he stood\\r\\n the jabberwock with eyes of flame\\r\\n came whiffling through the tulgey wood\\r\\n and burbled as it came\\r\\n\\r\\n one two one two and through and through\\r\\n the vorpal blade went snickersnack\\r\\n he left it dead and with its head\\r\\n he went galumphing back\\r\\n\\r\\n and hast thou slain the jabberwock\\r\\n come to my arms my beamish boy\\r\\n o frabjous day callooh callay\\r\\n he chortled in his joy\\r\\n\\r\\n twas brillig and the slithy toves\\r\\n did gyre and gimble in the wabe\\r\\n all mimsy were the borogoves\\r\\n and the mome raths outgrabe\\r\\n\\r\\n\\r\\n'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poem = poem.translate(dict.fromkeys(map(ord, string.punctuation)))\n", "poem" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['jabberwocky',\n", " 'twas',\n", " 'brillig',\n", " 'and',\n", " 'the',\n", " 'slithy',\n", " 'toves',\n", " 'did',\n", " 'gyre',\n", " 'and']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = poem.split()\n", "words[:10]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def is_palindrome(word):\n", " return word == word[::-1]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'did', 'o'}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{word for word in words if is_palindrome(word)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top 10 most frequent words\n", "----" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import collections" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "poem_counter = collections.Counter(words)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 19),\n", " ('and', 14),\n", " ('he', 7),\n", " ('in', 6),\n", " ('jabberwock', 3),\n", " ('my', 3),\n", " ('through', 3),\n", " ('twas', 2),\n", " ('brillig', 2),\n", " ('slithy', 2)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poem_counter.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Words that appear exactly twice.\n", "----" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('twas', 2),\n", " ('brillig', 2),\n", " ('slithy', 2),\n", " ('toves', 2),\n", " ('did', 2),\n", " ('gyre', 2),\n", " ('gimble', 2),\n", " ('wabe', 2),\n", " ('all', 2),\n", " ('mimsy', 2),\n", " ('were', 2),\n", " ('borogoves', 2),\n", " ('mome', 2),\n", " ('raths', 2),\n", " ('outgrabe', 2),\n", " ('beware', 2),\n", " ('that', 2),\n", " ('his', 2),\n", " ('vorpal', 2),\n", " ('stood', 2),\n", " ('thought', 2),\n", " ('as', 2),\n", " ('with', 2),\n", " ('came', 2),\n", " ('it', 2),\n", " ('one', 2),\n", " ('two', 2),\n", " ('went', 2)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(k, v) for (k, v) in poem_counter.items() if v==2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trigrams\n", "----\n", "\n", "All possible sequences of 3 words in the poem." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('jabberwocky', 'twas', 'brillig'),\n", " ('twas', 'brillig', 'and'),\n", " ('brillig', 'and', 'the'),\n", " ('and', 'the', 'slithy'),\n", " ('the', 'slithy', 'toves'),\n", " ('slithy', 'toves', 'did'),\n", " ('toves', 'did', 'gyre'),\n", " ('did', 'gyre', 'and'),\n", " ('gyre', 'and', 'gimble'),\n", " ('and', 'gimble', 'in')]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(zip(words[:-2], words[1:-1], words[2:]))[:10]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import itertools as it" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def window(x, n):\n", " \"\"\"Sliding widnow of size n from iterable x.\"\"\"\n", " s = (it.islice(x, i, None) for i in range(n))\n", " return zip(*s)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('jabberwocky', 'twas', 'brillig'),\n", " ('twas', 'brillig', 'and'),\n", " ('brillig', 'and', 'the'),\n", " ('and', 'the', 'slithy'),\n", " ('the', 'slithy', 'toves'),\n", " ('slithy', 'toves', 'did'),\n", " ('toves', 'did', 'gyre'),\n", " ('did', 'gyre', 'and'),\n", " ('gyre', 'and', 'gimble'),\n", " ('and', 'gimble', 'in')]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(window(words, 3))[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find words in poem that are over-represented\n", "----" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "book = text" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "book = book.lower().translate(dict.fromkeys(map(ord, string.punctuation)))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "book_counter = collections.Counter(book.split())" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "n = sum(book_counter.values())\n", "book_freqs = {k: v/n for k, v in book_counter.items()}" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "n = sum(poem_counter.values())\n", "stats = [(k, v, book_freqs.get(k,0)*n) for k, v in poem_counter.items()]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pandas import DataFrame" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = DataFrame(stats, columns = ['word', 'observed', 'expected'])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['score'] = (df.observed-df.expected)**2/df.expected" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | word | \n", "observed | \n", "expected | \n", "score | \n", "
|---|---|---|---|---|
| 20 | \n", "jabberwock | \n", "3 | \n", "0.01557 | \n", "572.045510 | \n", "
| 19 | \n", "beware | \n", "2 | \n", "0.01038 | \n", "381.363673 | \n", "
| 36 | \n", "vorpal | \n", "2 | \n", "0.01038 | \n", "381.363673 | \n", "
| 1 | \n", "twas | \n", "2 | \n", "0.01557 | \n", "252.917766 | \n", "
| 15 | \n", "borogoves | \n", "2 | \n", "0.01557 | \n", "252.917766 | \n", "
| 90 | \n", "joy | \n", "1 | \n", "0.00519 | \n", "190.681837 | \n", "
| 31 | \n", "frumious | \n", "1 | \n", "0.00519 | \n", "190.681837 | \n", "
| 74 | \n", "galumphing | \n", "1 | \n", "0.00519 | \n", "190.681837 | \n", "
| 26 | \n", "claws | \n", "1 | \n", "0.00519 | \n", "190.681837 | \n", "
| 69 | \n", "snickersnack | \n", "1 | \n", "0.00519 | \n", "190.681837 | \n", "