{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python: Text Data\n", "\n", "The process of cleaning data for analysis often requires working with text, for example, to correct typos, convert to standard nomenclature and resolve ambiguous labels. In some statistical fields that deal with (say) processing electronic medical records, information science or recommendations based on user feedback, text must be processed before analysis - for example, by converting to a bag of words.\n", "\n", "We will use a whimsical example to illustrate Python tools for *munging* text data using string methods and regular expressions. Finally, we will see how to format text data for reporting. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get \"Through the Looking Glass\" \n", "---" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import requests\n", "\n", "try:\n", " with open('looking_glass.txt') as f:\n", " text = f.read()\n", "except IOError:\n", " url = 'http://www.gutenberg.org/cache/epub/12/pg12.txt'\n", " res = requests.get(url)\n", " text = res.text\n", " with open('looking_glass.txt', 'w') as f:\n", " f.write(str(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slice to get Jabberwocky\n", "----" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "start = text.find('JABBERWOCKY')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"JABBERWOCKY\\r\\n\\r\\n 'Twas brillig, and the slithy toves\\r\\n Did gyre and gimble in the wabe;\\r\\n All mimsy were the borogoves,\\r\\n And the mome raths outgrabe.\\r\\n\\r\\n 'Beware the Jabberwock, my son!\\r\\n The jaws that bite, the claws that catch!\\r\\n Beware the Jubjub bird, and shun\\r\\n The frumious Bandersnatch!'\\r\\n\\r\\n He took his vorpal sword in hand:\\r\\n Long time the manxome foe he sought--\\r\\n So rested he by the Tumtum tree,\\r\\n And stood awhile in thought.\\r\\n\\r\\n And as in uffish thought he stood,\\r\\n The Jabberwock, with eyes of flame,\\r\\n Came whiffling through the tulgey wood,\\r\\n And burbled as it came!\\r\\n\\r\\n One, two! One, two! And through and through\\r\\n The vorpal blade went snicker-snack!\\r\\n He left it dead, and with its head\\r\\n He went galumphing back.\\r\\n\\r\\n 'And hast thou slain the Jabberwock?\\r\\n Come to my arms, my beamish boy!\\r\\n O frabjous day! Callooh! Callay!'\\r\\n He chortled in his joy.\\r\\n\\r\\n 'Twas brillig, and the slithy toves\\r\\n Did gyre and gimble in the wabe;\\r\\n All mimsy were the borogoves,\\r\\n And the mome raths outgrabe.\\r\\n\\r\\n\\r\\n'It seems very pretty,' she said when she had finished it, 'but it's\\r\\nRATHER hard to understand!' (You see she didn't like to confess, even\\r\\nto herself, that she couldn't make it out at all.) 'Somehow it seems\\r\\nto fill my head with ideas--only I don't exactly know what they are!\\r\\nHowever, SOMEBODY killed SOMETHING: that's clear, at any rate--'\\r\\n\\r\\n'But oh!' thought Alice, suddenly jumping up, 'if I don't make haste I\\r\\nshall have to go back through the Looking-glass, before I've seen what\\r\\nthe rest of the house is like! Let's have a look at the garden first!'\\r\\nShe was out of the room in a moment, and ran down stairs--or, at least,\\r\\nit wasn't exactly running, but a new invention of hers for getting down\\r\\nstairs quickly and easily, as Alice said to herself. She just kept the\\r\\ntips of her fingers on the hand-rail, and floated gently down without\\r\\neven\"" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text[start:start+2000]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "end = text.find('It seems very pretty', start)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"JABBERWOCKY\\r\\n\\r\\n 'Twas brillig, and the slithy toves\\r\\n Did gyre and gimble in the wabe;\\r\\n All mimsy were the borogoves,\\r\\n And the mome raths outgrabe.\\r\\n\\r\\n 'Beware the Jabberwock, my son!\\r\\n The jaws that bite, the claws that catch!\\r\\n Beware the Jubjub bird, and shun\\r\\n The frumious Bandersnatch!'\\r\\n\\r\\n He took his vorpal sword in hand:\\r\\n Long time the manxome foe he sought--\\r\\n So rested he by the Tumtum tree,\\r\\n And stood awhile in thought.\\r\\n\\r\\n And as in uffish thought he stood,\\r\\n The Jabberwock, with eyes of flame,\\r\\n Came whiffling through the tulgey wood,\\r\\n And burbled as it came!\\r\\n\\r\\n One, two! One, two! And through and through\\r\\n The vorpal blade went snicker-snack!\\r\\n He left it dead, and with its head\\r\\n He went galumphing back.\\r\\n\\r\\n 'And hast thou slain the Jabberwock?\\r\\n Come to my arms, my beamish boy!\\r\\n O frabjous day! Callooh! Callay!'\\r\\n He chortled in his joy.\\r\\n\\r\\n 'Twas brillig, and the slithy toves\\r\\n Did gyre and gimble in the wabe;\\r\\n All mimsy were the borogoves,\\r\\n And the mome raths outgrabe.\\r\\n\\r\\n\\r\\n'\"" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poem = text[start:end]\n", "poem" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JABBERWOCKY\r\n", "\r\n", " 'Twas brillig, and the slithy toves\r\n", " Did gyre and gimble in the wabe;\r\n", " All mimsy were the borogoves,\r\n", " And the mome raths outgrabe.\r\n", "\r\n", " 'Beware the Jabberwock, my son!\r\n", " The jaws that bite, the claws that catch!\r\n", " Beware the Jubjub bird, and shun\r\n", " The frumious Bandersnatch!'\r\n", "\r\n", " He took his vorpal sword in hand:\r\n", " Long time the manxome foe he sought--\r\n", " So rested he by the Tumtum tree,\r\n", " And stood awhile in thought.\r\n", "\r\n", " And as in uffish thought he stood,\r\n", " The Jabberwock, with eyes of flame,\r\n", " Came whiffling through the tulgey wood,\r\n", " And burbled as it came!\r\n", "\r\n", " One, two! One, two! And through and through\r\n", " The vorpal blade went snicker-snack!\r\n", " He left it dead, and with its head\r\n", " He went galumphing back.\r\n", "\r\n", " 'And hast thou slain the Jabberwock?\r\n", " Come to my arms, my beamish boy!\r\n", " O frabjous day! Callooh! Callay!'\r\n", " He chortled in his joy.\r\n", "\r\n", " 'Twas brillig, and the slithy toves\r\n", " Did gyre and gimble in the wabe;\r\n", " All mimsy were the borogoves,\r\n", " And the mome raths outgrabe.\r\n", "\r\n", "\r\n", "'\n" ] } ], "source": [ "print(poem)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Jabberwocky\r\n", "\r\n", " 'Twas Brillig, And The Slithy Toves\r\n", " Did Gyre And Gimble In The Wabe;\r\n", " All Mimsy Were The Borogoves,\r\n", " And The Mome Raths Outgrabe.\r\n", "\r\n", " 'Beware The Jabberwock, My Son!\r\n", " The Jaws That Bite, The Claws That Catch!\r\n", " Beware The Jubjub Bird, And Shun\r\n", " The Frumious Bandersnatch!'\r\n", "\r\n", " He Took His Vorpal Sword In Hand:\r\n", " Long Time The Manxome Foe He Sought--\r\n", " So Rested He By The Tumtum Tree,\r\n", " And Stood Awhile In Thought.\r\n", "\r\n", " And As In Uffish Thought He Stood,\r\n", " The Jabberwock, With Eyes Of Flame,\r\n", " Came Whiffling Through The Tulgey Wood,\r\n", " And Burbled As It Came!\r\n", "\r\n", " One, Two! One, Two! And Through And Through\r\n", " The Vorpal Blade Went Snicker-Snack!\r\n", " He Left It Dead, And With Its Head\r\n", " He Went Galumphing Back.\r\n", "\r\n", " 'And Hast Thou Slain The Jabberwock?\r\n", " Come To My Arms, My Beamish Boy!\r\n", " O Frabjous Day! Callooh! Callay!'\r\n", " He Chortled In His Joy.\r\n", "\r\n", " 'Twas Brillig, And The Slithy Toves\r\n", " Did Gyre And Gimble In The Wabe;\r\n", " All Mimsy Were The Borogoves,\r\n", " And The Mome Raths Outgrabe.\r\n", "\r\n", "\r\n", "'\n" ] } ], "source": [ "print(poem.title())" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poem.count('the')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JABBERWOCKY\r\n", "\r\n", " 'Twas brillig, and XXX slithy toves\r\n", " Did gyre and gimble in XXX wabe;\r\n", " All mimsy were XXX borogoves,\r\n", " And XXX mome raths outgrabe.\r\n", "\r\n", " 'Beware XXX Jabberwock, my son!\r\n", " The jaws that bite, XXX claws that catch!\r\n", " Beware XXX Jubjub bird, and shun\r\n", " The frumious Bandersnatch!'\r\n", "\r\n", " He took his vorpal sword in hand:\r\n", " Long time XXX manxome foe he sought--\r\n", " So rested he by XXX Tumtum tree,\r\n", " And stood awhile in thought.\r\n", "\r\n", " And as in uffish thought he stood,\r\n", " The Jabberwock, with eyes of flame,\r\n", " Came whiffling through XXX tulgey wood,\r\n", " And burbled as it came!\r\n", "\r\n", " One, two! One, two! And through and through\r\n", " The vorpal blade went snicker-snack!\r\n", " He left it dead, and with its head\r\n", " He went galumphing back.\r\n", "\r\n", " 'And hast thou slain XXX Jabberwock?\r\n", " Come to my arms, my beamish boy!\r\n", " O frabjous day! Callooh! Callay!'\r\n", " He chortled in his joy.\r\n", "\r\n", " 'Twas brillig, and XXX slithy toves\r\n", " Did gyre and gimble in XXX wabe;\r\n", " All mimsy were XXX borogoves,\r\n", " And XXX mome raths outgrabe.\r\n", "\r\n", "\r\n", "'\n" ] } ], "source": [ "print(poem.replace('the', 'XXX'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find palindromic words in poem if any\n", "----" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "poem = poem.lower()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import string\n", "string.punctuation" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'jabberwocky\\r\\n\\r\\n twas brillig and the slithy toves\\r\\n did gyre and gimble in the wabe\\r\\n all mimsy were the borogoves\\r\\n and the mome raths outgrabe\\r\\n\\r\\n beware the jabberwock my son\\r\\n the jaws that bite the claws that catch\\r\\n beware the jubjub bird and shun\\r\\n the frumious bandersnatch\\r\\n\\r\\n he took his vorpal sword in hand\\r\\n long time the manxome foe he sought\\r\\n so rested he by the tumtum tree\\r\\n and stood awhile in thought\\r\\n\\r\\n and as in uffish thought he stood\\r\\n the jabberwock with eyes of flame\\r\\n came whiffling through the tulgey wood\\r\\n and burbled as it came\\r\\n\\r\\n one two one two and through and through\\r\\n the vorpal blade went snickersnack\\r\\n he left it dead and with its head\\r\\n he went galumphing back\\r\\n\\r\\n and hast thou slain the jabberwock\\r\\n come to my arms my beamish boy\\r\\n o frabjous day callooh callay\\r\\n he chortled in his joy\\r\\n\\r\\n twas brillig and the slithy toves\\r\\n did gyre and gimble in the wabe\\r\\n all mimsy were the borogoves\\r\\n and the mome raths outgrabe\\r\\n\\r\\n\\r\\n'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poem = poem.translate(dict.fromkeys(map(ord, string.punctuation)))\n", "poem" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['jabberwocky',\n", " 'twas',\n", " 'brillig',\n", " 'and',\n", " 'the',\n", " 'slithy',\n", " 'toves',\n", " 'did',\n", " 'gyre',\n", " 'and']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = poem.split()\n", "words[:10]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def is_palindrome(word):\n", " return word == word[::-1]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'did', 'o'}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{word for word in words if is_palindrome(word)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top 10 most frequent words\n", "----" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import collections" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "poem_counter = collections.Counter(words)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 19),\n", " ('and', 14),\n", " ('he', 7),\n", " ('in', 6),\n", " ('jabberwock', 3),\n", " ('my', 3),\n", " ('through', 3),\n", " ('twas', 2),\n", " ('brillig', 2),\n", " ('slithy', 2)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poem_counter.most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Words that appear exactly twice.\n", "----" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('twas', 2),\n", " ('brillig', 2),\n", " ('slithy', 2),\n", " ('toves', 2),\n", " ('did', 2),\n", " ('gyre', 2),\n", " ('gimble', 2),\n", " ('wabe', 2),\n", " ('all', 2),\n", " ('mimsy', 2),\n", " ('were', 2),\n", " ('borogoves', 2),\n", " ('mome', 2),\n", " ('raths', 2),\n", " ('outgrabe', 2),\n", " ('beware', 2),\n", " ('that', 2),\n", " ('his', 2),\n", " ('vorpal', 2),\n", " ('stood', 2),\n", " ('thought', 2),\n", " ('as', 2),\n", " ('with', 2),\n", " ('came', 2),\n", " ('it', 2),\n", " ('one', 2),\n", " ('two', 2),\n", " ('went', 2)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(k, v) for (k, v) in poem_counter.items() if v==2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trigrams\n", "----\n", "\n", "All possible sequences of 3 words in the poem." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('jabberwocky', 'twas', 'brillig'),\n", " ('twas', 'brillig', 'and'),\n", " ('brillig', 'and', 'the'),\n", " ('and', 'the', 'slithy'),\n", " ('the', 'slithy', 'toves'),\n", " ('slithy', 'toves', 'did'),\n", " ('toves', 'did', 'gyre'),\n", " ('did', 'gyre', 'and'),\n", " ('gyre', 'and', 'gimble'),\n", " ('and', 'gimble', 'in')]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(zip(words[:-2], words[1:-1], words[2:]))[:10]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import itertools as it" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def window(x, n):\n", " \"\"\"Sliding widnow of size n from iterable x.\"\"\"\n", " s = (it.islice(x, i, None) for i in range(n))\n", " return zip(*s)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('jabberwocky', 'twas', 'brillig'),\n", " ('twas', 'brillig', 'and'),\n", " ('brillig', 'and', 'the'),\n", " ('and', 'the', 'slithy'),\n", " ('the', 'slithy', 'toves'),\n", " ('slithy', 'toves', 'did'),\n", " ('toves', 'did', 'gyre'),\n", " ('did', 'gyre', 'and'),\n", " ('gyre', 'and', 'gimble'),\n", " ('and', 'gimble', 'in')]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(window(words, 3))[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find words in poem that are over-represented\n", "----" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "book = text" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "book = book.lower().translate(dict.fromkeys(map(ord, string.punctuation)))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "book_counter = collections.Counter(book.split())" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "n = sum(book_counter.values())\n", "book_freqs = {k: v/n for k, v in book_counter.items()}" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "n = sum(poem_counter.values())\n", "stats = [(k, v, book_freqs.get(k,0)*n) for k, v in poem_counter.items()]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pandas import DataFrame" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = DataFrame(stats, columns = ['word', 'observed', 'expected'])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['score'] = (df.observed-df.expected)**2/df.expected" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordobservedexpectedscore
20jabberwock30.01557572.045510
19beware20.01038381.363673
36vorpal20.01038381.363673
1twas20.01557252.917766
15borogoves20.01557252.917766
90joy10.00519190.681837
31frumious10.00519190.681837
74galumphing10.00519190.681837
26claws10.00519190.681837
69snickersnack10.00519190.681837
\n", "
" ], "text/plain": [ " word observed expected score\n", "20 jabberwock 3 0.01557 572.045510\n", "19 beware 2 0.01038 381.363673\n", "36 vorpal 2 0.01038 381.363673\n", "1 twas 2 0.01557 252.917766\n", "15 borogoves 2 0.01557 252.917766\n", "90 joy 1 0.00519 190.681837\n", "31 frumious 1 0.00519 190.681837\n", "74 galumphing 1 0.00519 190.681837\n", "26 claws 1 0.00519 190.681837\n", "69 snickersnack 1 0.00519 190.681837" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.sort_values(['score'], ascending=False)\n", "df.head(n=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encode and decode poem using a Caesar cipher\n", "----" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "jabberwocky\r\n", "\r\n", " twas brillig and the slithy toves\r\n", " did gyre and gimble in the wabe\r\n", " all mimsy were the borogoves\r\n", " and the mome raths outgrabe\r\n", "\r\n", " beware the jabberwock my son\r\n", " the jaws that bite the claws that catch\r\n", " beware the jubjub bird and shun\r\n", " the frumious bandersnatch\r\n", "\r\n", " he took his vorpal sword in hand\r\n", " long time the manxome foe he sought\r\n", " so rested he by the tumtum tree\r\n", " and stood awhile in thought\r\n", "\r\n", " and as in uffish thought he stood\r\n", " the jabberwock with eyes of flame\r\n", " came whiffling through the tulgey wood\r\n", " and burbled as it came\r\n", "\r\n", " one two one two and through and through\r\n", " the vorpal blade went snickersnack\r\n", " he left it dead and with its head\r\n", " he went galumphing back\r\n", "\r\n", " and hast thou slain the jabberwock\r\n", " come to my arms my beamish boy\r\n", " o frabjous day callooh callay\r\n", " he chortled in his joy\r\n", "\r\n", " twas brillig and the slithy toves\r\n", " did gyre and gimble in the wabe\r\n", " all mimsy were the borogoves\r\n", " and the mome raths outgrabe\r\n", "\r\n", "\r\n", "\n" ] } ], "source": [ "print(poem)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encoding" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def encode(text, k):\n", " table = dict(zip(map(ord, string.ascii_lowercase), \n", " string.ascii_lowercase[k:] + string.ascii_lowercase[:k]))\n", " return text.translate(table)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lcddgtyqema\r\n", "\r\n", " vycu dtknnki cpf vjg unkvja vqxgu\r\n", " fkf iatg cpf ikodng kp vjg ycdg\r\n", " cnn okoua ygtg vjg dqtqiqxgu\r\n", " cpf vjg oqog tcvju qwvitcdg\r\n", "\r\n", " dgyctg vjg lcddgtyqem oa uqp\r\n", " vjg lcyu vjcv dkvg vjg encyu vjcv ecvej\r\n", " dgyctg vjg lwdlwd dktf cpf ujwp\r\n", " vjg htwokqwu dcpfgtupcvej\r\n", "\r\n", " jg vqqm jku xqtrcn uyqtf kp jcpf\r\n", " nqpi vkog vjg ocpzqog hqg jg uqwijv\r\n", " uq tguvgf jg da vjg vwovwo vtgg\r\n", " cpf uvqqf cyjkng kp vjqwijv\r\n", "\r\n", " cpf cu kp whhkuj vjqwijv jg uvqqf\r\n", " vjg lcddgtyqem ykvj gagu qh hncog\r\n", " ecog yjkhhnkpi vjtqwij vjg vwniga yqqf\r\n", " cpf dwtdngf cu kv ecog\r\n", "\r\n", " qpg vyq qpg vyq cpf vjtqwij cpf vjtqwij\r\n", " vjg xqtrcn dncfg ygpv upkemgtupcem\r\n", " jg nghv kv fgcf cpf ykvj kvu jgcf\r\n", " jg ygpv icnworjkpi dcem\r\n", "\r\n", " cpf jcuv vjqw unckp vjg lcddgtyqem\r\n", " eqog vq oa ctou oa dgcokuj dqa\r\n", " q htcdlqwu fca ecnnqqj ecnnca\r\n", " jg ejqtvngf kp jku lqa\r\n", "\r\n", " vycu dtknnki cpf vjg unkvja vqxgu\r\n", " fkf iatg cpf ikodng kp vjg ycdg\r\n", " cnn okoua ygtg vjg dqtqiqxgu\r\n", " cpf vjg oqog tcvju qwvitcdg\r\n", "\r\n", "\r\n", "\n" ] } ], "source": [ "cipher = encode(poem, 2)\n", "print(cipher)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Decoding" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "jabberwocky\r\n", "\r\n", " twas brillig and the slithy toves\r\n", " did gyre and gimble in the wabe\r\n", " all mimsy were the borogoves\r\n", " and the mome raths outgrabe\r\n", "\r\n", " beware the jabberwock my son\r\n", " the jaws that bite the claws that catch\r\n", " beware the jubjub bird and shun\r\n", " the frumious bandersnatch\r\n", "\r\n", " he took his vorpal sword in hand\r\n", " long time the manxome foe he sought\r\n", " so rested he by the tumtum tree\r\n", " and stood awhile in thought\r\n", "\r\n", " and as in uffish thought he stood\r\n", " the jabberwock with eyes of flame\r\n", " came whiffling through the tulgey wood\r\n", " and burbled as it came\r\n", "\r\n", " one two one two and through and through\r\n", " the vorpal blade went snickersnack\r\n", " he left it dead and with its head\r\n", " he went galumphing back\r\n", "\r\n", " and hast thou slain the jabberwock\r\n", " come to my arms my beamish boy\r\n", " o frabjous day callooh callay\r\n", " he chortled in his joy\r\n", "\r\n", " twas brillig and the slithy toves\r\n", " did gyre and gimble in the wabe\r\n", " all mimsy were the borogoves\r\n", " and the mome raths outgrabe\r\n", "\r\n", "\r\n", "\n" ] } ], "source": [ "recovered = encode(cipher, -2)\n", "print(recovered)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Regular Expressions\n", "----\n", "\n", "- [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)\n", "- [Test your Python regular expressions](http://pythex.org/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find all words with a sequence of two or more identical letters e.g. \"look\"" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "regex = re.compile(r'(\\w*(\\w)\\2+\\w*)', re.IGNORECASE | re.MULTILINE)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b jabberwocky\n", "l brillig\n", "l all\n", "b jabberwock\n", "o took\n", "e tree\n", "o stood\n", "f uffish\n", "o stood\n", "b jabberwock\n", "f whiffling\n", "o wood\n", "b jabberwock\n", "o callooh\n", "l callay\n", "l brillig\n", "l all\n" ] } ], "source": [ "for match in regex.finditer(poem):\n", " print(match.group(2), match.group(1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert the identical sequences to uppercase" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "jaBBerwocky\r\n", "\r\n", " twas briLLig and the slithy toves\r\n", " did gyre and gimble in the wabe\r\n", " aLL mimsy were the borogoves\r\n", " and the mome raths outgrabe\r\n", "\r\n", " beware the jaBBerwock my son\r\n", " the jaws that bite the claws that catch\r\n", " beware the jubjub bird and shun\r\n", " the frumious bandersnatch\r\n", "\r\n", " he tOOk his vorpal sword in hand\r\n", " long time the manxome foe he sought\r\n", " so rested he by the tumtum trEE\r\n", " and stOOd awhile in thought\r\n", "\r\n", " and as in uFFish thought he stOOd\r\n", " the jaBBerwock with eyes of flame\r\n", " came whiFFling through the tulgey wOOd\r\n", " and burbled as it came\r\n", "\r\n", " one two one two and through and through\r\n", " the vorpal blade went snickersnack\r\n", " he left it dead and with its head\r\n", " he went galumphing back\r\n", "\r\n", " and hast thou slain the jaBBerwock\r\n", " come to my arms my beamish boy\r\n", " o frabjous day callOOh caLLay\r\n", " he chortled in his joy\r\n", "\r\n", " twas briLLig and the slithy toves\r\n", " did gyre and gimble in the wabe\r\n", " aLL mimsy were the borogoves\r\n", " and the mome raths outgrabe\r\n", "\r\n", "\r\n", "\n" ] } ], "source": [ "def f(match):\n", " word, letter = match.groups()\n", " return word.replace(letter, letter.upper())\n", "\n", "print(regex.sub(f, poem))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Natural language processing\n", "----\n", "\n", "If you intend to perform statistical analysis on natural language, you should probably use NLTK to pre-process the text instead of using string methods and regular expressions. For example, a simple challenge is to first parse the paragraph below into sentences, then parse each sentence into words.\n", "\n", "Paragraph from random Pubmed abstract." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "para = \"\"\"When compared with the control group no significant associations were found for the NS-PEecl group after adjustment of confounding variables. For the S-PEecl group, antiβ2GP1 IgG (OR 16.91, 95% CI 3.71-77.06) was associated, as well as age, obesity, smoking and multiparity. Antiβ2GP1-domain I IgG were associated with aCL, antiβ2GP1 and aPS/PT IgG in the three groups. aPS/PT IgG were associated with aCL IgG, and aPS/PT IgM were associated with aCL and antiβ2GP1 IgM in the three groups CONCLUSION: S-PEecl is a distinct entity from NS-PEecl and is mainly associated with the presence of antiβ2GP1 IgG. Antiβ2GP1 domain I correlate with other aPL IgG tests, and aPS/PT may be promising in patients in which LA tests cannot be interpreted.\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Naive splitting of sentences as anything separated by ., ! or ?" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sep = re.compile(r'[\\?\\!\\.]')" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ss = sep.split(para)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 : When compared with the control group no significant associations were found for the NS-PEecl group after adjustment of confounding variables\n", "\n", "2 : For the S-PEecl group, antiβ2GP1 IgG (OR 16\n", "\n", "3 : 91, 95% CI 3\n", "\n", "4 : 71-77\n", "\n", "5 : 06) was associated, as well as age, obesity, smoking and multiparity\n", "\n", "6 : Antiβ2GP1-domain I IgG were associated with aCL, antiβ2GP1 and aPS/PT IgG in the three groups\n", "\n", "7 : aPS/PT IgG were associated with aCL IgG, and aPS/PT IgM were associated with aCL and antiβ2GP1 IgM in the three groups CONCLUSION: S-PEecl is a distinct entity from NS-PEecl and is mainly associated with the presence of antiβ2GP1 IgG\n", "\n", "8 : Antiβ2GP1 domain I correlate with other aPL IgG tests, and aPS/PT may be promising in patients in which LA tests cannot be interpreted\n", "\n", "9 : \n", "\n" ] } ], "source": [ "for i, s in enumerate(ss, 1):\n", " print(i, ':', s, end='\\n\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using NLTK" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import nltk" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ss_nltk = nltk.sent_tokenize(para)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 : When compared with the control group no significant associations were found for the NS-PEecl group after adjustment of confounding variables.\n", "\n", "2 : For the S-PEecl group, antiβ2GP1 IgG (OR 16.91, 95% CI 3.71-77.06) was associated, as well as age, obesity, smoking and multiparity.\n", "\n", "3 : Antiβ2GP1-domain I IgG were associated with aCL, antiβ2GP1 and aPS/PT IgG in the three groups.\n", "\n", "4 : aPS/PT IgG were associated with aCL IgG, and aPS/PT IgM were associated with aCL and antiβ2GP1 IgM in the three groups CONCLUSION: S-PEecl is a distinct entity from NS-PEecl and is mainly associated with the presence of antiβ2GP1 IgG.\n", "\n", "5 : Antiβ2GP1 domain I correlate with other aPL IgG tests, and aPS/PT may be promising in patients in which LA tests cannot be interpreted.\n", "\n" ] } ], "source": [ "for i, s in enumerate(ss_nltk, 1):\n", " print(i, ':', s, end='\\n\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Naive parsing of the second sentence into words" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'For the S-PEecl group, antiβ2GP1 IgG (OR 16.91, 95% CI 3.71-77.06) was associated, as well as age, obesity, smoking and multiparity.'" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = ss_nltk[1]\n", "s" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['For',\n", " 'the',\n", " 'SPEecl',\n", " 'group',\n", " 'antiβ2GP1',\n", " 'IgG',\n", " 'OR',\n", " '1691',\n", " '95',\n", " 'CI',\n", " '3717706',\n", " 'was',\n", " 'associated',\n", " 'as',\n", " 'well',\n", " 'as',\n", " 'age',\n", " 'obesity',\n", " 'smoking',\n", " 'and',\n", " 'multiparity']" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove punctuation and split on whit space\n", "table = dict.fromkeys(map(ord, string.punctuation))\n", "s.translate(table).split()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using NLTK" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['For',\n", " 'the',\n", " 'S-PEecl',\n", " 'group',\n", " ',',\n", " 'antiβ2GP1',\n", " 'IgG',\n", " '(',\n", " 'OR',\n", " '16.91',\n", " ',',\n", " '95',\n", " '%',\n", " 'CI',\n", " '3.71-77.06',\n", " ')',\n", " 'was',\n", " 'associated',\n", " ',',\n", " 'as',\n", " 'well',\n", " 'as',\n", " 'age',\n", " ',',\n", " 'obesity',\n", " ',',\n", " 'smoking',\n", " 'and',\n", " 'multiparity',\n", " '.']" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = nltk.word_tokenize(s)\n", "text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NLTK is a rich library for natural language processing\n", "\n", "See http://www.nltk.org for details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Tag tokens with part-of-speech labels" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('For', 'IN'),\n", " ('the', 'DT'),\n", " ('S-PEecl', 'NNP'),\n", " ('group', 'NN'),\n", " (',', ','),\n", " ('antiβ2GP1', 'RB'),\n", " ('IgG', 'NNP'),\n", " ('(', '('),\n", " ('OR', 'NNP'),\n", " ('16.91', 'CD'),\n", " (',', ','),\n", " ('95', 'CD'),\n", " ('%', 'NN'),\n", " ('CI', 'NNP'),\n", " ('3.71-77.06', 'CD'),\n", " (')', ')'),\n", " ('was', 'VBD'),\n", " ('associated', 'VBN'),\n", " (',', ','),\n", " ('as', 'RB'),\n", " ('well', 'RB'),\n", " ('as', 'IN'),\n", " ('age', 'NN'),\n", " (',', ','),\n", " ('obesity', 'NN'),\n", " (',', ','),\n", " ('smoking', 'NN'),\n", " ('and', 'CC'),\n", " ('multiparity', 'NN'),\n", " ('.', '.')]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tagged_text = nltk.pos_tag(text)\n", "tagged_text" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'For the S-PEecl group, antiβ2GP1 IgG (OR 16.91, 95% CI 3.71-77.06) was associated, as well as age, obesity, smoking and multiparity.'" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A simplistic way to pick up nouns" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['S-PEecl',\n", " 'group',\n", " 'IgG',\n", " 'OR',\n", " '%',\n", " 'CI',\n", " 'age',\n", " 'obesity',\n", " 'smoking',\n", " 'multiparity']" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[w for w, t in tagged_text if t.startswith('N')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "String formatting\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selection" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import math" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": true }, "outputs": [], "source": [ "stuff = ('bun', 'shoe', ['bee', 'door'], 2, math.pi, 0.05)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: bun, Two shoe'" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {}, Two {}'.format(*stuff)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: bun, Two shoe'" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {0}, Two {1}'.format(*stuff)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: shoe, Two shoe'" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {1}, Two {1}'.format(*stuff)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: bun, Two door'" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {0}, Two {2[1]}'.format(*stuff)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Formatting" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: bun , Two ___________shoe'" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {0:^10s}, Two {1:_>15s}'.format(*stuff)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: 2, Two 3.141592653589793'" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {3}, Two {4}'.format(*stuff)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: +2, Two 3.1416'" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {3:+10d}, Two {4:.4f}'.format(*stuff)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: 0002, Two 3.142'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {3:04d}, Two {4:.4g}'.format(*stuff)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: 2.0000e+00, Two 3.1416e+00'" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {3:.4e}, Two {4:.4e}'.format(*stuff)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'One: 5.00%, Two 0.050000'" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'One: {5:.2%}, Two {5:f}'.format(*stuff)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Old style formatting" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"bun, shoe, ['bee', 'door'], 2, 3.1416, 0.05\"" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'%s, %s, %a, %d, %.4f, %.2f' % stuff" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Formatting numpy arrays" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1, 2, 3, 4],\n", " [ 5, 6, 7, 8],\n", " [ 9, 10, 11, 12]])" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = np.arange(1, 13).reshape(3,4)\n", "x" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.set_printoptions(formatter={'int': lambda x: '%8.2f' % x})" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1.00, 2.00, 3.00, 4.00],\n", " [ 5.00, 6.00, 7.00, 8.00],\n", " [ 9.00, 10.00, 11.00, 12.00]])" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.set_printoptions()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1, 2, 3, 4],\n", " [ 5, 6, 7, 8],\n", " [ 9, 10, 11, 12]])" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 } }, "nbformat": 4, "nbformat_minor": 1 }