{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Analysis\n", "\n", "We will explore exploratory data analysis and supervised learning for free text in this lecture. In the next lecture, we will look at unsupervised learning and topic models.\n", "\n", "Along the way, we will use the packages\n", "\n", "- [`sklearn`](http://scikit-learn.org/stable/)\n", "- [`wordcloud`](https://github.com/amueller/word_cloud)\n", "- [`nltk`](https://www.nltk.org)\n", "- [`gensim`](https://radimrehurek.com/gensim/)\n", "- [`spaCy`](https://spacy.io)\n", "\n", "Other packages useful for text analysis include\n", "\n", "- [`fasttext`](https://fasttext.cc/)\n", "\n", "and many, many others." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory data analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Corpus\n", "\n", "A corpus is a collection of text documents. There are many ways to create a corpus, and they may come from documents, scraped web pages, Twitter streams, speech translation and so on. The first step in any text analysis application is nearly always to create an application-specific corpus. This is important, because the language patterns in different domains are often very different (e.g. contrast medical records with legal documents with Twitter streams). " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set_context('notebook', font_scale=1.5)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "from nltk.stem import SnowballStemmer, WordNetLemmatizer\n", "from nltk.collocations import QuadgramCollocationFinder, TrigramCollocationFinder\n", "from nltk.metrics.association import QuadgramAssocMeasures, TrigramAssocMeasures\n", "import string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Toy corpus\n", "\n", "We see how a small corpus with two documents is broken down into smaller pieces \n", "\n", "document $\\to$ paragraph $\\to$ sentences $\\to$ tokens\n", "\n", "Although this explicit decomposition may not be necessary in all applications, it is still useful to be aware of these units:\n", "\n", "- A paragraph contains an *idea*\n", "- A sentence is a unit of syntax\n", "- A token (word or punctuation) is the smallest meaningful unit" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "docs = [\n", " '''Spicy jalapeno bacon ipsum dolor amet aute prosciutto velit corned beef consectetur. Aute kielbasa adipisicing, nostrud drumstick ipsum tail pig capicola burgdoggen corned beef. Dolor proident salami deserunt. Venison capicola pork belly bacon aliquip swine incididunt sint quis cupidatat pork chop et turducken nulla beef. Ground round kielbasa tri-tip consectetur, t-bone pariatur deserunt id ut adipisicing.\n", "\n", "Strip steak meatball chuck aute, pork loin turkey pork commodo et officia. Rump enim spare ribs, prosciutto chuck deserunt tail. Aute pork lorem sausage. Nostrud dolore kevin proident pork chop do in. Exercitation shoulder dolore kevin ut, sausage ullamco frankfurter ham hock. Ground round fatback ribeye turkey tri-tip capicola.''',\n", " '''Burgdoggen id ham hock ut kielbasa. Eu pork chop anim picanha sed porchetta dolor consequat drumstick shankle proident pork andouille. Et cupim burgdoggen, officia lorem shank ut sed drumstick shankle salami ad ball tip dolore pig. Shankle turkey officia, reprehenderit bacon ipsum ullamco enim tail tongue. Brisket short ribs biltong jerky flank, venison filet mignon tenderloin culpa bacon meatball short loin commodo. Leberkas jowl prosciutto, et kielbasa pancetta chicken. Nisi minim sausage porchetta jowl.\n", "\n", "Beef ribs pariatur pork chop dolore ex, consequat turducken frankfurter esse filet mignon lorem bacon. Elit dolore porchetta meatball ea, pork loin pork anim non sirloin. Aliquip tenderloin reprehenderit pariatur, leberkas alcatra short loin. Fugiat elit meatloaf, nulla cow in sausage. Doner consequat shankle salami est, boudin deserunt. Drumstick ham lorem reprehenderit.\n", "\n", "Beef adipisicing nisi rump filet mignon cillum leberkas boudin tail picanha pork loin. Culpa picanha ground round in laborum spare ribs. Burgdoggen leberkas landjaeger adipisicing strip steak velit doner eu ground round meatloaf consectetur deserunt anim ball tip cow. Porchetta ad minim eiusmod labore eu nisi boudin laboris officia jowl deserunt strip steak. Shank aliquip beef ribs tri-tip ipsum flank. Turducken elit meatloaf aliqua corned beef sirloin irure. Tongue cupim ullamco in sint prosciutto.'''\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Documents" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Spicy jalapeno bacon ipsum dolor amet aute prosciutto velit corned beef consectetur. Aute kielbasa adipisicing, nostrud drumstick ipsum tail pig capicola burgdoggen corned beef. Dolor proident salami deserunt. Venison capicola pork belly bacon aliquip swine incididunt sint quis cupidatat pork chop et turducken nulla beef. Ground round kielbasa tri-tip consectetur, t-bone pariatur deserunt id ut adipisicing.\\n\\nStrip steak meatball chuck aute, pork loin turkey pork commodo et officia. Rump enim spare ribs, prosciutto chuck deserunt tail. Aute pork lorem sausage. Nostrud dolore kevin proident pork chop do in. Exercitation shoulder dolore kevin ut, sausage ullamco frankfurter ham hock. Ground round fatback ribeye turkey tri-tip capicola.',\n", " 'Burgdoggen id ham hock ut kielbasa. Eu pork chop anim picanha sed porchetta dolor consequat drumstick shankle proident pork andouille. Et cupim burgdoggen, officia lorem shank ut sed drumstick shankle salami ad ball tip dolore pig. Shankle turkey officia, reprehenderit bacon ipsum ullamco enim tail tongue. Brisket short ribs biltong jerky flank, venison filet mignon tenderloin culpa bacon meatball short loin commodo. Leberkas jowl prosciutto, et kielbasa pancetta chicken. Nisi minim sausage porchetta jowl.\\n\\nBeef ribs pariatur pork chop dolore ex, consequat turducken frankfurter esse filet mignon lorem bacon. Elit dolore porchetta meatball ea, pork loin pork anim non sirloin. Aliquip tenderloin reprehenderit pariatur, leberkas alcatra short loin. Fugiat elit meatloaf, nulla cow in sausage. Doner consequat shankle salami est, boudin deserunt. Drumstick ham lorem reprehenderit.\\n\\nBeef adipisicing nisi rump filet mignon cillum leberkas boudin tail picanha pork loin. Culpa picanha ground round in laborum spare ribs. Burgdoggen leberkas landjaeger adipisicing strip steak velit doner eu ground round meatloaf consectetur deserunt anim ball tip cow. Porchetta ad minim eiusmod labore eu nisi boudin laboris officia jowl deserunt strip steak. Shank aliquip beef ribs tri-tip ipsum flank. Turducken elit meatloaf aliqua corned beef sirloin irure. Tongue cupim ullamco in sint prosciutto.']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from itertools import chain" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def flatten(listOfLists):\n", " return list(chain.from_iterable(listOfLists))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Paragraphs" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "paras = flatten([doc.split('\\n\\n') for doc in docs])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Spicy jalapeno bacon ipsum dolor amet aute prosciutto velit corned beef consectetur. Aute kielbasa adipisicing, nostrud drumstick ipsum tail pig capicola burgdoggen corned beef. Dolor proident salami deserunt. Venison capicola pork belly bacon aliquip swine incididunt sint quis cupidatat pork chop et turducken nulla beef. Ground round kielbasa tri-tip consectetur, t-bone pariatur deserunt id ut adipisicing.',\n", " 'Strip steak meatball chuck aute, pork loin turkey pork commodo et officia. Rump enim spare ribs, prosciutto chuck deserunt tail. Aute pork lorem sausage. Nostrud dolore kevin proident pork chop do in. Exercitation shoulder dolore kevin ut, sausage ullamco frankfurter ham hock. Ground round fatback ribeye turkey tri-tip capicola.',\n", " 'Burgdoggen id ham hock ut kielbasa. Eu pork chop anim picanha sed porchetta dolor consequat drumstick shankle proident pork andouille. Et cupim burgdoggen, officia lorem shank ut sed drumstick shankle salami ad ball tip dolore pig. Shankle turkey officia, reprehenderit bacon ipsum ullamco enim tail tongue. Brisket short ribs biltong jerky flank, venison filet mignon tenderloin culpa bacon meatball short loin commodo. Leberkas jowl prosciutto, et kielbasa pancetta chicken. Nisi minim sausage porchetta jowl.']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "paras[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Sentences" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "sentences = flatten([nltk.tokenize.sent_tokenize(para) for para in paras])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Spicy jalapeno bacon ipsum dolor amet aute prosciutto velit corned beef consectetur.',\n", " 'Aute kielbasa adipisicing, nostrud drumstick ipsum tail pig capicola burgdoggen corned beef.',\n", " 'Dolor proident salami deserunt.',\n", " 'Venison capicola pork belly bacon aliquip swine incididunt sint quis cupidatat pork chop et turducken nulla beef.',\n", " 'Ground round kielbasa tri-tip consectetur, t-bone pariatur deserunt id ut adipisicing.',\n", " 'Strip steak meatball chuck aute, pork loin turkey pork commodo et officia.',\n", " 'Rump enim spare ribs, prosciutto chuck deserunt tail.',\n", " 'Aute pork lorem sausage.',\n", " 'Nostrud dolore kevin proident pork chop do in.',\n", " 'Exercitation shoulder dolore kevin ut, sausage ullamco frankfurter ham hock.']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentences[:10]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "tokens = flatten([nltk.tokenize.word_tokenize(sentence) for sentence in sentences])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Spicy',\n", " 'jalapeno',\n", " 'bacon',\n", " 'ipsum',\n", " 'dolor',\n", " 'amet',\n", " 'aute',\n", " 'prosciutto',\n", " 'velit',\n", " 'corned']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploratory analysis of the `newsgroup` corpus" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, we will use an existing corpus - the 20 newsgroups dataset that comprises around 18000 newsgroups posts on 20 topics. The 20 topics are\n", "\n", "```\n", "['alt.atheism',\n", " 'comp.graphics',\n", " 'comp.os.ms-windows.misc',\n", " 'comp.sys.ibm.pc.hardware',\n", " 'comp.sys.mac.hardware',\n", " 'comp.windows.x',\n", " 'misc.forsale',\n", " 'rec.autos',\n", " 'rec.motorcycles',\n", " 'rec.sport.baseball',\n", " 'rec.sport.hockey',\n", " 'sci.crypt',\n", " 'sci.electronics',\n", " 'sci.med',\n", " 'sci.space',\n", " 'soc.religion.christian',\n", " 'talk.politics.guns',\n", " 'talk.politics.mideast',\n", " 'talk.politics.misc',\n", " 'talk.religion.misc']\n", "```" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "newsgroups_train = fetch_20newsgroups(\n", " subset='train',\n", " categories=('rec.sport.baseball', \n", " 'rec.sport.hockey',\n", " 'sci.med',\n", " 'sci.space'),\n", " \n", " remove=('headers', 'footers', 'quotes'))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups_train.keys()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'the 20 newsgroups by date dataset'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups_train.description" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2384,)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups_train.filenames.shape" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2384,)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups_train.target.shape" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['rec.sport.baseball', 'rec.sport.hockey', 'sci.med', 'sci.space']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups_train.target_names" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\nA freeze dried Tootsie Roll (tm). The actual taste sensation was like nothing\\nyou will ever willingly experience. The amazing thing was that we ate a second\\none, and a third and ....\\n\\nI doubt that they actually flew on missions, as I\\'m certain they did \"bad\\nthings\" to the gastrointestinal tract. Compared to Space Food Sticks, Tang was\\na gastronomic contribution to mankind.\\n--\\nDillon Pyron | The opinions expressed are those of the\\nTI/DSEG Lewisville VAX Support | sender unless otherwise stated.\\n(214)462-3556 (when I\\'m here) |\\n(214)492-4656 (when I\\'m home) |God gave us weather so we wouldn\\'t complain\\npyron@skndiv.dseg.ti.com |about other things.\\nPADI DM-54909 |'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups_train.data[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting word counts" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import (\n", " HashingVectorizer,\n", " TfidfVectorizer, \n", " CountVectorizer, \n", ")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "idx = np.nonzero(\n", " newsgroups_train.target == \n", " newsgroups_train.target_names.index('rec.sport.baseball')\n", ")[0]\n", "baseball_sample = [newsgroups_train.data[i] for i in idx]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "X = vectorizer.fit_transform(baseball_sample)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "vocab = vectorizer.get_feature_names()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "rownames = [':'.join(filename.split('/')[-2:]) \n", " for filename in newsgroups_train.filenames[idx]]\n", "df = pd.SparseDataFrame(X, columns=vocab, index=rownames)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "freqs = df.sum(axis=0).astype('int')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "the 3508\n", "to 1481\n", "and 1312\n", "of 1142\n", "in 1114\n", "that 882\n", "is 842\n", "he 738\n", "for 580\n", "it 543\n", "dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "freqs.nlargest(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Distribution of word counts" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.\n", " return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZYAAAEJCAYAAAC3yAEAAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAE7pJREFUeJzt3X2wXHV9x/H3l0BuQgOEisUbbpVpkQIjM4kCGR5iU0AdqVVUQAcFscaHWitBB5WCUkdkCD4lOlgVmKYRHxBaQcS0M4hMk1BAwKBg8YERQ0xigyVgICE8/PrHOSsnJ7v3nt38wu6G92vmzN77O9/97uE3y/3k7HnYSCkhSVIuu/R7AyRJOxeDRZKUlcEiScrKYJEkZWWwSJKyMlgkSVkZLJKkrAwWSVJWBoskKSuDRZKUlcEiScpq135vwLMlIkaAw4G1wFN93hxJGhaTgFHghymlx5s84TkTLBShsqzfGyFJQ2oOsLxJ4XMpWNYCLFu2jLGxsX5viyQNhdWrVzNnzhwo/4Y28VwKlqcAxsbG2H///fu8KZI0dBofQvDgvSQpK4NFkpSVwSJJyspgkSRlZbBIkrIyWCRJWRkskqSsnkvXsWyXr9+6qu34qbNf+CxviSQNNvdYJElZGSySpKwMFklSVgaLJCkrg0WSlJXBIknKymCRJGVlsEiSsjJYJElZGSySpKwMFklSVgaLJCkrg0WSlJXBIknKymCRJGVlsEiSsjJYJElZGSySpKwMFklSVgaLJCkrg0WSlJXBIknKymCRJGVlsEiSsjJYJElZGSySpKwMFklSVgaLJCkrg0WSlJXBIknKymCRJGXVOFgi4vCIuD4iHoqIjRFxV0ScUat5bUTcGRGbI2JVRJwfEbu26TU9Ir4SEesj4tGIuDEiZnZ43UY9JUmDodEf6Ih4NXAtcBPwUeAJ4EDgT2s11wA3Av8AHAp8DNin/L1Vtwtwfbn+08DvgPcCN0XEy1JK93XbU5I0OCYMlojYC1gM/HNK6cxxSj8N/Ah4VUrpqfK5jwDnRMTnU0q/KOtOAo4CXp9Suqas+xbwc+B84PQeekqSBkSTj8JOBaZT7CkQEXtERFQLIuIQ4BDgy60AKH2xfI03VsZOAtZQ7AEBkFJaD3wLODEiduuhpyRpQDQJluOBe4ETIuIB4BHg/yLiooiYVNbMKh9vrz4xpbQGWF1Z36q9I6WUaq9zG7AHcEAPPbdSHsPZv7oAYxP9h0qStl+TYDmA4ljK4nJ5I/Bt4MPAZ8qa0fJxbZvnrwVmVH4fHaeOSm03PevmA7+qLcvGqZckZdLk4P00YG/gIymlBeXYv0fENOC9EXEBMLUcf7zN8zcDu1d+nzpOXWt99bFJz7qFFCFYNYbhIkk7XJNg2VQ+fqM2/jXgZOCISs1Im+dPqaxv9etUV329bnpuJaW0AdhQHasdFpIk7SBNPgprfRT129p46/e9KzWjbGuU4mB9tV+nOiq13fSUJA2IJsFyR/m4X228dTB8PbCy/PmwakFEzCjrVlaGVwIvq59ZBswGNgK/rNQ17SlJGhBNguWq8vEdrYEyFOYBjwK3pJTuoThz7F2VM8UA/g54Gvi3ytjVFAfeX1fptw/Fx2rXppSeAOiypyRpQEx4jCWldEdELKG4KPFPgDuBvwZeBXwopfRIWXo28B3gPyPiSuAlwPsorkP5eaXl1cAtwJKI+DTwIMWV97sA/1R7+aY9JUkDoum9wt4JfJIiTBZRnIL8npTSp1oFKaXvAm8Angd8ofz5AuD91UblxY4nUFwQ+X7gUxQfp/1VSumXtdpGPSVJg6PRvcJSSlso7hH20QnqrqG4t9dE/R6i+ChtXoPaRj0lSYPB2+ZLkrIyWCRJWRkskqSsDBZJUlYGiyQpK4NFkpSVwSJJyspgkSRlZbBIkrIyWCRJWRkskqSsDBZJUlYGiyQpK4NFkpSVwSJJyspgkSRlZbBIkrIyWCRJWRkskqSsDBZJUlYGiyQpK4NFkpSVwSJJyspgkSRlZbBIkrIyWCRJWRkskqSsDBZJUlYGiyQpK4NFkpSVwSJJyspgkSRlZbBIkrIyWCRJWRkskqSsDBZJUlYGiyQpK4NFkpSVwSJJyspgkSRlZbBIkrIyWCRJWfUULBHxoYhIEbGyzbqjImJ5RDwWEesiYlFE7N6mbiQiFkTEmojYFBG3RMRxHV6vUU9JUv91HSwR8QLgPODRNutmAt8HpgAfAC4D3g1c2abVYuAs4ArgTOBpYGlEHLkdPSVJfbZrD8+5CLidIpSm19ZdCPwOmJtS2ggQEfcDl0bEsSmlG8uxI4A3A2ellBaWY0uAu4EFwMu77SlJGgxd7bGUgfBWij2H+ro9gVcAS1oBUFoCbAROqYydBDxBsfcBQEppM3A5cExEjPbQU5I0ABoHS0QE8AXgX1NK2xxbAQ6l2AO6vTqYUtoCrARmVYZnAffWwgLgNiCAmT30lCQNgG4+CjsdOAQ4scP60fJxbZt1a4Eja7W/6VAHMKOHnn8QEdPZ9mO6sXa1kqS8GgVLROxBcWzlopRSuz/yAFPLx8fbrNtcWd+q7VRX7dVNz6r5wPkd1kmSdqCmeyznAVuAz45Ts6l8HGmzbkplfau2U121Vzc9qxZSnHVWNQYs61AvScpkwmApD6TPBz4K7FscagGKP+yTI2J/4GGe+bhqlG2NAmsqv68dp45KbTc9/yCltAHYUPvvaFcqScqsycH7fYHJFKcB/6qyzAYOLn/+MMWpwk8Ch1WfHBGTKQ7GVw/4rwQOiohptdeaXT7eVT5201OSNACaBMuvgNe3We4B7i9/XpJSehi4ATitFhinAdOAqypjVwO7AfNaAxExArwdWJFSWgPQZU9J0gCY8KOw8o/7NfXxiJgPPJlSqq47F7gZuCkiLqM4rvFBYGlK6YZKz1sj4irg4vKjtvuAtwEvAs6ovVSjnpKkwZD1JpQppTuB4ynO4voc8E7gUuDkNuWnA4vKx89T7MGckFJasR09JUl91sstXQBIKc3tML4cOLrB8zcDZ5fLRLWNekqS+s/b5kuSsjJYJElZGSySpKwMFklSVgaLJCkrg0WSlJXBIknKymCRJGVlsEiSsjJYJElZGSySpKwMFklSVgaLJCkrg0WSlJXBIknKymCRJGVlsEiSsjJYJElZGSySpKwMFklSVgaLJCkrg0WSlJXBIknKymCRJGVlsEiSsjJYJElZGSySpKwMFklSVgaLJCkrg0WSlJXBIknKymCRJGVlsEiSsjJYJElZGSySpKwMFklSVgaLJCkrg0WSlJXBIknKymCRJGVlsEiSsjJYJElZGSySpKwmDJaIODwiLomIn0bEoxGxKiK+GREHtKk9KiKWR8RjEbEuIhZFxO5t6kYiYkFErImITRFxS0Qc1+H1G/WUJA2GJnssHwbeANwAnAl8BZgL/CgiDm4VRcRM4PvAFOADwGXAu4Er2/RcDJwFXFH2fBpYGhFHVou67ClJGgC7Nqj5LHBqSmlLayAirgR+QhE6Z5TDFwK/A+amlDaWdfcDl0bEsSmlG8uxI4A3A2ellBaWY0uAu4EFwMsrr92opyRpcEy4x5JSurkaKuXYL4B7gIMBImJP4BXAklYAlJYAG4FTKmMnAU9Q7H20+m0GLgeOiYjRHnpKkgZETwfvIyKAfYEHy6FDKfZ+bq/WlYG0EphVGZ4F3FsLC4DbgABm9tBTkjQgmnwU1s5bgP2Ac8vfR8vHtW1q1wLVYyejwG861AHM6KHnViJiOjC9NjzWqV6SlE/XwRIRBwGXAMuBr5bDU8vHx9s8ZXNlfau2U121Vzc96+YD54+zXpK0g3QVLBHxAuB64CHg5JTS0+WqTeXjSJunTamsb9V2qqv26qZn3UKKM8+qxoBl4zxHkpRB42CJiL2ApcBewNEppXWV1a2Pq0a3eWIxtqZW26mOSm03PbeSUtoAbKhtf6dySVJGjQ7eR8QU4DrgQOA1KaWf1UruBp4EDqs9bzLFwfiVleGVwEERMa3WY3b5eFcPPSVJA6LJlfeTKC5IPJLi469b6jUppYcpLqA8rRYYpwHTgKsqY1cDuwHzKq8xArwdWJFSWtNDT0nSgGjyUdhngNdS7LH8cUS8tbJuY0rpmvLnc4GbgZsi4jKKYxofBJamlG5oPSGldGtEXAVcXF6zch/wNuBFPHOxJd30lCQNjibB0rqu5G/KperXwDUAKaU7I+J4iqvnPwc8AlwKnNOm5+nAJ8rHvYEfAyeklFZUi7rsKUkaABMGS0ppbtNmKaXlwNEN6jYDZ5dLlp6SpMHgbfMlSVkZLJKkrAwWSVJWBoskKSuDRZKUlcEiScrKYJEkZWWwSJKyMlgkSVkZLJKkrAwWSVJWBoskKSuDRZKUlcEiScrKYJEkZWWwSJKyMlgkSVkZLJKkrAwWSVJWBoskKSuDRZKUlcEiScrKYJEkZWWwSJKyMlgkSVkZLJKkrAwWSVJWBoskKSuDRZKUlcEiScrKYJEkZWWwSJKyMlgkSVkZLJKkrAwWSVJWBoskKSuDRZKUlcEiScrKYJEkZWWwSJKyMlgkSVkZLJKkrAwWSVJWQxEsETESEQsiYk1EbIqIWyLiuH5vlyRpW0MRLMBi4CzgCuBM4GlgaUQc2c+NkiRta9d+b8BEIuII4M3AWSmlheXYEuBuYAHw8j5uniSpZuCDBTgJeAK4rDWQUtocEZcDn4yI0ZTS2n5t3NdvXdV2/NTZL3yWt0SSBsMwBMss4N6U0sba+G1AADOBrYIlIqYD02v1LwJYvXp1Txuxfs1vuqpf9O0H2o6/btZ+Pb2+JPVD5W/mpKbPGYZgGQXa/VVvhcmMNuvmA+e3azZnzpxMm9Wb+X19dUnq2ShwX5PCYQiWqcDjbcY3V9bXLaQ44F81Gfgz4BfAU11uwxiwDJgD9LbLo06c2x3Ded1xnmtzO4kiVH7Y9AnDECybgJE241Mq67eSUtoAbGjznJ/3sgER0fpxdUrp/l56qD3ndsdwXnec5+jcNtpTaRmG043XUqRlXWtszbO4LZKkCQxDsKwEDoqIabXx2eXjXc/y9kiSxjEMwXI1sBswrzUQESPA24EVKSX3WCRpgAz8MZaU0q0RcRVwcUS0zkp4G8Xpw2c8S5uxAfg47Y/baPs4tzuG87rjOLcTiJRSv7dhQhExBfgE8FZgb+DHwD+mlG7o64ZJkrYxFMEiSRoew3CMRZI0RAwWSVJWBss4/B6Y9iJiNCIuiogfRMTvIyJFxNwOta+NiDsjYnNErIqI8yNim5NGImJ6RHwlItZHxKMRcWNEzNyensMoIg6PiEsi4qflPKyKiG9GxAFtao+KiOUR8VhErIuIRRGxe5u6xu/jpj2HTUQcFhHfjohfl3OwLiL+IyKOalPrvG6vlJJLhwX4BrAFuBh4F3Bz+fuR/d62Ps/LXCBR3B5nRfnz3DZ1r6b47pwbgHcCn6e4nc4XanW7lH0eAT4G/D1wD8VZN3/eS89hXShOr19b/nfNA84D1gG/Bw6u1M2kuOvE7cB7gAsobnN0Xa/v4256DtsCvAm4rpzPdwAfBO4EngRe4bxmnu9+b8CgLsAR5R/M+ZWxKcAvgf/q9/b1eW72AJ5X/nziOMFyD3AHMKkydkEZBC+ujJ1S9jixMvZ84CFgSS89h3UBjgIm18ZeXP4hWlwZ+x7FfaqmVcbmlfN4bGWs8fu4ac+dZQF2pwjt7zqvmee23xswqEv5r5At1TdDOX4Oxb+YR/u9jYOwdAoW4JBy/F218Rnl+EcqY9+iuIN11Gq/TLEXs1u3PXe2pQzTW8uf96T4jqILazWTKfZsvlQZa/Q+7qbnzrQAPwGWO695F4+xdNbke2DU2azy8fbqYCrulLC6sr5Ve0cq/4+ruI1i7+iASl3TnjuNKO56uC/wYDl0KMXFzfV52EJxC6T63DZ5H3fTc2hFxB4RsU9E/EVEXAi8BPh+udp5zcRg6WyU2heIlcb7Hhg9o3WT0E5zOKNW22Suu+m5M3kLsB/Fnh04t9vjX4D1wL0Ux1m+BFxYrnNeMzFYOuvle2D0jNb8dJrDqbXaJnPdTc+dQkQcBFwCLAe+Wg47t737OPBK4G8pThgZobgXITiv2ewUp2juIF1/D4y20pqfTnO4qVbbZK676Tn0IuIFwPUUJzGcnFJ6ulzl3PYopfQTiuMqRMQVFB9RLQZOwnnNxj2WzvwemO3T2vXvNIdrarVN5rqbnkMtIvYClgJ7Aa9KKa2rrHZuM0gpPQFcC7whIqbivGZjsHTm98Bsn5Xl42HVwYiYQfHVritrtS+LylfzlWYDGylO4ey259Aqb7p6HXAg8JqU0s9qJXdTXH9Rn4fJFAeN63Pb5H3cTc+dyVSKg+174Lzm0+/T0gZ1oXiD1M9TH6G4KHB5v7dvUBbGv47lfyg+aqhec/IJimtODqyMvYltr2PZh+IjoCt66TmsC8X3i19LcYrqCePULQUeYOtrI95RzuPxlbHG7+OmPYdxAZ7fZmxP4H5glfOaeb77vQGDvFCchbMFWEBxZe2K8vej+71t/V4ormA+D/ha+T/I5eXv76vUvIatr5JfVAbAF2u9JgH/zTNX3r+X4l96DwMH1Gob9RzWBVhYzud3KL4morpUg/elFAd/q1dzbwK+1+v7uJuew7YAN1JcqHgexcWJHwdWle+lU5zXzPPd7w0Y5IXi4NqnKD4n3UxxjvpO+S+MHuYmdVjur9WdCPyonL8Hyv+hd23Tb2/gMoprNR4FfgC8tMNrN+o5jAtwUxdze0z5x2wT8FuK28D8UZuejd/HTXsO20JxFthNwP9S7A2up/i48S97nQPntfPi97FIkrLy4L0kKSuDRZKUlcEiScrKYJEkZWWwSJKyMlgkSVkZLJKkrAwWSVJWBoskKSuDRZKU1f8DkrCkLUD64zYAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.distplot(freqs, kde=False)\n", "pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Zipf's law\n", "\n", "The number of words that occur with frequency $f$ is a random variable with a power law distribution\n", "\n", "$$\n", "p(f) = \\alpha f^{1-1/s}\n", "$$\n", "\n", "Random variables that follow a power law distribution look linear on a log-log plot." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "xs = freqs.sort_values(ascending=False).reset_index(drop=True, )\n", "plt.loglog(xs.index + 1, xs)\n", "plt.xlabel('Log(Rank)')\n", "plt.ylabel('Log(Frequency)')\n", "plt.title(\"Zipf's law\")\n", "pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stop words, lemmatization and stemming\n", "\n", "We can try to reduce the number of tokens using the simple strategies of stop words, stemming and lemmatization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stop words\n", "\n", "The most common words are not very informative, and we may wish to remove them. There are other ways to handle this (e.g. with TF-IDF vectorizers) but we will simply use stop words for this section." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer(stop_words='english')" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "idx = np.nonzero(\n", " newsgroups_train.target == \n", " newsgroups_train.target_names.index('rec.sport.baseball')\n", ")[0]\n", "baseball_sample = [newsgroups_train.data[i] for i in idx]" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "X = vectorizer.fit_transform(baseball_sample)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "vocab = vectorizer.get_feature_names()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "rownames = [':'.join(filename.split('/')[-2:]) \n", " for filename in newsgroups_train.filenames[idx]]\n", "df = pd.SparseDataFrame(X, columns=vocab, index=rownames)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "freqs = df.sum(axis=0).astype('int')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also drop numbers." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "freqs = freqs[~freqs.index.str.isnumeric()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the most common words are more informative." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "year 310\n", "game 204\n", "good 200\n", "team 195\n", "think 189\n", "don 186\n", "just 161\n", "like 153\n", "games 149\n", "better 140\n", "baseball 137\n", "hit 137\n", "runs 137\n", "players 135\n", "time 131\n", "dtype: int64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "freqs.nlargest(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stemming\n", "\n", "Stemming is the attempt to identify the common roots of words using prefix and suffix rules." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "def tokenize(text):\n", " stem = SnowballStemmer('english')\n", " text = text.lower()\n", " \n", " for token in nltk.word_tokenize(text):\n", " if token in string.punctuation:\n", " continue\n", " yield stem.stem(token)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "text = '''circle circles circular circularity \n", "circumference circumscribe circumstantial\n", "infer inference inferences inferential'''" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['circl',\n", " 'circl',\n", " 'circular',\n", " 'circular',\n", " 'circumfer',\n", " 'circumscrib',\n", " 'circumstanti',\n", " 'infer',\n", " 'infer',\n", " 'infer',\n", " 'inferenti']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(tokenize(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Lemmatization\n", "\n", "Lemmatization also attempts to identify the common roots of words, but uses dictionary lookup to do so. Lemmatization often gives better results than stemming, but is slower." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "def tokenize(text):\n", " lem = WordNetLemmatizer()\n", " text = text.lower()\n", " \n", " for token in nltk.word_tokenize(text):\n", " if token in string.punctuation:\n", " continue\n", " yield lem.lemmatize(token)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['circle',\n", " 'circle',\n", " 'circular',\n", " 'circularity',\n", " 'circumference',\n", " 'circumscribe',\n", " 'circumstantial',\n", " 'infer',\n", " 'inference',\n", " 'inference',\n", " 'inferential']" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(tokenize(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Word cloud" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from wordcloud import WordCloud" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "wordcloud = WordCloud().generate(' '.join(freqs.nlargest(200).index))\n", "pass" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.imshow(wordcloud, interpolation='bilinear')\n", "plt.axis(\"off\")\n", "pass" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "from imageio import imread" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "rabbit = imread('data/rabbit.png').astype('ubyte')" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "wc = WordCloud(mask=rabbit[:,:,0], \n", " mode='RGBA',\n", " background_color=None)\n", "wc.generate(' '.join(freqs.nlargest(200).index))\n", "pass" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.imshow(wc, interpolation='bilinear')\n", "plt.axis(\"off\")\n", "pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Supervised Learning\n", "\n", "A general framework for supervised learning on text is\n", "\n", "construct corpus $\\to$ vectorization of features $\\to$ classification $\\to$ evaluation (often by cross-validation)\n", "\n", "For example, we may classify documents into topics, or by sentiment, or as spam/not spam." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorization of features\n", "\n", "There are 3 common methods to vectorize features when the text is treated as a bag of words - word count, one hot encoding and TF-IDF." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "small_sample = \"\"\"Do you like green eggs and ham?\n", "I do not like them, Sam-I-am.\n", "I do not like green eggs and ham!\n", "Would you like them here or there?\n", "I would not like them here or there.\n", "I would not like them anywhere.\n", "I do so like green eggs and ham!\n", "Thank you! Thank you,\n", "Sam-I-am!\"\"\".splitlines()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Do you like green eggs and ham?',\n", " 'I do not like them, Sam-I-am.',\n", " 'I do not like green eggs and ham!',\n", " 'Would you like them here or there?',\n", " 'I would not like them here or there.',\n", " 'I would not like them anywhere.',\n", " 'I do so like green eggs and ham!',\n", " 'Thank you! Thank you,',\n", " 'Sam-I-am!']" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small_sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Word counts" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "count_vectorizer = CountVectorizer()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "X = count_vectorizer.fit_transform(small_sample)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amandanywheredoeggsgreenhamherelikenot
00.01.00.01.01.01.01.00.01.00.0
11.00.00.01.00.00.00.00.01.01.0
20.01.00.01.01.01.01.00.01.01.0
30.00.00.00.00.00.00.01.01.00.0
40.00.00.00.00.00.00.01.01.01.0
50.00.01.00.00.00.00.00.01.01.0
60.01.00.01.01.01.01.00.01.00.0
70.00.00.00.00.00.00.00.00.00.0
81.00.00.00.00.00.00.00.00.00.0
\n", "
" ], "text/plain": [ " am and anywhere do eggs green ham here like not\n", "0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0\n", "1 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0\n", "2 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0\n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0\n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0\n", "5 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0\n", "6 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0\n", "7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "8 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab = count_vectorizer.get_feature_names()\n", "df = pd.SparseDataFrame(X, columns=vocab)\n", "df.fillna(0).iloc[:, :10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Hashing\n", "\n", "If the number of words is too large, we can hash words into a fixed number of buckets to keep the computations tractable. However, we lose the ability to map back to the original tokens." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "hash_vectorizer = HashingVectorizer(n_features=5)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "X = hash_vectorizer.fit_transform(small_sample)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-0.60302269, 0.30151134, 0.30151134, 0.30151134, 0.60302269],\n", " [ 0.5 , 0.5 , 0. , 0.5 , -0.5 ],\n", " [-0.60302269, 0.30151134, 0.30151134, 0.30151134, 0.60302269],\n", " [-0.57735027, 0. , 0.57735027, 0. , -0.57735027],\n", " [-0.57735027, 0. , 0.57735027, 0. , -0.57735027],\n", " [-0.90453403, 0. , 0. , 0.30151134, -0.30151134],\n", " [-0.60302269, 0.30151134, 0.30151134, 0.30151134, 0.60302269],\n", " [-0.70710678, 0. , 0. , 0.70710678, 0. ],\n", " [ 1. , 0. , 0. , 0. , 0. ]])" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### One hot encoding\n", "\n", "One hot encoding simply sets words with non-zero counts to 1." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "one_hot_vectorizer = CountVectorizer(binary=True)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "X = one_hot_vectorizer.fit_transform(small_sample)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amandanywheredoeggsgreenhamherelikenot
00.01.00.01.01.01.01.00.01.00.0
11.00.00.01.00.00.00.00.01.01.0
20.01.00.01.01.01.01.00.01.01.0
30.00.00.00.00.00.00.01.01.00.0
40.00.00.00.00.00.00.01.01.01.0
50.00.01.00.00.00.00.00.01.01.0
60.01.00.01.01.01.01.00.01.00.0
70.00.00.00.00.00.00.00.00.00.0
81.00.00.00.00.00.00.00.00.00.0
\n", "
" ], "text/plain": [ " am and anywhere do eggs green ham here like not\n", "0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0\n", "1 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0\n", "2 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0\n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0\n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0\n", "5 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0\n", "6 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0\n", "7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", "8 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab = one_hot_vectorizer.get_feature_names()\n", "df = pd.SparseDataFrame(X, columns=vocab)\n", "df.fillna(0).iloc[:, :10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### TF-IDF\n", "\n", "You have previously implemented this in your homework." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "tf_idf_vectorizer = TfidfVectorizer()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\", category=FutureWarning)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "X = tf_idf_vectorizer.fit_transform(small_sample)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amandanywheredoeggsgreenhamherelikenot
00.0000000.4019960.000000.3551860.4019960.4019960.4019960.0000000.2565890.000000
10.4951650.0000000.000000.3803980.0000000.0000000.0000000.0000000.2748030.380398
20.0000000.4093160.000000.3616530.4093160.4093160.4093160.0000000.2612610.361653
30.0000000.0000000.000000.0000000.0000000.0000000.0000000.4299290.2385980.000000
40.0000000.0000000.000000.0000000.0000000.0000000.0000000.4366720.2423410.335463
50.0000000.0000000.620050.0000000.0000000.0000000.0000000.0000000.2906410.402322
60.0000000.3768270.000000.3329470.3768270.3768270.3768270.0000000.2405230.000000
70.0000000.0000000.000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
80.7071070.0000000.000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
\n", "
" ], "text/plain": [ " am and anywhere do eggs green ham \\\n", "0 0.000000 0.401996 0.00000 0.355186 0.401996 0.401996 0.401996 \n", "1 0.495165 0.000000 0.00000 0.380398 0.000000 0.000000 0.000000 \n", "2 0.000000 0.409316 0.00000 0.361653 0.409316 0.409316 0.409316 \n", "3 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 \n", "4 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 \n", "5 0.000000 0.000000 0.62005 0.000000 0.000000 0.000000 0.000000 \n", "6 0.000000 0.376827 0.00000 0.332947 0.376827 0.376827 0.376827 \n", "7 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 \n", "8 0.707107 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 \n", "\n", " here like not \n", "0 0.000000 0.256589 0.000000 \n", "1 0.000000 0.274803 0.380398 \n", "2 0.000000 0.261261 0.361653 \n", "3 0.429929 0.238598 0.000000 \n", "4 0.436672 0.242341 0.335463 \n", "5 0.000000 0.290641 0.402322 \n", "6 0.000000 0.240523 0.000000 \n", "7 0.000000 0.000000 0.000000 \n", "8 0.000000 0.000000 0.000000 " ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab = tf_idf_vectorizer.get_feature_names()\n", "df = pd.SparseDataFrame(X, columns=vocab)\n", "df.fillna(0).iloc[:, :10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Maintaining context\n", "\n", "For some supervised learning tasks such as sentiment analysis (is this review positive or negative), the context of words is very important. For example the following two reviews use very similar words but have very different meanings.\n", "\n", "- `Only an idiot like Reviewer two could love that movie`\n", "- `Could not love that movie more. Reviewer one is an idiot`\n", "\n", "In this case, we need to take the context of individual words into account. Common ways to take context into account include the use N-grams (also known as colocations), part-of-speech (POS) tagging and grammars, and the `word2vec` family of algorithms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### N-grams" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "count_vectorizer = CountVectorizer(ngram_range=(1,3))" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "X = count_vectorizer.fit_transform(small_sample)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amandand hamanywheredodo notdo not likedo sodo so likedo you
00.01.01.00.01.00.00.00.00.01.0
11.00.00.00.01.01.01.00.00.00.0
20.01.01.00.01.01.01.00.00.00.0
30.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.00.00.00.00.0
50.00.00.01.00.00.00.00.00.00.0
60.01.01.00.01.00.00.01.01.00.0
70.00.00.00.00.00.00.00.00.00.0
81.00.00.00.00.00.00.00.00.00.0
\n", "
" ], "text/plain": [ " am and and ham anywhere do do not do not like do so do so like \\\n", "0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 \n", "1 1.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 \n", "2 0.0 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "5 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", "6 0.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 \n", "7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "8 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " do you \n", "0 1.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 \n", "5 0.0 \n", "6 0.0 \n", "7 0.0 \n", "8 0.0 " ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab = count_vectorizer.get_feature_names()\n", "df = pd.SparseDataFrame(X, columns=vocab)\n", "df.fillna(0).iloc[:, :10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Significant collocation\n", "\n", "Most n-grams are not meaningfully phrases. We can use statistical tests for the likelihood of co-occurrence of words, and only use the significant collocations. Basically we test against the null hypothesis that the words in the n-gram appear by chance if the probability of each word was independently derived from its empirical frequency. " ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "abstract = '''Macrophages represent one of the most numerous and diverse \n", "leukocyte types in the body. Furthermore, they are important regulators \n", "and promoters of many cardiovascular disease programs. Their functions \n", "range from sensing pathogens to digesting cell debris, modulating inflammation, \n", "and producing key cytokines and other regulatory factors throughout the body. \n", "Macrophage research has undergone a renaissance in recent years, which \n", "has propelled a newfound interest in their heterogeneity as well as a \n", "new understanding of ontological differences in their development. \n", "In addition, recent technological advances such as single-cell \n", "mass-cytometry by time-of-flight have enabled phenotype and functional \n", "analyses of individual immune myeloid cells, including macrophages, \n", "at unprecedented resolution. In this Part 1 of a 4-part review series \n", "covering the macrophage in cardiovascular disease, we focus on the \n", "basic principles of macrophage development, heterogeneity, phenotype, \n", "tissue-specific differentiation, and functionality as a basis to understand \n", "their role in cardiovascular disease.'''" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "ngrams = TrigramCollocationFinder.from_words(nltk.tokenize.word_tokenize(abstract))" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "scores = ngrams.score_ngrams(TrigramAssocMeasures.likelihood_ratio)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(('in', 'cardiovascular', 'disease'), 60.22140084295821),\n", " (('cardiovascular', 'disease', 'programs'), 57.490270384342544),\n", " (('many', 'cardiovascular', 'disease'), 57.490270384342544),\n", " (('cardiovascular', 'disease', '.'), 49.568274269761346),\n", " (('cardiovascular', 'disease', ','), 47.586079738744886)]" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores[:5]" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(('development', ',', 'heterogeneity'), 18.377430413805826),\n", " (('heterogeneity', ',', 'phenotype'), 18.377430413805826),\n", " (('the', 'macrophage', 'in'), 17.35538066534174),\n", " ((',', 'heterogeneity', ','), 12.326088385780718),\n", " ((',', 'phenotype', ','), 12.326088385780718)]" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores[-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part-of-speech tagging\n", "\n", "Regex for grammar from this [blog](http://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Parts of speech in NLTK" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$: dollar\n", " $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$\n", "'': closing quotation mark\n", " ' ''\n", "(: opening parenthesis\n", " ( [ {\n", "): closing parenthesis\n", " ) ] }\n", ",: comma\n", " ,\n", "--: dash\n", " --\n", ".: sentence terminator\n", " . ! ?\n", ":: colon or ellipsis\n", " : ; ...\n", "CC: conjunction, coordinating\n", " & 'n and both but either et for less minus neither nor or plus so\n", " therefore times v. versus vs. whether yet\n", "CD: numeral, cardinal\n", " mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-\n", " seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025\n", " fifteen 271,124 dozen quintillion DM2,000 ...\n", "DT: determiner\n", " all an another any both del each either every half la many much nary\n", " neither no some such that the them these this those\n", "EX: existential there\n", " there\n", "FW: foreign word\n", " gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous\n", " lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte\n", " terram fiche oui corporis ...\n", "IN: preposition or conjunction, subordinating\n", " astride among uppon whether out inside pro despite on by throughout\n", " below within for towards near behind atop around if like until below\n", " next into if beside ...\n", "JJ: adjective or numeral, ordinal\n", " third ill-mannered pre-war regrettable oiled calamitous first separable\n", " ectoplasmic battery-powered participatory fourth still-to-be-named\n", " multilingual multi-disciplinary ...\n", "JJR: adjective, comparative\n", " bleaker braver breezier briefer brighter brisker broader bumper busier\n", " calmer cheaper choosier cleaner clearer closer colder commoner costlier\n", " cozier creamier crunchier cuter ...\n", "JJS: adjective, superlative\n", " calmest cheapest choicest classiest cleanest clearest closest commonest\n", " corniest costliest crassest creepiest crudest cutest darkest deadliest\n", " dearest deepest densest dinkiest ...\n", "LS: list item marker\n", " A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005\n", " SP-44007 Second Third Three Two * a b c d first five four one six three\n", " two\n", "MD: modal auxiliary\n", " can cannot could couldn't dare may might must need ought shall should\n", " shouldn't will would\n", "NN: noun, common, singular or mass\n", " common-carrier cabbage knuckle-duster Casino afghan shed thermostat\n", " investment slide humour falloff slick wind hyena override subhumanity\n", " machinist ...\n", "NNP: noun, proper, singular\n", " Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos\n", " Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA\n", " Shannon A.K.C. Meltex Liverpool ...\n", "NNPS: noun, proper, plural\n", " Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists\n", " Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques\n", " Apache Apaches Apocrypha ...\n", "NNS: noun, common, plural\n", " undergraduates scotches bric-a-brac products bodyguards facets coasts\n", " divestitures storehouses designs clubs fragrances averages\n", " subjectivists apprehensions muses factory-jobs ...\n", "PDT: pre-determiner\n", " all both half many quite such sure this\n", "POS: genitive marker\n", " ' 's\n", "PRP: pronoun, personal\n", " hers herself him himself hisself it itself me myself one oneself ours\n", " ourselves ownself self she thee theirs them themselves they thou thy us\n", "PRP$: pronoun, possessive\n", " her his mine my our ours their thy your\n", "RB: adverb\n", " occasionally unabatingly maddeningly adventurously professedly\n", " stirringly prominently technologically magisterially predominately\n", " swiftly fiscally pitilessly ...\n", "RBR: adverb, comparative\n", " further gloomier grander graver greater grimmer harder harsher\n", " healthier heavier higher however larger later leaner lengthier less-\n", " perfectly lesser lonelier longer louder lower more ...\n", "RBS: adverb, superlative\n", " best biggest bluntest earliest farthest first furthest hardest\n", " heartiest highest largest least less most nearest second tightest worst\n", "RP: particle\n", " aboard about across along apart around aside at away back before behind\n", " by crop down ever fast for forth from go high i.e. in into just later\n", " low more off on open out over per pie raising start teeth that through\n", " under unto up up-pp upon whole with you\n", "SYM: symbol\n", " % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***\n", "TO: \"to\" as preposition or infinitive marker\n", " to\n", "UH: interjection\n", " Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen\n", " huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly\n", " man baby diddle hush sonuvabitch ...\n", "VB: verb, base form\n", " ask assemble assess assign assume atone attention avoid bake balkanize\n", " bank begin behold believe bend benefit bevel beware bless boil bomb\n", " boost brace break bring broil brush build ...\n", "VBD: verb, past tense\n", " dipped pleaded swiped regummed soaked tidied convened halted registered\n", " cushioned exacted snubbed strode aimed adopted belied figgered\n", " speculated wore appreciated contemplated ...\n", "VBG: verb, present participle or gerund\n", " telegraphing stirring focusing angering judging stalling lactating\n", " hankerin' alleging veering capping approaching traveling besieging\n", " encrypting interrupting erasing wincing ...\n", "VBN: verb, past participle\n", " multihulled dilapidated aerosolized chaired languished panelized used\n", " experimented flourished imitated reunifed factored condensed sheared\n", " unsettled primed dubbed desired ...\n", "VBP: verb, present tense, not 3rd person singular\n", " predominate wrap resort sue twist spill cure lengthen brush terminate\n", " appear tend stray glisten obtain comprise detest tease attract\n", " emphasize mold postpone sever return wag ...\n", "VBZ: verb, present tense, 3rd person singular\n", " bases reconstructs marks mixes displeases seals carps weaves snatches\n", " slumps stretches authorizes smolders pictures emerges stockpiles\n", " seduces fizzes uses bolsters slaps speaks pleads ...\n", "WDT: WH-determiner\n", " that what whatever which whichever\n", "WP: WH-pronoun\n", " that what whatever whatsoever which who whom whosoever\n", "WP$: WH-pronoun, possessive\n", " whose\n", "WRB: Wh-adverb\n", " how however whence whenever where whereby whereever wherein whereof why\n", "``: opening quotation mark\n", " ` ``\n" ] } ], "source": [ "nltk.help.upenn_tagset()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a [paragraph](https://en.wikipedia.org/wiki/Alfred_Nobel) from Wikipedia." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "nobel = \"Born in Stockholm, Alfred Nobel was the third son of Immanuel Nobel (1801–1872), an inventor and engineer, and Carolina Andriette (Ahlsell) Nobel (1805–1889).The couple married in 1827 and had eight children. The family was impoverished, and only Alfred and his three brothers survived past childhood. Through his father, Alfred Nobel was a descendant of the Swedish scientist Olaus Rudbeck (1630–1702),and in his turn the boy was interested in engineering, particularly explosives, learning the basic principles from his father at a young age. Alfred Nobel's interest in technology was inherited from his father, an alumnus of Royal Institute of Technology in Stockholm.\"" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Born in Stockholm, Alfred Nobel was the third son of Immanuel Nobel (1801–1872), an inventor and engineer, and Carolina Andriette (Ahlsell) Nobel (1805–1889).The couple married in 1827 and had eight children. The family was impoverished, and only Alfred and his three brothers survived past childhood. Through his father, Alfred Nobel was a descendant of the Swedish scientist Olaus Rudbeck (1630–1702),and in his turn the boy was interested in engineering, particularly explosives, learning the basic principles from his father at a young age. Alfred Nobel's interest in technology was inherited from his father, an alumnus of Royal Institute of Technology in Stockholm.\"" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nobel" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "text = nltk.word_tokenize(nobel)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "pos = nltk.pos_tag(text)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Born', 'VBN'),\n", " ('in', 'IN'),\n", " ('Stockholm', 'NNP'),\n", " (',', ','),\n", " ('Alfred', 'NNP'),\n", " ('Nobel', 'NNP'),\n", " ('was', 'VBD'),\n", " ('the', 'DT'),\n", " ('third', 'JJ'),\n", " ('son', 'NN'),\n", " ('of', 'IN'),\n", " ('Immanuel', 'NNP'),\n", " ('Nobel', 'NNP'),\n", " ('(', '('),\n", " ('1801–1872', 'CD'),\n", " (')', ')'),\n", " (',', ','),\n", " ('an', 'DT'),\n", " ('inventor', 'NN'),\n", " ('and', 'CC'),\n", " ('engineer', 'NN'),\n", " (',', ','),\n", " ('and', 'CC'),\n", " ('Carolina', 'NNP'),\n", " ('Andriette', 'NNP'),\n", " ('(', '('),\n", " ('Ahlsell', 'NNP'),\n", " (')', ')'),\n", " ('Nobel', 'NNP'),\n", " ('(', '('),\n", " ('1805–1889', 'CD'),\n", " (')', ')')]" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pos[:32]" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "grammar = 'KP: {(* + )? * +}'" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "chunker = nltk.RegexpParser(grammar)" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "tree = chunker.parse(pos[:32])" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Tree('S', [('Born', 'VBN'), ('in', 'IN'), Tree('KP', [('Stockholm', 'NNP')]), (',', ','), Tree('KP', [('Alfred', 'NNP'), ('Nobel', 'NNP')]), ('was', 'VBD'), ('the', 'DT'), Tree('KP', [('third', 'JJ'), ('son', 'NN'), ('of', 'IN'), ('Immanuel', 'NNP'), ('Nobel', 'NNP')]), ('(', '('), ('1801–1872', 'CD'), (')', ')'), (',', ','), ('an', 'DT'), Tree('KP', [('inventor', 'NN')]), ('and', 'CC'), Tree('KP', [('engineer', 'NN')]), (',', ','), ('and', 'CC'), Tree('KP', [('Carolina', 'NNP'), ('Andriette', 'NNP')]), ('(', '('), Tree('KP', [('Ahlsell', 'NNP')]), (')', ')'), Tree('KP', [('Nobel', 'NNP')]), ('(', '('), ('1805–1889', 'CD'), (')', ')')])" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree.collapse_unary" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "import itertools" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Stockholm',\n", " 'Alfred',\n", " 'Nobel',\n", " 'third',\n", " 'son of Immanuel Nobel',\n", " 'inventor',\n", " 'engineer',\n", " 'Carolina',\n", " 'Andriette',\n", " 'Ahlsell',\n", " 'Nobel']" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kps = [ ]\n", "for key, group in itertools.groupby(nltk.tree2conlltags(tree), lambda x: x[-1]):\n", " if key != 'O':\n", " phrase = []\n", " for word, pos, cls in group:\n", " phrase.append(word)\n", " kps.append(' '.join(phrase))\n", "kps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding named entities\n", "\n", "We use a pre-trained model from `spacy`. See [here](https://spacy.io/usage/training#ner) if you want to train on your own corpus or extend the pre-trained model.\n", "\n", "The default model is not perfect, but may be good enough for your needs." ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "from spacy import displacy\n", "import en_core_web_sm\n", "nlp = en_core_web_sm.load()" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "doc = nlp(nobel)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(Born, 'O', ''), (in, 'O', ''), (Stockholm, 'B', 'GPE'), (,, 'O', ''), (Alfred, 'B', 'PERSON'), (Nobel, 'I', 'PERSON'), (was, 'O', ''), (the, 'O', ''), (third, 'B', 'ORDINAL'), (son, 'O', ''), (of, 'O', ''), (Immanuel, 'B', 'PERSON'), (Nobel, 'I', 'PERSON'), ((, 'O', ''), (1801–1872, 'B', 'CARDINAL'), (), 'O', ''), (,, 'O', ''), (an, 'O', ''), (inventor, 'O', ''), (and, 'O', ''), (engineer, 'O', ''), (,, 'O', ''), (and, 'O', ''), (Carolina, 'B', 'PERSON'), (Andriette, 'I', 'PERSON'), ((, 'O', ''), (Ahlsell, 'O', ''), (), 'O', ''), (Nobel, 'B', 'WORK_OF_ART'), ((, 'O', ''), (1805–1889).The, 'O', ''), (couple, 'O', ''), (married, 'O', ''), (in, 'O', ''), (1827, 'B', 'DATE'), (and, 'O', ''), (had, 'O', ''), (eight, 'B', 'CARDINAL'), (children, 'O', ''), (., 'O', ''), (The, 'O', ''), (family, 'O', ''), (was, 'O', ''), (impoverished, 'O', ''), (,, 'O', ''), (and, 'O', ''), (only, 'O', ''), (Alfred, 'B', 'PERSON'), (and, 'O', ''), (his, 'O', ''), (three, 'B', 'CARDINAL'), (brothers, 'O', ''), (survived, 'O', ''), (past, 'O', ''), (childhood, 'O', ''), (., 'O', ''), (Through, 'O', ''), (his, 'O', ''), (father, 'O', ''), (,, 'O', ''), (Alfred, 'B', 'PERSON'), (Nobel, 'I', 'PERSON'), (was, 'O', ''), (a, 'O', ''), (descendant, 'O', ''), (of, 'O', ''), (the, 'O', ''), (Swedish, 'B', 'NORP'), (scientist, 'O', ''), (Olaus, 'B', 'PERSON'), (Rudbeck, 'I', 'PERSON'), ((, 'O', ''), (1630–1702),and, 'B', 'LOC'), (in, 'O', ''), (his, 'O', ''), (turn, 'O', ''), (the, 'O', ''), (boy, 'O', ''), (was, 'O', ''), (interested, 'O', ''), (in, 'O', ''), (engineering, 'O', ''), (,, 'O', ''), (particularly, 'O', ''), (explosives, 'O', ''), (,, 'O', ''), (learning, 'O', ''), (the, 'O', ''), (basic, 'O', ''), (principles, 'O', ''), (from, 'O', ''), (his, 'O', ''), (father, 'O', ''), (at, 'O', ''), (a, 'O', ''), (young, 'O', ''), (age, 'O', ''), (., 'O', ''), (Alfred, 'B', 'PERSON'), (Nobel, 'I', 'PERSON'), ('s, 'I', 'PERSON'), (interest, 'O', ''), (in, 'O', ''), (technology, 'O', ''), (was, 'O', ''), (inherited, 'O', ''), (from, 'O', ''), (his, 'O', ''), (father, 'O', ''), (,, 'O', ''), (an, 'O', ''), (alumnus, 'O', ''), (of, 'O', ''), (Royal, 'B', 'ORG'), (Institute, 'I', 'ORG'), (of, 'I', 'ORG'), (Technology, 'I', 'ORG'), (in, 'O', ''), (Stockholm, 'B', 'GPE'), (., 'O', '')]\n" ] } ], "source": [ "print([(X, X.ent_iob_, X.ent_type_) for X in doc])" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Born in \n", "\n", " Stockholm\n", " GPE\n", "\n", ", \n", "\n", " Alfred Nobel\n", " PERSON\n", "\n", " was the \n", "\n", " third\n", " ORDINAL\n", "\n", " son of \n", "\n", " Immanuel Nobel\n", " PERSON\n", "\n", " (\n", "\n", " 1801–1872\n", " CARDINAL\n", "\n", "), an inventor and engineer, and \n", "\n", " Carolina Andriette\n", " PERSON\n", "\n", " (Ahlsell) \n", "\n", " Nobel\n", " WORK_OF_ART\n", "\n", " (1805–1889).The couple married in \n", "\n", " 1827\n", " DATE\n", "\n", " and had \n", "\n", " eight\n", " CARDINAL\n", "\n", " children. The family was impoverished, and only \n", "\n", " Alfred\n", " PERSON\n", "\n", " and his \n", "\n", " three\n", " CARDINAL\n", "\n", " brothers survived past childhood. Through his father, \n", "\n", " Alfred Nobel\n", " PERSON\n", "\n", " was a descendant of the \n", "\n", " Swedish\n", " NORP\n", "\n", " scientist \n", "\n", " Olaus Rudbeck\n", " PERSON\n", "\n", " (\n", "\n", " 1630–1702),and\n", " LOC\n", "\n", " in his turn the boy was interested in engineering, particularly explosives, learning the basic principles from his father at a young age. \n", "\n", " Alfred Nobel's\n", " PERSON\n", "\n", " interest in technology was inherited from his father, an alumnus of \n", "\n", " Royal Institute of Technology\n", " ORG\n", "\n", " in \n", "\n", " Stockholm\n", " GPE\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(doc, jupyter=True, style='ent')" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Alfred Nobel\n", "Immanuel Nobel\n", "Carolina Andriette\n", "Alfred\n", "Alfred Nobel\n", "Olaus Rudbeck\n", "Alfred Nobel's\n" ] } ], "source": [ "for entity in doc.ents:\n", " if entity.label_ == 'PERSON':\n", " print(entity)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }