Text Analysis

We will explore exploratory data analysis and supervised learning for free text in this lecture. In the next lecture, we will look at unsupervised learning and topic models.

Along the way, we will use the packages

Other packages useful for text analysis include

and many, many others.

Exploratory data analysis

Corpus

A corpus is a collection of text documents. There are many ways to create a corpus, and they may come from documents, scraped web pages, Twitter streams, speech translation and so on. The first step in any text analysis application is nearly always to create an application-specific corpus. This is important, because the language patterns in different domains are often very different (e.g. contrast medical records with legal documents with Twitter streams).

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook', font_scale=1.5)
In [2]:
import numpy as np
import pandas as pd
In [3]:
import nltk
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.collocations import QuadgramCollocationFinder, TrigramCollocationFinder
from nltk.metrics.association import QuadgramAssocMeasures, TrigramAssocMeasures
import string

Toy corpus

We see how a small corpus with two documents is broken down into smaller pieces

document \(\to\) paragraph \(\to\) sentences \(\to\) tokens

Although this explicit decomposition may not be necessary in all applications, it is still useful to be aware of these units:

  • A paragraph contains an idea
  • A sentence is a unit of syntax
  • A token (word or punctuation) is the smallest meaningful unit
In [4]:
docs = [
    '''Spicy jalapeno bacon ipsum dolor amet aute prosciutto velit corned beef consectetur. Aute kielbasa adipisicing, nostrud drumstick ipsum tail pig capicola burgdoggen corned beef. Dolor proident salami deserunt. Venison capicola pork belly bacon aliquip swine incididunt sint quis cupidatat pork chop et turducken nulla beef. Ground round kielbasa tri-tip consectetur, t-bone pariatur deserunt id ut adipisicing.

Strip steak meatball chuck aute, pork loin turkey pork commodo et officia. Rump enim spare ribs, prosciutto chuck deserunt tail. Aute pork lorem sausage. Nostrud dolore kevin proident pork chop do in. Exercitation shoulder dolore kevin ut, sausage ullamco frankfurter ham hock. Ground round fatback ribeye turkey tri-tip capicola.''',
    '''Burgdoggen id ham hock ut kielbasa. Eu pork chop anim picanha sed porchetta dolor consequat drumstick shankle proident pork andouille. Et cupim burgdoggen, officia lorem shank ut sed drumstick shankle salami ad ball tip dolore pig. Shankle turkey officia, reprehenderit bacon ipsum ullamco enim tail tongue. Brisket short ribs biltong jerky flank, venison filet mignon tenderloin culpa bacon meatball short loin commodo. Leberkas jowl prosciutto, et kielbasa pancetta chicken. Nisi minim sausage porchetta jowl.

Beef ribs pariatur pork chop dolore ex, consequat turducken frankfurter esse filet mignon lorem bacon. Elit dolore porchetta meatball ea, pork loin pork anim non sirloin. Aliquip tenderloin reprehenderit pariatur, leberkas alcatra short loin. Fugiat elit meatloaf, nulla cow in sausage. Doner consequat shankle salami est, boudin deserunt. Drumstick ham lorem reprehenderit.

Beef adipisicing nisi rump filet mignon cillum leberkas boudin tail picanha pork loin. Culpa picanha ground round in laborum spare ribs. Burgdoggen leberkas landjaeger adipisicing strip steak velit doner eu ground round meatloaf consectetur deserunt anim ball tip cow. Porchetta ad minim eiusmod labore eu nisi boudin laboris officia jowl deserunt strip steak. Shank aliquip beef ribs tri-tip ipsum flank. Turducken elit meatloaf aliqua corned beef sirloin irure. Tongue cupim ullamco in sint prosciutto.'''
]
Documents
In [5]:
docs
Out[5]:
['Spicy jalapeno bacon ipsum dolor amet aute prosciutto velit corned beef consectetur. Aute kielbasa adipisicing, nostrud drumstick ipsum tail pig capicola burgdoggen corned beef. Dolor proident salami deserunt. Venison capicola pork belly bacon aliquip swine incididunt sint quis cupidatat pork chop et turducken nulla beef. Ground round kielbasa tri-tip consectetur, t-bone pariatur deserunt id ut adipisicing.\n\nStrip steak meatball chuck aute, pork loin turkey pork commodo et officia. Rump enim spare ribs, prosciutto chuck deserunt tail. Aute pork lorem sausage. Nostrud dolore kevin proident pork chop do in. Exercitation shoulder dolore kevin ut, sausage ullamco frankfurter ham hock. Ground round fatback ribeye turkey tri-tip capicola.',
 'Burgdoggen id ham hock ut kielbasa. Eu pork chop anim picanha sed porchetta dolor consequat drumstick shankle proident pork andouille. Et cupim burgdoggen, officia lorem shank ut sed drumstick shankle salami ad ball tip dolore pig. Shankle turkey officia, reprehenderit bacon ipsum ullamco enim tail tongue. Brisket short ribs biltong jerky flank, venison filet mignon tenderloin culpa bacon meatball short loin commodo. Leberkas jowl prosciutto, et kielbasa pancetta chicken. Nisi minim sausage porchetta jowl.\n\nBeef ribs pariatur pork chop dolore ex, consequat turducken frankfurter esse filet mignon lorem bacon. Elit dolore porchetta meatball ea, pork loin pork anim non sirloin. Aliquip tenderloin reprehenderit pariatur, leberkas alcatra short loin. Fugiat elit meatloaf, nulla cow in sausage. Doner consequat shankle salami est, boudin deserunt. Drumstick ham lorem reprehenderit.\n\nBeef adipisicing nisi rump filet mignon cillum leberkas boudin tail picanha pork loin. Culpa picanha ground round in laborum spare ribs. Burgdoggen leberkas landjaeger adipisicing strip steak velit doner eu ground round meatloaf consectetur deserunt anim ball tip cow. Porchetta ad minim eiusmod labore eu nisi boudin laboris officia jowl deserunt strip steak. Shank aliquip beef ribs tri-tip ipsum flank. Turducken elit meatloaf aliqua corned beef sirloin irure. Tongue cupim ullamco in sint prosciutto.']
In [6]:
from itertools import chain
In [7]:
def flatten(listOfLists):
    return list(chain.from_iterable(listOfLists))

Paragraphs

In [8]:
paras = flatten([doc.split('\n\n') for doc in docs])
In [9]:
paras[:3]
Out[9]:
['Spicy jalapeno bacon ipsum dolor amet aute prosciutto velit corned beef consectetur. Aute kielbasa adipisicing, nostrud drumstick ipsum tail pig capicola burgdoggen corned beef. Dolor proident salami deserunt. Venison capicola pork belly bacon aliquip swine incididunt sint quis cupidatat pork chop et turducken nulla beef. Ground round kielbasa tri-tip consectetur, t-bone pariatur deserunt id ut adipisicing.',
 'Strip steak meatball chuck aute, pork loin turkey pork commodo et officia. Rump enim spare ribs, prosciutto chuck deserunt tail. Aute pork lorem sausage. Nostrud dolore kevin proident pork chop do in. Exercitation shoulder dolore kevin ut, sausage ullamco frankfurter ham hock. Ground round fatback ribeye turkey tri-tip capicola.',
 'Burgdoggen id ham hock ut kielbasa. Eu pork chop anim picanha sed porchetta dolor consequat drumstick shankle proident pork andouille. Et cupim burgdoggen, officia lorem shank ut sed drumstick shankle salami ad ball tip dolore pig. Shankle turkey officia, reprehenderit bacon ipsum ullamco enim tail tongue. Brisket short ribs biltong jerky flank, venison filet mignon tenderloin culpa bacon meatball short loin commodo. Leberkas jowl prosciutto, et kielbasa pancetta chicken. Nisi minim sausage porchetta jowl.']
Sentences
In [10]:
sentences = flatten([nltk.tokenize.sent_tokenize(para) for para in paras])
In [11]:
sentences[:10]
Out[11]:
['Spicy jalapeno bacon ipsum dolor amet aute prosciutto velit corned beef consectetur.',
 'Aute kielbasa adipisicing, nostrud drumstick ipsum tail pig capicola burgdoggen corned beef.',
 'Dolor proident salami deserunt.',
 'Venison capicola pork belly bacon aliquip swine incididunt sint quis cupidatat pork chop et turducken nulla beef.',
 'Ground round kielbasa tri-tip consectetur, t-bone pariatur deserunt id ut adipisicing.',
 'Strip steak meatball chuck aute, pork loin turkey pork commodo et officia.',
 'Rump enim spare ribs, prosciutto chuck deserunt tail.',
 'Aute pork lorem sausage.',
 'Nostrud dolore kevin proident pork chop do in.',
 'Exercitation shoulder dolore kevin ut, sausage ullamco frankfurter ham hock.']
In [12]:
tokens = flatten([nltk.tokenize.word_tokenize(sentence) for sentence in sentences])
In [13]:
tokens[:10]
Out[13]:
['Spicy',
 'jalapeno',
 'bacon',
 'ipsum',
 'dolor',
 'amet',
 'aute',
 'prosciutto',
 'velit',
 'corned']

Exploratory analysis of the newsgroup corpus

In [14]:
from sklearn.datasets import fetch_20newsgroups

For convenience, we will use an existing corpus - the 20 newsgroups dataset that comprises around 18000 newsgroups posts on 20 topics. The 20 topics are

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
In [15]:
newsgroups_train = fetch_20newsgroups(
    subset='train',
    categories=('rec.sport.baseball',
                'rec.sport.hockey',
                'sci.med',
                'sci.space'),

    remove=('headers', 'footers', 'quotes'))
In [16]:
newsgroups_train.keys()
Out[16]:
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])
In [17]:
newsgroups_train.description
Out[17]:
'the 20 newsgroups by date dataset'
In [18]:
newsgroups_train.filenames.shape
Out[18]:
(2384,)
In [19]:
newsgroups_train.target.shape
Out[19]:
(2384,)
In [20]:
newsgroups_train.target_names
Out[20]:
['rec.sport.baseball', 'rec.sport.hockey', 'sci.med', 'sci.space']
In [21]:
newsgroups_train.data[0]
Out[21]:
'\nA freeze dried Tootsie Roll (tm).  The actual taste sensation was like nothing\nyou will ever willingly experience.  The amazing thing was that we ate a second\none, and a third and ....\n\nI doubt that they actually flew on missions, as I\'m certain they did "bad\nthings" to the gastrointestinal tract.  Compared to Space Food Sticks, Tang was\na gastronomic contribution to mankind.\n--\nDillon Pyron                      | The opinions expressed are those of the\nTI/DSEG Lewisville VAX Support    | sender unless otherwise stated.\n(214)462-3556 (when I\'m here)     |\n(214)492-4656 (when I\'m home)     |God gave us weather so we wouldn\'t complain\npyron@skndiv.dseg.ti.com          |about other things.\nPADI DM-54909                     |'

Getting word counts

In [22]:
from sklearn.feature_extraction.text import (
    HashingVectorizer,
    TfidfVectorizer,
    CountVectorizer,
)
In [23]:
vectorizer = CountVectorizer()
In [24]:
idx = np.nonzero(
    newsgroups_train.target ==
    newsgroups_train.target_names.index('rec.sport.baseball')
)[0]
baseball_sample = [newsgroups_train.data[i] for i in idx]
In [25]:
X = vectorizer.fit_transform(baseball_sample)
In [26]:
vocab = vectorizer.get_feature_names()
In [27]:
rownames = [':'.join(filename.split('/')[-2:])
            for filename in newsgroups_train.filenames[idx]]
df = pd.SparseDataFrame(X, columns=vocab, index=rownames)
In [28]:
freqs = df.sum(axis=0).astype('int')
In [29]:
freqs.nlargest(10)
Out[29]:
the     3508
to      1481
and     1312
of      1142
in      1114
that     882
is       842
he       738
for      580
it       543
dtype: int64

Distribution of word counts

In [30]:
sns.distplot(freqs, kde=False)
pass
/usr/local/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
_images/S17A_Text_Analysis_1_40_1.png

Zipf’s law

The number of words that occur with frequency \(f\) is a random variable with a power law distribution

\[p(f) = \alpha f^{1-1/s}\]

Random variables that follow a power law distribution look linear on a log-log plot.

In [31]:
xs = freqs.sort_values(ascending=False).reset_index(drop=True, )
plt.loglog(xs.index + 1, xs)
plt.xlabel('Log(Rank)')
plt.ylabel('Log(Frequency)')
plt.title("Zipf's law")
pass
_images/S17A_Text_Analysis_1_42_0.png

Stop words, lemmatization and stemming

We can try to reduce the number of tokens using the simple strategies of stop words, stemming and lemmatization.

Stop words

The most common words are not very informative, and we may wish to remove them. There are other ways to handle this (e.g. with TF-IDF vectorizers) but we will simply use stop words for this section.

In [32]:
vectorizer = CountVectorizer(stop_words='english')
In [33]:
idx = np.nonzero(
    newsgroups_train.target ==
    newsgroups_train.target_names.index('rec.sport.baseball')
)[0]
baseball_sample = [newsgroups_train.data[i] for i in idx]
In [34]:
X = vectorizer.fit_transform(baseball_sample)
In [35]:
vocab = vectorizer.get_feature_names()
In [36]:
rownames = [':'.join(filename.split('/')[-2:])
            for filename in newsgroups_train.filenames[idx]]
df = pd.SparseDataFrame(X, columns=vocab, index=rownames)
In [37]:
freqs = df.sum(axis=0).astype('int')

We will also drop numbers.

In [38]:
freqs = freqs[~freqs.index.str.isnumeric()]

Now the most common words are more informative.

In [39]:
freqs.nlargest(15)
Out[39]:
year        310
game        204
good        200
team        195
think       189
don         186
just        161
like        153
games       149
better      140
baseball    137
hit         137
runs        137
players     135
time        131
dtype: int64

Stemming

Stemming is the attempt to identify the common roots of words using prefix and suffix rules.

In [40]:
def tokenize(text):
    stem = SnowballStemmer('english')
    text = text.lower()

    for token in nltk.word_tokenize(text):
        if token in string.punctuation:
            continue
        yield stem.stem(token)
In [41]:
text = '''circle circles circular circularity
circumference circumscribe circumstantial
infer inference inferences inferential'''
In [42]:
list(tokenize(text))
Out[42]:
['circl',
 'circl',
 'circular',
 'circular',
 'circumfer',
 'circumscrib',
 'circumstanti',
 'infer',
 'infer',
 'infer',
 'inferenti']

Lemmatization

Lemmatization also attempts to identify the common roots of words, but uses dictionary lookup to do so. Lemmatization often gives better results than stemming, but is slower.

In [43]:
def tokenize(text):
    lem = WordNetLemmatizer()
    text = text.lower()

    for token in nltk.word_tokenize(text):
        if token in string.punctuation:
            continue
        yield lem.lemmatize(token)
In [44]:
list(tokenize(text))
Out[44]:
['circle',
 'circle',
 'circular',
 'circularity',
 'circumference',
 'circumscribe',
 'circumstantial',
 'infer',
 'inference',
 'inference',
 'inferential']

Word cloud

In [45]:
from wordcloud import WordCloud
In [46]:
wordcloud = WordCloud().generate(' '.join(freqs.nlargest(200).index))
pass
In [47]:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
pass
_images/S17A_Text_Analysis_1_65_0.png
In [48]:
from imageio import imread
In [49]:
rabbit = imread('data/rabbit.png').astype('ubyte')
In [50]:
wc = WordCloud(mask=rabbit[:,:,0],
               mode='RGBA',
               background_color=None)
wc.generate(' '.join(freqs.nlargest(200).index))
pass
In [51]:
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
pass
_images/S17A_Text_Analysis_1_69_0.png

Supervised Learning

A general framework for supervised learning on text is

construct corpus \(\to\) vectorization of features \(\to\) classification \(\to\) evaluation (often by cross-validation)

For example, we may classify documents into topics, or by sentiment, or as spam/not spam.

Vectorization of features

There are 3 common methods to vectorize features when the text is treated as a bag of words - word count, one hot encoding and TF-IDF.

In [52]:
small_sample = """Do you like green eggs and ham?
I do not like them, Sam-I-am.
I do not like green eggs and ham!
Would you like them here or there?
I would not like them here or there.
I would not like them anywhere.
I do so like green eggs and ham!
Thank you! Thank you,
Sam-I-am!""".splitlines()
In [53]:
small_sample
Out[53]:
['Do you like green eggs and ham?',
 'I do not like them, Sam-I-am.',
 'I do not like green eggs and ham!',
 'Would you like them here or there?',
 'I would not like them here or there.',
 'I would not like them anywhere.',
 'I do so like green eggs and ham!',
 'Thank you! Thank you,',
 'Sam-I-am!']

Word counts

In [54]:
count_vectorizer = CountVectorizer()
In [55]:
X = count_vectorizer.fit_transform(small_sample)
In [56]:
vocab = count_vectorizer.get_feature_names()
df = pd.SparseDataFrame(X, columns=vocab)
df.fillna(0).iloc[:, :10]
Out[56]:
am and anywhere do eggs green ham here like not
0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0
1 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0
2 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
5 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
6 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Hashing

If the number of words is too large, we can hash words into a fixed number of buckets to keep the computations tractable. However, we lose the ability to map back to the original tokens.

In [57]:
hash_vectorizer = HashingVectorizer(n_features=5)
In [58]:
X = hash_vectorizer.fit_transform(small_sample)
In [59]:
X.toarray()
Out[59]:
array([[-0.60302269,  0.30151134,  0.30151134,  0.30151134,  0.60302269],
       [ 0.5       ,  0.5       ,  0.        ,  0.5       , -0.5       ],
       [-0.60302269,  0.30151134,  0.30151134,  0.30151134,  0.60302269],
       [-0.57735027,  0.        ,  0.57735027,  0.        , -0.57735027],
       [-0.57735027,  0.        ,  0.57735027,  0.        , -0.57735027],
       [-0.90453403,  0.        ,  0.        ,  0.30151134, -0.30151134],
       [-0.60302269,  0.30151134,  0.30151134,  0.30151134,  0.60302269],
       [-0.70710678,  0.        ,  0.        ,  0.70710678,  0.        ],
       [ 1.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

One hot encoding

One hot encoding simply sets words with non-zero counts to 1.

In [60]:
one_hot_vectorizer = CountVectorizer(binary=True)
In [61]:
X = one_hot_vectorizer.fit_transform(small_sample)
In [62]:
vocab = one_hot_vectorizer.get_feature_names()
df = pd.SparseDataFrame(X, columns=vocab)
df.fillna(0).iloc[:, :10]
Out[62]:
am and anywhere do eggs green ham here like not
0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0
1 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0
2 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
5 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
6 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

TF-IDF

You have previously implemented this in your homework.

In [63]:
tf_idf_vectorizer = TfidfVectorizer()
In [64]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
In [65]:
X = tf_idf_vectorizer.fit_transform(small_sample)
In [66]:
vocab = tf_idf_vectorizer.get_feature_names()
df = pd.SparseDataFrame(X, columns=vocab)
df.fillna(0).iloc[:, :10]
Out[66]:
am and anywhere do eggs green ham here like not
0 0.000000 0.401996 0.00000 0.355186 0.401996 0.401996 0.401996 0.000000 0.256589 0.000000
1 0.495165 0.000000 0.00000 0.380398 0.000000 0.000000 0.000000 0.000000 0.274803 0.380398
2 0.000000 0.409316 0.00000 0.361653 0.409316 0.409316 0.409316 0.000000 0.261261 0.361653
3 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.429929 0.238598 0.000000
4 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.436672 0.242341 0.335463
5 0.000000 0.000000 0.62005 0.000000 0.000000 0.000000 0.000000 0.000000 0.290641 0.402322
6 0.000000 0.376827 0.00000 0.332947 0.376827 0.376827 0.376827 0.000000 0.240523 0.000000
7 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.707107 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

Maintaining context

For some supervised learning tasks such as sentiment analysis (is this review positive or negative), the context of words is very important. For example the following two reviews use very similar words but have very different meanings.

  • Only an idiot like Reviewer two could love that movie
  • Could not love that movie more. Reviewer one is an idiot

In this case, we need to take the context of individual words into account. Common ways to take context into account include the use N-grams (also known as colocations), part-of-speech (POS) tagging and grammars, and the word2vec family of algorithms.

N-grams

In [67]:
count_vectorizer = CountVectorizer(ngram_range=(1,3))
In [68]:
X = count_vectorizer.fit_transform(small_sample)
In [69]:
vocab = count_vectorizer.get_feature_names()
df = pd.SparseDataFrame(X, columns=vocab)
df.fillna(0).iloc[:, :10]
Out[69]:
am and and ham anywhere do do not do not like do so do so like do you
0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
1 1.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0
2 0.0 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Significant collocation

Most n-grams are not meaningfully phrases. We can use statistical tests for the likelihood of co-occurrence of words, and only use the significant collocations. Basically we test against the null hypothesis that the words in the n-gram appear by chance if the probability of each word was independently derived from its empirical frequency.

In [70]:
abstract = '''Macrophages represent one of the most numerous and diverse
leukocyte types in the body. Furthermore, they are important regulators
and promoters of many cardiovascular disease programs. Their functions
range from sensing pathogens to digesting cell debris, modulating inflammation,
and producing key cytokines and other regulatory factors throughout the body.
Macrophage research has undergone a renaissance in recent years, which
has propelled a newfound interest in their heterogeneity as well as a
new understanding of ontological differences in their development.
In addition, recent technological advances such as single-cell
mass-cytometry by time-of-flight have enabled phenotype and functional
analyses of individual immune myeloid cells, including macrophages,
at unprecedented resolution. In this Part 1 of a 4-part review series
covering the macrophage in cardiovascular disease, we focus on the
basic principles of macrophage development, heterogeneity, phenotype,
tissue-specific differentiation, and functionality as a basis to understand
their role in cardiovascular disease.'''
In [71]:
ngrams = TrigramCollocationFinder.from_words(nltk.tokenize.word_tokenize(abstract))
In [72]:
scores = ngrams.score_ngrams(TrigramAssocMeasures.likelihood_ratio)
In [73]:
scores[:5]
Out[73]:
[(('in', 'cardiovascular', 'disease'), 60.22140084295821),
 (('cardiovascular', 'disease', 'programs'), 57.490270384342544),
 (('many', 'cardiovascular', 'disease'), 57.490270384342544),
 (('cardiovascular', 'disease', '.'), 49.568274269761346),
 (('cardiovascular', 'disease', ','), 47.586079738744886)]
In [74]:
scores[-5:]
Out[74]:
[(('development', ',', 'heterogeneity'), 18.377430413805826),
 (('heterogeneity', ',', 'phenotype'), 18.377430413805826),
 (('the', 'macrophage', 'in'), 17.35538066534174),
 ((',', 'heterogeneity', ','), 12.326088385780718),
 ((',', 'phenotype', ','), 12.326088385780718)]

Part-of-speech tagging

Regex for grammar from this blog

In [75]:
nltk.help.upenn_tagset()
$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
JJR: adjective, comparative
    bleaker braver breezier briefer brighter brisker broader bumper busier
    calmer cheaper choosier cleaner clearer closer colder commoner costlier
    cozier creamier crunchier cuter ...
JJS: adjective, superlative
    calmest cheapest choicest classiest cleanest clearest closest commonest
    corniest costliest crassest creepiest crudest cutest darkest deadliest
    dearest deepest densest dinkiest ...
LS: list item marker
    A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
    SP-44007 Second Third Three Two * a b c d first five four one six three
    two
MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
PDT: pre-determiner
    all both half many quite such sure this
POS: genitive marker
    ' 's
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
PRP$: pronoun, possessive
    her his mine my our ours their thy your
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...
RBS: adverb, superlative
    best biggest bluntest earliest farthest first furthest hardest
    heartiest highest largest least less most nearest second tightest worst
RP: particle
    aboard about across along apart around aside at away back before behind
    by crop down ever fast for forth from go high i.e. in into just later
    low more off on open out over per pie raising start teeth that through
    under unto up up-pp upon whole with you
SYM: symbol
    % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
TO: "to" as preposition or infinitive marker
    to
UH: interjection
    Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
    huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
    man baby diddle hush sonuvabitch ...
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...
WDT: WH-determiner
    that what whatever which whichever
WP: WH-pronoun
    that what whatever whatsoever which who whom whosoever
WP$: WH-pronoun, possessive
    whose
WRB: Wh-adverb
    how however whence whenever where whereby whereever wherein whereof why
``: opening quotation mark
    ` ``

Using a paragraph from Wikipedia.

In [76]:
nobel = "Born in Stockholm, Alfred Nobel was the third son of Immanuel Nobel (1801–1872), an inventor and engineer, and Carolina Andriette (Ahlsell) Nobel (1805–1889).The couple married in 1827 and had eight children. The family was impoverished, and only Alfred and his three brothers survived past childhood. Through his father, Alfred Nobel was a descendant of the Swedish scientist Olaus Rudbeck (1630–1702),and in his turn the boy was interested in engineering, particularly explosives, learning the basic principles from his father at a young age. Alfred Nobel's interest in technology was inherited from his father, an alumnus of Royal Institute of Technology in Stockholm."
In [77]:
nobel
Out[77]:
"Born in Stockholm, Alfred Nobel was the third son of Immanuel Nobel (1801–1872), an inventor and engineer, and Carolina Andriette (Ahlsell) Nobel (1805–1889).The couple married in 1827 and had eight children. The family was impoverished, and only Alfred and his three brothers survived past childhood. Through his father, Alfred Nobel was a descendant of the Swedish scientist Olaus Rudbeck (1630–1702),and in his turn the boy was interested in engineering, particularly explosives, learning the basic principles from his father at a young age. Alfred Nobel's interest in technology was inherited from his father, an alumnus of Royal Institute of Technology in Stockholm."
In [78]:
text = nltk.word_tokenize(nobel)
In [79]:
pos = nltk.pos_tag(text)
In [80]:
pos[:32]
Out[80]:
[('Born', 'VBN'),
 ('in', 'IN'),
 ('Stockholm', 'NNP'),
 (',', ','),
 ('Alfred', 'NNP'),
 ('Nobel', 'NNP'),
 ('was', 'VBD'),
 ('the', 'DT'),
 ('third', 'JJ'),
 ('son', 'NN'),
 ('of', 'IN'),
 ('Immanuel', 'NNP'),
 ('Nobel', 'NNP'),
 ('(', '('),
 ('1801–1872', 'CD'),
 (')', ')'),
 (',', ','),
 ('an', 'DT'),
 ('inventor', 'NN'),
 ('and', 'CC'),
 ('engineer', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('Carolina', 'NNP'),
 ('Andriette', 'NNP'),
 ('(', '('),
 ('Ahlsell', 'NNP'),
 (')', ')'),
 ('Nobel', 'NNP'),
 ('(', '('),
 ('1805–1889', 'CD'),
 (')', ')')]
In [81]:
grammar = 'KP: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'
In [82]:
chunker = nltk.RegexpParser(grammar)
In [83]:
tree = chunker.parse(pos[:32])
In [84]:
tree
Out[84]:
_images/S17A_Text_Analysis_1_114_0.png
In [85]:
tree.collapse_unary
Out[85]:
<bound method Tree.collapse_unary of Tree('S', [('Born', 'VBN'), ('in', 'IN'), Tree('KP', [('Stockholm', 'NNP')]), (',', ','), Tree('KP', [('Alfred', 'NNP'), ('Nobel', 'NNP')]), ('was', 'VBD'), ('the', 'DT'), Tree('KP', [('third', 'JJ'), ('son', 'NN'), ('of', 'IN'), ('Immanuel', 'NNP'), ('Nobel', 'NNP')]), ('(', '('), ('1801–1872', 'CD'), (')', ')'), (',', ','), ('an', 'DT'), Tree('KP', [('inventor', 'NN')]), ('and', 'CC'), Tree('KP', [('engineer', 'NN')]), (',', ','), ('and', 'CC'), Tree('KP', [('Carolina', 'NNP'), ('Andriette', 'NNP')]), ('(', '('), Tree('KP', [('Ahlsell', 'NNP')]), (')', ')'), Tree('KP', [('Nobel', 'NNP')]), ('(', '('), ('1805–1889', 'CD'), (')', ')')])>
In [86]:
import itertools
In [87]:
kps = [ ]
for key, group in itertools.groupby(nltk.tree2conlltags(tree), lambda x: x[-1]):
    if key != 'O':
        phrase = []
        for word, pos, cls in group:
            phrase.append(word)
        kps.append(' '.join(phrase))
kps
Out[87]:
['Stockholm',
 'Alfred',
 'Nobel',
 'third',
 'son of Immanuel Nobel',
 'inventor',
 'engineer',
 'Carolina',
 'Andriette',
 'Ahlsell',
 'Nobel']

Finding named entities

We use a pre-trained model from spacy. See here if you want to train on your own corpus or extend the pre-trained model.

The default model is not perfect, but may be good enough for your needs.

In [88]:
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()
In [89]:
doc = nlp(nobel)
In [90]:
print([(X, X.ent_iob_, X.ent_type_) for X in doc])
[(Born, 'O', ''), (in, 'O', ''), (Stockholm, 'B', 'GPE'), (,, 'O', ''), (Alfred, 'B', 'PERSON'), (Nobel, 'I', 'PERSON'), (was, 'O', ''), (the, 'O', ''), (third, 'B', 'ORDINAL'), (son, 'O', ''), (of, 'O', ''), (Immanuel, 'B', 'PERSON'), (Nobel, 'I', 'PERSON'), ((, 'O', ''), (1801–1872, 'B', 'CARDINAL'), (), 'O', ''), (,, 'O', ''), (an, 'O', ''), (inventor, 'O', ''), (and, 'O', ''), (engineer, 'O', ''), (,, 'O', ''), (and, 'O', ''), (Carolina, 'B', 'PERSON'), (Andriette, 'I', 'PERSON'), ((, 'O', ''), (Ahlsell, 'O', ''), (), 'O', ''), (Nobel, 'B', 'WORK_OF_ART'), ((, 'O', ''), (1805–1889).The, 'O', ''), (couple, 'O', ''), (married, 'O', ''), (in, 'O', ''), (1827, 'B', 'DATE'), (and, 'O', ''), (had, 'O', ''), (eight, 'B', 'CARDINAL'), (children, 'O', ''), (., 'O', ''), (The, 'O', ''), (family, 'O', ''), (was, 'O', ''), (impoverished, 'O', ''), (,, 'O', ''), (and, 'O', ''), (only, 'O', ''), (Alfred, 'B', 'PERSON'), (and, 'O', ''), (his, 'O', ''), (three, 'B', 'CARDINAL'), (brothers, 'O', ''), (survived, 'O', ''), (past, 'O', ''), (childhood, 'O', ''), (., 'O', ''), (Through, 'O', ''), (his, 'O', ''), (father, 'O', ''), (,, 'O', ''), (Alfred, 'B', 'PERSON'), (Nobel, 'I', 'PERSON'), (was, 'O', ''), (a, 'O', ''), (descendant, 'O', ''), (of, 'O', ''), (the, 'O', ''), (Swedish, 'B', 'NORP'), (scientist, 'O', ''), (Olaus, 'B', 'PERSON'), (Rudbeck, 'I', 'PERSON'), ((, 'O', ''), (1630–1702),and, 'B', 'LOC'), (in, 'O', ''), (his, 'O', ''), (turn, 'O', ''), (the, 'O', ''), (boy, 'O', ''), (was, 'O', ''), (interested, 'O', ''), (in, 'O', ''), (engineering, 'O', ''), (,, 'O', ''), (particularly, 'O', ''), (explosives, 'O', ''), (,, 'O', ''), (learning, 'O', ''), (the, 'O', ''), (basic, 'O', ''), (principles, 'O', ''), (from, 'O', ''), (his, 'O', ''), (father, 'O', ''), (at, 'O', ''), (a, 'O', ''), (young, 'O', ''), (age, 'O', ''), (., 'O', ''), (Alfred, 'B', 'PERSON'), (Nobel, 'I', 'PERSON'), ('s, 'I', 'PERSON'), (interest, 'O', ''), (in, 'O', ''), (technology, 'O', ''), (was, 'O', ''), (inherited, 'O', ''), (from, 'O', ''), (his, 'O', ''), (father, 'O', ''), (,, 'O', ''), (an, 'O', ''), (alumnus, 'O', ''), (of, 'O', ''), (Royal, 'B', 'ORG'), (Institute, 'I', 'ORG'), (of, 'I', 'ORG'), (Technology, 'I', 'ORG'), (in, 'O', ''), (Stockholm, 'B', 'GPE'), (., 'O', '')]
In [91]:
displacy.render(doc, jupyter=True, style='ent')
Born in Stockholm GPE , Alfred Nobel PERSON was the third ORDINAL son of Immanuel Nobel PERSON ( 1801–1872 CARDINAL ), an inventor and engineer, and Carolina Andriette PERSON (Ahlsell) Nobel WORK_OF_ART (1805–1889).The couple married in 1827 DATE and had eight CARDINAL children. The family was impoverished, and only Alfred PERSON and his three CARDINAL brothers survived past childhood. Through his father, Alfred Nobel PERSON was a descendant of the Swedish NORP scientist Olaus Rudbeck PERSON ( 1630–1702),and LOC in his turn the boy was interested in engineering, particularly explosives, learning the basic principles from his father at a young age. Alfred Nobel's PERSON interest in technology was inherited from his father, an alumnus of Royal Institute of Technology ORG in Stockholm GPE .
In [92]:
for entity in doc.ents:
    if entity.label_ == 'PERSON':
        print(entity)
Alfred Nobel
Immanuel Nobel
Carolina Andriette
Alfred
Alfred Nobel
Olaus Rudbeck
Alfred Nobel's