Data¶
Libraries for Data Processing¶
In [1]:
%matplotlib inline
String processing and utility libraries
In [2]:
import os
import glob
import string
import re
import imageio
Libraries for functional programming
In [3]:
import operator as op
import itertools as it
from functools import reduce, partial
import toolz as tz
import toolz.curried as c
Libraries for numerical programming
In [4]:
import numpy as np
import pandas as pd
from scipy import (stats, sparse, linalg,
spatial, integrate, optimize, io)
Libraries for plotting and visualization
In [5]:
import matplotlib.pyplot as plt
import seaborn as sns
Working with unstructured text data¶
Categorical and one-hot encoding¶
See examples from S05_Text notebook.
Term frequency - inverse document frequency (tf-idf)¶
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes; 83% of text-based recommender systems in the domain of digital libraries use tf-idf.
Source: Wikipedia
Documents¶
In [6]:
paths = glob.glob(os.path.join('data', 'Gutenberg', '*.txt') )
names = [os.path.splitext(os.path.split(path)[-1])[0] for path in paths]
names
Out[6]:
['blake-poems',
'carroll-alice',
'shakespeare-caesar',
'whitman-leaves',
'milton-paradise',
'bible-kjv',
'austen-persuasion',
'melville-moby_dick',
'edgeworth-parents',
'chesterton-thursday',
'burgess-busterbrown',
'chesterton-ball',
'austen-emma',
'chesterton-brown',
'shakespeare-hamlet',
'austen-sense',
'shakespeare-macbeth',
'bryant-stories']
In [7]:
N = len(names)
Simple processing to find words in each document¶
Standard Python idiom.
In [8]:
doc_terms = []
for path in paths:
with open(path, encoding='latin-1') as f:
text = f.read()
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
words = text.split()
doc_terms.append(words)
Using a functional idiom
In [9]:
doc_terms = tz.pipe(
paths,
c.map(partial(open, encoding='latin-1')),
c.map(lambda x: x.read()),
c.map(lambda x: x.lower()),
c.map(lambda x: x.translate(str.maketrans('', '', string.punctuation))),
c.map(lambda x: x.split()),
list
)
Term frequency¶
This is just a word count for each document.
In [10]:
tf = {name: tz.frequencies(doc)
for name, doc in zip(names, doc_terms)}
Document frequency¶
This is how many documents a term appears in.
In [11]:
df = tz.frequencies(tz.concat(d.keys() for d in tf.values()))
Inverse document frequency¶
This weights each term by the inverse frequency of its appearance across different documents, and reduces the important of common words like “the”.
In [12]:
idf = {term: np.log((1 + N)/(1 + count)) for term, count in df.items()}
Term frequency - inverse document frequency¶
This is just the product of tf and idf, and a measure of the importance of each term in a document.
In [13]:
terms = list(tz.unique(tz.concat(doc_terms)))
In [14]:
tf_idf = {}
for name, doc in tf.items():
d = {}
for term in terms:
d[term] = doc.get(term, 0) * idf[term]
tf_idf[name] = d
Convert to a structured data type¶
In [15]:
tf_idf = pd.DataFrame(tf_idf)
In [16]:
tf_idf.iloc[:5, :5]
Out[16]:
austen-emma | austen-persuasion | austen-sense | bible-kjv | blake-poems | |
---|---|---|---|---|---|
00 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
00021053 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
00081429 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
00482129 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
01 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Distinctive words in each document¶
These are words that appear frequently in that document but not in others.
In [17]:
pd.DataFrame({doc: series.sort_values(ascending=False).index[:5]
for doc, series in tf_idf.items()}).T
Out[17]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
austen-emma | emma | harriet | weston | knightley | elton |
austen-persuasion | elliot | anne | wentworth | musgrove | russell |
austen-sense | elinor | marianne | dashwood | jennings | mrs |
bible-kjv | unto | israel | saith | thee | thou |
blake-poems | thel | weep | thee | lyca | vales |
bryant-stories | jackal | margery | brahmin | nightingale | epaminondas |
burgess-busterbrown | buster | browns | joe | blacky | billy |
carroll-alice | alice | gryphon | dormouse | duchess | hatter |
chesterton-ball | turnbull | macian | evan | turnbulls | have |
chesterton-brown | flambeau | boulnois | muscari | brown | fanshaw |
chesterton-thursday | syme | gregory | professor | marquis | secretary |
edgeworth-parents | susan | cecilia | piedro | archer | mr |
melville-moby_dick | whale | ahab | sperm | whales | stubb |
milton-paradise | thee | thou | heaven | thy | eve |
shakespeare-caesar | bru | brutus | cassi | haue | cassius |
shakespeare-hamlet | ham | haue | hor | qu | laer |
shakespeare-macbeth | macb | haue | macbeth | macd | rosse |
whitman-leaves | o | thee | poems | states | pioneers |
Why is “macb” the top hit for Macbeth?¶
In [18]:
tf['shakespeare-macbeth']['macb']
Out[18]:
137
In [19]:
text = open('data/gutenberg/shakespeare-macbeth.txt', encoding='latin-1').read()
re.findall('macb[^e].*\n', text, re.IGNORECASE)[:10]
Out[19]:
['Macb. So foule and faire a day I haue not seene\n',
'Macb. Stay you imperfect Speakers, tell me more:\n',
"Macb. Into the Ayre: and what seem'd corporall,\n",
'Macb. Your Children shall be Kings\n',
'Macb. And Thane of Cawdor too: went it not so?\n',
'Macb. The Thane of Cawdor liues:\n',
'Macb. Glamys, and Thane of Cawdor:\n',
'Macb. Two Truths are told,\n',
'Macb. If Chance will haue me King,\n',
'Macb. Come what come may,\n']
Image data¶
In [20]:
paths = glob.glob(os.path.join('data', 'POKEMON', '*.png') )[:3]
imgs = np.array([imageio.imread(path) for path in paths])
In [21]:
imgs.shape
Out[21]:
(3, 215, 215, 4)
In [22]:
fig, axes = plt.subplots(1,3,figsize=(10,3))
for ax, img in zip(axes, imgs):
ax.imshow(img)
ax.set_xticks([])
ax.set_yticks([])
Using pandas
for data munging¶
In [23]:
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
In [24]:
tips = pd.read_csv(url)
Inspecting a data frame¶
In [25]:
tips.shape
Out[25]:
(244, 7)
In [26]:
tips.dtypes
Out[26]:
total_bill float64
tip float64
sex object
smoker object
day object
time object
size int64
dtype: object
In [27]:
tips.columns
Out[27]:
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')
In [28]:
tips.index
Out[28]:
RangeIndex(start=0, stop=244, step=1)
In [29]:
tips.describe()
Out[29]:
total_bill | tip | size | |
---|---|---|---|
count | 244.000000 | 244.000000 | 244.000000 |
mean | 19.785943 | 2.998279 | 2.569672 |
std | 8.902412 | 1.383638 | 0.951100 |
min | 3.070000 | 1.000000 | 1.000000 |
25% | 13.347500 | 2.000000 | 2.000000 |
50% | 17.795000 | 2.900000 | 2.000000 |
75% | 24.127500 | 3.562500 | 3.000000 |
max | 50.810000 | 10.000000 | 6.000000 |
In [30]:
tips.head(3)
Out[30]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
In [31]:
tips.tail(3)
Out[31]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
In [32]:
tips.sample(3)
Out[32]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
132 | 11.17 | 1.50 | Female | No | Thur | Lunch | 2 |
144 | 16.43 | 2.30 | Female | No | Thur | Lunch | 2 |
55 | 19.49 | 3.51 | Male | No | Sun | Dinner | 2 |
Series¶
A column is a pandas Series object. It behaves like an indexed vector.
In [33]:
tips['sex'].head()
Out[33]:
0 Female
1 Male
2 Male
3 Male
4 Female
Name: sex, dtype: object
Alternative way to get Series when column name meets requirements for Python variable (i.e. no spaces or punctuation)
In [34]:
tips.sex.head()
Out[34]:
0 Female
1 Male
2 Male
3 Male
4 Female
Name: sex, dtype: object
Series types¶
String operations¶
In [35]:
tips.sex.str.lower().str[0].head()
Out[35]:
0 f
1 m
2 m
3 m
4 f
Name: sex, dtype: object
Categorical data types¶
In [36]:
tips.day.unique()
Out[36]:
array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)
In [37]:
tips['day'] = tips.day.astype('category')
In [38]:
tips.day.head(3)
Out[38]:
0 Sun
1 Sun
2 Sun
Name: day, dtype: category
Categories (4, object): [Fri, Sat, Sun, Thur]
In [39]:
tips.day.cat.reorder_categories(['Thur', 'Fri', 'Sat', 'Sun'], ordered=True, inplace=True)
In [40]:
tips.day.head(3)
Out[40]:
0 Sun
1 Sun
2 Sun
Name: day, dtype: category
Categories (4, object): [Thur < Fri < Sat < Sun]
Datetime operations¶
In [41]:
import pandas_datareader as pdr
data_source = 'google'
start_date = '2010-01-01'
end_date = '2016-12-31'
data = pdr.get_data_morningstar('MSFT', start_date, end_date)
In [42]:
data.head(3)
Out[42]:
Close | High | Low | Open | Volume | ||
---|---|---|---|---|---|---|
Symbol | Date | |||||
MSFT | 2010-01-01 | 30.48 | 30.99 | 30.48 | 30.98 | 0 |
2010-01-04 | 30.95 | 31.10 | 30.59 | 30.65 | 38414185 | |
2010-01-05 | 30.96 | 31.10 | 30.64 | 30.87 | 49758862 |
In [43]:
dates = data.index.get_level_values(1)
In [44]:
dates[:3]
Out[44]:
DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05'], dtype='datetime64[ns]', name='Date', freq=None)
In [45]:
list(it.islice(zip(dates.year,
dates.month,
dates.day), 6))
Out[45]:
[(2010, 1, 1),
(2010, 1, 4),
(2010, 1, 5),
(2010, 1, 6),
(2010, 1, 7),
(2010, 1, 8)]
In [46]:
msft = data.loc['MSFT']
In [47]:
msft.plot(y='High')
pass
In [48]:
msft.loc['2016-12-01':'2016-12-31', ('Low', 'High')].plot(
linestyle='dashed', marker='+')
pass
Indexing¶
In [49]:
tips[0:2]
Out[49]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
In [50]:
tips.loc[0:2]
Out[50]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
In [51]:
tips.iloc[0:2]
Out[51]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
In [52]:
tips.iloc[0:3, [2,3,4]]
Out[52]:
sex | smoker | day | |
---|---|---|---|
0 | Female | No | Sun |
1 | Male | No | Sun |
2 | Male | No | Sun |
In [53]:
tips.loc[0:2, ['tip', 'sex', 'size']]
Out[53]:
tip | sex | size | |
---|---|---|---|
0 | 1.01 | Female | 2 |
1 | 1.66 | Male | 3 |
2 | 3.50 | Male | 3 |
In [54]:
tips.loc[0:2, 'tip':'day']
Out[54]:
tip | sex | smoker | day | |
---|---|---|---|---|
0 | 1.01 | Female | No | Sun |
1 | 1.66 | Male | No | Sun |
2 | 3.50 | Male | No | Sun |
Boolean indexing¶
In [55]:
tips[tips.sex == 'Male'].head(3)
Out[55]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
In [56]:
tips[(tips.sex == 'Male') & (tips.time != 'Dinner')].head(3)
Out[56]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
77 | 27.20 | 4.00 | Male | No | Thur | Lunch | 4 |
78 | 22.76 | 3.00 | Male | No | Thur | Lunch | 2 |
79 | 17.29 | 2.71 | Male | No | Thur | Lunch | 2 |
Special selection¶
In [57]:
df = pd.DataFrame({
'a': [1,2,None,4,1],
'b': [3,3,None,None,3],
'c': [1,2,None,4,1]})
In [58]:
df
Out[58]:
a | b | c | |
---|---|---|---|
0 | 1.0 | 3.0 | 1.0 |
1 | 2.0 | 3.0 | 2.0 |
2 | NaN | NaN | NaN |
3 | 4.0 | NaN | 4.0 |
4 | 1.0 | 3.0 | 1.0 |
In [59]:
df.dropna()
Out[59]:
a | b | c | |
---|---|---|---|
0 | 1.0 | 3.0 | 1.0 |
1 | 2.0 | 3.0 | 2.0 |
4 | 1.0 | 3.0 | 1.0 |
In [60]:
df.dropna(how='all')
Out[60]:
a | b | c | |
---|---|---|---|
0 | 1.0 | 3.0 | 1.0 |
1 | 2.0 | 3.0 | 2.0 |
3 | 4.0 | NaN | 4.0 |
4 | 1.0 | 3.0 | 1.0 |
In [61]:
df.drop_duplicates()
Out[61]:
a | b | c | |
---|---|---|---|
0 | 1.0 | 3.0 | 1.0 |
1 | 2.0 | 3.0 | 2.0 |
2 | NaN | NaN | NaN |
3 | 4.0 | NaN | 4.0 |
In [62]:
df.drop_duplicates(keep='last')
Out[62]:
a | b | c | |
---|---|---|---|
1 | 2.0 | 3.0 | 2.0 |
2 | NaN | NaN | NaN |
3 | 4.0 | NaN | 4.0 |
4 | 1.0 | 3.0 | 1.0 |
Selecting by label¶
In [63]:
tips.filter(regex='^s.*').head(3)
Out[63]:
sex | smoker | size | |
---|---|---|---|
0 | Female | No | 2 |
1 | Male | No | 3 |
2 | Male | No | 3 |
Sorting¶
In [64]:
tips.sort_values(['size', 'tip'], ascending=[True, False]).head(3)
Out[64]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
222 | 8.58 | 1.92 | Male | Yes | Fri | Lunch | 1 |
82 | 10.07 | 1.83 | Female | No | Thur | Lunch | 1 |
67 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
Note that ordered categorical values are sorted appropriately.
In [65]:
tips.sort_values(['day']).head(3)
Out[65]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
121 | 13.42 | 1.68 | Female | No | Thur | Lunch | 2 |
133 | 12.26 | 2.00 | Female | No | Thur | Lunch | 2 |
132 | 11.17 | 1.50 | Female | No | Thur | Lunch | 2 |
Rearrange columns¶
In [66]:
tips = tips[tips.columns.sort_values()]
tips.head(3)
Out[66]:
day | sex | size | smoker | time | tip | total_bill | |
---|---|---|---|---|---|---|---|
0 | Sun | Female | 2 | No | Dinner | 1.01 | 16.99 |
1 | Sun | Male | 3 | No | Dinner | 1.66 | 10.34 |
2 | Sun | Male | 3 | No | Dinner | 3.50 | 21.01 |
In [67]:
tips = tips.filter('total_bill tip sex smoker day time size'.split())
tips.head(3)
Out[67]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
Transforms¶
In [68]:
tips['day_type'] = 'Weekday'
tips.loc[tips.day.isin(['Sat', 'Sun']), 'day_type'] = 'Weekend'
tips.head(3)
Out[68]:
total_bill | tip | sex | smoker | day | time | size | day_type | |
---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend |
In [69]:
tips['cost'] = tips['total_bill'] + tips['tip']
tips.head(3)
Out[69]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 |
In [70]:
tips.assign(log1p_cost = np.log1p(tips.cost)).head(3)
Out[70]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | log1p_cost | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 | 2.944439 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 | 2.564949 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 | 3.239071 |
In [71]:
tips.replace({
'sex': dict(Femals='F', Male='M'),
'day': dict(Thur=4, Fri=5, Sat=6, Sun=7)}).head(3)
Out[71]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | 7 | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | M | No | 7 | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | M | No | 7 | Dinner | 3 | Weekend | 24.51 |
In [72]:
tips.rename({'sex': 'gender'}, axis=1).head(3)
Out[72]:
total_bill | tip | gender | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 |
Transforming missing values¶
In [73]:
df
Out[73]:
a | b | c | |
---|---|---|---|
0 | 1.0 | 3.0 | 1.0 |
1 | 2.0 | 3.0 | 2.0 |
2 | NaN | NaN | NaN |
3 | 4.0 | NaN | 4.0 |
4 | 1.0 | 3.0 | 1.0 |
In [74]:
df.fillna(0)
Out[74]:
a | b | c | |
---|---|---|---|
0 | 1.0 | 3.0 | 1.0 |
1 | 2.0 | 3.0 | 2.0 |
2 | 0.0 | 0.0 | 0.0 |
3 | 4.0 | 0.0 | 4.0 |
4 | 1.0 | 3.0 | 1.0 |
In [75]:
df.fillna(df.mean())
Out[75]:
a | b | c | |
---|---|---|---|
0 | 1.0 | 3.0 | 1.0 |
1 | 2.0 | 3.0 | 2.0 |
2 | 2.0 | 3.0 | 2.0 |
3 | 4.0 | 3.0 | 4.0 |
4 | 1.0 | 3.0 | 1.0 |
In [76]:
df.fillna(method='ffill')
Out[76]:
a | b | c | |
---|---|---|---|
0 | 1.0 | 3.0 | 1.0 |
1 | 2.0 | 3.0 | 2.0 |
2 | 2.0 | 3.0 | 2.0 |
3 | 4.0 | 3.0 | 4.0 |
4 | 1.0 | 3.0 | 1.0 |
In [77]:
df.fillna(method='bfill')
Out[77]:
a | b | c | |
---|---|---|---|
0 | 1.0 | 3.0 | 1.0 |
1 | 2.0 | 3.0 | 2.0 |
2 | 4.0 | 3.0 | 4.0 |
3 | 4.0 | 3.0 | 4.0 |
4 | 1.0 | 3.0 | 1.0 |
Summaries¶
In [78]:
tips.mean()
Out[78]:
total_bill 19.785943
tip 2.998279
size 2.569672
cost 22.784221
dtype: float64
In [79]:
tips.std()
Out[79]:
total_bill 8.902412
tip 1.383638
size 0.951100
cost 9.890116
dtype: float64
In [80]:
tips.count()
Out[80]:
total_bill 244
tip 244
sex 244
smoker 244
day 244
time 244
size 244
day_type 244
cost 244
dtype: int64
Grouping¶
In [81]:
tips.groupby(['sex', 'day', 'time']).mean()
Out[81]:
total_bill | tip | size | cost | |||
---|---|---|---|---|---|---|
sex | day | time | ||||
Female | Thur | Dinner | 18.780000 | 3.000000 | 2.000000 | 21.780000 |
Lunch | 16.648710 | 2.561935 | 2.483871 | 19.210645 | ||
Fri | Dinner | 14.310000 | 2.810000 | 2.000000 | 17.120000 | |
Lunch | 13.940000 | 2.745000 | 2.250000 | 16.685000 | ||
Sat | Dinner | 19.680357 | 2.801786 | 2.250000 | 22.482143 | |
Lunch | NaN | NaN | NaN | NaN | ||
Sun | Dinner | 19.872222 | 3.367222 | 2.944444 | 23.239444 | |
Lunch | NaN | NaN | NaN | NaN | ||
Male | Thur | Dinner | NaN | NaN | NaN | NaN |
Lunch | 18.714667 | 2.980333 | 2.433333 | 21.695000 | ||
Fri | Dinner | 23.487143 | 3.032857 | 2.285714 | 26.520000 | |
Lunch | 11.386667 | 1.900000 | 1.666667 | 13.286667 | ||
Sat | Dinner | 20.802542 | 3.083898 | 2.644068 | 23.886441 | |
Lunch | NaN | NaN | NaN | NaN | ||
Sun | Dinner | 21.887241 | 3.220345 | 2.810345 | 25.107586 | |
Lunch | NaN | NaN | NaN | NaN |
In [82]:
tips.groupby(['sex']).agg(['mean', 'std', 'min', 'max'])
Out[82]:
total_bill | tip | size | cost | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | std | min | max | mean | std | min | max | mean | std | min | max | mean | std | min | max | |
sex | ||||||||||||||||
Female | 18.056897 | 8.009209 | 3.07 | 44.30 | 2.833448 | 1.159495 | 1.0 | 6.5 | 2.459770 | 0.937644 | 1 | 6 | 20.890345 | 8.841798 | 4.07 | 48.11 |
Male | 20.744076 | 9.246469 | 7.25 | 50.81 | 3.089618 | 1.489102 | 1.0 | 10.0 | 2.630573 | 0.955997 | 1 | 6 | 23.833694 | 10.303309 | 9.00 | 60.81 |
Working with hierarchical indexes¶
In [83]:
df = tips.groupby(['sex', 'day', 'time']).mean()
df.head()
Out[83]:
total_bill | tip | size | cost | |||
---|---|---|---|---|---|---|
sex | day | time | ||||
Female | Thur | Dinner | 18.780000 | 3.000000 | 2.000000 | 21.780000 |
Lunch | 16.648710 | 2.561935 | 2.483871 | 19.210645 | ||
Fri | Dinner | 14.310000 | 2.810000 | 2.000000 | 17.120000 | |
Lunch | 13.940000 | 2.745000 | 2.250000 | 16.685000 | ||
Sat | Dinner | 19.680357 | 2.801786 | 2.250000 | 22.482143 |
In [84]:
df.index
Out[84]:
MultiIndex(levels=[['Female', 'Male'], ['Thur', 'Fri', 'Sat', 'Sun'], ['Dinner', 'Lunch']],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 1, 1, 2, 2, 3, 3, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['sex', 'day', 'time'])
In [85]:
df.columns
Out[85]:
Index(['total_bill', 'tip', 'size', 'cost'], dtype='object')
In [86]:
df.loc[('Female')]
Out[86]:
total_bill | tip | size | cost | ||
---|---|---|---|---|---|
day | time | ||||
Thur | Dinner | 18.780000 | 3.000000 | 2.000000 | 21.780000 |
Lunch | 16.648710 | 2.561935 | 2.483871 | 19.210645 | |
Fri | Dinner | 14.310000 | 2.810000 | 2.000000 | 17.120000 |
Lunch | 13.940000 | 2.745000 | 2.250000 | 16.685000 | |
Sat | Dinner | 19.680357 | 2.801786 | 2.250000 | 22.482143 |
Lunch | NaN | NaN | NaN | NaN | |
Sun | Dinner | 19.872222 | 3.367222 | 2.944444 | 23.239444 |
Lunch | NaN | NaN | NaN | NaN |
In [87]:
df.loc[('Female', 'Sat')]
Out[87]:
total_bill | tip | size | cost | |
---|---|---|---|---|
time | ||||
Dinner | 19.680357 | 2.801786 | 2.25 | 22.482143 |
Lunch | NaN | NaN | NaN | NaN |
In [88]:
df.loc[('Female', slice(None), 'Dinner')]
Out[88]:
total_bill | tip | size | cost | |||
---|---|---|---|---|---|---|
sex | day | time | ||||
Female | Thur | Dinner | 18.780000 | 3.000000 | 2.000000 | 21.780000 |
Fri | Dinner | 14.310000 | 2.810000 | 2.000000 | 17.120000 | |
Sat | Dinner | 19.680357 | 2.801786 | 2.250000 | 22.482143 | |
Sun | Dinner | 19.872222 | 3.367222 | 2.944444 | 23.239444 |
Stacking and unstacking¶
In [89]:
tips.head(3)
Out[89]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 |
In [90]:
df.unstack()['tip']
Out[90]:
time | Dinner | Lunch | |
---|---|---|---|
sex | day | ||
Female | Thur | 3.000000 | 2.561935 |
Fri | 2.810000 | 2.745000 | |
Sat | 2.801786 | NaN | |
Sun | 3.367222 | NaN | |
Male | Thur | NaN | 2.980333 |
Fri | 3.032857 | 1.900000 | |
Sat | 3.083898 | NaN | |
Sun | 3.220345 | NaN |
In [91]:
df.unstack(level=0)['tip']
Out[91]:
sex | Female | Male | |
---|---|---|---|
day | time | ||
Thur | Dinner | 3.000000 | NaN |
Lunch | 2.561935 | 2.980333 | |
Fri | Dinner | 2.810000 | 3.032857 |
Lunch | 2.745000 | 1.900000 | |
Sat | Dinner | 2.801786 | 3.083898 |
Lunch | NaN | NaN | |
Sun | Dinner | 3.367222 | 3.220345 |
Lunch | NaN | NaN |
In [92]:
df.unstack(level=[1,2])['tip']
Out[92]:
day | Thur | Fri | Sat | Sun | ||||
---|---|---|---|---|---|---|---|---|
time | Dinner | Lunch | Dinner | Lunch | Dinner | Lunch | Dinner | Lunch |
sex | ||||||||
Female | 3.0 | 2.561935 | 2.810000 | 2.745 | 2.801786 | NaN | 3.367222 | NaN |
Male | NaN | 2.980333 | 3.032857 | 1.900 | 3.083898 | NaN | 3.220345 | NaN |
In [93]:
df.unstack(level=[1,2])['tip'].stack(level=0)
Out[93]:
time | Dinner | Lunch | |
---|---|---|---|
sex | day | ||
Female | Thur | 3.000000 | 2.561935 |
Fri | 2.810000 | 2.745000 | |
Sat | 2.801786 | NaN | |
Sun | 3.367222 | NaN | |
Male | Thur | NaN | 2.980333 |
Fri | 3.032857 | 1.900000 | |
Sat | 3.083898 | NaN | |
Sun | 3.220345 | NaN |
Swapping levels¶
In [94]:
df.swaplevel(1,2)
Out[94]:
total_bill | tip | size | cost | |||
---|---|---|---|---|---|---|
sex | time | day | ||||
Female | Dinner | Thur | 18.780000 | 3.000000 | 2.000000 | 21.780000 |
Lunch | Thur | 16.648710 | 2.561935 | 2.483871 | 19.210645 | |
Dinner | Fri | 14.310000 | 2.810000 | 2.000000 | 17.120000 | |
Lunch | Fri | 13.940000 | 2.745000 | 2.250000 | 16.685000 | |
Dinner | Sat | 19.680357 | 2.801786 | 2.250000 | 22.482143 | |
Lunch | Sat | NaN | NaN | NaN | NaN | |
Dinner | Sun | 19.872222 | 3.367222 | 2.944444 | 23.239444 | |
Lunch | Sun | NaN | NaN | NaN | NaN | |
Male | Dinner | Thur | NaN | NaN | NaN | NaN |
Lunch | Thur | 18.714667 | 2.980333 | 2.433333 | 21.695000 | |
Dinner | Fri | 23.487143 | 3.032857 | 2.285714 | 26.520000 | |
Lunch | Fri | 11.386667 | 1.900000 | 1.666667 | 13.286667 | |
Dinner | Sat | 20.802542 | 3.083898 | 2.644068 | 23.886441 | |
Lunch | Sat | NaN | NaN | NaN | NaN | |
Dinner | Sun | 21.887241 | 3.220345 | 2.810345 | 25.107586 | |
Lunch | Sun | NaN | NaN | NaN | NaN |
Resetting index¶
If you’d rather not deal with hierarchical indexes, use reset_index
.
In [95]:
df.reset_index()
Out[95]:
sex | day | time | total_bill | tip | size | cost | |
---|---|---|---|---|---|---|---|
0 | Female | Thur | Dinner | 18.780000 | 3.000000 | 2.000000 | 21.780000 |
1 | Female | Thur | Lunch | 16.648710 | 2.561935 | 2.483871 | 19.210645 |
2 | Female | Fri | Dinner | 14.310000 | 2.810000 | 2.000000 | 17.120000 |
3 | Female | Fri | Lunch | 13.940000 | 2.745000 | 2.250000 | 16.685000 |
4 | Female | Sat | Dinner | 19.680357 | 2.801786 | 2.250000 | 22.482143 |
5 | Female | Sat | Lunch | NaN | NaN | NaN | NaN |
6 | Female | Sun | Dinner | 19.872222 | 3.367222 | 2.944444 | 23.239444 |
7 | Female | Sun | Lunch | NaN | NaN | NaN | NaN |
8 | Male | Thur | Dinner | NaN | NaN | NaN | NaN |
9 | Male | Thur | Lunch | 18.714667 | 2.980333 | 2.433333 | 21.695000 |
10 | Male | Fri | Dinner | 23.487143 | 3.032857 | 2.285714 | 26.520000 |
11 | Male | Fri | Lunch | 11.386667 | 1.900000 | 1.666667 | 13.286667 |
12 | Male | Sat | Dinner | 20.802542 | 3.083898 | 2.644068 | 23.886441 |
13 | Male | Sat | Lunch | NaN | NaN | NaN | NaN |
14 | Male | Sun | Dinner | 21.887241 | 3.220345 | 2.810345 | 25.107586 |
15 | Male | Sun | Lunch | NaN | NaN | NaN | NaN |
Reshaping¶
In [96]:
df1 = df.reset_index()
In [97]:
df1.head(3)
Out[97]:
sex | day | time | total_bill | tip | size | cost | |
---|---|---|---|---|---|---|---|
0 | Female | Thur | Dinner | 18.78000 | 3.000000 | 2.000000 | 21.780000 |
1 | Female | Thur | Lunch | 16.64871 | 2.561935 | 2.483871 | 19.210645 |
2 | Female | Fri | Dinner | 14.31000 | 2.810000 | 2.000000 | 17.120000 |
Transpose¶
In [98]:
df1.T
Out[98]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sex | Female | Female | Female | Female | Female | Female | Female | Female | Male | Male | Male | Male | Male | Male | Male | Male |
day | Thur | Thur | Fri | Fri | Sat | Sat | Sun | Sun | Thur | Thur | Fri | Fri | Sat | Sat | Sun | Sun |
time | Dinner | Lunch | Dinner | Lunch | Dinner | Lunch | Dinner | Lunch | Dinner | Lunch | Dinner | Lunch | Dinner | Lunch | Dinner | Lunch |
total_bill | 18.78 | 16.6487 | 14.31 | 13.94 | 19.6804 | NaN | 19.8722 | NaN | NaN | 18.7147 | 23.4871 | 11.3867 | 20.8025 | NaN | 21.8872 | NaN |
tip | 3 | 2.56194 | 2.81 | 2.745 | 2.80179 | NaN | 3.36722 | NaN | NaN | 2.98033 | 3.03286 | 1.9 | 3.0839 | NaN | 3.22034 | NaN |
size | 2 | 2.48387 | 2 | 2.25 | 2.25 | NaN | 2.94444 | NaN | NaN | 2.43333 | 2.28571 | 1.66667 | 2.64407 | NaN | 2.81034 | NaN |
cost | 21.78 | 19.2106 | 17.12 | 16.685 | 22.4821 | NaN | 23.2394 | NaN | NaN | 21.695 | 26.52 | 13.2867 | 23.8864 | NaN | 25.1076 | NaN |
Melt (gather)¶
In [99]:
pd.melt(df1, id_vars=['sex', 'day', 'time', 'size']).head(10)
Out[99]:
sex | day | time | size | variable | value | |
---|---|---|---|---|---|---|
0 | Female | Thur | Dinner | 2.000000 | total_bill | 18.780000 |
1 | Female | Thur | Lunch | 2.483871 | total_bill | 16.648710 |
2 | Female | Fri | Dinner | 2.000000 | total_bill | 14.310000 |
3 | Female | Fri | Lunch | 2.250000 | total_bill | 13.940000 |
4 | Female | Sat | Dinner | 2.250000 | total_bill | 19.680357 |
5 | Female | Sat | Lunch | NaN | total_bill | NaN |
6 | Female | Sun | Dinner | 2.944444 | total_bill | 19.872222 |
7 | Female | Sun | Lunch | NaN | total_bill | NaN |
8 | Male | Thur | Dinner | NaN | total_bill | NaN |
9 | Male | Thur | Lunch | 2.433333 | total_bill | 18.714667 |
Pivot table¶
A pivot table is like a group_by operation on both the index (rows) and columns.
In [100]:
pd.pivot_table(df1, values=['total_bill', 'tip'],
index=['sex', 'day'],
columns='time',
aggfunc='mean')
Out[100]:
tip | total_bill | ||||
---|---|---|---|---|---|
time | Dinner | Lunch | Dinner | Lunch | |
sex | day | ||||
Female | Thur | 3.000000 | 2.561935 | 18.780000 | 16.648710 |
Fri | 2.810000 | 2.745000 | 14.310000 | 13.940000 | |
Sat | 2.801786 | NaN | 19.680357 | NaN | |
Sun | 3.367222 | NaN | 19.872222 | NaN | |
Male | Thur | NaN | 2.980333 | NaN | 18.714667 |
Fri | 3.032857 | 1.900000 | 23.487143 | 11.386667 | |
Sat | 3.083898 | NaN | 20.802542 | NaN | |
Sun | 3.220345 | NaN | 21.887241 | NaN |
Joining data frames¶
In [101]:
tips1 = tips[0:5]
tips1
Out[101]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
In [102]:
tips2 = tips[3:8]
tips2
Out[102]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 | Weekend | 30.00 |
6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 | Weekend | 10.77 |
7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 | Weekend | 30.00 |
Simple concatenation¶
In [103]:
pd.concat([tips1, tips2])
Out[103]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 | Weekend | 30.00 |
6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 | Weekend | 10.77 |
7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 | Weekend | 30.00 |
Joining columns¶
Merge uses all common columns to combine. It is very flexible - see help(pd.merge).
In [104]:
pd.merge(tips1, tips2)
Out[104]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
1 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
In [105]:
pd.merge(tips1, tips2, how='left')
Out[105]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
In [106]:
pd.merge(tips1, tips2, how='right')
Out[106]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
1 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
2 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 | Weekend | 30.00 |
3 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 | Weekend | 10.77 |
4 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 | Weekend | 30.00 |
In [107]:
pd.merge(tips1, tips2, how='outer')
Out[107]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 | Weekend | 30.00 |
6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 | Weekend | 10.77 |
7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 | Weekend | 30.00 |
Visualizing data¶
In [108]:
tips.head()
Out[108]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
In [109]:
sns.set_context('notebook', font_scale=1.3)
g = sns.factorplot(x='sex', y='tip',
col='time', row='smoker',
data=tips, kind='swarm')
pass
In [110]:
sns.set_context('notebook', font_scale=1.3)
g = sns.factorplot(x='day', y='tip', hue='sex',
col='time', row='smoker',
data=tips, kind='violin',
margin_titles=True)
pass
In [111]:
sns.lmplot(x='total_bill', y='tip', hue='sex',
col='time', row='smoker',
data=tips, palette='dark')
pass
Pipelines¶
As nearly all pandas
functions and methods consume and return a
DataFrame
, they can be chained togtehr to construct a pipeline.
In [112]:
tips.head()
Out[112]:
total_bill | tip | sex | smoker | day | time | size | day_type | cost | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Weekend | 18.00 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Weekend | 12.00 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Weekend | 24.51 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | Weekend | 26.99 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | Weekend | 28.20 |
In [113]:
pd.set_option('precision', 2)
In [114]:
(
tips.
dropna().
filter(regex=r".*[t|s].*").
loc[lambda df: df.sex=='Female'].
groupby(['smoker', 'time']).
agg({'total_bill': 'mean',
'tip': 'mean'}).
assign(percent = lambda df: 100*df.tip/df.total_bill)
)
Out[114]:
total_bill | tip | percent | ||
---|---|---|---|---|
smoker | time | |||
No | Dinner | 20.00 | 3.04 | 15.22 |
Lunch | 15.90 | 2.46 | 15.47 | |
Yes | Dinner | 18.22 | 2.95 | 16.19 |
Lunch | 17.43 | 2.89 | 16.59 |