Data¶

Libraries for Data Processing¶

In [1]:

%matplotlib inline

String processing and utility libraries

In [2]:

import os
import glob
import string
import re
import imageio

Libraries for functional programming

In [3]:

import operator as op
import itertools as it
from functools import reduce, partial
import toolz as tz
import toolz.curried as c

Libraries for numerical programming

In [4]:

import numpy as np
import pandas as pd
from scipy import (stats, sparse, linalg,
                   spatial, integrate, optimize, io)

Libraries for plotting and visualization

In [5]:

import matplotlib.pyplot as plt
import seaborn as sns

Working with unstructured text data¶

Categorical and one-hot encoding¶

See examples from S05_Text notebook.

Term frequency - inverse document frequency (tf-idf)¶

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes; 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

Source: Wikipedia

Documents¶

In [6]:

paths = glob.glob(os.path.join('data', 'Gutenberg', '*.txt') )
names = [os.path.splitext(os.path.split(path)[-1])[0] for path in paths]
names

Out[6]:

['blake-poems',
 'carroll-alice',
 'shakespeare-caesar',
 'whitman-leaves',
 'milton-paradise',
 'bible-kjv',
 'austen-persuasion',
 'melville-moby_dick',
 'edgeworth-parents',
 'chesterton-thursday',
 'burgess-busterbrown',
 'chesterton-ball',
 'austen-emma',
 'chesterton-brown',
 'shakespeare-hamlet',
 'austen-sense',
 'shakespeare-macbeth',
 'bryant-stories']

In [7]:

N = len(names)

Simple processing to find words in each document¶

Standard Python idiom.

In [8]:

doc_terms = []
for path in paths:
    with open(path, encoding='latin-1') as f:
        text = f.read()
        text = text.lower()
        text = text.translate(str.maketrans('', '', string.punctuation))
        words = text.split()
        doc_terms.append(words)

Using a functional idiom

In [9]:

doc_terms = tz.pipe(
    paths,
    c.map(partial(open, encoding='latin-1')),
    c.map(lambda x: x.read()),
    c.map(lambda x: x.lower()),
    c.map(lambda x: x.translate(str.maketrans('', '', string.punctuation))),
    c.map(lambda x: x.split()),
    list
)

Term frequency¶

This is just a word count for each document.

In [10]:

tf = {name: tz.frequencies(doc)
      for name, doc in zip(names, doc_terms)}

Document frequency¶

This is how many documents a term appears in.

In [11]:

df = tz.frequencies(tz.concat(d.keys() for d in tf.values()))

Inverse document frequency¶

This weights each term by the inverse frequency of its appearance across different documents, and reduces the important of common words like “the”.

In [12]:

idf = {term: np.log((1 + N)/(1 + count)) for term, count in df.items()}

Term frequency - inverse document frequency¶

This is just the product of tf and idf, and a measure of the importance of each term in a document.

In [13]:

terms = list(tz.unique(tz.concat(doc_terms)))

In [14]:

tf_idf = {}
for name, doc in tf.items():
    d = {}
    for term in terms:
        d[term] = doc.get(term, 0) * idf[term]
    tf_idf[name] = d

Convert to a structured data type¶

In [15]:

tf_idf = pd.DataFrame(tf_idf)

In [16]:

tf_idf.iloc[:5, :5]

Out[16]:

	austen-emma	austen-persuasion	austen-sense	bible-kjv	blake-poems
00	0.0	0.0	0.0	0.0	0.0
00021053	0.0	0.0	0.0	0.0	0.0
00081429	0.0	0.0	0.0	0.0	0.0
00482129	0.0	0.0	0.0	0.0	0.0
01	0.0	0.0	0.0	0.0	0.0

Distinctive words in each document¶

These are words that appear frequently in that document but not in others.

In [17]:

pd.DataFrame({doc: series.sort_values(ascending=False).index[:5]
              for doc, series in tf_idf.items()}).T

Out[17]:

	0	1	2	3	4
austen-emma	emma	harriet	weston	knightley	elton
austen-persuasion	elliot	anne	wentworth	musgrove	russell
austen-sense	elinor	marianne	dashwood	jennings	mrs
bible-kjv	unto	israel	saith	thee	thou
blake-poems	thel	weep	thee	lyca	vales
bryant-stories	jackal	margery	brahmin	nightingale	epaminondas
burgess-busterbrown	buster	browns	joe	blacky	billy
carroll-alice	alice	gryphon	dormouse	duchess	hatter
chesterton-ball	turnbull	macian	evan	turnbulls	have
chesterton-brown	flambeau	boulnois	muscari	brown	fanshaw
chesterton-thursday	syme	gregory	professor	marquis	secretary
edgeworth-parents	susan	cecilia	piedro	archer	mr
melville-moby_dick	whale	ahab	sperm	whales	stubb
milton-paradise	thee	thou	heaven	thy	eve
shakespeare-caesar	bru	brutus	cassi	haue	cassius
shakespeare-hamlet	ham	haue	hor	qu	laer
shakespeare-macbeth	macb	haue	macbeth	macd	rosse
whitman-leaves	o	thee	poems	states	pioneers

Why is “macb” the top hit for Macbeth?¶

In [18]:

tf['shakespeare-macbeth']['macb']

Out[18]:

In [19]:

text = open('data/gutenberg/shakespeare-macbeth.txt', encoding='latin-1').read()
re.findall('macb[^e].*\n', text, re.IGNORECASE)[:10]

Out[19]:

['Macb. So foule and faire a day I haue not seene\n',
 'Macb. Stay you imperfect Speakers, tell me more:\n',
 "Macb. Into the Ayre: and what seem'd corporall,\n",
 'Macb. Your Children shall be Kings\n',
 'Macb. And Thane of Cawdor too: went it not so?\n',
 'Macb. The Thane of Cawdor liues:\n',
 'Macb. Glamys, and Thane of Cawdor:\n',
 'Macb. Two Truths are told,\n',
 'Macb. If Chance will haue me King,\n',
 'Macb. Come what come may,\n']

Image data¶

In [20]:

paths = glob.glob(os.path.join('data', 'POKEMON', '*.png') )[:3]

imgs = np.array([imageio.imread(path) for path in paths])

In [21]:

imgs.shape

Out[21]:

(3, 215, 215, 4)

In [22]:

fig, axes = plt.subplots(1,3,figsize=(10,3))
for ax, img in zip(axes, imgs):
    ax.imshow(img)
    ax.set_xticks([])
    ax.set_yticks([])

../_images/notebooks_S06_Data_Annotated_42_0.png

Using `pandas` for data munging¶

In [23]:

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"

In [24]:

tips = pd.read_csv(url)

Inspecting a data frame¶

In [25]:

tips.shape

Out[25]:

(244, 7)

In [26]:

tips.dtypes

Out[26]:

total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

In [27]:

tips.columns

Out[27]:

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

In [28]:

tips.index

Out[28]:

RangeIndex(start=0, stop=244, step=1)

In [29]:

tips.describe()

Out[29]:

	total_bill	tip	size
count	244.000000	244.000000	244.000000
mean	19.785943	2.998279	2.569672
std	8.902412	1.383638	0.951100
min	3.070000	1.000000	1.000000
25%	13.347500	2.000000	2.000000
50%	17.795000	2.900000	2.000000
75%	24.127500	3.562500	3.000000
max	50.810000	10.000000	6.000000

In [30]:

tips.head(3)

Out[30]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3

In [31]:

tips.tail(3)

Out[31]:

	total_bill	tip	sex	smoker	day	time	size
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

In [32]:

tips.sample(3)

Out[32]:

	total_bill	tip	sex	smoker	day	time	size
132	11.17	1.50	Female	No	Thur	Lunch	2
144	16.43	2.30	Female	No	Thur	Lunch	2
55	19.49	3.51	Male	No	Sun	Dinner	2

Series¶

A column is a pandas Series object. It behaves like an indexed vector.

In [33]:

tips['sex'].head()

Out[33]:

  Female
    Male
    Male
    Male
  Female
Name: sex, dtype: object

Alternative way to get Series when column name meets requirements for Python variable (i.e. no spaces or punctuation)

In [34]:

tips.sex.head()

Out[34]:

  Female
    Male
    Male
    Male
  Female
Name: sex, dtype: object

Series types¶

String operations¶

In [35]:

tips.sex.str.lower().str[0].head()

Out[35]:

  f
  m
  m
  m
  f
Name: sex, dtype: object

Categorical data types¶

In [36]:

tips.day.unique()

Out[36]:

array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

In [37]:

tips['day'] = tips.day.astype('category')

In [38]:

tips.day.head(3)

Out[38]:

0    Sun
1    Sun
2    Sun
Name: day, dtype: category
Categories (4, object): [Fri, Sat, Sun, Thur]

In [39]:

tips.day.cat.reorder_categories(['Thur', 'Fri', 'Sat', 'Sun'], ordered=True, inplace=True)

In [40]:

tips.day.head(3)

Out[40]:

0    Sun
1    Sun
2    Sun
Name: day, dtype: category
Categories (4, object): [Thur < Fri < Sat < Sun]

Datetime operations¶

In [41]:

import pandas_datareader as pdr

data_source = 'google'
start_date = '2010-01-01'
end_date = '2016-12-31'

data = pdr.get_data_morningstar('MSFT', start_date, end_date)

In [42]:

data.head(3)

Out[42]:

		Close	High	Low	Open	Volume
Symbol	Date
MSFT	2010-01-01	30.48	30.99	30.48	30.98	0
	2010-01-04	30.95	31.10	30.59	30.65	38414185
	2010-01-05	30.96	31.10	30.64	30.87	49758862

In [43]:

dates = data.index.get_level_values(1)

In [44]:

dates[:3]

Out[44]:

DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05'], dtype='datetime64[ns]', name='Date', freq=None)

In [45]:

list(it.islice(zip(dates.year,
                    dates.month,
                    dates.day), 6))

Out[45]:

[(2010, 1, 1),
 (2010, 1, 4),
 (2010, 1, 5),
 (2010, 1, 6),
 (2010, 1, 7),
 (2010, 1, 8)]

In [46]:

msft = data.loc['MSFT']

In [47]:

msft.plot(y='High')
pass

../_images/notebooks_S06_Data_Annotated_75_0.png

In [48]:

msft.loc['2016-12-01':'2016-12-31', ('Low', 'High')].plot(
    linestyle='dashed', marker='+')
pass

../_images/notebooks_S06_Data_Annotated_76_0.png

Indexing¶

In [49]:

tips[0:2]

Out[49]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3

In [50]:

tips.loc[0:2]

Out[50]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3

In [51]:

tips.iloc[0:2]

Out[51]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3

In [52]:

tips.iloc[0:3, [2,3,4]]

Out[52]:

	sex	smoker	day
0	Female	No	Sun
1	Male	No	Sun
2	Male	No	Sun

In [53]:

tips.loc[0:2, ['tip', 'sex', 'size']]

Out[53]:

	tip	sex	size
0	1.01	Female	2
1	1.66	Male	3
2	3.50	Male	3

In [54]:

tips.loc[0:2, 'tip':'day']

Out[54]:

	tip	sex	smoker	day
0	1.01	Female	No	Sun
1	1.66	Male	No	Sun
2	3.50	Male	No	Sun

Boolean indexing¶

In [55]:

tips[tips.sex == 'Male'].head(3)

Out[55]:

	total_bill	tip	sex	smoker	day	time	size
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2

In [56]:

tips[(tips.sex == 'Male') & (tips.time != 'Dinner')].head(3)

Out[56]:

	total_bill	tip	sex	smoker	day	time	size
77	27.20	4.00	Male	No	Thur	Lunch	4
78	22.76	3.00	Male	No	Thur	Lunch	2
79	17.29	2.71	Male	No	Thur	Lunch	2

Special selection¶

In [57]:

df = pd.DataFrame({
    'a': [1,2,None,4,1],
    'b': [3,3,None,None,3],
    'c': [1,2,None,4,1]})

In [58]:

df

Out[58]:

	a	b	c
0	1.0	3.0	1.0
1	2.0	3.0	2.0
2	NaN	NaN	NaN
3	4.0	NaN	4.0
4	1.0	3.0	1.0

In [59]:

df.dropna()

Out[59]:

	a	b	c
0	1.0	3.0	1.0
1	2.0	3.0	2.0
4	1.0	3.0	1.0

In [60]:

df.dropna(how='all')

Out[60]:

	a	b	c
0	1.0	3.0	1.0
1	2.0	3.0	2.0
3	4.0	NaN	4.0
4	1.0	3.0	1.0

In [61]:

df.drop_duplicates()

Out[61]:

	a	b	c
0	1.0	3.0	1.0
1	2.0	3.0	2.0
2	NaN	NaN	NaN
3	4.0	NaN	4.0

In [62]:

df.drop_duplicates(keep='last')

Out[62]:

	a	b	c
1	2.0	3.0	2.0
2	NaN	NaN	NaN
3	4.0	NaN	4.0
4	1.0	3.0	1.0

Selecting by label¶

In [63]:

tips.filter(regex='^s.*').head(3)

Out[63]:

	sex	smoker	size
0	Female	No	2
1	Male	No	3
2	Male	No	3

Sorting¶

In [64]:

tips.sort_values(['size', 'tip'], ascending=[True, False]).head(3)

Out[64]:

	total_bill	tip	sex	smoker	day	time	size
222	8.58	1.92	Male	Yes	Fri	Lunch	1
82	10.07	1.83	Female	No	Thur	Lunch	1
67	3.07	1.00	Female	Yes	Sat	Dinner	1

Note that ordered categorical values are sorted appropriately.

In [65]:

tips.sort_values(['day']).head(3)

Out[65]:

	total_bill	tip	sex	smoker	day	time	size
121	13.42	1.68	Female	No	Thur	Lunch	2
133	12.26	2.00	Female	No	Thur	Lunch	2
132	11.17	1.50	Female	No	Thur	Lunch	2

Rearrange columns¶

In [66]:

tips = tips[tips.columns.sort_values()]
tips.head(3)

Out[66]:

	day	sex	size	smoker	time	tip	total_bill
0	Sun	Female	2	No	Dinner	1.01	16.99
1	Sun	Male	3	No	Dinner	1.66	10.34
2	Sun	Male	3	No	Dinner	3.50	21.01

In [67]:

tips = tips.filter('total_bill      tip     sex     smoker  day     time    size'.split())
tips.head(3)

Out[67]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3

Transforms¶

In [68]:

tips['day_type'] = 'Weekday'
tips.loc[tips.day.isin(['Sat', 'Sun']), 'day_type'] = 'Weekend'
tips.head(3)

Out[68]:

	total_bill	tip	sex	smoker	day	time	size	day_type
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend

In [69]:

tips['cost'] = tips['total_bill'] + tips['tip']
tips.head(3)

Out[69]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51

In [70]:

tips.assign(log1p_cost = np.log1p(tips.cost)).head(3)

Out[70]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost	log1p_cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00	2.944439
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00	2.564949
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51	3.239071

In [71]:

tips.replace({
    'sex': dict(Femals='F', Male='M'),
    'day': dict(Thur=4, Fri=5, Sat=6, Sun=7)}).head(3)

Out[71]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	7	Dinner	2	Weekend	18.00
1	10.34	1.66	M	No	7	Dinner	3	Weekend	12.00
2	21.01	3.50	M	No	7	Dinner	3	Weekend	24.51

In [72]:

tips.rename({'sex': 'gender'}, axis=1).head(3)

Out[72]:

	total_bill	tip	gender	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51

Transforming missing values¶

In [73]:

df

Out[73]:

	a	b	c
0	1.0	3.0	1.0
1	2.0	3.0	2.0
2	NaN	NaN	NaN
3	4.0	NaN	4.0
4	1.0	3.0	1.0

In [74]:

df.fillna(0)

Out[74]:

	a	b	c
0	1.0	3.0	1.0
1	2.0	3.0	2.0
2	0.0	0.0	0.0
3	4.0	0.0	4.0
4	1.0	3.0	1.0

In [75]:

df.fillna(df.mean())

Out[75]:

	a	b	c
0	1.0	3.0	1.0
1	2.0	3.0	2.0
2	2.0	3.0	2.0
3	4.0	3.0	4.0
4	1.0	3.0	1.0

In [76]:

df.fillna(method='ffill')

Out[76]:

	a	b	c
0	1.0	3.0	1.0
1	2.0	3.0	2.0
2	2.0	3.0	2.0
3	4.0	3.0	4.0
4	1.0	3.0	1.0

In [77]:

df.fillna(method='bfill')

Out[77]:

	a	b	c
0	1.0	3.0	1.0
1	2.0	3.0	2.0
2	4.0	3.0	4.0
3	4.0	3.0	4.0
4	1.0	3.0	1.0

Summaries¶

In [78]:

tips.mean()

Out[78]:

total_bill    19.785943
tip            2.998279
size           2.569672
cost          22.784221
dtype: float64

In [79]:

tips.std()

Out[79]:

total_bill    8.902412
tip           1.383638
size          0.951100
cost          9.890116
dtype: float64

In [80]:

tips.count()

Out[80]:

total_bill    244
tip           244
sex           244
smoker        244
day           244
time          244
size          244
day_type      244
cost          244
dtype: int64

Grouping¶

In [81]:

tips.groupby(['sex', 'day', 'time']).mean()

Out[81]:

			total_bill	tip	size	cost
sex	day	time
Female	Thur	Dinner	18.780000	3.000000	2.000000	21.780000
	Thur	Lunch	16.648710	2.561935	2.483871	19.210645
	Fri	Dinner	14.310000	2.810000	2.000000	17.120000
	Fri	Lunch	13.940000	2.745000	2.250000	16.685000
	Sat	Dinner	19.680357	2.801786	2.250000	22.482143
	Sat	Lunch	NaN	NaN	NaN	NaN
	Sun	Dinner	19.872222	3.367222	2.944444	23.239444
	Sun	Lunch	NaN	NaN	NaN	NaN
Male	Thur	Dinner	NaN	NaN	NaN	NaN
	Thur	Lunch	18.714667	2.980333	2.433333	21.695000
	Fri	Dinner	23.487143	3.032857	2.285714	26.520000
	Fri	Lunch	11.386667	1.900000	1.666667	13.286667
	Sat	Dinner	20.802542	3.083898	2.644068	23.886441
	Sat	Lunch	NaN	NaN	NaN	NaN
	Sun	Dinner	21.887241	3.220345	2.810345	25.107586
	Sun	Lunch	NaN	NaN	NaN	NaN

In [82]:

tips.groupby(['sex']).agg(['mean', 'std', 'min', 'max'])

Out[82]:

	total_bill				tip				size				cost
	mean	std	min	max	mean	std	min	max	mean	std	min	max	mean	std	min	max
sex
Female	18.056897	8.009209	3.07	44.30	2.833448	1.159495	1.0	6.5	2.459770	0.937644	1	6	20.890345	8.841798	4.07	48.11
Male	20.744076	9.246469	7.25	50.81	3.089618	1.489102	1.0	10.0	2.630573	0.955997	1	6	23.833694	10.303309	9.00	60.81

Working with hierarchical indexes¶

In [83]:

df = tips.groupby(['sex', 'day', 'time']).mean()
df.head()

Out[83]:

			total_bill	tip	size	cost
sex	day	time
Female	Thur	Dinner	18.780000	3.000000	2.000000	21.780000
	Thur	Lunch	16.648710	2.561935	2.483871	19.210645
	Fri	Dinner	14.310000	2.810000	2.000000	17.120000
	Fri	Lunch	13.940000	2.745000	2.250000	16.685000
	Sat	Dinner	19.680357	2.801786	2.250000	22.482143

In [84]:

df.index

Out[84]:

MultiIndex(levels=[['Female', 'Male'], ['Thur', 'Fri', 'Sat', 'Sun'], ['Dinner', 'Lunch']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 1, 1, 2, 2, 3, 3, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
           names=['sex', 'day', 'time'])

In [85]:

df.columns

Out[85]:

Index(['total_bill', 'tip', 'size', 'cost'], dtype='object')

In [86]:

df.loc[('Female')]

Out[86]:

		total_bill	tip	size	cost
day	time
Thur	Dinner	18.780000	3.000000	2.000000	21.780000
Thur	Lunch	16.648710	2.561935	2.483871	19.210645
Fri	Dinner	14.310000	2.810000	2.000000	17.120000
Fri	Lunch	13.940000	2.745000	2.250000	16.685000
Sat	Dinner	19.680357	2.801786	2.250000	22.482143
Sat	Lunch	NaN	NaN	NaN	NaN
Sun	Dinner	19.872222	3.367222	2.944444	23.239444
Sun	Lunch	NaN	NaN	NaN	NaN

In [87]:

df.loc[('Female', 'Sat')]

Out[87]:

	total_bill	tip	size	cost
time
Dinner	19.680357	2.801786	2.25	22.482143
Lunch	NaN	NaN	NaN	NaN

In [88]:

df.loc[('Female', slice(None), 'Dinner')]

Out[88]:

			total_bill	tip	size	cost
sex	day	time
Female	Thur	Dinner	18.780000	3.000000	2.000000	21.780000
	Fri	Dinner	14.310000	2.810000	2.000000	17.120000
	Sat	Dinner	19.680357	2.801786	2.250000	22.482143
	Sun	Dinner	19.872222	3.367222	2.944444	23.239444

Stacking and unstacking¶

In [89]:

tips.head(3)

Out[89]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51

In [90]:

df.unstack()['tip']

Out[90]:

	time	Dinner	Lunch
sex	day
Female	Thur	3.000000	2.561935
	Fri	2.810000	2.745000
	Sat	2.801786	NaN
	Sun	3.367222	NaN
Male	Thur	NaN	2.980333
	Fri	3.032857	1.900000
	Sat	3.083898	NaN
	Sun	3.220345	NaN

In [91]:

df.unstack(level=0)['tip']

Out[91]:

	sex	Female	Male
day	time
Thur	Dinner	3.000000	NaN
Thur	Lunch	2.561935	2.980333
Fri	Dinner	2.810000	3.032857
Fri	Lunch	2.745000	1.900000
Sat	Dinner	2.801786	3.083898
Sat	Lunch	NaN	NaN
Sun	Dinner	3.367222	3.220345
Sun	Lunch	NaN	NaN

In [92]:

df.unstack(level=[1,2])['tip']

Out[92]:

day	Thur		Fri		Sat		Sun
time	Dinner	Lunch	Dinner	Lunch	Dinner	Lunch	Dinner	Lunch
sex
Female	3.0	2.561935	2.810000	2.745	2.801786	NaN	3.367222	NaN
Male	NaN	2.980333	3.032857	1.900	3.083898	NaN	3.220345	NaN

In [93]:

df.unstack(level=[1,2])['tip'].stack(level=0)

Out[93]:

	time	Dinner	Lunch
sex	day
Female	Thur	3.000000	2.561935
	Fri	2.810000	2.745000
	Sat	2.801786	NaN
	Sun	3.367222	NaN
Male	Thur	NaN	2.980333
	Fri	3.032857	1.900000
	Sat	3.083898	NaN
	Sun	3.220345	NaN

Swapping levels¶

In [94]:

df.swaplevel(1,2)

Out[94]:

			total_bill	tip	size	cost
sex	time	day
Female	Dinner	Thur	18.780000	3.000000	2.000000	21.780000
	Lunch	Thur	16.648710	2.561935	2.483871	19.210645
	Dinner	Fri	14.310000	2.810000	2.000000	17.120000
	Lunch	Fri	13.940000	2.745000	2.250000	16.685000
	Dinner	Sat	19.680357	2.801786	2.250000	22.482143
	Lunch	Sat	NaN	NaN	NaN	NaN
	Dinner	Sun	19.872222	3.367222	2.944444	23.239444
	Lunch	Sun	NaN	NaN	NaN	NaN
Male	Dinner	Thur	NaN	NaN	NaN	NaN
	Lunch	Thur	18.714667	2.980333	2.433333	21.695000
	Dinner	Fri	23.487143	3.032857	2.285714	26.520000
	Lunch	Fri	11.386667	1.900000	1.666667	13.286667
	Dinner	Sat	20.802542	3.083898	2.644068	23.886441
	Lunch	Sat	NaN	NaN	NaN	NaN
	Dinner	Sun	21.887241	3.220345	2.810345	25.107586
	Lunch	Sun	NaN	NaN	NaN	NaN

Resetting index¶

If you’d rather not deal with hierarchical indexes, use reset_index.

In [95]:

df.reset_index()

Out[95]:

	sex	day	time	total_bill	tip	size	cost
0	Female	Thur	Dinner	18.780000	3.000000	2.000000	21.780000
1	Female	Thur	Lunch	16.648710	2.561935	2.483871	19.210645
2	Female	Fri	Dinner	14.310000	2.810000	2.000000	17.120000
3	Female	Fri	Lunch	13.940000	2.745000	2.250000	16.685000
4	Female	Sat	Dinner	19.680357	2.801786	2.250000	22.482143
5	Female	Sat	Lunch	NaN	NaN	NaN	NaN
6	Female	Sun	Dinner	19.872222	3.367222	2.944444	23.239444
7	Female	Sun	Lunch	NaN	NaN	NaN	NaN
8	Male	Thur	Dinner	NaN	NaN	NaN	NaN
9	Male	Thur	Lunch	18.714667	2.980333	2.433333	21.695000
10	Male	Fri	Dinner	23.487143	3.032857	2.285714	26.520000
11	Male	Fri	Lunch	11.386667	1.900000	1.666667	13.286667
12	Male	Sat	Dinner	20.802542	3.083898	2.644068	23.886441
13	Male	Sat	Lunch	NaN	NaN	NaN	NaN
14	Male	Sun	Dinner	21.887241	3.220345	2.810345	25.107586
15	Male	Sun	Lunch	NaN	NaN	NaN	NaN

Reshaping¶

In [96]:

df1 = df.reset_index()

In [97]:

df1.head(3)

Out[97]:

	sex	day	time	total_bill	tip	size	cost
0	Female	Thur	Dinner	18.78000	3.000000	2.000000	21.780000
1	Female	Thur	Lunch	16.64871	2.561935	2.483871	19.210645
2	Female	Fri	Dinner	14.31000	2.810000	2.000000	17.120000

Transpose¶

In [98]:

df1.T

Out[98]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
sex	Female	Female	Female	Female	Female	Female	Female	Female	Male	Male	Male	Male	Male	Male	Male	Male
day	Thur	Thur	Fri	Fri	Sat	Sat	Sun	Sun	Thur	Thur	Fri	Fri	Sat	Sat	Sun	Sun
time	Dinner	Lunch	Dinner	Lunch	Dinner	Lunch	Dinner	Lunch	Dinner	Lunch	Dinner	Lunch	Dinner	Lunch	Dinner	Lunch
total_bill	18.78	16.6487	14.31	13.94	19.6804	NaN	19.8722	NaN	NaN	18.7147	23.4871	11.3867	20.8025	NaN	21.8872	NaN
tip	3	2.56194	2.81	2.745	2.80179	NaN	3.36722	NaN	NaN	2.98033	3.03286	1.9	3.0839	NaN	3.22034	NaN
size	2	2.48387	2	2.25	2.25	NaN	2.94444	NaN	NaN	2.43333	2.28571	1.66667	2.64407	NaN	2.81034	NaN
cost	21.78	19.2106	17.12	16.685	22.4821	NaN	23.2394	NaN	NaN	21.695	26.52	13.2867	23.8864	NaN	25.1076	NaN

Melt (gather)¶

In [99]:

pd.melt(df1, id_vars=['sex', 'day', 'time', 'size']).head(10)

Out[99]:

	sex	day	time	size	variable	value
0	Female	Thur	Dinner	2.000000	total_bill	18.780000
1	Female	Thur	Lunch	2.483871	total_bill	16.648710
2	Female	Fri	Dinner	2.000000	total_bill	14.310000
3	Female	Fri	Lunch	2.250000	total_bill	13.940000
4	Female	Sat	Dinner	2.250000	total_bill	19.680357
5	Female	Sat	Lunch	NaN	total_bill	NaN
6	Female	Sun	Dinner	2.944444	total_bill	19.872222
7	Female	Sun	Lunch	NaN	total_bill	NaN
8	Male	Thur	Dinner	NaN	total_bill	NaN
9	Male	Thur	Lunch	2.433333	total_bill	18.714667

Pivot table¶

A pivot table is like a group_by operation on both the index (rows) and columns.

In [100]:

pd.pivot_table(df1, values=['total_bill', 'tip'],
               index=['sex', 'day'],
               columns='time',
               aggfunc='mean')

Out[100]:

		tip		total_bill
	time	Dinner	Lunch	Dinner	Lunch
sex	day
Female	Thur	3.000000	2.561935	18.780000	16.648710
	Fri	2.810000	2.745000	14.310000	13.940000
	Sat	2.801786	NaN	19.680357	NaN
	Sun	3.367222	NaN	19.872222	NaN
Male	Thur	NaN	2.980333	NaN	18.714667
	Fri	3.032857	1.900000	23.487143	11.386667
	Sat	3.083898	NaN	20.802542	NaN
	Sun	3.220345	NaN	21.887241	NaN

Joining data frames¶

In [101]:

tips1 = tips[0:5]
tips1

Out[101]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51
3	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
4	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20

In [102]:

tips2 = tips[3:8]
tips2

Out[102]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
3	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
4	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20
5	25.29	4.71	Male	No	Sun	Dinner	4	Weekend	30.00
6	8.77	2.00	Male	No	Sun	Dinner	2	Weekend	10.77
7	26.88	3.12	Male	No	Sun	Dinner	4	Weekend	30.00

Simple concatenation¶

In [103]:

pd.concat([tips1, tips2])

Out[103]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51
3	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
4	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20
3	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
4	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20
5	25.29	4.71	Male	No	Sun	Dinner	4	Weekend	30.00
6	8.77	2.00	Male	No	Sun	Dinner	2	Weekend	10.77
7	26.88	3.12	Male	No	Sun	Dinner	4	Weekend	30.00

Joining columns¶

Merge uses all common columns to combine. It is very flexible - see help(pd.merge).

In [104]:

pd.merge(tips1, tips2)

Out[104]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
1	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20

In [105]:

pd.merge(tips1, tips2, how='left')

Out[105]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51
3	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
4	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20

In [106]:

pd.merge(tips1, tips2, how='right')

Out[106]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
1	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20
2	25.29	4.71	Male	No	Sun	Dinner	4	Weekend	30.00
3	8.77	2.00	Male	No	Sun	Dinner	2	Weekend	10.77
4	26.88	3.12	Male	No	Sun	Dinner	4	Weekend	30.00

In [107]:

pd.merge(tips1, tips2, how='outer')

Out[107]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51
3	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
4	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20
5	25.29	4.71	Male	No	Sun	Dinner	4	Weekend	30.00
6	8.77	2.00	Male	No	Sun	Dinner	2	Weekend	10.77
7	26.88	3.12	Male	No	Sun	Dinner	4	Weekend	30.00

Visualizing data¶

In [108]:

tips.head()

Out[108]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51
3	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
4	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20

In [109]:

sns.set_context('notebook', font_scale=1.3)
g = sns.factorplot(x='sex', y='tip',
                   col='time', row='smoker',
                   data=tips, kind='swarm')
pass

../_images/notebooks_S06_Data_Annotated_161_0.png

In [110]:

sns.set_context('notebook', font_scale=1.3)
g = sns.factorplot(x='day', y='tip', hue='sex',
                   col='time', row='smoker',
                   data=tips, kind='violin',
                   margin_titles=True)
pass

../_images/notebooks_S06_Data_Annotated_162_0.png

In [111]:

sns.lmplot(x='total_bill', y='tip', hue='sex',
           col='time', row='smoker',
           data=tips, palette='dark')
pass

../_images/notebooks_S06_Data_Annotated_163_0.png

Pipelines¶

As nearly all pandas functions and methods consume and return a DataFrame, they can be chained togtehr to construct a pipeline.

In [112]:

tips.head()

Out[112]:

	total_bill	tip	sex	smoker	day	time	size	day_type	cost
0	16.99	1.01	Female	No	Sun	Dinner	2	Weekend	18.00
1	10.34	1.66	Male	No	Sun	Dinner	3	Weekend	12.00
2	21.01	3.50	Male	No	Sun	Dinner	3	Weekend	24.51
3	23.68	3.31	Male	No	Sun	Dinner	2	Weekend	26.99
4	24.59	3.61	Female	No	Sun	Dinner	4	Weekend	28.20

In [113]:

pd.set_option('precision', 2)

In [114]:

(
    tips.
    dropna().
    filter(regex=r".*[t|s].*").
    loc[lambda df: df.sex=='Female'].
    groupby(['smoker', 'time']).
    agg({'total_bill': 'mean',
         'tip': 'mean'}).
    assign(percent = lambda df: 100*df.tip/df.total_bill)
)

Out[114]:

		total_bill	tip	percent
smoker	time
No	Dinner	20.00	3.04	15.22
No	Lunch	15.90	2.46	15.47
Yes	Dinner	18.22	2.95	16.19
Yes	Lunch	17.43	2.89	16.59