Pandas Part 2¶

Part 1 assumed that the data.frame is in a tidy format, with one observation per row and one variable per column. Real-world data is often not so obliging, and we have to clean and wrangle it before we can analyze the data efficiently.

Data cleaning operations

Move column to index
Move index to column
Rearranging column order
Change or rename values in the table
Dealing with missing data
Dealing with duplicate data

Special operations (strings and categorical variables)

Splitting single string column into multiple columns
Joining multiple string columns into single column
Creating dummy variables from categorical variables for regression or machine learning

Data wrangling (Generally involve changes of shapes)

Moving data from multiple columns to single column (melt)
Moving multiple categories in a column to separate columns (pivot)
Pivoting with aggregation (pivot_table)
Combining multiple data frames

In [1]:

%matplotlib inline

In [2]:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

Data cleaning¶

Swapping columns and indexes¶

In [3]:

df = DataFrame({'pid': ['101','102','103'],
                'age': [23,34,45],
                'wt': [150,160,170]})

In [4]:

df

Out[4]:

	age	pid	wt
0	23	101	150
1	34	102	160
2	45	103	170

In [5]:

df1 = df.set_index('pid')
df1

Out[5]:

	age	wt
pid
101	23	150
102	34	160
103	45	170

In [6]:

df1.reset_index()

Out[6]:

	pid	age	wt
0	101	23	150
1	102	34	160
2	103	45	170

Transforming values with a map¶

In [7]:

df['age'] = df['age'].map(lambda x: 'young' if x <= 30 else 'old')
df

Out[7]:

	age	pid	wt
0	young	101	150
1	old	102	160
2	old	103	170

Explicit replacement¶

In [8]:

df.replace({23: 'young', 34: 'old', 45: 'old'})

Out[8]:

	age	pid	wt
0	young	101	150
1	old	102	160
2	old	103	170

Missing data¶

In [9]:

%%file assay.txt
pid,mon,tue,wed
101,10,11,12
102,11,-99,13
103,-99,?,10
104,?,?,?

Overwriting assay.txt

In [10]:

df = pd.read_csv('assay.txt', na_values=['-99', '?'], index_col=0)

In [11]:

df

Out[11]:

	mon	tue	wed
pid
101	10.0	11.0	12.0
102	11.0	NaN	13.0
103	NaN	NaN	10.0
104	NaN	NaN	NaN

In [12]:

df.isnull()

Out[12]:

	mon	tue	wed
pid
101	False	False	False
102	False	True	False
103	True	True	False
104	True	True	True

In [13]:

df.notnull()

Out[13]:

	mon	tue	wed
pid
101	True	True	True
102	True	False	True
103	False	False	True
104	False	False	False

Drop any row containing missing data¶

In [14]:

df.dropna()

Out[14]:

	mon	tue	wed
pid
101	10.0	11.0	12.0

Drop only rows where all data is missing¶

In [15]:

df.dropna(how='all')

Out[15]:

	mon	tue	wed
pid
101	10.0	11.0	12.0
102	11.0	NaN	13.0
103	NaN	NaN	10.0

Drop any column with missing data¶

In [16]:

df.dropna(how='all').dropna(axis=1)

Out[16]:

	wed
pid
101	12.0
102	13.0
103	10.0

Simple imputation¶

In [17]:

df

Out[17]:

	mon	tue	wed
pid
101	10.0	11.0	12.0
102	11.0	NaN	13.0
103	NaN	NaN	10.0
104	NaN	NaN	NaN

Fill with column mean¶

In [18]:

df.fillna(df.mean(0))

Out[18]:

	mon	tue	wed
pid
101	10.0	11.0	12.000000
102	11.0	11.0	13.000000
103	10.5	11.0	10.000000
104	10.5	11.0	11.666667

Fill wiht specific values¶

In [19]:

df.fillna(0)

Out[19]:

	mon	tue	wed
pid
101	10.0	11.0	12.0
102	11.0	0.0	13.0
103	0.0	0.0	10.0
104	0.0	0.0	0.0

In [20]:

df.fillna({'mon': 10, 'tue': 11, 'wed': 12})

Out[20]:

	mon	tue	wed
pid
101	10.0	11.0	12.0
102	11.0	11.0	13.0
103	10.0	11.0	10.0
104	10.0	11.0	12.0

Duplicate data¶

In [21]:

%%file assay2.txt
pid,mon,tue,wed
101,10,11,12
102,11,12,13
103,10,10,10
102,11,12,13 # duplicate row
102,10,11,14 # duplicate pid
101,10,11,12 # duplicate row

Overwriting assay2.txt

In [22]:

df = pd.read_csv('assay2.txt', comment='#')

In [23]:

df

Out[23]:

	pid	mon	tue	wed
0	101	10	11	12
1	102	11	12	13
2	103	10	10	10
3	102	11	12	13
4	102	10	11	14
5	101	10	11	12

In [24]:

df.duplicated()

Out[24]:

  False
  False
  False
   True
  False
   True
dtype: bool

In [25]:

df.duplicated(subset=['pid'])

Out[25]:

  False
  False
  False
   True
   True
   True
dtype: bool

In [26]:

df.drop_duplicates()

Out[26]:

	pid	mon	tue	wed
0	101	10	11	12
1	102	11	12	13
2	103	10	10	10
4	102	10	11	14

In [27]:

df.drop_duplicates(subset=['pid'], keep='first')

Out[27]:

	pid	mon	tue	wed
0	101	10	11	12
1	102	11	12	13
2	103	10	10	10

Special operations on strings and factors¶

Strings¶

In [28]:

%%file assay3.txt
id,mon,tue,wed
duke:2017,10,11,12
duke:2017,11,12,13
duke:2017,10,10,10
unc:2016,11,12,13
unc:2017,10,11,14
unc:2017,10,11,12

Overwriting assay3.txt

In [29]:

df = pd.read_csv('assay3.txt')

In [30]:

df

Out[30]:

	id	mon	tue	wed
0	101:duke:2017	10	11	12
1	102:duke:2017	11	12	13
2	103:duke:2017	10	10	10
3	102:unc:2016	11	12	13
4	102:unc:2017	10	11	14
5	101:unc:2017	10	11	12

In [31]:

df.id.str.split(':')

Out[31]:

  [101, duke, 2017]
  [102, duke, 2017]
  [103, duke, 2017]
   [102, unc, 2016]
   [102, unc, 2017]
   [101, unc, 2017]
Name: id, dtype: object

In [32]:

df.id.str.split(':').str[-1]

Out[32]:

  2017
  2017
  2017
  2016
  2017
  2017
Name: id, dtype: object

In [33]:

df[['pid', 'site', 'year']] = df.id.str.split(':', expand=True)
df

Out[33]:

	id	mon	tue	wed	pid	site	year
0	101:duke:2017	10	11	12	101	duke	2017
1	102:duke:2017	11	12	13	102	duke	2017
2	103:duke:2017	10	10	10	103	duke	2017
3	102:unc:2016	11	12	13	102	unc	2016
4	102:unc:2017	10	11	14	102	unc	2017
5	101:unc:2017	10	11	12	101	unc	2017

Rearrange and drop old id column¶

In [34]:

df.iloc[:, np.r_[4:7, 1:4]]

Out[34]:

	pid	site	year	mon	tue	wed
0	101	duke	2017	10	11	12
1	102	duke	2017	11	12	13
2	103	duke	2017	10	10	10
3	102	unc	2016	11	12	13
4	102	unc	2017	10	11	14
5	101	unc	2017	10	11	12

In [35]:

df.id.str.extract(r'.*:(.*):(.*)', expand=True)

Out[35]:

	0	1
0	duke	2017
1	duke	2017
2	duke	2017
3	unc	2016
4	unc	2017
5	unc	2017

In [36]:

df[df.id.str.contains('unc')]

Out[36]:

	id	mon	tue	wed	pid	site	year
3	102:unc:2016	11	12	13	102	unc	2016
4	102:unc:2017	10	11	14	102	unc	2017
5	101:unc:2017	10	11	12	101	unc	2017

Joining multiple string columns¶

In [37]:

df.pid + ':' + df.site + ':' + df.year

Out[37]:

  101:duke:2017
  102:duke:2017
  103:duke:2017
   102:unc:2016
   102:unc:2017
   101:unc:2017
dtype: object

Another method if you are combining LOTS of columns¶

In [38]:

df.iloc[:, -3:].astype('str').add(':').sum(axis=1).str[:-1]

Out[38]:

  101:duke:2017
  102:duke:2017
  103:duke:2017
   102:unc:2016
   102:unc:2017
   101:unc:2017
dtype: object

Categorical variables¶

In [39]:

df = df.iloc[:, np.r_[4:7, 1:4]]
df

Out[39]:

	pid	site	year	mon	tue	wed
0	101	duke	2017	10	11	12
1	102	duke	2017	11	12	13
2	103	duke	2017	10	10	10
3	102	unc	2016	11	12	13
4	102	unc	2017	10	11	14
5	101	unc	2017	10	11	12

In [40]:

pd.options.mode.chained_assignment = None # turns off harmless warning
df['pid'] = df['pid'].astype('category')
df['site'] = df['site'].astype('category')

In [41]:

df.site

Out[41]:

  duke
  duke
  duke
   unc
   unc
   unc
Name: site, dtype: category
Categories (2, object): [duke, unc]

In [42]:

df.site.cat.codes

Out[42]:

  0
  0
  0
  1
  1
  1
dtype: int8

In [43]:

df.site.cat.categories

Out[43]:

Index(['duke', 'unc'], dtype='object')

In [44]:

df.site.cat.reorder_categories(['unc', 'duke'], ordered=True)

Out[44]:

  duke
  duke
  duke
   unc
   unc
   unc
Name: site, dtype: category
Categories (2, object): [unc < duke]

One-hot encoding¶

For regression models, it is often necessary to convert a single column of categorical variables into dummy variable, with one dummy variable for each possible category. In machine learning applications, this is known as one-hot encoding.

In [45]:

n = 10
race = np.random.choice(['white', 'black', 'brown', 'yellow'], n)
vals = np.random.random(n)
df = DataFrame(dict(race=race, vals=vals))

In [46]:

df

Out[46]:

	race	vals
0	brown	0.260416
1	white	0.079823
2	black	0.912978
3	black	0.711759
4	white	0.059453
5	brown	0.024857
6	white	0.960330
7	brown	0.870268
8	yellow	0.008312
9	brown	0.345326

In [47]:

df['race'] = df['race'].astype('category')

In [48]:

df.race

Out[48]:

   brown
   white
   black
   black
   white
   brown
   white
   brown
  yellow
   brown
Name: race, dtype: category
Categories (4, object): [black, brown, white, yellow]

In [49]:

df1 = pd.get_dummies(df.race)
df1

Out[49]:

	black	brown	white	yellow
0	0	1	0	0
1	0	0	1	0
2	1	0	0	0
3	1	0	0	0
4	0	0	1	0
5	0	1	0	0
6	0	0	1	0
7	0	1	0	0
8	0	0	0	1
9	0	1	0	0

In [50]:

df.join(df1)

Out[50]:

	race	vals	black	brown	white	yellow
0	brown	0.260416	0	1	0	0
1	white	0.079823	0	0	1	0
2	black	0.912978	1	0	0	0
3	black	0.711759	1	0	0	0
4	white	0.059453	0	0	1	0
5	brown	0.024857	0	1	0	0
6	white	0.960330	0	0	1	0
7	brown	0.870268	0	1	0	0
8	yellow	0.008312	0	0	0	1
9	brown	0.345326	0	1	0	0

Data Wrangling¶

Hierarchical indexing¶

In [51]:

birth_weight = np.random.normal(7, 1, 8)
df = DataFrame(birth_weight,
               index=[['anne']*3 + ['bella']*3 + ['carrie']*2,
                      [1,2,3,1,2,3,1,2]],
               columns=['weight'])
df.index.names = ['women', 'pregnancy']

In [52]:

df

Out[52]:

		weight
women	pregnancy
anne	1	8.661477
	2	6.524455
	3	7.052574
bella	1	7.183759
	2	7.095648
	3	9.075680
carrie	1	6.659004
carrie	2	8.807852

In [53]:

df.index

Out[53]:

MultiIndex(levels=[['anne', 'bella', 'carrie'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1]],
           names=['women', 'pregnancy'])

Swapping levels¶

In [54]:

df.swaplevel().sort_index(0)

Out[54]:

		weight
pregnancy	women
1	anne	8.661477
	bella	7.183759
	carrie	6.659004
2	anne	6.524455
	bella	7.095648
	carrie	8.807852
3	anne	7.052574
3	bella	9.075680

Partial indexing¶

In [55]:

df.loc['bella']

Out[55]:

	weight
pregnancy
1	7.183759
2	7.095648
3	9.075680

Unstack¶

This rotates from the rows to the columns.

In [56]:

df.unstack(level=0)

Out[56]:

	weight
women	anne	bella	carrie
pregnancy
1	8.661477	7.183759	6.659004
2	6.524455	7.095648	8.807852
3	7.052574	9.075680	NaN

In [57]:

df.unstack(level=1)

Out[57]:

	weight
pregnancy	1	2	3
women
anne	8.661477	6.524455	7.052574
bella	7.183759	7.095648	9.075680
carrie	6.659004	8.807852	NaN

Stack¶

This rotates from the columns to the rows

In [58]:

s1 = df.unstack(level=0)

In [59]:

s1

Out[59]:

	weight
women	anne	bella	carrie
pregnancy
1	8.661477	7.183759	6.659004
2	6.524455	7.095648	8.807852
3	7.052574	9.075680	NaN

In [60]:

s1.stack(level=0)

Out[60]:

	women	anne	bella	carrie
pregnancy
1	weight	8.661477	7.183759	6.659004
2	weight	6.524455	7.095648	8.807852
3	weight	7.052574	9.075680	NaN

In [61]:

s1.stack(level=1)

Out[61]:

		weight
pregnancy	women
1	anne	8.661477
	bella	7.183759
	carrie	6.659004
2	anne	6.524455
	bella	7.095648
	carrie	8.807852
3	anne	7.052574
3	bella	9.075680

“Flattening” a hierarchical index¶

In [62]:

df.reset_index()

Out[62]:

	women	pregnancy	weight
0	anne	1	8.661477
1	anne	2	6.524455
2	anne	3	7.052574
3	bella	1	7.183759
4	bella	2	7.095648
5	bella	3	9.075680
6	carrie	1	6.659004
7	carrie	2	8.807852

In [63]:

df.reset_index(level=0)

Out[63]:

	women	weight
pregnancy
1	anne	8.661477
2	anne	6.524455
3	anne	7.052574
1	bella	7.183759
2	bella	7.095648
3	bella	9.075680
1	carrie	6.659004
2	carrie	8.807852

Summary by level¶

In [64]:

df

Out[64]:

		weight
women	pregnancy
anne	1	8.661477
	2	6.524455
	3	7.052574
bella	1	7.183759
	2	7.095648
	3	9.075680
carrie	1	6.659004
carrie	2	8.807852

In [65]:

df.mean(level=0)

Out[65]:

	weight
women
anne	7.412835
bella	7.785029
carrie	7.733428

In [66]:

df.mean(level=1)

Out[66]:

	weight
pregnancy
1	7.501413
2	7.475985
3	8.064127

Reshape and pivot data¶

In [67]:

%load_ext rpy2.ipython

In [68]:

iris = %R iris

In [69]:

iris.head()

Out[69]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa

From wide to long¶

This is often necessary for plotting routines

In [70]:

iris1 = pd.melt(iris, id_vars=['Species'])
iris1.head()

Out[70]:

	Species	variable	value
0	setosa	Sepal.Length	5.1
1	setosa	Sepal.Length	4.9
2	setosa	Sepal.Length	4.7
3	setosa	Sepal.Length	4.6
4	setosa	Sepal.Length	5.0

In [71]:

import seaborn as sns
sns.set_context('notebook', font_scale=1.5)

In [72]:

g = sns.factorplot(data=iris1, x='Species', y='value',
                   col='variable', col_wrap=2,
                   kind='swarm')
g.set_titles('{col_name}')
pass

Pivot¶

Pivot splits a column into multiple columns

In [73]:

pid = np.r_[0:5, 0:5]
time = ['A']*5 + ['B']*5
val = np.random.normal(10,1,10)
df = DataFrame(dict(pid=pid, time=time, val=val),
               columns=['pid', 'time', 'val'])

In [74]:

df

Out[74]:

	pid	time	val
0	0	A	9.066841
1	1	A	9.113788
2	2	A	10.965594
3	3	A	9.848895
4	4	A	9.194968
5	0	B	10.819055
6	1	B	9.900527
7	2	B	11.007177
8	3	B	9.586152
9	4	B	9.486819

In [75]:

df.pivot(index='time', columns='pid')

Out[75]:

	val
pid	0	1	2	3	4
time
A	9.066841	9.113788	10.965594	9.848895	9.194968
B	10.819055	9.900527	11.007177	9.586152	9.486819

In [76]:

df.pivot(index='pid', columns='time', values='val')

Out[76]:

time	A	B
pid
0	9.066841	10.819055
1	9.113788	9.900527
2	10.965594	11.007177
3	9.848895	9.586152
4	9.194968	9.486819

Note that pivot is set_index followed by `unstack

In [77]:

df.set_index(['pid', 'time'])

Out[77]:

		val
pid	time
0	A	9.066841
1	A	9.113788
2	A	10.965594
3	A	9.848895
4	A	9.194968
0	B	10.819055
1	B	9.900527
2	B	11.007177
3	B	9.586152
4	B	9.486819

In [78]:

df.set_index(['pid', 'time']).unstack('time')

Out[78]:

	val
time	A	B
pid
0	9.066841	10.819055
1	9.113788	9.900527
2	10.965594	11.007177
3	9.848895	9.586152
4	9.194968	9.486819

pivot can be reversed by melt.

In [79]:

df1 = df.pivot(index='pid', columns='time', values='val')
df1

Out[79]:

time	A	B
pid
0	9.066841	10.819055
1	9.113788	9.900527
2	10.965594	11.007177
3	9.848895	9.586152
4	9.194968	9.486819

In [80]:

df1 = df1.reset_index()
df1

Out[80]:

time	pid	A	B
0	0	9.066841	10.819055
1	1	9.113788	9.900527
2	2	10.965594	11.007177
3	3	9.848895	9.586152
4	4	9.194968	9.486819

In [81]:

pd.melt(df1, ['pid'])

Out[81]:

	pid	time	value
0	0	A	9.066841
1	1	A	9.113788
2	2	A	10.965594
3	3	A	9.848895
4	4	A	9.194968
5	0	B	10.819055
6	1	B	9.900527
7	2	B	11.007177
8	3	B	9.586152
9	4	B	9.486819

Pivot tables and cross-tabulation¶

Pivot tables can be an alternative to using groupby. They are also useful for calculating marginals.

In [82]:

pid = np.r_[0:4, 0:4]
time = ['A']*4 + ['B']*4
grp = np.tile(['Case', 'Control'], 4)
val = np.random.normal(10,1,8)
df = DataFrame(dict(pid=pid, grp=grp, time=time, val=val),
               columns=['pid', 'grp', 'time', 'val'])

In [83]:

df

Out[83]:

	pid	grp	time	val
0	0	Case	A	9.081981
1	1	Control	A	9.068653
2	2	Case	A	9.346933
3	3	Control	A	8.895022
4	0	Case	B	11.139738
5	1	Control	B	10.489634
6	2	Case	B	9.913781
7	3	Control	B	9.703207

In [84]:

df.pivot_table(index=['pid', 'grp'])

Out[84]:

		val
pid	grp
0	Case	10.110860
1	Control	9.779143
2	Case	9.630357
3	Control	9.299114

In [85]:

df.pivot_table(index=['pid', 'time'])

Out[85]:

		val
pid	time
0	A	9.081981
0	B	11.139738
1	A	9.068653
1	B	10.489634
2	A	9.346933
2	B	9.913781
3	A	8.895022
3	B	9.703207

In [86]:

df.pivot_table(index='pid', columns=['grp'])

Out[86]:

	val
grp	Case	Control
pid
0	10.110860	NaN
1	NaN	9.779143
2	9.630357	NaN
3	NaN	9.299114

In [87]:

df.pivot_table(index='pid')

Out[87]:

	val
pid
0	10.110860
1	9.779143
2	9.630357
3	9.299114

In [88]:

iris.head()

Out[88]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa

In [89]:

iris.pivot_table(index='Species', aggfunc='mean')

Out[89]:

	Petal.Length	Petal.Width	Sepal.Length	Sepal.Width
Species
setosa	1.462	0.246	5.006	3.428
versicolor	4.260	1.326	5.936	2.770
virginica	5.552	2.026	6.588	2.974

In [90]:

iris.pivot_table(index='Species',
                 values=['Petal.Length', 'Petal.Width'],
                 aggfunc='mean')

Out[90]:

	Petal.Length	Petal.Width
Species
setosa	1.462	0.246
versicolor	4.260	1.326
virginica	5.552	2.026

In [91]:

iris.pivot_table(index='Species',
                 values=['Petal.Length', 'Petal.Width'],
                 aggfunc='mean',
                 margins=True)

Out[91]:

	Petal.Length	Petal.Width
Species
setosa	1.462	0.246000
versicolor	4.260	1.326000
virginica	5.552	2.026000
All	3.758	1.199333

Simple cross tabulation of margins¶

In [92]:

deaths = [10, 12, 15, 18]
sex = ['m', 'f', 'm', 'f']
smoker = ['n', 'n', 'y', 'y']
df = DataFrame(dict(deaths=deaths, sex=sex, smoker=smoker))

In [93]:

df

Out[93]:

	deaths	sex	smoker
0	10	m	n
1	12	f	n
2	15	m	y
3	18	f	y

In [94]:

df.pivot_table(index='smoker', columns='sex', aggfunc='sum', margins=True)

Out[94]:

	deaths
sex	f	m	All
smoker
n	12	10	22
y	18	15	33
All	30	25	55

Cross-tabulation¶

In [95]:

smoker = np.random.choice(['y', 'n'], 100)
status = np.random.choice(['alive', 'dead'], 100)

In [96]:

pd.crosstab(smoker, status, margins=True)

Out[96]:

col_0	alive	dead	All
row_0
n	22	29	51
y	26	23	49
All	48	52	100

We cna do this with pivot_table but it takes a bit more work¶

In [97]:

df = DataFrame(dict(smoker=smoker, status=status, val=np.ones(100)))

In [98]:

df.head()

Out[98]:

	smoker	status	val
0	y	alive	1.0
1	n	dead	1.0
2	n	alive	1.0
3	y	dead	1.0
4	y	alive	1.0

In [99]:

df.pivot_table(index='smoker', columns='status', aggfunc='count', margins=True)

Out[99]:

	val
status	alive	dead	All
smoker
n	22	29	51
y	26	23	49
All	48	52	100

Merging data¶

In [100]:

pids1 = [101, 102, 103]
vals1 = [10, 20, 30]
pids2 = [104,105,106]
vals2 = [40, 50, 60]
pids3 = [101, 103, 105]
vals3 = [70, 80, 90]

In [101]:

df1 = DataFrame(dict(pid=pids1, x=vals1))
df2 = DataFrame(dict(pid=pids2, x=vals2))
df3 = DataFrame(dict(pid=pids1, y=vals2))
df4 = DataFrame(dict(pid=pids3, y=vals3))
df5 = DataFrame(dict(pid=pids1, z=vals2))

Appending rows¶

In [102]:

df1

Out[102]:

	pid	x
0	101	10
1	102	20
2	103	30

In [103]:

df2

Out[103]:

	pid	x
0	104	40
1	105	50
2	106	60

In [104]:

pd.concat([df1, df2])

Out[104]:

	pid	x
0	101	10
1	102	20
2	103	30
0	104	40
1	105	50
2	106	60

In [105]:

pd.concat([df1, df2]).reset_index(drop=True)

Out[105]:

	pid	x
0	101	10
1	102	20
2	103	30
3	104	40
4	105	50
5	106	60

Database style joins¶

In [106]:

df1

Out[106]:

	pid	x
0	101	10
1	102	20
2	103	30

In [107]:

df3

Out[107]:

	pid	y
0	101	40
1	102	50
2	103	60

In [108]:

df

Out[108]:

	smoker	status	val
0	y	alive	1.0
1	n	dead	1.0
2	n	alive	1.0
3	y	dead	1.0
4	y	alive	1.0
5	y	alive	1.0
6	y	dead	1.0
7	n	alive	1.0
8	y	alive	1.0
9	y	alive	1.0
10	y	alive	1.0
11	y	dead	1.0
12	y	dead	1.0
13	n	dead	1.0
14	n	dead	1.0
15	n	alive	1.0
16	y	dead	1.0
17	y	dead	1.0
18	y	alive	1.0
19	y	dead	1.0
20	n	dead	1.0
21	n	alive	1.0
22	n	dead	1.0
23	n	dead	1.0
24	n	dead	1.0
25	y	dead	1.0
26	n	dead	1.0
27	n	dead	1.0
28	y	alive	1.0
29	y	alive	1.0
...	...	...	...
70	n	dead	1.0
71	n	alive	1.0
72	n	alive	1.0
73	n	dead	1.0
74	n	alive	1.0
75	n	dead	1.0
76	y	alive	1.0
77	y	dead	1.0
78	n	dead	1.0
79	n	dead	1.0
80	n	dead	1.0
81	n	alive	1.0
82	y	dead	1.0
83	y	dead	1.0
84	y	dead	1.0
85	y	alive	1.0
86	n	dead	1.0
87	n	dead	1.0
88	n	alive	1.0
89	n	alive	1.0
90	y	alive	1.0
91	y	alive	1.0
92	y	dead	1.0
93	n	dead	1.0
94	y	dead	1.0
95	n	dead	1.0
96	n	alive	1.0
97	n	alive	1.0
98	y	dead	1.0
99	y	dead	1.0

100 rows × 3 columns

Merge joins on a column or columns¶

If on argument not specified, merge on all columns with same name.

In [109]:

pd.merge(df1, df3)

Out[109]:

	pid	x	y
0	101	10	40
1	102	20	50
2	103	30	60

In [110]:

pd.merge(df1, df4, on='pid', how='inner')

Out[110]:

	pid	x	y
0	101	10	70
1	103	30	80

In [111]:

pd.merge(df1, df4, on='pid', how='left')

Out[111]:

	pid	x	y
0	101	10	70.0
1	102	20	NaN
2	103	30	80.0

In [112]:

pd.merge(df1, df4, on='pid', how='right')

Out[112]:

	pid	x	y
0	101	10.0	70
1	103	30.0	80
2	105	NaN	90

In [113]:

pd.merge(df1, df4, on='pid', how='outer')

Out[113]:

	pid	x	y
0	101	10.0	70.0
1	102	20.0	NaN
2	103	30.0	80.0
3	105	NaN	90.0

Joining on common index¶

In [114]:

df1.set_index('pid', inplace=True)
df4.set_index('pid', inplace=True)
df5.set_index('pid', inplace=True)

In [115]:

df1

Out[115]:

	x
pid
101	10
102	20
103	30

In [116]:

df4

Out[116]:

	y
pid
101	70
103	80
105	90

In [117]:

df5

Out[117]:

	z
pid
101	40
102	50
103	60

If the data frames share a common index, join can combine mulitple data frames at once.

In [118]:

df1.join([df4, df5], how='outer')

Out[118]:

	x	y	z
pid
101	10.0	70.0	40.0
102	20.0	NaN	50.0
103	30.0	80.0	60.0
105	NaN	90.0	NaN

	pid	mon	tue	wed
0	101	10	11	12
1	102	11	12	13
2	103	10	10	10
3	102	11	12	13
4	102	10	11	14
5	101	10	11	12

	pid	mon	tue	wed
0	101	10	11	12
1	102	11	12	13
2	103	10	10	10
3	102	11	12	13
4	102	10	11	14
5	101	10	11	12

	pid	mon	tue	wed
0	101	10	11	12
1	102	11	12	13
2	103	10	10	10
3	102	11	12	13
4	102	10	11	14
5	101	10	11	12