Pandas Part 2
Part 1 assumed that the data.frame is in a tidy format, with one
observation per row and one variable per column. Real-world data is
often not so obliging, and we have to clean and wrangle it before we can
analyze the data efficiently.
Data cleaning operations
- Move column to index
- Move index to column
- Rearranging column order
- Change or rename values in the table
- Dealing with missing data
- Dealing with duplicate data
Special operations (strings and categorical variables)
- Splitting single string column into multiple columns
- Joining multiple string columns into single column
- Creating dummy variables from categorical variables for regression or
machine learning
Data wrangling (Generally involve changes of shapes)
- Moving data from multiple columns to single column (melt)
- Moving multiple categories in a column to separate columns (pivot)
- Pivoting with aggregation (pivot_table)
- Combining multiple data frames
Data cleaning
Swapping columns and indexes
|
age |
pid |
wt |
| 0 |
23 |
101 |
150 |
| 1 |
34 |
102 |
160 |
| 2 |
45 |
103 |
170 |
|
age |
wt |
| pid |
|
|
| 101 |
23 |
150 |
| 102 |
34 |
160 |
| 103 |
45 |
170 |
|
pid |
age |
wt |
| 0 |
101 |
23 |
150 |
| 1 |
102 |
34 |
160 |
| 2 |
103 |
45 |
170 |
Explicit replacement
|
age |
pid |
wt |
| 0 |
young |
101 |
150 |
| 1 |
old |
102 |
160 |
| 2 |
old |
103 |
170 |
Missing data
|
mon |
tue |
wed |
| pid |
|
|
|
| 101 |
10.0 |
11.0 |
12.0 |
| 102 |
11.0 |
NaN |
13.0 |
| 103 |
NaN |
NaN |
10.0 |
| 104 |
NaN |
NaN |
NaN |
|
mon |
tue |
wed |
| pid |
|
|
|
| 101 |
False |
False |
False |
| 102 |
False |
True |
False |
| 103 |
True |
True |
False |
| 104 |
True |
True |
True |
|
mon |
tue |
wed |
| pid |
|
|
|
| 101 |
True |
True |
True |
| 102 |
True |
False |
True |
| 103 |
False |
False |
True |
| 104 |
False |
False |
False |
Drop any row containing missing data
|
mon |
tue |
wed |
| pid |
|
|
|
| 101 |
10.0 |
11.0 |
12.0 |
Drop only rows where all data is missing
|
mon |
tue |
wed |
| pid |
|
|
|
| 101 |
10.0 |
11.0 |
12.0 |
| 102 |
11.0 |
NaN |
13.0 |
| 103 |
NaN |
NaN |
10.0 |
Drop any column with missing data
|
wed |
| pid |
|
| 101 |
12.0 |
| 102 |
13.0 |
| 103 |
10.0 |
Simple imputation
|
mon |
tue |
wed |
| pid |
|
|
|
| 101 |
10.0 |
11.0 |
12.0 |
| 102 |
11.0 |
NaN |
13.0 |
| 103 |
NaN |
NaN |
10.0 |
| 104 |
NaN |
NaN |
NaN |
Fill with column mean
|
mon |
tue |
wed |
| pid |
|
|
|
| 101 |
10.0 |
11.0 |
12.000000 |
| 102 |
11.0 |
11.0 |
13.000000 |
| 103 |
10.5 |
11.0 |
10.000000 |
| 104 |
10.5 |
11.0 |
11.666667 |
Fill wiht specific values
|
mon |
tue |
wed |
| pid |
|
|
|
| 101 |
10.0 |
11.0 |
12.0 |
| 102 |
11.0 |
0.0 |
13.0 |
| 103 |
0.0 |
0.0 |
10.0 |
| 104 |
0.0 |
0.0 |
0.0 |
|
mon |
tue |
wed |
| pid |
|
|
|
| 101 |
10.0 |
11.0 |
12.0 |
| 102 |
11.0 |
11.0 |
13.0 |
| 103 |
10.0 |
11.0 |
10.0 |
| 104 |
10.0 |
11.0 |
12.0 |
Duplicate data
|
pid |
mon |
tue |
wed |
| 0 |
101 |
10 |
11 |
12 |
| 1 |
102 |
11 |
12 |
13 |
| 2 |
103 |
10 |
10 |
10 |
| 3 |
102 |
11 |
12 |
13 |
| 4 |
102 |
10 |
11 |
14 |
| 5 |
101 |
10 |
11 |
12 |
0 False
1 False
2 False
3 True
4 False
5 True
dtype: bool
0 False
1 False
2 False
3 True
4 True
5 True
dtype: bool
|
pid |
mon |
tue |
wed |
| 0 |
101 |
10 |
11 |
12 |
| 1 |
102 |
11 |
12 |
13 |
| 2 |
103 |
10 |
10 |
10 |
| 4 |
102 |
10 |
11 |
14 |
|
pid |
mon |
tue |
wed |
| 0 |
101 |
10 |
11 |
12 |
| 1 |
102 |
11 |
12 |
13 |
| 2 |
103 |
10 |
10 |
10 |
Special operations on strings and factors
Strings
|
id |
mon |
tue |
wed |
| 0 |
101:duke:2017 |
10 |
11 |
12 |
| 1 |
102:duke:2017 |
11 |
12 |
13 |
| 2 |
103:duke:2017 |
10 |
10 |
10 |
| 3 |
102:unc:2016 |
11 |
12 |
13 |
| 4 |
102:unc:2017 |
10 |
11 |
14 |
| 5 |
101:unc:2017 |
10 |
11 |
12 |
0 [101, duke, 2017]
1 [102, duke, 2017]
2 [103, duke, 2017]
3 [102, unc, 2016]
4 [102, unc, 2017]
5 [101, unc, 2017]
Name: id, dtype: object
0 2017
1 2017
2 2017
3 2016
4 2017
5 2017
Name: id, dtype: object
|
id |
mon |
tue |
wed |
pid |
site |
year |
| 0 |
101:duke:2017 |
10 |
11 |
12 |
101 |
duke |
2017 |
| 1 |
102:duke:2017 |
11 |
12 |
13 |
102 |
duke |
2017 |
| 2 |
103:duke:2017 |
10 |
10 |
10 |
103 |
duke |
2017 |
| 3 |
102:unc:2016 |
11 |
12 |
13 |
102 |
unc |
2016 |
| 4 |
102:unc:2017 |
10 |
11 |
14 |
102 |
unc |
2017 |
| 5 |
101:unc:2017 |
10 |
11 |
12 |
101 |
unc |
2017 |
Rearrange and drop old id column
|
pid |
site |
year |
mon |
tue |
wed |
| 0 |
101 |
duke |
2017 |
10 |
11 |
12 |
| 1 |
102 |
duke |
2017 |
11 |
12 |
13 |
| 2 |
103 |
duke |
2017 |
10 |
10 |
10 |
| 3 |
102 |
unc |
2016 |
11 |
12 |
13 |
| 4 |
102 |
unc |
2017 |
10 |
11 |
14 |
| 5 |
101 |
unc |
2017 |
10 |
11 |
12 |
|
0 |
1 |
| 0 |
duke |
2017 |
| 1 |
duke |
2017 |
| 2 |
duke |
2017 |
| 3 |
unc |
2016 |
| 4 |
unc |
2017 |
| 5 |
unc |
2017 |
|
id |
mon |
tue |
wed |
pid |
site |
year |
| 3 |
102:unc:2016 |
11 |
12 |
13 |
102 |
unc |
2016 |
| 4 |
102:unc:2017 |
10 |
11 |
14 |
102 |
unc |
2017 |
| 5 |
101:unc:2017 |
10 |
11 |
12 |
101 |
unc |
2017 |
Joining multiple string columns
0 101:duke:2017
1 102:duke:2017
2 103:duke:2017
3 102:unc:2016
4 102:unc:2017
5 101:unc:2017
dtype: object
Another method if you are combining LOTS of columns
0 101:duke:2017
1 102:duke:2017
2 103:duke:2017
3 102:unc:2016
4 102:unc:2017
5 101:unc:2017
dtype: object
Categorical variables
|
pid |
site |
year |
mon |
tue |
wed |
| 0 |
101 |
duke |
2017 |
10 |
11 |
12 |
| 1 |
102 |
duke |
2017 |
11 |
12 |
13 |
| 2 |
103 |
duke |
2017 |
10 |
10 |
10 |
| 3 |
102 |
unc |
2016 |
11 |
12 |
13 |
| 4 |
102 |
unc |
2017 |
10 |
11 |
14 |
| 5 |
101 |
unc |
2017 |
10 |
11 |
12 |
0 duke
1 duke
2 duke
3 unc
4 unc
5 unc
Name: site, dtype: category
Categories (2, object): [duke, unc]
0 0
1 0
2 0
3 1
4 1
5 1
dtype: int8
Index(['duke', 'unc'], dtype='object')
0 duke
1 duke
2 duke
3 unc
4 unc
5 unc
Name: site, dtype: category
Categories (2, object): [unc < duke]
One-hot encoding
For regression models, it is often necessary to convert a single column
of categorical variables into dummy variable, with one dummy variable
for each possible category. In machine learning applications, this is
known as one-hot encoding.
|
race |
vals |
| 0 |
brown |
0.260416 |
| 1 |
white |
0.079823 |
| 2 |
black |
0.912978 |
| 3 |
black |
0.711759 |
| 4 |
white |
0.059453 |
| 5 |
brown |
0.024857 |
| 6 |
white |
0.960330 |
| 7 |
brown |
0.870268 |
| 8 |
yellow |
0.008312 |
| 9 |
brown |
0.345326 |
0 brown
1 white
2 black
3 black
4 white
5 brown
6 white
7 brown
8 yellow
9 brown
Name: race, dtype: category
Categories (4, object): [black, brown, white, yellow]
|
black |
brown |
white |
yellow |
| 0 |
0 |
1 |
0 |
0 |
| 1 |
0 |
0 |
1 |
0 |
| 2 |
1 |
0 |
0 |
0 |
| 3 |
1 |
0 |
0 |
0 |
| 4 |
0 |
0 |
1 |
0 |
| 5 |
0 |
1 |
0 |
0 |
| 6 |
0 |
0 |
1 |
0 |
| 7 |
0 |
1 |
0 |
0 |
| 8 |
0 |
0 |
0 |
1 |
| 9 |
0 |
1 |
0 |
0 |
|
race |
vals |
black |
brown |
white |
yellow |
| 0 |
brown |
0.260416 |
0 |
1 |
0 |
0 |
| 1 |
white |
0.079823 |
0 |
0 |
1 |
0 |
| 2 |
black |
0.912978 |
1 |
0 |
0 |
0 |
| 3 |
black |
0.711759 |
1 |
0 |
0 |
0 |
| 4 |
white |
0.059453 |
0 |
0 |
1 |
0 |
| 5 |
brown |
0.024857 |
0 |
1 |
0 |
0 |
| 6 |
white |
0.960330 |
0 |
0 |
1 |
0 |
| 7 |
brown |
0.870268 |
0 |
1 |
0 |
0 |
| 8 |
yellow |
0.008312 |
0 |
0 |
0 |
1 |
| 9 |
brown |
0.345326 |
0 |
1 |
0 |
0 |
Data Wrangling
Hierarchical indexing
|
|
weight |
| women |
pregnancy |
|
| anne |
1 |
8.661477 |
| 2 |
6.524455 |
| 3 |
7.052574 |
| bella |
1 |
7.183759 |
| 2 |
7.095648 |
| 3 |
9.075680 |
| carrie |
1 |
6.659004 |
| 2 |
8.807852 |
MultiIndex(levels=[['anne', 'bella', 'carrie'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1]],
names=['women', 'pregnancy'])
Swapping levels
|
|
weight |
| pregnancy |
women |
|
| 1 |
anne |
8.661477 |
| bella |
7.183759 |
| carrie |
6.659004 |
| 2 |
anne |
6.524455 |
| bella |
7.095648 |
| carrie |
8.807852 |
| 3 |
anne |
7.052574 |
| bella |
9.075680 |
Partial indexing
|
weight |
| pregnancy |
|
| 1 |
7.183759 |
| 2 |
7.095648 |
| 3 |
9.075680 |
Unstack
This rotates from the rows to the columns.
|
weight |
| women |
anne |
bella |
carrie |
| pregnancy |
|
|
|
| 1 |
8.661477 |
7.183759 |
6.659004 |
| 2 |
6.524455 |
7.095648 |
8.807852 |
| 3 |
7.052574 |
9.075680 |
NaN |
|
weight |
| pregnancy |
1 |
2 |
3 |
| women |
|
|
|
| anne |
8.661477 |
6.524455 |
7.052574 |
| bella |
7.183759 |
7.095648 |
9.075680 |
| carrie |
6.659004 |
8.807852 |
NaN |
Stack
This rotates from the columns to the rows
|
weight |
| women |
anne |
bella |
carrie |
| pregnancy |
|
|
|
| 1 |
8.661477 |
7.183759 |
6.659004 |
| 2 |
6.524455 |
7.095648 |
8.807852 |
| 3 |
7.052574 |
9.075680 |
NaN |
|
women |
anne |
bella |
carrie |
| pregnancy |
|
|
|
|
| 1 |
weight |
8.661477 |
7.183759 |
6.659004 |
| 2 |
weight |
6.524455 |
7.095648 |
8.807852 |
| 3 |
weight |
7.052574 |
9.075680 |
NaN |
|
|
weight |
| pregnancy |
women |
|
| 1 |
anne |
8.661477 |
| bella |
7.183759 |
| carrie |
6.659004 |
| 2 |
anne |
6.524455 |
| bella |
7.095648 |
| carrie |
8.807852 |
| 3 |
anne |
7.052574 |
| bella |
9.075680 |
“Flattening” a hierarchical index
|
women |
pregnancy |
weight |
| 0 |
anne |
1 |
8.661477 |
| 1 |
anne |
2 |
6.524455 |
| 2 |
anne |
3 |
7.052574 |
| 3 |
bella |
1 |
7.183759 |
| 4 |
bella |
2 |
7.095648 |
| 5 |
bella |
3 |
9.075680 |
| 6 |
carrie |
1 |
6.659004 |
| 7 |
carrie |
2 |
8.807852 |
|
women |
weight |
| pregnancy |
|
|
| 1 |
anne |
8.661477 |
| 2 |
anne |
6.524455 |
| 3 |
anne |
7.052574 |
| 1 |
bella |
7.183759 |
| 2 |
bella |
7.095648 |
| 3 |
bella |
9.075680 |
| 1 |
carrie |
6.659004 |
| 2 |
carrie |
8.807852 |
Summary by level
|
|
weight |
| women |
pregnancy |
|
| anne |
1 |
8.661477 |
| 2 |
6.524455 |
| 3 |
7.052574 |
| bella |
1 |
7.183759 |
| 2 |
7.095648 |
| 3 |
9.075680 |
| carrie |
1 |
6.659004 |
| 2 |
8.807852 |
|
weight |
| women |
|
| anne |
7.412835 |
| bella |
7.785029 |
| carrie |
7.733428 |
|
weight |
| pregnancy |
|
| 1 |
7.501413 |
| 2 |
7.475985 |
| 3 |
8.064127 |
Reshape and pivot data
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
| 1 |
5.1 |
3.5 |
1.4 |
0.2 |
setosa |
| 2 |
4.9 |
3.0 |
1.4 |
0.2 |
setosa |
| 3 |
4.7 |
3.2 |
1.3 |
0.2 |
setosa |
| 4 |
4.6 |
3.1 |
1.5 |
0.2 |
setosa |
| 5 |
5.0 |
3.6 |
1.4 |
0.2 |
setosa |
From wide to long
This is often necessary for plotting routines
|
Species |
variable |
value |
| 0 |
setosa |
Sepal.Length |
5.1 |
| 1 |
setosa |
Sepal.Length |
4.9 |
| 2 |
setosa |
Sepal.Length |
4.7 |
| 3 |
setosa |
Sepal.Length |
4.6 |
| 4 |
setosa |
Sepal.Length |
5.0 |
Pivot
Pivot splits a column into multiple columns
|
pid |
time |
val |
| 0 |
0 |
A |
9.066841 |
| 1 |
1 |
A |
9.113788 |
| 2 |
2 |
A |
10.965594 |
| 3 |
3 |
A |
9.848895 |
| 4 |
4 |
A |
9.194968 |
| 5 |
0 |
B |
10.819055 |
| 6 |
1 |
B |
9.900527 |
| 7 |
2 |
B |
11.007177 |
| 8 |
3 |
B |
9.586152 |
| 9 |
4 |
B |
9.486819 |
|
val |
| pid |
0 |
1 |
2 |
3 |
4 |
| time |
|
|
|
|
|
| A |
9.066841 |
9.113788 |
10.965594 |
9.848895 |
9.194968 |
| B |
10.819055 |
9.900527 |
11.007177 |
9.586152 |
9.486819 |
| time |
A |
B |
| pid |
|
|
| 0 |
9.066841 |
10.819055 |
| 1 |
9.113788 |
9.900527 |
| 2 |
10.965594 |
11.007177 |
| 3 |
9.848895 |
9.586152 |
| 4 |
9.194968 |
9.486819 |
Note that pivot is set_index followed by `unstack
|
|
val |
| pid |
time |
|
| 0 |
A |
9.066841 |
| 1 |
A |
9.113788 |
| 2 |
A |
10.965594 |
| 3 |
A |
9.848895 |
| 4 |
A |
9.194968 |
| 0 |
B |
10.819055 |
| 1 |
B |
9.900527 |
| 2 |
B |
11.007177 |
| 3 |
B |
9.586152 |
| 4 |
B |
9.486819 |
|
val |
| time |
A |
B |
| pid |
|
|
| 0 |
9.066841 |
10.819055 |
| 1 |
9.113788 |
9.900527 |
| 2 |
10.965594 |
11.007177 |
| 3 |
9.848895 |
9.586152 |
| 4 |
9.194968 |
9.486819 |
pivot can be reversed by melt.
| time |
A |
B |
| pid |
|
|
| 0 |
9.066841 |
10.819055 |
| 1 |
9.113788 |
9.900527 |
| 2 |
10.965594 |
11.007177 |
| 3 |
9.848895 |
9.586152 |
| 4 |
9.194968 |
9.486819 |
| time |
pid |
A |
B |
| 0 |
0 |
9.066841 |
10.819055 |
| 1 |
1 |
9.113788 |
9.900527 |
| 2 |
2 |
10.965594 |
11.007177 |
| 3 |
3 |
9.848895 |
9.586152 |
| 4 |
4 |
9.194968 |
9.486819 |
|
pid |
time |
value |
| 0 |
0 |
A |
9.066841 |
| 1 |
1 |
A |
9.113788 |
| 2 |
2 |
A |
10.965594 |
| 3 |
3 |
A |
9.848895 |
| 4 |
4 |
A |
9.194968 |
| 5 |
0 |
B |
10.819055 |
| 6 |
1 |
B |
9.900527 |
| 7 |
2 |
B |
11.007177 |
| 8 |
3 |
B |
9.586152 |
| 9 |
4 |
B |
9.486819 |
Pivot tables and cross-tabulation
Pivot tables can be an alternative to using groupby. They are also
useful for calculating marginals.
|
pid |
grp |
time |
val |
| 0 |
0 |
Case |
A |
9.081981 |
| 1 |
1 |
Control |
A |
9.068653 |
| 2 |
2 |
Case |
A |
9.346933 |
| 3 |
3 |
Control |
A |
8.895022 |
| 4 |
0 |
Case |
B |
11.139738 |
| 5 |
1 |
Control |
B |
10.489634 |
| 6 |
2 |
Case |
B |
9.913781 |
| 7 |
3 |
Control |
B |
9.703207 |
|
|
val |
| pid |
grp |
|
| 0 |
Case |
10.110860 |
| 1 |
Control |
9.779143 |
| 2 |
Case |
9.630357 |
| 3 |
Control |
9.299114 |
|
|
val |
| pid |
time |
|
| 0 |
A |
9.081981 |
| B |
11.139738 |
| 1 |
A |
9.068653 |
| B |
10.489634 |
| 2 |
A |
9.346933 |
| B |
9.913781 |
| 3 |
A |
8.895022 |
| B |
9.703207 |
|
val |
| grp |
Case |
Control |
| pid |
|
|
| 0 |
10.110860 |
NaN |
| 1 |
NaN |
9.779143 |
| 2 |
9.630357 |
NaN |
| 3 |
NaN |
9.299114 |
|
val |
| pid |
|
| 0 |
10.110860 |
| 1 |
9.779143 |
| 2 |
9.630357 |
| 3 |
9.299114 |
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
| 1 |
5.1 |
3.5 |
1.4 |
0.2 |
setosa |
| 2 |
4.9 |
3.0 |
1.4 |
0.2 |
setosa |
| 3 |
4.7 |
3.2 |
1.3 |
0.2 |
setosa |
| 4 |
4.6 |
3.1 |
1.5 |
0.2 |
setosa |
| 5 |
5.0 |
3.6 |
1.4 |
0.2 |
setosa |
|
Petal.Length |
Petal.Width |
Sepal.Length |
Sepal.Width |
| Species |
|
|
|
|
| setosa |
1.462 |
0.246 |
5.006 |
3.428 |
| versicolor |
4.260 |
1.326 |
5.936 |
2.770 |
| virginica |
5.552 |
2.026 |
6.588 |
2.974 |
|
Petal.Length |
Petal.Width |
| Species |
|
|
| setosa |
1.462 |
0.246 |
| versicolor |
4.260 |
1.326 |
| virginica |
5.552 |
2.026 |
|
Petal.Length |
Petal.Width |
| Species |
|
|
| setosa |
1.462 |
0.246000 |
| versicolor |
4.260 |
1.326000 |
| virginica |
5.552 |
2.026000 |
| All |
3.758 |
1.199333 |
Simple cross tabulation of margins
|
deaths |
sex |
smoker |
| 0 |
10 |
m |
n |
| 1 |
12 |
f |
n |
| 2 |
15 |
m |
y |
| 3 |
18 |
f |
y |
|
deaths |
| sex |
f |
m |
All |
| smoker |
|
|
|
| n |
12 |
10 |
22 |
| y |
18 |
15 |
33 |
| All |
30 |
25 |
55 |
Cross-tabulation
| col_0 |
alive |
dead |
All |
| row_0 |
|
|
|
| n |
22 |
29 |
51 |
| y |
26 |
23 |
49 |
| All |
48 |
52 |
100 |
We cna do this with pivot_table but it takes a bit more work
|
smoker |
status |
val |
| 0 |
y |
alive |
1.0 |
| 1 |
n |
dead |
1.0 |
| 2 |
n |
alive |
1.0 |
| 3 |
y |
dead |
1.0 |
| 4 |
y |
alive |
1.0 |
|
val |
| status |
alive |
dead |
All |
| smoker |
|
|
|
| n |
22 |
29 |
51 |
| y |
26 |
23 |
49 |
| All |
48 |
52 |
100 |
Merging data
Appending rows
|
pid |
x |
| 0 |
101 |
10 |
| 1 |
102 |
20 |
| 2 |
103 |
30 |
|
pid |
x |
| 0 |
104 |
40 |
| 1 |
105 |
50 |
| 2 |
106 |
60 |
|
pid |
x |
| 0 |
101 |
10 |
| 1 |
102 |
20 |
| 2 |
103 |
30 |
| 0 |
104 |
40 |
| 1 |
105 |
50 |
| 2 |
106 |
60 |
|
pid |
x |
| 0 |
101 |
10 |
| 1 |
102 |
20 |
| 2 |
103 |
30 |
| 3 |
104 |
40 |
| 4 |
105 |
50 |
| 5 |
106 |
60 |
Database style joins
|
pid |
x |
| 0 |
101 |
10 |
| 1 |
102 |
20 |
| 2 |
103 |
30 |
|
pid |
y |
| 0 |
101 |
40 |
| 1 |
102 |
50 |
| 2 |
103 |
60 |
|
smoker |
status |
val |
| 0 |
y |
alive |
1.0 |
| 1 |
n |
dead |
1.0 |
| 2 |
n |
alive |
1.0 |
| 3 |
y |
dead |
1.0 |
| 4 |
y |
alive |
1.0 |
| 5 |
y |
alive |
1.0 |
| 6 |
y |
dead |
1.0 |
| 7 |
n |
alive |
1.0 |
| 8 |
y |
alive |
1.0 |
| 9 |
y |
alive |
1.0 |
| 10 |
y |
alive |
1.0 |
| 11 |
y |
dead |
1.0 |
| 12 |
y |
dead |
1.0 |
| 13 |
n |
dead |
1.0 |
| 14 |
n |
dead |
1.0 |
| 15 |
n |
alive |
1.0 |
| 16 |
y |
dead |
1.0 |
| 17 |
y |
dead |
1.0 |
| 18 |
y |
alive |
1.0 |
| 19 |
y |
dead |
1.0 |
| 20 |
n |
dead |
1.0 |
| 21 |
n |
alive |
1.0 |
| 22 |
n |
dead |
1.0 |
| 23 |
n |
dead |
1.0 |
| 24 |
n |
dead |
1.0 |
| 25 |
y |
dead |
1.0 |
| 26 |
n |
dead |
1.0 |
| 27 |
n |
dead |
1.0 |
| 28 |
y |
alive |
1.0 |
| 29 |
y |
alive |
1.0 |
| ... |
... |
... |
... |
| 70 |
n |
dead |
1.0 |
| 71 |
n |
alive |
1.0 |
| 72 |
n |
alive |
1.0 |
| 73 |
n |
dead |
1.0 |
| 74 |
n |
alive |
1.0 |
| 75 |
n |
dead |
1.0 |
| 76 |
y |
alive |
1.0 |
| 77 |
y |
dead |
1.0 |
| 78 |
n |
dead |
1.0 |
| 79 |
n |
dead |
1.0 |
| 80 |
n |
dead |
1.0 |
| 81 |
n |
alive |
1.0 |
| 82 |
y |
dead |
1.0 |
| 83 |
y |
dead |
1.0 |
| 84 |
y |
dead |
1.0 |
| 85 |
y |
alive |
1.0 |
| 86 |
n |
dead |
1.0 |
| 87 |
n |
dead |
1.0 |
| 88 |
n |
alive |
1.0 |
| 89 |
n |
alive |
1.0 |
| 90 |
y |
alive |
1.0 |
| 91 |
y |
alive |
1.0 |
| 92 |
y |
dead |
1.0 |
| 93 |
n |
dead |
1.0 |
| 94 |
y |
dead |
1.0 |
| 95 |
n |
dead |
1.0 |
| 96 |
n |
alive |
1.0 |
| 97 |
n |
alive |
1.0 |
| 98 |
y |
dead |
1.0 |
| 99 |
y |
dead |
1.0 |
100 rows × 3 columns
Merge joins on a column or columns
If on argument not specified, merge on all columns with same name.
|
pid |
x |
y |
| 0 |
101 |
10 |
40 |
| 1 |
102 |
20 |
50 |
| 2 |
103 |
30 |
60 |
|
pid |
x |
y |
| 0 |
101 |
10 |
70 |
| 1 |
103 |
30 |
80 |
|
pid |
x |
y |
| 0 |
101 |
10 |
70.0 |
| 1 |
102 |
20 |
NaN |
| 2 |
103 |
30 |
80.0 |
|
pid |
x |
y |
| 0 |
101 |
10.0 |
70 |
| 1 |
103 |
30.0 |
80 |
| 2 |
105 |
NaN |
90 |
|
pid |
x |
y |
| 0 |
101 |
10.0 |
70.0 |
| 1 |
102 |
20.0 |
NaN |
| 2 |
103 |
30.0 |
80.0 |
| 3 |
105 |
NaN |
90.0 |
Joining on common index
|
x |
| pid |
|
| 101 |
10 |
| 102 |
20 |
| 103 |
30 |
|
y |
| pid |
|
| 101 |
70 |
| 103 |
80 |
| 105 |
90 |
|
z |
| pid |
|
| 101 |
40 |
| 102 |
50 |
| 103 |
60 |
If the data frames share a common index, join can combine mulitple
data frames at once.
|
x |
y |
z |
| pid |
|
|
|
| 101 |
10.0 |
70.0 |
40.0 |
| 102 |
20.0 |
NaN |
50.0 |
| 103 |
30.0 |
80.0 |
60.0 |
| 105 |
NaN |
90.0 |
NaN |