In [1]:
import numpy as np
import pandas as pd
Data Manipulation¶
(1) Combining data sets¶
Combining rows¶
In [2]:
df_versiocolor = pd.read_csv('data/versicolor.csv')
df_virginica = pd.read_csv('data/virginica.csv')
df_sertosa = pd.read_csv('data/setosa.csv')
dfs = [df_versiocolor, df_virginica, df_sertosa]
In [3]:
df_versiocolor.head(3)
Out[3]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 7.0 | 3.2 | 4.7 | 1.4 | versicolor |
1 | 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
2 | 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
In [4]:
df_virginica.head(3)
Out[4]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
1 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
2 | 7.1 | 3.0 | 5.9 | 2.1 | virginica |
In [5]:
df_sertosa.head(3)
Out[5]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
In [6]:
for df in dfs:
print(df.shape)
(50, 5)
(50, 5)
(50, 5)
Each DataFrame only contains data about one species of iris¶
In [7]:
for df in dfs:
print(df.Species.unique())
['versicolor']
['virginica']
['setosa']
Combine with concat
¶
We can append rows of one data frame to another using concat
. Here
axis = 0
means combining by rows, in contrast to axis = 1
which
means combining by columns.
In [8]:
df = pd.concat(dfs, axis=0)
df.shape
Out[8]:
(150, 5)
Note that the indices are maintained¶
Here we have to use iloc
because ix
and loc
work with the
integer index values, not the positions, and there are no values equal
to 50 or 51.
In [9]:
df.iloc[48:52]
Out[9]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
48 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
49 | 5.7 | 2.8 | 4.1 | 1.3 | versicolor |
0 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
1 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
We can ask for a new consecutive index¶
In [10]:
df = df.reset_index(drop=True)
In [11]:
df.iloc[48:52]
Out[11]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
48 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
49 | 5.7 | 2.8 | 4.1 | 1.3 | versicolor |
50 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
51 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
Combined DataFrame contains all 3 species¶
In [12]:
df.Species.unique()
Out[12]:
array(['versicolor', 'virginica', 'setosa'], dtype=object)
Combining columns¶
When combining rows, we are usually just adding new observations with the same variables and there is little ambiguity about what we are trying to do.
When combining columns, we are usually trying to merge information based
on some unique identifier, for example, to combine clinical and
laboratory information for specific patients with unique PIDs. Often the
ordering of the PIDs in the two data frames are not exactly lined up so
“horizontal stacking” will not work. Usually we need to merge
the
data on a unique identifier from one or more columns.
Simplest case - rows line up perfectly across data frames¶
In [13]:
df_sepal = pd.read_csv('data/sepal.csv')
df_petal = pd.read_csv('data/petal.csv')
In [14]:
df_sepal.head(3)
Out[14]:
Species | Sepal.Length | Sepal.Width | |
---|---|---|---|
0 | setosa | 5.1 | 3.5 |
1 | setosa | 4.9 | 3.0 |
2 | setosa | 4.7 | 3.2 |
In [15]:
df_petal.head(3)
Out[15]:
Species | Petal.Length | Petal.Width | |
---|---|---|---|
0 | setosa | 1.4 | 0.2 |
1 | setosa | 1.4 | 0.2 |
2 | setosa | 1.3 | 0.2 |
In [16]:
df_sepal.shape, df_petal.shape
Out[16]:
((150, 3), (150, 3))
Using concat
¶
In [17]:
pd.concat([df_sepal, df_petal], axis=1).head()
Out[17]:
Species | Sepal.Length | Sepal.Width | Species | Petal.Length | Petal.Width | |
---|---|---|---|---|---|---|
0 | setosa | 5.1 | 3.5 | setosa | 1.4 | 0.2 |
1 | setosa | 4.9 | 3.0 | setosa | 1.4 | 0.2 |
2 | setosa | 4.7 | 3.2 | setosa | 1.3 | 0.2 |
3 | setosa | 4.6 | 3.1 | setosa | 1.5 | 0.2 |
4 | setosa | 5.0 | 3.6 | setosa | 1.4 | 0.2 |
Joining on a single unique column¶
Combining values for the same subject across different variables.
Make up dummy data sets for illustration¶
In [18]:
pid1 = np.random.choice(100, 6, replace=False)
val1 = np.random.normal(10, 1, 6)
df1 = pd.DataFrame({'pid': pid1, 'val': val1})
df1
Out[18]:
pid | val | |
---|---|---|
0 | 4 | 8.115657 |
1 | 22 | 9.385071 |
2 | 83 | 12.240448 |
3 | 90 | 9.173178 |
4 | 98 | 10.119245 |
5 | 85 | 10.476193 |
In [19]:
pid2 = np.random.permutation(pid1)
val2 = np.random.normal(15, 1, 6)
df2 = pd.DataFrame({'pid': pid2, 'val': val2})
df2
Out[19]:
pid | val | |
---|---|---|
0 | 98 | 15.945742 |
1 | 22 | 14.873965 |
2 | 90 | 16.505458 |
3 | 85 | 13.775417 |
4 | 83 | 14.690199 |
5 | 4 | 14.749047 |
Using merge to join on columns containing unique information¶
Note that the pid
order is not the same, so we cannot simply stack
horizontally.
In [20]:
pd.merge(df1, df2, on='pid', suffixes=['_visit_1', '_visit_2'])
Out[20]:
pid | val_visit_1 | val_visit_2 | |
---|---|---|---|
0 | 4 | 8.115657 | 14.749047 |
1 | 22 | 9.385071 | 14.873965 |
2 | 83 | 12.240448 | 14.690199 |
3 | 90 | 9.173178 | 16.505458 |
4 | 98 | 10.119245 | 15.945742 |
5 | 85 | 10.476193 | 13.775417 |
In [21]:
df1a = df1.set_index('pid')
df2a = df2.set_index('pid')
Using merge on multiple columns¶
Sometimes we need multiple columns to define a unique identifier.
Note: In the data frames being merged, not all the unique identifiers need to be shared in both data frames. For example, when merging laboratory and clinical data, a patient may have laboratory results but no clinical results (perhaps due to a typo) or have clinical data but no laboratory data (because results are still being processed by the lab). We illustrate different ways to resolve the merge in this scenario in the example below.
In [22]:
df1['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df1 = df1[['pid', 'stim', 'val']]
df1
Out[22]:
pid | stim | val | |
---|---|---|---|
0 | 4 | cmv | 8.115657 |
1 | 22 | flu | 9.385071 |
2 | 83 | cmv | 12.240448 |
3 | 90 | flu | 9.173178 |
4 | 98 | flu | 10.119245 |
5 | 85 | cmv | 10.476193 |
In [23]:
df2['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df2 = df2[['pid', 'stim', 'val']]
df2
Out[23]:
pid | stim | val | |
---|---|---|---|
0 | 98 | cmv | 15.945742 |
1 | 22 | flu | 14.873965 |
2 | 90 | flu | 16.505458 |
3 | 85 | cmv | 13.775417 |
4 | 83 | cmv | 14.690199 |
5 | 4 | cmv | 14.749047 |
In [24]:
pd.merge(df1, df2, on = ['pid', 'stim'], suffixes = ['_visit_1', '_visit_2'])
Out[24]:
pid | stim | val_visit_1 | val_visit_2 | |
---|---|---|---|---|
0 | 4 | cmv | 8.115657 | 14.749047 |
1 | 22 | flu | 9.385071 | 14.873965 |
2 | 83 | cmv | 12.240448 | 14.690199 |
3 | 90 | flu | 9.173178 | 16.505458 |
4 | 85 | cmv | 10.476193 | 13.775417 |
In [25]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'left', suffixes = ['_visit_1', '_visit_2'])
Out[25]:
pid | stim | val_visit_1 | val_visit_2 | |
---|---|---|---|---|
0 | 4 | cmv | 8.115657 | 14.749047 |
1 | 22 | flu | 9.385071 | 14.873965 |
2 | 83 | cmv | 12.240448 | 14.690199 |
3 | 90 | flu | 9.173178 | 16.505458 |
4 | 98 | flu | 10.119245 | NaN |
5 | 85 | cmv | 10.476193 | 13.775417 |
In [26]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'right', suffixes = ['_visit_1', '_visit_2'])
Out[26]:
pid | stim | val_visit_1 | val_visit_2 | |
---|---|---|---|---|
0 | 4.0 | cmv | 8.115657 | 14.749047 |
1 | 22.0 | flu | 9.385071 | 14.873965 |
2 | 83.0 | cmv | 12.240448 | 14.690199 |
3 | 90.0 | flu | 9.173178 | 16.505458 |
4 | 85.0 | cmv | 10.476193 | 13.775417 |
5 | 98.0 | cmv | NaN | 15.945742 |
In [27]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'outer', suffixes = ['_visit_1', '_visit_2'])
Out[27]:
pid | stim | val_visit_1 | val_visit_2 | |
---|---|---|---|---|
0 | 4.0 | cmv | 8.115657 | 14.749047 |
1 | 22.0 | flu | 9.385071 | 14.873965 |
2 | 83.0 | cmv | 12.240448 | 14.690199 |
3 | 90.0 | flu | 9.173178 | 16.505458 |
4 | 98.0 | flu | 10.119245 | NaN |
5 | 85.0 | cmv | 10.476193 | 13.775417 |
6 | 98.0 | cmv | NaN | 15.945742 |
Using join
when indexes are the keys¶
In [28]:
df1a
Out[28]:
val | |
---|---|
pid | |
4 | 8.115657 |
22 | 9.385071 |
83 | 12.240448 |
90 | 9.173178 |
98 | 10.119245 |
85 | 10.476193 |
In [29]:
df2a
Out[29]:
val | |
---|---|
pid | |
98 | 15.945742 |
22 | 14.873965 |
90 | 16.505458 |
85 | 13.775417 |
83 | 14.690199 |
4 | 14.749047 |
In [30]:
df1a.join(df2a, lsuffix='_1', rsuffix='_2')
Out[30]:
val_1 | val_2 | |
---|---|---|
pid | ||
4 | 8.115657 | 14.749047 |
22 | 9.385071 | 14.873965 |
83 | 12.240448 | 14.690199 |
90 | 9.173178 | 16.505458 |
98 | 10.119245 | 15.945742 |
85 | 10.476193 | 13.775417 |
(2) Separate multiple values in a single column¶
Remember that for a tidy data frame, each column must contain a single variable.
In [31]:
from collections import OrderedDict
In [32]:
d = OrderedDict()
d['pid-visit-stim'] = ['1-1-cmv', '1-1-hiv', '1-2-cmv', '1-2-hiv', '1-3-cmv', '1-3-hiv', '2-1-cmv', '2-1-hiv', '2-2-cmv', '2-2-hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)
In [33]:
df.head()
Out[33]:
pid-visit-stim | tnf | ifn | il2 | |
---|---|---|---|---|
0 | 1-1-cmv | 1.0 | 11.0 | 0.0 |
1 | 1-1-hiv | 2.0 | 12.0 | 0.0 |
2 | 1-2-cmv | 1.1 | 11.1 | 0.1 |
3 | 1-2-hiv | 2.1 | 12.1 | 0.1 |
4 | 1-3-cmv | 1.2 | 11.2 | 0.2 |
In [34]:
df1 = pd.DataFrame(df['pid-visit-stim'].str.split('-').tolist(),
columns = ['pid', 'visit', 'stim'])
df1.head()
Out[34]:
pid | visit | stim | |
---|---|---|---|
0 | 1 | 1 | cmv |
1 | 1 | 1 | hiv |
2 | 1 | 2 | cmv |
3 | 1 | 2 | hiv |
4 | 1 | 3 | cmv |
In [35]:
df1 = pd.concat([df1, df], axis=1)
df1.head()
Out[35]:
pid | visit | stim | pid-visit-stim | tnf | ifn | il2 | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | cmv | 1-1-cmv | 1.0 | 11.0 | 0.0 |
1 | 1 | 1 | hiv | 1-1-hiv | 2.0 | 12.0 | 0.0 |
2 | 1 | 2 | cmv | 1-2-cmv | 1.1 | 11.1 | 0.1 |
3 | 1 | 2 | hiv | 1-2-hiv | 2.1 | 12.1 | 0.1 |
4 | 1 | 3 | cmv | 1-3-cmv | 1.2 | 11.2 | 0.2 |
In [36]:
df1.drop('pid-visit-stim', axis=1, inplace=True)
df1.head()
Out[36]:
pid | visit | stim | tnf | ifn | il2 | |
---|---|---|---|---|---|---|
0 | 1 | 1 | cmv | 1.0 | 11.0 | 0.0 |
1 | 1 | 1 | hiv | 2.0 | 12.0 | 0.0 |
2 | 1 | 2 | cmv | 1.1 | 11.1 | 0.1 |
3 | 1 | 2 | hiv | 2.1 | 12.1 | 0.1 |
4 | 1 | 3 | cmv | 1.2 | 11.2 | 0.2 |
In [37]:
def separate(df, column, sep):
df1 = pd.DataFrame(df[column].str.split(sep).tolist(), columns = column.split(sep))
df1 = pd.concat([df1, df], axis=1)
return df1.drop(column, axis = 1)
In [38]:
separate(df, 'pid-visit-stim', '-').head()
Out[38]:
pid | visit | stim | tnf | ifn | il2 | |
---|---|---|---|---|---|---|
0 | 1 | 1 | cmv | 1.0 | 11.0 | 0.0 |
1 | 1 | 1 | hiv | 2.0 | 12.0 | 0.0 |
2 | 1 | 2 | cmv | 1.1 | 11.1 | 0.1 |
3 | 1 | 2 | hiv | 2.1 | 12.1 | 0.1 |
4 | 1 | 3 | cmv | 1.2 | 11.2 | 0.2 |
(3) Reshaping DataFrames¶
In [39]:
d = OrderedDict()
d['pid'] = ['1', '1', '1', '1', '1', '1', '2', '2', '2', '2']
d['visit'] = ['1', '1', '2', '2', '3', '3', '1', '1', '2', '2']
d['stim'] = ['cmv', 'hiv', 'cmv', 'hiv', 'cmv', 'hiv', 'cmv', 'hiv', 'cmv', 'hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)
In [40]:
df.head()
Out[40]:
pid | visit | stim | tnf | ifn | il2 | |
---|---|---|---|---|---|---|
0 | 1 | 1 | cmv | 1.0 | 11.0 | 0.0 |
1 | 1 | 1 | hiv | 2.0 | 12.0 | 0.0 |
2 | 1 | 2 | cmv | 1.1 | 11.1 | 0.1 |
3 | 1 | 2 | hiv | 2.1 | 12.1 | 0.1 |
4 | 1 | 3 | cmv | 1.2 | 11.2 | 0.2 |
Wide to long¶
Suppose it is more convenient to consider the type of cytokine as new variable. We need to put the name of the cytokine in one column and its value in another - going from 3 to 2 columns (wide to long).
We use teh melt
method and specify the identifier (columns that stay the same) and value variables.¶
In [41]:
long1 = pd.melt(df, id_vars = ['pid', 'stim', 'visit'],
value_vars = ['tnf', 'ifn', 'il2'])
long1.sample(6)
Out[41]:
pid | stim | visit | variable | value | |
---|---|---|---|---|---|
5 | 1 | hiv | 3 | tnf | 2.2 |
8 | 2 | cmv | 2 | tnf | 3.1 |
10 | 1 | cmv | 1 | ifn | 11.0 |
15 | 1 | hiv | 3 | ifn | 12.2 |
17 | 2 | hiv | 1 | ifn | 14.0 |
23 | 1 | hiv | 2 | il2 | 0.1 |
By default, all non-id variabels are treated as values¶
In [42]:
long2 = pd.melt(df, id_vars =['pid', 'stim', 'visit'])
long2.sample(6)
Out[42]:
pid | stim | visit | variable | value | |
---|---|---|---|---|---|
17 | 2 | hiv | 1 | ifn | 14.0 |
20 | 1 | cmv | 1 | il2 | 0.0 |
2 | 1 | cmv | 2 | tnf | 1.1 |
18 | 2 | cmv | 2 | ifn | 13.1 |
8 | 2 | cmv | 2 | tnf | 3.1 |
4 | 1 | cmv | 3 | tnf | 1.2 |
We can exclude columns we are not intersted in¶
In [43]:
long3 = pd.melt(df, id_vars = ['pid', 'stim', 'visit'],
value_vars = ['tnf', 'il2'])
long3.sample(6)
Out[43]:
pid | stim | visit | variable | value | |
---|---|---|---|---|---|
18 | 2 | cmv | 2 | il2 | 0.1 |
4 | 1 | cmv | 3 | tnf | 1.2 |
10 | 1 | cmv | 1 | il2 | 0.0 |
0 | 1 | cmv | 1 | tnf | 1.0 |
16 | 2 | cmv | 1 | il2 | 0.1 |
6 | 2 | cmv | 1 | tnf | 3.0 |
Long to Wide¶
There is no function in pandas
to undo the wide_to_long
operation, and the details are tricky, so I have written a small
function to do this.
In [44]:
def long_to_wide(df, index):
df = df.set_index(index).unstack().reset_index()
cols = [t[1] if t[1] else t[0] for t in df.columns]
df.columns = cols
return df
In [45]:
wide1 = long_to_wide(long1, ['pid', 'stim', 'visit', 'variable'])
wide1.sample(6)
Out[45]:
pid | stim | visit | ifn | il2 | tnf | |
---|---|---|---|---|---|---|
1 | 1 | cmv | 2 | 11.1 | 0.1 | 1.1 |
4 | 1 | hiv | 2 | 12.1 | 0.1 | 2.1 |
2 | 1 | cmv | 3 | 11.2 | 0.2 | 1.2 |
3 | 1 | hiv | 1 | 12.0 | 0.0 | 2.0 |
9 | 2 | hiv | 2 | 14.1 | 0.1 | 4.1 |
5 | 1 | hiv | 3 | 12.2 | 0.2 | 2.2 |
In [46]:
wide2 = long_to_wide(long2, ['pid', 'stim', 'visit', 'variable'])
wide2.sample(6)
Out[46]:
pid | stim | visit | ifn | il2 | tnf | |
---|---|---|---|---|---|---|
0 | 1 | cmv | 1 | 11.0 | 0.0 | 1.0 |
4 | 1 | hiv | 2 | 12.1 | 0.1 | 2.1 |
5 | 1 | hiv | 3 | 12.2 | 0.2 | 2.2 |
2 | 1 | cmv | 3 | 11.2 | 0.2 | 1.2 |
6 | 2 | cmv | 1 | 13.0 | 0.1 | 3.0 |
8 | 2 | hiv | 1 | 14.0 | 0.3 | 4.0 |
In [47]:
wide3 = long_to_wide(long3, ['pid', 'stim', 'visit', 'variable'])
wide3.sample(6)
Out[47]:
pid | stim | visit | il2 | tnf | |
---|---|---|---|---|---|
5 | 1 | hiv | 3 | 0.2 | 2.2 |
4 | 1 | hiv | 2 | 0.1 | 2.1 |
8 | 2 | hiv | 1 | 0.3 | 4.0 |
6 | 2 | cmv | 1 | 0.1 | 3.0 |
3 | 1 | hiv | 1 | 0.0 | 2.0 |
9 | 2 | hiv | 2 | 0.1 | 4.1 |
(4) Hierarchical Indexes (More advanced material)¶
Sometimes your data is best considered as stacks of 2D DataFrames, for
example, when each patient has laboratory data for multiple visits. The
handling of such “3D” or even higher-dimensional data is handled in
pandas
by the sue of hierarchical indexes.
In [48]:
df.head()
Out[48]:
pid | visit | stim | tnf | ifn | il2 | |
---|---|---|---|---|---|---|
0 | 1 | 1 | cmv | 1.0 | 11.0 | 0.0 |
1 | 1 | 1 | hiv | 2.0 | 12.0 | 0.0 |
2 | 1 | 2 | cmv | 1.1 | 11.1 | 0.1 |
3 | 1 | 2 | hiv | 2.1 | 12.1 | 0.1 |
4 | 1 | 3 | cmv | 1.2 | 11.2 | 0.2 |
In [49]:
df1 = df.set_index(['pid', 'stim', 'visit'])
df1
Out[49]:
tnf | ifn | il2 | |||
---|---|---|---|---|---|
pid | stim | visit | |||
1 | cmv | 1 | 1.0 | 11.0 | 0.0 |
hiv | 1 | 2.0 | 12.0 | 0.0 | |
cmv | 2 | 1.1 | 11.1 | 0.1 | |
hiv | 2 | 2.1 | 12.1 | 0.1 | |
cmv | 3 | 1.2 | 11.2 | 0.2 | |
hiv | 3 | 2.2 | 12.2 | 0.2 | |
2 | cmv | 1 | 3.0 | 13.0 | 0.1 |
hiv | 1 | 4.0 | 14.0 | 0.3 | |
cmv | 2 | 3.1 | 13.1 | 0.1 | |
hiv | 2 | 4.1 | 14.1 | 0.1 |
Indexing for multi-index¶
With the multi-index, each “cell” is now a block of values for the combinations (pid, stim, visit).
Find TNF values¶
In [50]:
df1[['tnf']]
Out[50]:
tnf | |||
---|---|---|---|
pid | stim | visit | |
1 | cmv | 1 | 1.0 |
hiv | 1 | 2.0 | |
cmv | 2 | 1.1 | |
hiv | 2 | 2.1 | |
cmv | 3 | 1.2 | |
hiv | 3 | 2.2 | |
2 | cmv | 1 | 3.0 |
hiv | 1 | 4.0 | |
cmv | 2 | 3.1 | |
hiv | 2 | 4.1 |
Find all values for Subject 2¶
In [51]:
df1.ix['2']
Out[51]:
tnf | ifn | il2 | ||
---|---|---|---|---|
stim | visit | |||
cmv | 1 | 3.0 | 13.0 | 0.1 |
hiv | 1 | 4.0 | 14.0 | 0.3 |
cmv | 2 | 3.1 | 13.1 | 0.1 |
hiv | 2 | 4.1 | 14.1 | 0.1 |
Find TNF values for subject 2¶
In [52]:
df1.ix['2', ['tnf']]
Out[52]:
tnf | ||
---|---|---|
stim | visit | |
cmv | 1 | 3.0 |
hiv | 1 | 4.0 |
cmv | 2 | 3.1 |
hiv | 2 | 4.1 |
Flattening hierarchicla indexes into column variables¶
To actually undo, you would have to save the result to a variable.
In [53]:
df1.reset_index()
Out[53]:
pid | stim | visit | tnf | ifn | il2 | |
---|---|---|---|---|---|---|
0 | 1 | cmv | 1 | 1.0 | 11.0 | 0.0 |
1 | 1 | hiv | 1 | 2.0 | 12.0 | 0.0 |
2 | 1 | cmv | 2 | 1.1 | 11.1 | 0.1 |
3 | 1 | hiv | 2 | 2.1 | 12.1 | 0.1 |
4 | 1 | cmv | 3 | 1.2 | 11.2 | 0.2 |
5 | 1 | hiv | 3 | 2.2 | 12.2 | 0.2 |
6 | 2 | cmv | 1 | 3.0 | 13.0 | 0.1 |
7 | 2 | hiv | 1 | 4.0 | 14.0 | 0.3 |
8 | 2 | cmv | 2 | 3.1 | 13.1 | 0.1 |
9 | 2 | hiv | 2 | 4.1 | 14.1 | 0.1 |
Moving parts of an index into columns with unstack
¶
Move pid from column to row¶
In [54]:
df1.unstack('pid')
Out[54]:
tnf | ifn | il2 | |||||
---|---|---|---|---|---|---|---|
pid | 1 | 2 | 1 | 2 | 1 | 2 | |
stim | visit | ||||||
cmv | 1 | 1.0 | 3.0 | 11.0 | 13.0 | 0.0 | 0.1 |
2 | 1.1 | 3.1 | 11.1 | 13.1 | 0.1 | 0.1 | |
3 | 1.2 | NaN | 11.2 | NaN | 0.2 | NaN | |
hiv | 1 | 2.0 | 4.0 | 12.0 | 14.0 | 0.0 | 0.3 |
2 | 2.1 | 4.1 | 12.1 | 14.1 | 0.1 | 0.1 | |
3 | 2.2 | NaN | 12.2 | NaN | 0.2 | NaN |
In [55]:
df1.unstack(0)
Out[55]:
tnf | ifn | il2 | |||||
---|---|---|---|---|---|---|---|
pid | 1 | 2 | 1 | 2 | 1 | 2 | |
stim | visit | ||||||
cmv | 1 | 1.0 | 3.0 | 11.0 | 13.0 | 0.0 | 0.1 |
2 | 1.1 | 3.1 | 11.1 | 13.1 | 0.1 | 0.1 | |
3 | 1.2 | NaN | 11.2 | NaN | 0.2 | NaN | |
hiv | 1 | 2.0 | 4.0 | 12.0 | 14.0 | 0.0 | 0.3 |
2 | 2.1 | 4.1 | 12.1 | 14.1 | 0.1 | 0.1 | |
3 | 2.2 | NaN | 12.2 | NaN | 0.2 | NaN |
Move stim from column to row¶
In [56]:
df1.unstack('stim')
Out[56]:
tnf | ifn | il2 | |||||
---|---|---|---|---|---|---|---|
stim | cmv | hiv | cmv | hiv | cmv | hiv | |
pid | visit | ||||||
1 | 1 | 1.0 | 2.0 | 11.0 | 12.0 | 0.0 | 0.0 |
2 | 1.1 | 2.1 | 11.1 | 12.1 | 0.1 | 0.1 | |
3 | 1.2 | 2.2 | 11.2 | 12.2 | 0.2 | 0.2 | |
2 | 1 | 3.0 | 4.0 | 13.0 | 14.0 | 0.1 | 0.3 |
2 | 3.1 | 4.1 | 13.1 | 14.1 | 0.1 | 0.1 |
In [57]:
df1.unstack(1)
Out[57]:
tnf | ifn | il2 | |||||
---|---|---|---|---|---|---|---|
stim | cmv | hiv | cmv | hiv | cmv | hiv | |
pid | visit | ||||||
1 | 1 | 1.0 | 2.0 | 11.0 | 12.0 | 0.0 | 0.0 |
2 | 1.1 | 2.1 | 11.1 | 12.1 | 0.1 | 0.1 | |
3 | 1.2 | 2.2 | 11.2 | 12.2 | 0.2 | 0.2 | |
2 | 1 | 3.0 | 4.0 | 13.0 | 14.0 | 0.1 | 0.3 |
2 | 3.1 | 4.1 | 13.1 | 14.1 | 0.1 | 0.1 |
Move pid and stim from column to row¶
In [58]:
df1.unstack(['pid', 'stim'])
Out[58]:
tnf | ifn | il2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
pid | 1 | 2 | 1 | 2 | 1 | 2 | ||||||
stim | cmv | hiv | cmv | hiv | cmv | hiv | cmv | hiv | cmv | hiv | cmv | hiv |
visit | ||||||||||||
1 | 1.0 | 2.0 | 3.0 | 4.0 | 11.0 | 12.0 | 13.0 | 14.0 | 0.0 | 0.0 | 0.1 | 0.3 |
2 | 1.1 | 2.1 | 3.1 | 4.1 | 11.1 | 12.1 | 13.1 | 14.1 | 0.1 | 0.1 | 0.1 | 0.1 |
3 | 1.2 | 2.2 | NaN | NaN | 11.2 | 12.2 | NaN | NaN | 0.2 | 0.2 | NaN | NaN |
Moving columns to a multi-index¶
In [59]:
df1.head()
Out[59]:
tnf | ifn | il2 | |||
---|---|---|---|---|---|
pid | stim | visit | |||
1 | cmv | 1 | 1.0 | 11.0 | 0.0 |
hiv | 1 | 2.0 | 12.0 | 0.0 | |
cmv | 2 | 1.1 | 11.1 | 0.1 | |
hiv | 2 | 2.1 | 12.1 | 0.1 | |
cmv | 3 | 1.2 | 11.2 | 0.2 |
In [60]:
df2 = df1.unstack(['pid', 'stim'])
df2.head()
Out[60]:
tnf | ifn | il2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
pid | 1 | 2 | 1 | 2 | 1 | 2 | ||||||
stim | cmv | hiv | cmv | hiv | cmv | hiv | cmv | hiv | cmv | hiv | cmv | hiv |
visit | ||||||||||||
1 | 1.0 | 2.0 | 3.0 | 4.0 | 11.0 | 12.0 | 13.0 | 14.0 | 0.0 | 0.0 | 0.1 | 0.3 |
2 | 1.1 | 2.1 | 3.1 | 4.1 | 11.1 | 12.1 | 13.1 | 14.1 | 0.1 | 0.1 | 0.1 | 0.1 |
3 | 1.2 | 2.2 | NaN | NaN | 11.2 | 12.2 | NaN | NaN | 0.2 | 0.2 | NaN | NaN |
In [61]:
df3 = df2.stack(['pid', 'stim'])
df3.head()
Out[61]:
ifn | il2 | tnf | |||
---|---|---|---|---|---|
visit | pid | stim | |||
1 | 1 | cmv | 11.0 | 0.0 | 1.0 |
hiv | 12.0 | 0.0 | 2.0 | ||
2 | cmv | 13.0 | 0.1 | 3.0 | |
hiv | 14.0 | 0.3 | 4.0 | ||
2 | 1 | cmv | 11.1 | 0.1 | 1.1 |
Exercises¶
You are given three DataFrames d1
, d2
, and d3
. Answer the
questions below.
In [62]:
d1 = pd.DataFrame({'subject': ['a', 'b', 'c', 'd'], 'group': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
d2 = pd.DataFrame({'subject': ['w', 'x', 'y', 'z'], 'group': ['A', 'B', 'C', 'D'], 'value': [2, 1, 4, 3]})
d3 = pd.DataFrame({'subject': ['a', 'x', 'w', 'd', 'b', 'c', 'y', 'z'], 'age': [10, 20, 30, 40, 50, 60, 70, 80]})
In [63]:
d1
Out[63]:
group | subject | value | |
---|---|---|---|
0 | A | a | 1 |
1 | B | b | 2 |
2 | C | c | 3 |
3 | D | d | 4 |
In [64]:
d2
Out[64]:
group | subject | value | |
---|---|---|---|
0 | A | w | 2 |
1 | B | x | 1 |
2 | C | y | 4 |
3 | D | z | 3 |
In [65]:
d3
Out[65]:
age | subject | |
---|---|---|
0 | 10 | a |
1 | 20 | x |
2 | 30 | w |
3 | 40 | d |
4 | 50 | b |
5 | 60 | c |
6 | 70 | y |
7 | 80 | z |
1. Combine d1
and d2
to create a DataFrame called df1
with 8 rows and 3 columns.
In [69]:
df1 = pd.concat([d1, d2])
df1
Out[69]:
group | subject | value | |
---|---|---|---|
0 | A | a | 1 |
1 | B | b | 2 |
2 | C | c | 3 |
3 | D | d | 4 |
0 | A | w | 2 |
1 | B | x | 1 |
2 | C | y | 4 |
3 | D | z | 3 |
2. Combine df1
with d3
to create a DataFrame called df2
with 8 rows and 4 columns.
In [71]:
df2 = pd.merge(df1, d3)
df2
Out[71]:
group | subject | value | age | |
---|---|---|---|---|
0 | A | a | 1 | 10 |
1 | B | b | 2 | 50 |
2 | C | c | 3 | 60 |
3 | D | d | 4 | 40 |
4 | A | w | 2 | 30 |
5 | B | x | 1 | 20 |
6 | C | y | 4 | 70 |
7 | D | z | 3 | 80 |