In [1]:
import numpy as np
import pandas as pd

Data Manipulation

(1) Combining data sets

Combining rows

In [2]:
df_versiocolor = pd.read_csv('data/versicolor.csv')
df_virginica = pd.read_csv('data/virginica.csv')
df_sertosa = pd.read_csv('data/setosa.csv')
dfs = [df_versiocolor, df_virginica, df_sertosa]
In [3]:
df_versiocolor.head(3)
Out[3]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 7.0 3.2 4.7 1.4 versicolor
1 6.4 3.2 4.5 1.5 versicolor
2 6.9 3.1 4.9 1.5 versicolor
In [4]:
df_virginica.head(3)
Out[4]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 6.3 3.3 6.0 2.5 virginica
1 5.8 2.7 5.1 1.9 virginica
2 7.1 3.0 5.9 2.1 virginica
In [5]:
df_sertosa.head(3)
Out[5]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
In [6]:
for df in dfs:
    print(df.shape)
(50, 5)
(50, 5)
(50, 5)

Each DataFrame only contains data about one species of iris

In [7]:
for df in dfs:
    print(df.Species.unique())
['versicolor']
['virginica']
['setosa']

Combine with concat

We can append rows of one data frame to another using concat. Here axis = 0 means combining by rows, in contrast to axis = 1 which means combining by columns.

In [8]:
df = pd.concat(dfs, axis=0)
df.shape
Out[8]:
(150, 5)

Note that the indices are maintained

Here we have to use iloc because ix and loc work with the integer index values, not the positions, and there are no values equal to 50 or 51.

In [9]:
df.iloc[48:52]
Out[9]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
48 5.1 2.5 3.0 1.1 versicolor
49 5.7 2.8 4.1 1.3 versicolor
0 6.3 3.3 6.0 2.5 virginica
1 5.8 2.7 5.1 1.9 virginica

We can ask for a new consecutive index

In [10]:
df = df.reset_index(drop=True)
In [11]:
df.iloc[48:52]
Out[11]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
48 5.1 2.5 3.0 1.1 versicolor
49 5.7 2.8 4.1 1.3 versicolor
50 6.3 3.3 6.0 2.5 virginica
51 5.8 2.7 5.1 1.9 virginica

Combined DataFrame contains all 3 species

In [12]:
df.Species.unique()
Out[12]:
array(['versicolor', 'virginica', 'setosa'], dtype=object)

Combining columns

When combining rows, we are usually just adding new observations with the same variables and there is little ambiguity about what we are trying to do.

When combining columns, we are usually trying to merge information based on some unique identifier, for example, to combine clinical and laboratory information for specific patients with unique PIDs. Often the ordering of the PIDs in the two data frames are not exactly lined up so “horizontal stacking” will not work. Usually we need to merge the data on a unique identifier from one or more columns.

Simplest case - rows line up perfectly across data frames

In [13]:
df_sepal = pd.read_csv('data/sepal.csv')
df_petal = pd.read_csv('data/petal.csv')
In [14]:
df_sepal.head(3)
Out[14]:
Species Sepal.Length Sepal.Width
0 setosa 5.1 3.5
1 setosa 4.9 3.0
2 setosa 4.7 3.2
In [15]:
df_petal.head(3)
Out[15]:
Species Petal.Length Petal.Width
0 setosa 1.4 0.2
1 setosa 1.4 0.2
2 setosa 1.3 0.2
In [16]:
df_sepal.shape, df_petal.shape
Out[16]:
((150, 3), (150, 3))

Using concat

In [17]:
pd.concat([df_sepal, df_petal], axis=1).head()
Out[17]:
Species Sepal.Length Sepal.Width Species Petal.Length Petal.Width
0 setosa 5.1 3.5 setosa 1.4 0.2
1 setosa 4.9 3.0 setosa 1.4 0.2
2 setosa 4.7 3.2 setosa 1.3 0.2
3 setosa 4.6 3.1 setosa 1.5 0.2
4 setosa 5.0 3.6 setosa 1.4 0.2

Joining on a single unique column

Combining values for the same subject across different variables.

Make up dummy data sets for illustration

In [18]:
pid1 = np.random.choice(100, 6, replace=False)
val1 = np.random.normal(10, 1, 6)
df1 = pd.DataFrame({'pid': pid1, 'val': val1})
df1
Out[18]:
pid val
0 4 8.115657
1 22 9.385071
2 83 12.240448
3 90 9.173178
4 98 10.119245
5 85 10.476193
In [19]:
pid2 = np.random.permutation(pid1)
val2 = np.random.normal(15, 1, 6)
df2 = pd.DataFrame({'pid': pid2, 'val': val2})
df2
Out[19]:
pid val
0 98 15.945742
1 22 14.873965
2 90 16.505458
3 85 13.775417
4 83 14.690199
5 4 14.749047

Using merge to join on columns containing unique information

Note that the pid order is not the same, so we cannot simply stack horizontally.

In [20]:
pd.merge(df1, df2, on='pid', suffixes=['_visit_1', '_visit_2'])
Out[20]:
pid val_visit_1 val_visit_2
0 4 8.115657 14.749047
1 22 9.385071 14.873965
2 83 12.240448 14.690199
3 90 9.173178 16.505458
4 98 10.119245 15.945742
5 85 10.476193 13.775417
In [21]:
df1a = df1.set_index('pid')
df2a = df2.set_index('pid')

Using merge on multiple columns

Sometimes we need multiple columns to define a unique identifier.

Note: In the data frames being merged, not all the unique identifiers need to be shared in both data frames. For example, when merging laboratory and clinical data, a patient may have laboratory results but no clinical results (perhaps due to a typo) or have clinical data but no laboratory data (because results are still being processed by the lab). We illustrate different ways to resolve the merge in this scenario in the example below.

In [22]:
df1['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df1 = df1[['pid', 'stim', 'val']]
df1
Out[22]:
pid stim val
0 4 cmv 8.115657
1 22 flu 9.385071
2 83 cmv 12.240448
3 90 flu 9.173178
4 98 flu 10.119245
5 85 cmv 10.476193
In [23]:
df2['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df2 = df2[['pid', 'stim', 'val']]
df2
Out[23]:
pid stim val
0 98 cmv 15.945742
1 22 flu 14.873965
2 90 flu 16.505458
3 85 cmv 13.775417
4 83 cmv 14.690199
5 4 cmv 14.749047
In [24]:
pd.merge(df1, df2, on = ['pid', 'stim'], suffixes = ['_visit_1', '_visit_2'])
Out[24]:
pid stim val_visit_1 val_visit_2
0 4 cmv 8.115657 14.749047
1 22 flu 9.385071 14.873965
2 83 cmv 12.240448 14.690199
3 90 flu 9.173178 16.505458
4 85 cmv 10.476193 13.775417
In [25]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'left', suffixes = ['_visit_1', '_visit_2'])
Out[25]:
pid stim val_visit_1 val_visit_2
0 4 cmv 8.115657 14.749047
1 22 flu 9.385071 14.873965
2 83 cmv 12.240448 14.690199
3 90 flu 9.173178 16.505458
4 98 flu 10.119245 NaN
5 85 cmv 10.476193 13.775417
In [26]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'right', suffixes = ['_visit_1', '_visit_2'])
Out[26]:
pid stim val_visit_1 val_visit_2
0 4.0 cmv 8.115657 14.749047
1 22.0 flu 9.385071 14.873965
2 83.0 cmv 12.240448 14.690199
3 90.0 flu 9.173178 16.505458
4 85.0 cmv 10.476193 13.775417
5 98.0 cmv NaN 15.945742
In [27]:
pd.merge(df1, df2, on = ['pid', 'stim'], how = 'outer', suffixes = ['_visit_1', '_visit_2'])
Out[27]:
pid stim val_visit_1 val_visit_2
0 4.0 cmv 8.115657 14.749047
1 22.0 flu 9.385071 14.873965
2 83.0 cmv 12.240448 14.690199
3 90.0 flu 9.173178 16.505458
4 98.0 flu 10.119245 NaN
5 85.0 cmv 10.476193 13.775417
6 98.0 cmv NaN 15.945742

Using join when indexes are the keys

In [28]:
df1a
Out[28]:
val
pid
4 8.115657
22 9.385071
83 12.240448
90 9.173178
98 10.119245
85 10.476193
In [29]:
df2a
Out[29]:
val
pid
98 15.945742
22 14.873965
90 16.505458
85 13.775417
83 14.690199
4 14.749047
In [30]:
df1a.join(df2a, lsuffix='_1', rsuffix='_2')
Out[30]:
val_1 val_2
pid
4 8.115657 14.749047
22 9.385071 14.873965
83 12.240448 14.690199
90 9.173178 16.505458
98 10.119245 15.945742
85 10.476193 13.775417

(2) Separate multiple values in a single column

Remember that for a tidy data frame, each column must contain a single variable.

In [31]:
from collections import OrderedDict
In [32]:
d = OrderedDict()
d['pid-visit-stim'] = ['1-1-cmv', '1-1-hiv', '1-2-cmv', '1-2-hiv', '1-3-cmv', '1-3-hiv', '2-1-cmv', '2-1-hiv', '2-2-cmv', '2-2-hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)
In [33]:
df.head()
Out[33]:
pid-visit-stim tnf ifn il2
0 1-1-cmv 1.0 11.0 0.0
1 1-1-hiv 2.0 12.0 0.0
2 1-2-cmv 1.1 11.1 0.1
3 1-2-hiv 2.1 12.1 0.1
4 1-3-cmv 1.2 11.2 0.2
In [34]:
df1 = pd.DataFrame(df['pid-visit-stim'].str.split('-').tolist(),
                   columns = ['pid', 'visit', 'stim'])
df1.head()
Out[34]:
pid visit stim
0 1 1 cmv
1 1 1 hiv
2 1 2 cmv
3 1 2 hiv
4 1 3 cmv
In [35]:
df1 = pd.concat([df1, df], axis=1)
df1.head()
Out[35]:
pid visit stim pid-visit-stim tnf ifn il2
0 1 1 cmv 1-1-cmv 1.0 11.0 0.0
1 1 1 hiv 1-1-hiv 2.0 12.0 0.0
2 1 2 cmv 1-2-cmv 1.1 11.1 0.1
3 1 2 hiv 1-2-hiv 2.1 12.1 0.1
4 1 3 cmv 1-3-cmv 1.2 11.2 0.2
In [36]:
df1.drop('pid-visit-stim', axis=1, inplace=True)
df1.head()
Out[36]:
pid visit stim tnf ifn il2
0 1 1 cmv 1.0 11.0 0.0
1 1 1 hiv 2.0 12.0 0.0
2 1 2 cmv 1.1 11.1 0.1
3 1 2 hiv 2.1 12.1 0.1
4 1 3 cmv 1.2 11.2 0.2
In [37]:
def separate(df, column, sep):
    df1 = pd.DataFrame(df[column].str.split(sep).tolist(), columns = column.split(sep))
    df1 = pd.concat([df1, df], axis=1)
    return df1.drop(column, axis = 1)
In [38]:
separate(df, 'pid-visit-stim', '-').head()
Out[38]:
pid visit stim tnf ifn il2
0 1 1 cmv 1.0 11.0 0.0
1 1 1 hiv 2.0 12.0 0.0
2 1 2 cmv 1.1 11.1 0.1
3 1 2 hiv 2.1 12.1 0.1
4 1 3 cmv 1.2 11.2 0.2

(3) Reshaping DataFrames

In [39]:
d = OrderedDict()
d['pid'] = ['1', '1', '1', '1', '1', '1', '2', '2', '2', '2']
d['visit'] = ['1', '1', '2', '2', '3', '3', '1', '1', '2', '2']
d['stim'] = ['cmv', 'hiv', 'cmv', 'hiv', 'cmv', 'hiv', 'cmv', 'hiv', 'cmv', 'hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)
In [40]:
df.head()
Out[40]:
pid visit stim tnf ifn il2
0 1 1 cmv 1.0 11.0 0.0
1 1 1 hiv 2.0 12.0 0.0
2 1 2 cmv 1.1 11.1 0.1
3 1 2 hiv 2.1 12.1 0.1
4 1 3 cmv 1.2 11.2 0.2

Wide to long

Suppose it is more convenient to consider the type of cytokine as new variable. We need to put the name of the cytokine in one column and its value in another - going from 3 to 2 columns (wide to long).

We use teh melt method and specify the identifier (columns that stay the same) and value variables.

In [41]:
long1 = pd.melt(df, id_vars = ['pid', 'stim', 'visit'],
                value_vars = ['tnf', 'ifn', 'il2'])
long1.sample(6)
Out[41]:
pid stim visit variable value
5 1 hiv 3 tnf 2.2
8 2 cmv 2 tnf 3.1
10 1 cmv 1 ifn 11.0
15 1 hiv 3 ifn 12.2
17 2 hiv 1 ifn 14.0
23 1 hiv 2 il2 0.1

By default, all non-id variabels are treated as values

In [42]:
long2 = pd.melt(df, id_vars =['pid', 'stim', 'visit'])
long2.sample(6)
Out[42]:
pid stim visit variable value
17 2 hiv 1 ifn 14.0
20 1 cmv 1 il2 0.0
2 1 cmv 2 tnf 1.1
18 2 cmv 2 ifn 13.1
8 2 cmv 2 tnf 3.1
4 1 cmv 3 tnf 1.2

We can exclude columns we are not intersted in

In [43]:
long3 = pd.melt(df, id_vars = ['pid', 'stim', 'visit'],
                value_vars = ['tnf', 'il2'])
long3.sample(6)
Out[43]:
pid stim visit variable value
18 2 cmv 2 il2 0.1
4 1 cmv 3 tnf 1.2
10 1 cmv 1 il2 0.0
0 1 cmv 1 tnf 1.0
16 2 cmv 1 il2 0.1
6 2 cmv 1 tnf 3.0

Long to Wide

There is no function in pandas to undo the wide_to_long operation, and the details are tricky, so I have written a small function to do this.

In [44]:
def long_to_wide(df, index):
    df = df.set_index(index).unstack().reset_index()
    cols = [t[1] if t[1] else t[0] for t in df.columns]
    df.columns = cols
    return df
In [45]:
wide1 = long_to_wide(long1, ['pid', 'stim', 'visit', 'variable'])
wide1.sample(6)
Out[45]:
pid stim visit ifn il2 tnf
1 1 cmv 2 11.1 0.1 1.1
4 1 hiv 2 12.1 0.1 2.1
2 1 cmv 3 11.2 0.2 1.2
3 1 hiv 1 12.0 0.0 2.0
9 2 hiv 2 14.1 0.1 4.1
5 1 hiv 3 12.2 0.2 2.2
In [46]:
wide2 = long_to_wide(long2, ['pid', 'stim', 'visit', 'variable'])
wide2.sample(6)
Out[46]:
pid stim visit ifn il2 tnf
0 1 cmv 1 11.0 0.0 1.0
4 1 hiv 2 12.1 0.1 2.1
5 1 hiv 3 12.2 0.2 2.2
2 1 cmv 3 11.2 0.2 1.2
6 2 cmv 1 13.0 0.1 3.0
8 2 hiv 1 14.0 0.3 4.0
In [47]:
wide3 = long_to_wide(long3, ['pid', 'stim', 'visit', 'variable'])
wide3.sample(6)
Out[47]:
pid stim visit il2 tnf
5 1 hiv 3 0.2 2.2
4 1 hiv 2 0.1 2.1
8 2 hiv 1 0.3 4.0
6 2 cmv 1 0.1 3.0
3 1 hiv 1 0.0 2.0
9 2 hiv 2 0.1 4.1

(4) Hierarchical Indexes (More advanced material)

Sometimes your data is best considered as stacks of 2D DataFrames, for example, when each patient has laboratory data for multiple visits. The handling of such “3D” or even higher-dimensional data is handled in pandas by the sue of hierarchical indexes.

In [48]:
df.head()
Out[48]:
pid visit stim tnf ifn il2
0 1 1 cmv 1.0 11.0 0.0
1 1 1 hiv 2.0 12.0 0.0
2 1 2 cmv 1.1 11.1 0.1
3 1 2 hiv 2.1 12.1 0.1
4 1 3 cmv 1.2 11.2 0.2
In [49]:
df1 = df.set_index(['pid', 'stim', 'visit'])
df1
Out[49]:
tnf ifn il2
pid stim visit
1 cmv 1 1.0 11.0 0.0
hiv 1 2.0 12.0 0.0
cmv 2 1.1 11.1 0.1
hiv 2 2.1 12.1 0.1
cmv 3 1.2 11.2 0.2
hiv 3 2.2 12.2 0.2
2 cmv 1 3.0 13.0 0.1
hiv 1 4.0 14.0 0.3
cmv 2 3.1 13.1 0.1
hiv 2 4.1 14.1 0.1

Indexing for multi-index

With the multi-index, each “cell” is now a block of values for the combinations (pid, stim, visit).

Find TNF values

In [50]:
df1[['tnf']]
Out[50]:
tnf
pid stim visit
1 cmv 1 1.0
hiv 1 2.0
cmv 2 1.1
hiv 2 2.1
cmv 3 1.2
hiv 3 2.2
2 cmv 1 3.0
hiv 1 4.0
cmv 2 3.1
hiv 2 4.1

Find all values for Subject 2

In [51]:
df1.ix['2']
Out[51]:
tnf ifn il2
stim visit
cmv 1 3.0 13.0 0.1
hiv 1 4.0 14.0 0.3
cmv 2 3.1 13.1 0.1
hiv 2 4.1 14.1 0.1

Find TNF values for subject 2

In [52]:
df1.ix['2', ['tnf']]
Out[52]:
tnf
stim visit
cmv 1 3.0
hiv 1 4.0
cmv 2 3.1
hiv 2 4.1

Flattening hierarchicla indexes into column variables

To actually undo, you would have to save the result to a variable.

In [53]:
df1.reset_index()
Out[53]:
pid stim visit tnf ifn il2
0 1 cmv 1 1.0 11.0 0.0
1 1 hiv 1 2.0 12.0 0.0
2 1 cmv 2 1.1 11.1 0.1
3 1 hiv 2 2.1 12.1 0.1
4 1 cmv 3 1.2 11.2 0.2
5 1 hiv 3 2.2 12.2 0.2
6 2 cmv 1 3.0 13.0 0.1
7 2 hiv 1 4.0 14.0 0.3
8 2 cmv 2 3.1 13.1 0.1
9 2 hiv 2 4.1 14.1 0.1

Moving parts of an index into columns with unstack

Move pid from column to row
In [54]:
df1.unstack('pid')
Out[54]:
tnf ifn il2
pid 1 2 1 2 1 2
stim visit
cmv 1 1.0 3.0 11.0 13.0 0.0 0.1
2 1.1 3.1 11.1 13.1 0.1 0.1
3 1.2 NaN 11.2 NaN 0.2 NaN
hiv 1 2.0 4.0 12.0 14.0 0.0 0.3
2 2.1 4.1 12.1 14.1 0.1 0.1
3 2.2 NaN 12.2 NaN 0.2 NaN
In [55]:
df1.unstack(0)
Out[55]:
tnf ifn il2
pid 1 2 1 2 1 2
stim visit
cmv 1 1.0 3.0 11.0 13.0 0.0 0.1
2 1.1 3.1 11.1 13.1 0.1 0.1
3 1.2 NaN 11.2 NaN 0.2 NaN
hiv 1 2.0 4.0 12.0 14.0 0.0 0.3
2 2.1 4.1 12.1 14.1 0.1 0.1
3 2.2 NaN 12.2 NaN 0.2 NaN
Move stim from column to row
In [56]:
df1.unstack('stim')
Out[56]:
tnf ifn il2
stim cmv hiv cmv hiv cmv hiv
pid visit
1 1 1.0 2.0 11.0 12.0 0.0 0.0
2 1.1 2.1 11.1 12.1 0.1 0.1
3 1.2 2.2 11.2 12.2 0.2 0.2
2 1 3.0 4.0 13.0 14.0 0.1 0.3
2 3.1 4.1 13.1 14.1 0.1 0.1
In [57]:
df1.unstack(1)
Out[57]:
tnf ifn il2
stim cmv hiv cmv hiv cmv hiv
pid visit
1 1 1.0 2.0 11.0 12.0 0.0 0.0
2 1.1 2.1 11.1 12.1 0.1 0.1
3 1.2 2.2 11.2 12.2 0.2 0.2
2 1 3.0 4.0 13.0 14.0 0.1 0.3
2 3.1 4.1 13.1 14.1 0.1 0.1
Move pid and stim from column to row
In [58]:
df1.unstack(['pid', 'stim'])
Out[58]:
tnf ifn il2
pid 1 2 1 2 1 2
stim cmv hiv cmv hiv cmv hiv cmv hiv cmv hiv cmv hiv
visit
1 1.0 2.0 3.0 4.0 11.0 12.0 13.0 14.0 0.0 0.0 0.1 0.3
2 1.1 2.1 3.1 4.1 11.1 12.1 13.1 14.1 0.1 0.1 0.1 0.1
3 1.2 2.2 NaN NaN 11.2 12.2 NaN NaN 0.2 0.2 NaN NaN

Moving columns to a multi-index

In [59]:
df1.head()
Out[59]:
tnf ifn il2
pid stim visit
1 cmv 1 1.0 11.0 0.0
hiv 1 2.0 12.0 0.0
cmv 2 1.1 11.1 0.1
hiv 2 2.1 12.1 0.1
cmv 3 1.2 11.2 0.2
In [60]:
df2 = df1.unstack(['pid', 'stim'])
df2.head()
Out[60]:
tnf ifn il2
pid 1 2 1 2 1 2
stim cmv hiv cmv hiv cmv hiv cmv hiv cmv hiv cmv hiv
visit
1 1.0 2.0 3.0 4.0 11.0 12.0 13.0 14.0 0.0 0.0 0.1 0.3
2 1.1 2.1 3.1 4.1 11.1 12.1 13.1 14.1 0.1 0.1 0.1 0.1
3 1.2 2.2 NaN NaN 11.2 12.2 NaN NaN 0.2 0.2 NaN NaN
In [61]:
df3 = df2.stack(['pid', 'stim'])
df3.head()
Out[61]:
ifn il2 tnf
visit pid stim
1 1 cmv 11.0 0.0 1.0
hiv 12.0 0.0 2.0
2 cmv 13.0 0.1 3.0
hiv 14.0 0.3 4.0
2 1 cmv 11.1 0.1 1.1

Exercises

You are given three DataFrames d1, d2, and d3. Answer the questions below.

In [62]:
d1 = pd.DataFrame({'subject': ['a', 'b', 'c', 'd'], 'group': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
d2 = pd.DataFrame({'subject': ['w', 'x', 'y', 'z'], 'group': ['A', 'B', 'C', 'D'], 'value': [2, 1, 4, 3]})
d3 = pd.DataFrame({'subject': ['a', 'x', 'w', 'd', 'b', 'c', 'y', 'z'], 'age': [10, 20, 30, 40, 50, 60, 70, 80]})
In [63]:
d1
Out[63]:
group subject value
0 A a 1
1 B b 2
2 C c 3
3 D d 4
In [64]:
d2
Out[64]:
group subject value
0 A w 2
1 B x 1
2 C y 4
3 D z 3
In [65]:
d3
Out[65]:
age subject
0 10 a
1 20 x
2 30 w
3 40 d
4 50 b
5 60 c
6 70 y
7 80 z

1. Combine d1 and d2 to create a DataFrame called df1 with 8 rows and 3 columns.

In [69]:
df1 = pd.concat([d1, d2])
df1
Out[69]:
group subject value
0 A a 1
1 B b 2
2 C c 3
3 D d 4
0 A w 2
1 B x 1
2 C y 4
3 D z 3

2. Combine df1 with d3 to create a DataFrame called df2 with 8 rows and 4 columns.

In [71]:
df2 = pd.merge(df1, d3)
df2
Out[71]:
group subject value age
0 A a 1 10
1 B b 2 50
2 C c 3 60
3 D d 4 40
4 A w 2 30
5 B x 1 20
6 C y 4 70
7 D z 3 80