In [1]:

import numpy as np
import pandas as pd

Data Manipulation¶

(1) Combining data sets¶

Combining rows¶

In [2]:

df_versiocolor = pd.read_csv('data/versicolor.csv')
df_virginica = pd.read_csv('data/virginica.csv')
df_sertosa = pd.read_csv('data/setosa.csv')
dfs = [df_versiocolor, df_virginica, df_sertosa]

In [3]:

df_versiocolor.head(3)

Out[3]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	7.0	3.2	4.7	1.4	versicolor
1	6.4	3.2	4.5	1.5	versicolor
2	6.9	3.1	4.9	1.5	versicolor

In [4]:

df_virginica.head(3)

Out[4]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	6.3	3.3	6.0	2.5	virginica
1	5.8	2.7	5.1	1.9	virginica
2	7.1	3.0	5.9	2.1	virginica

In [5]:

df_sertosa.head(3)

Out[5]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa

In [6]:

for df in dfs:
    print(df.shape)

(50, 5)
(50, 5)
(50, 5)

Each DataFrame only contains data about one species of iris¶

In [7]:

for df in dfs:
    print(df.Species.unique())

['versicolor']
['virginica']
['setosa']

Combine with `concat`¶

We can append rows of one data frame to another using concat. Here axis = 0 means combining by rows, in contrast to axis = 1 which means combining by columns.

In [8]:

df = pd.concat(dfs, axis=0)
df.shape

Out[8]:

(150, 5)

Note that the indices are maintained¶

Here we have to use iloc because ix and loc work with the integer index values, not the positions, and there are no values equal to 50 or 51.

In [9]:

df.iloc[48:52]

Out[9]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
48	5.1	2.5	3.0	1.1	versicolor
49	5.7	2.8	4.1	1.3	versicolor
0	6.3	3.3	6.0	2.5	virginica
1	5.8	2.7	5.1	1.9	virginica

We can ask for a new consecutive index¶

In [10]:

df = df.reset_index(drop=True)

In [11]:

df.iloc[48:52]

Out[11]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
48	5.1	2.5	3.0	1.1	versicolor
49	5.7	2.8	4.1	1.3	versicolor
50	6.3	3.3	6.0	2.5	virginica
51	5.8	2.7	5.1	1.9	virginica

Combined DataFrame contains all 3 species¶

In [12]:

df.Species.unique()

Out[12]:

array(['versicolor', 'virginica', 'setosa'], dtype=object)

Combining columns¶

When combining rows, we are usually just adding new observations with the same variables and there is little ambiguity about what we are trying to do.

When combining columns, we are usually trying to merge information based on some unique identifier, for example, to combine clinical and laboratory information for specific patients with unique PIDs. Often the ordering of the PIDs in the two data frames are not exactly lined up so “horizontal stacking” will not work. Usually we need to merge the data on a unique identifier from one or more columns.

Simplest case - rows line up perfectly across data frames¶

In [13]:

df_sepal = pd.read_csv('data/sepal.csv')
df_petal = pd.read_csv('data/petal.csv')

In [14]:

df_sepal.head(3)

Out[14]:

	Species	Sepal.Length	Sepal.Width
0	setosa	5.1	3.5
1	setosa	4.9	3.0
2	setosa	4.7	3.2

In [15]:

df_petal.head(3)

Out[15]:

	Species	Petal.Length	Petal.Width
0	setosa	1.4	0.2
1	setosa	1.4	0.2
2	setosa	1.3	0.2

In [16]:

df_sepal.shape, df_petal.shape

Out[16]:

((150, 3), (150, 3))

Using `concat`¶

In [17]:

pd.concat([df_sepal, df_petal], axis=1).head()

Out[17]:

	Species	Sepal.Length	Sepal.Width	Species	Petal.Length	Petal.Width
0	setosa	5.1	3.5	setosa	1.4	0.2
1	setosa	4.9	3.0	setosa	1.4	0.2
2	setosa	4.7	3.2	setosa	1.3	0.2
3	setosa	4.6	3.1	setosa	1.5	0.2
4	setosa	5.0	3.6	setosa	1.4	0.2

Joining on a single unique column¶

Combining values for the same subject across different variables.

Make up dummy data sets for illustration¶

In [18]:

pid1 = np.random.choice(100, 6, replace=False)
val1 = np.random.normal(10, 1, 6)
df1 = pd.DataFrame({'pid': pid1, 'val': val1})
df1

Out[18]:

	pid	val
0	4	8.115657
1	22	9.385071
2	83	12.240448
3	90	9.173178
4	98	10.119245
5	85	10.476193

In [19]:

pid2 = np.random.permutation(pid1)
val2 = np.random.normal(15, 1, 6)
df2 = pd.DataFrame({'pid': pid2, 'val': val2})
df2

Out[19]:

	pid	val
0	98	15.945742
1	22	14.873965
2	90	16.505458
3	85	13.775417
4	83	14.690199
5	4	14.749047

Using merge to join on columns containing unique information¶

Note that the pid order is not the same, so we cannot simply stack horizontally.

In [20]:

pd.merge(df1, df2, on='pid', suffixes=['_visit_1', '_visit_2'])

Out[20]:

	pid	val_visit_1	val_visit_2
0	4	8.115657	14.749047
1	22	9.385071	14.873965
2	83	12.240448	14.690199
3	90	9.173178	16.505458
4	98	10.119245	15.945742
5	85	10.476193	13.775417

In [21]:

df1a = df1.set_index('pid')
df2a = df2.set_index('pid')

Using merge on multiple columns¶

Sometimes we need multiple columns to define a unique identifier.

Note: In the data frames being merged, not all the unique identifiers need to be shared in both data frames. For example, when merging laboratory and clinical data, a patient may have laboratory results but no clinical results (perhaps due to a typo) or have clinical data but no laboratory data (because results are still being processed by the lab). We illustrate different ways to resolve the merge in this scenario in the example below.

In [22]:

df1['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df1 = df1[['pid', 'stim', 'val']]
df1

Out[22]:

	pid	stim	val
0	4	cmv	8.115657
1	22	flu	9.385071
2	83	cmv	12.240448
3	90	flu	9.173178
4	98	flu	10.119245
5	85	cmv	10.476193

In [23]:

df2['stim'] = np.random.choice(['cmv', 'flu'], 6, replace=True)
df2 = df2[['pid', 'stim', 'val']]
df2

Out[23]:

	pid	stim	val
0	98	cmv	15.945742
1	22	flu	14.873965
2	90	flu	16.505458
3	85	cmv	13.775417
4	83	cmv	14.690199
5	4	cmv	14.749047

In [24]:

pd.merge(df1, df2, on = ['pid', 'stim'], suffixes = ['_visit_1', '_visit_2'])

Out[24]:

	pid	stim	val_visit_1	val_visit_2
0	4	cmv	8.115657	14.749047
1	22	flu	9.385071	14.873965
2	83	cmv	12.240448	14.690199
3	90	flu	9.173178	16.505458
4	85	cmv	10.476193	13.775417

In [25]:

pd.merge(df1, df2, on = ['pid', 'stim'], how = 'left', suffixes = ['_visit_1', '_visit_2'])

Out[25]:

	pid	stim	val_visit_1	val_visit_2
0	4	cmv	8.115657	14.749047
1	22	flu	9.385071	14.873965
2	83	cmv	12.240448	14.690199
3	90	flu	9.173178	16.505458
4	98	flu	10.119245	NaN
5	85	cmv	10.476193	13.775417

In [26]:

pd.merge(df1, df2, on = ['pid', 'stim'], how = 'right', suffixes = ['_visit_1', '_visit_2'])

Out[26]:

	pid	stim	val_visit_1	val_visit_2
0	4.0	cmv	8.115657	14.749047
1	22.0	flu	9.385071	14.873965
2	83.0	cmv	12.240448	14.690199
3	90.0	flu	9.173178	16.505458
4	85.0	cmv	10.476193	13.775417
5	98.0	cmv	NaN	15.945742

In [27]:

pd.merge(df1, df2, on = ['pid', 'stim'], how = 'outer', suffixes = ['_visit_1', '_visit_2'])

Out[27]:

	pid	stim	val_visit_1	val_visit_2
0	4.0	cmv	8.115657	14.749047
1	22.0	flu	9.385071	14.873965
2	83.0	cmv	12.240448	14.690199
3	90.0	flu	9.173178	16.505458
4	98.0	flu	10.119245	NaN
5	85.0	cmv	10.476193	13.775417
6	98.0	cmv	NaN	15.945742

Using `join` when indexes are the keys¶

In [28]:

df1a

Out[28]:

	val
pid
4	8.115657
22	9.385071
83	12.240448
90	9.173178
98	10.119245
85	10.476193

In [29]:

df2a

Out[29]:

	val
pid
98	15.945742
22	14.873965
90	16.505458
85	13.775417
83	14.690199
4	14.749047

In [30]:

df1a.join(df2a, lsuffix='_1', rsuffix='_2')

Out[30]:

	val_1	val_2
pid
4	8.115657	14.749047
22	9.385071	14.873965
83	12.240448	14.690199
90	9.173178	16.505458
98	10.119245	15.945742
85	10.476193	13.775417

(2) Separate multiple values in a single column¶

Remember that for a tidy data frame, each column must contain a single variable.

In [31]:

from collections import OrderedDict

In [32]:

d = OrderedDict()
d['pid-visit-stim'] = ['1-1-cmv', '1-1-hiv', '1-2-cmv', '1-2-hiv', '1-3-cmv', '1-3-hiv', '2-1-cmv', '2-1-hiv', '2-2-cmv', '2-2-hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)

In [33]:

df.head()

Out[33]:

	pid-visit-stim	tnf	ifn	il2
0	1-1-cmv	1.0	11.0	0.0
1	1-1-hiv	2.0	12.0	0.0
2	1-2-cmv	1.1	11.1	0.1
3	1-2-hiv	2.1	12.1	0.1
4	1-3-cmv	1.2	11.2	0.2

In [34]:

df1 = pd.DataFrame(df['pid-visit-stim'].str.split('-').tolist(),
                   columns = ['pid', 'visit', 'stim'])
df1.head()

Out[34]:

	pid	visit	stim
0	1	1	cmv
1	1	1	hiv
2	1	2	cmv
3	1	2	hiv
4	1	3	cmv

In [35]:

df1 = pd.concat([df1, df], axis=1)
df1.head()

Out[35]:

	pid	visit	stim	pid-visit-stim	tnf	ifn	il2
0	1	1	cmv	1-1-cmv	1.0	11.0	0.0
1	1	1	hiv	1-1-hiv	2.0	12.0	0.0
2	1	2	cmv	1-2-cmv	1.1	11.1	0.1
3	1	2	hiv	1-2-hiv	2.1	12.1	0.1
4	1	3	cmv	1-3-cmv	1.2	11.2	0.2

In [36]:

df1.drop('pid-visit-stim', axis=1, inplace=True)
df1.head()

Out[36]:

	pid	visit	stim	tnf	ifn	il2
0	1	1	cmv	1.0	11.0	0.0
1	1	1	hiv	2.0	12.0	0.0
2	1	2	cmv	1.1	11.1	0.1
3	1	2	hiv	2.1	12.1	0.1
4	1	3	cmv	1.2	11.2	0.2

In [37]:

def separate(df, column, sep):
    df1 = pd.DataFrame(df[column].str.split(sep).tolist(), columns = column.split(sep))
    df1 = pd.concat([df1, df], axis=1)
    return df1.drop(column, axis = 1)

In [38]:

separate(df, 'pid-visit-stim', '-').head()

Out[38]:

	pid	visit	stim	tnf	ifn	il2
0	1	1	cmv	1.0	11.0	0.0
1	1	1	hiv	2.0	12.0	0.0
2	1	2	cmv	1.1	11.1	0.1
3	1	2	hiv	2.1	12.1	0.1
4	1	3	cmv	1.2	11.2	0.2

(3) Reshaping DataFrames¶

In [39]:

d = OrderedDict()
d['pid'] = ['1', '1', '1', '1', '1', '1', '2', '2', '2', '2']
d['visit'] = ['1', '1', '2', '2', '3', '3', '1', '1', '2', '2']
d['stim'] = ['cmv', 'hiv', 'cmv', 'hiv', 'cmv', 'hiv', 'cmv', 'hiv', 'cmv', 'hiv']
d['tnf'] = [1.0, 2.0, 1.1, 2.1, 1.2, 2.2, 3, 4, 3.1, 4.1]
d['ifn'] = [11.0, 12.0, 11.1, 12.1, 11.2, 12.2, 13, 14, 13.1, 14.1]
d['il2'] = [0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.1, 0.1]
df = pd.DataFrame(d)

In [40]:

df.head()

Out[40]:

	pid	visit	stim	tnf	ifn	il2
0	1	1	cmv	1.0	11.0	0.0
1	1	1	hiv	2.0	12.0	0.0
2	1	2	cmv	1.1	11.1	0.1
3	1	2	hiv	2.1	12.1	0.1
4	1	3	cmv	1.2	11.2	0.2

Wide to long¶

Suppose it is more convenient to consider the type of cytokine as new variable. We need to put the name of the cytokine in one column and its value in another - going from 3 to 2 columns (wide to long).

We use teh `melt` method and specify the identifier (columns that stay the same) and value variables.¶

In [41]:

long1 = pd.melt(df, id_vars = ['pid', 'stim', 'visit'],
                value_vars = ['tnf', 'ifn', 'il2'])
long1.sample(6)

Out[41]:

	pid	stim	visit	variable	value
5	1	hiv	3	tnf	2.2
8	2	cmv	2	tnf	3.1
10	1	cmv	1	ifn	11.0
15	1	hiv	3	ifn	12.2
17	2	hiv	1	ifn	14.0
23	1	hiv	2	il2	0.1

By default, all non-id variabels are treated as values¶

In [42]:

long2 = pd.melt(df, id_vars =['pid', 'stim', 'visit'])
long2.sample(6)

Out[42]:

	pid	stim	visit	variable	value
17	2	hiv	1	ifn	14.0
20	1	cmv	1	il2	0.0
2	1	cmv	2	tnf	1.1
18	2	cmv	2	ifn	13.1
8	2	cmv	2	tnf	3.1
4	1	cmv	3	tnf	1.2

We can exclude columns we are not intersted in¶

In [43]:

long3 = pd.melt(df, id_vars = ['pid', 'stim', 'visit'],
                value_vars = ['tnf', 'il2'])
long3.sample(6)

Out[43]:

	pid	stim	visit	variable	value
18	2	cmv	2	il2	0.1
4	1	cmv	3	tnf	1.2
10	1	cmv	1	il2	0.0
0	1	cmv	1	tnf	1.0
16	2	cmv	1	il2	0.1
6	2	cmv	1	tnf	3.0

Long to Wide¶

There is no function in pandas to undo the wide_to_long operation, and the details are tricky, so I have written a small function to do this.

In [44]:

def long_to_wide(df, index):
    df = df.set_index(index).unstack().reset_index()
    cols = [t[1] if t[1] else t[0] for t in df.columns]
    df.columns = cols
    return df

In [45]:

wide1 = long_to_wide(long1, ['pid', 'stim', 'visit', 'variable'])
wide1.sample(6)

Out[45]:

	pid	stim	visit	ifn	il2	tnf
1	1	cmv	2	11.1	0.1	1.1
4	1	hiv	2	12.1	0.1	2.1
2	1	cmv	3	11.2	0.2	1.2
3	1	hiv	1	12.0	0.0	2.0
9	2	hiv	2	14.1	0.1	4.1
5	1	hiv	3	12.2	0.2	2.2

In [46]:

wide2 = long_to_wide(long2, ['pid', 'stim', 'visit', 'variable'])
wide2.sample(6)

Out[46]:

	pid	stim	visit	ifn	il2	tnf
0	1	cmv	1	11.0	0.0	1.0
4	1	hiv	2	12.1	0.1	2.1
5	1	hiv	3	12.2	0.2	2.2
2	1	cmv	3	11.2	0.2	1.2
6	2	cmv	1	13.0	0.1	3.0
8	2	hiv	1	14.0	0.3	4.0

In [47]:

wide3 = long_to_wide(long3, ['pid', 'stim', 'visit', 'variable'])
wide3.sample(6)

Out[47]:

	pid	stim	visit	il2	tnf
5	1	hiv	3	0.2	2.2
4	1	hiv	2	0.1	2.1
8	2	hiv	1	0.3	4.0
6	2	cmv	1	0.1	3.0
3	1	hiv	1	0.0	2.0
9	2	hiv	2	0.1	4.1

(4) Hierarchical Indexes (More advanced material)¶

Sometimes your data is best considered as stacks of 2D DataFrames, for example, when each patient has laboratory data for multiple visits. The handling of such “3D” or even higher-dimensional data is handled in pandas by the sue of hierarchical indexes.

In [48]:

df.head()

Out[48]:

	pid	visit	stim	tnf	ifn	il2
0	1	1	cmv	1.0	11.0	0.0
1	1	1	hiv	2.0	12.0	0.0
2	1	2	cmv	1.1	11.1	0.1
3	1	2	hiv	2.1	12.1	0.1
4	1	3	cmv	1.2	11.2	0.2

In [49]:

df1 = df.set_index(['pid', 'stim', 'visit'])
df1

Out[49]:

			tnf	ifn	il2
pid	stim	visit
1	cmv	1	1.0	11.0	0.0
	hiv	1	2.0	12.0	0.0
	cmv	2	1.1	11.1	0.1
	hiv	2	2.1	12.1	0.1
	cmv	3	1.2	11.2	0.2
	hiv	3	2.2	12.2	0.2
2	cmv	1	3.0	13.0	0.1
	hiv	1	4.0	14.0	0.3
	cmv	2	3.1	13.1	0.1
	hiv	2	4.1	14.1	0.1

Indexing for multi-index¶

With the multi-index, each “cell” is now a block of values for the combinations (pid, stim, visit).

Find TNF values¶

In [50]:

df1[['tnf']]

Out[50]:

			tnf
pid	stim	visit
1	cmv	1	1.0
	hiv	1	2.0
	cmv	2	1.1
	hiv	2	2.1
	cmv	3	1.2
	hiv	3	2.2
2	cmv	1	3.0
	hiv	1	4.0
	cmv	2	3.1
	hiv	2	4.1

Find all values for Subject 2¶

In [51]:

df1.ix['2']

Out[51]:

		tnf	ifn	il2
stim	visit
cmv	1	3.0	13.0	0.1
hiv	1	4.0	14.0	0.3
cmv	2	3.1	13.1	0.1
hiv	2	4.1	14.1	0.1

Find TNF values for subject 2¶

In [52]:

df1.ix['2', ['tnf']]

Out[52]:

		tnf
stim	visit
cmv	1	3.0
hiv	1	4.0
cmv	2	3.1
hiv	2	4.1

Flattening hierarchicla indexes into column variables¶

To actually undo, you would have to save the result to a variable.

In [53]:

df1.reset_index()

Out[53]:

	pid	stim	visit	tnf	ifn	il2
0	1	cmv	1	1.0	11.0	0.0
1	1	hiv	1	2.0	12.0	0.0
2	1	cmv	2	1.1	11.1	0.1
3	1	hiv	2	2.1	12.1	0.1
4	1	cmv	3	1.2	11.2	0.2
5	1	hiv	3	2.2	12.2	0.2
6	2	cmv	1	3.0	13.0	0.1
7	2	hiv	1	4.0	14.0	0.3
8	2	cmv	2	3.1	13.1	0.1
9	2	hiv	2	4.1	14.1	0.1

Moving parts of an index into columns with `unstack`¶

Move pid from column to row¶

In [54]:

df1.unstack('pid')

Out[54]:

		tnf		ifn		il2
	pid	1	2	1	2	1	2
stim	visit
cmv	1	1.0	3.0	11.0	13.0	0.0	0.1
	2	1.1	3.1	11.1	13.1	0.1	0.1
	3	1.2	NaN	11.2	NaN	0.2	NaN
hiv	1	2.0	4.0	12.0	14.0	0.0	0.3
	2	2.1	4.1	12.1	14.1	0.1	0.1
	3	2.2	NaN	12.2	NaN	0.2	NaN

In [55]:

df1.unstack(0)

Out[55]:

		tnf		ifn		il2
	pid	1	2	1	2	1	2
stim	visit
cmv	1	1.0	3.0	11.0	13.0	0.0	0.1
	2	1.1	3.1	11.1	13.1	0.1	0.1
	3	1.2	NaN	11.2	NaN	0.2	NaN
hiv	1	2.0	4.0	12.0	14.0	0.0	0.3
	2	2.1	4.1	12.1	14.1	0.1	0.1
	3	2.2	NaN	12.2	NaN	0.2	NaN

Move stim from column to row¶

In [56]:

df1.unstack('stim')

Out[56]:

		tnf		ifn		il2
	stim	cmv	hiv	cmv	hiv	cmv	hiv
pid	visit
1	1	1.0	2.0	11.0	12.0	0.0	0.0
	2	1.1	2.1	11.1	12.1	0.1	0.1
	3	1.2	2.2	11.2	12.2	0.2	0.2
2	1	3.0	4.0	13.0	14.0	0.1	0.3
2	2	3.1	4.1	13.1	14.1	0.1	0.1

In [57]:

df1.unstack(1)

Out[57]:

		tnf		ifn		il2
	stim	cmv	hiv	cmv	hiv	cmv	hiv
pid	visit
1	1	1.0	2.0	11.0	12.0	0.0	0.0
	2	1.1	2.1	11.1	12.1	0.1	0.1
	3	1.2	2.2	11.2	12.2	0.2	0.2
2	1	3.0	4.0	13.0	14.0	0.1	0.3
2	2	3.1	4.1	13.1	14.1	0.1	0.1

Move pid and stim from column to row¶

In [58]:

df1.unstack(['pid', 'stim'])

Out[58]:

	tnf				ifn				il2
pid	1		2		1		2		1		2
stim	cmv	hiv	cmv	hiv	cmv	hiv	cmv	hiv	cmv	hiv	cmv	hiv
visit
1	1.0	2.0	3.0	4.0	11.0	12.0	13.0	14.0	0.0	0.0	0.1	0.3
2	1.1	2.1	3.1	4.1	11.1	12.1	13.1	14.1	0.1	0.1	0.1	0.1
3	1.2	2.2	NaN	NaN	11.2	12.2	NaN	NaN	0.2	0.2	NaN	NaN

Moving columns to a multi-index¶

In [59]:

df1.head()

Out[59]:

			tnf	ifn	il2
pid	stim	visit
1	cmv	1	1.0	11.0	0.0
	hiv	1	2.0	12.0	0.0
	cmv	2	1.1	11.1	0.1
	hiv	2	2.1	12.1	0.1
	cmv	3	1.2	11.2	0.2

In [60]:

df2 = df1.unstack(['pid', 'stim'])
df2.head()

Out[60]:

	tnf				ifn				il2
pid	1		2		1		2		1		2
stim	cmv	hiv	cmv	hiv	cmv	hiv	cmv	hiv	cmv	hiv	cmv	hiv
visit
1	1.0	2.0	3.0	4.0	11.0	12.0	13.0	14.0	0.0	0.0	0.1	0.3
2	1.1	2.1	3.1	4.1	11.1	12.1	13.1	14.1	0.1	0.1	0.1	0.1
3	1.2	2.2	NaN	NaN	11.2	12.2	NaN	NaN	0.2	0.2	NaN	NaN

In [61]:

df3 = df2.stack(['pid', 'stim'])
df3.head()

Out[61]:

			ifn	il2	tnf
visit	pid	stim
1	1	cmv	11.0	0.0	1.0
	1	hiv	12.0	0.0	2.0
	2	cmv	13.0	0.1	3.0
	2	hiv	14.0	0.3	4.0
2	1	cmv	11.1	0.1	1.1

Exercises¶

You are given three DataFrames d1, d2, and d3. Answer the questions below.

In [62]:

d1 = pd.DataFrame({'subject': ['a', 'b', 'c', 'd'], 'group': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
d2 = pd.DataFrame({'subject': ['w', 'x', 'y', 'z'], 'group': ['A', 'B', 'C', 'D'], 'value': [2, 1, 4, 3]})
d3 = pd.DataFrame({'subject': ['a', 'x', 'w', 'd', 'b', 'c', 'y', 'z'], 'age': [10, 20, 30, 40, 50, 60, 70, 80]})

In [63]:

d1

Out[63]:

	group	subject	value
0	A	a	1
1	B	b	2
2	C	c	3
3	D	d	4

In [64]:

d2

Out[64]:

	group	subject	value
0	A	w	2
1	B	x	1
2	C	y	4
3	D	z	3

In [65]:

d3

Out[65]:

	age	subject
0	10	a
1	20	x
2	30	w
3	40	d
4	50	b
5	60	c
6	70	y
7	80	z

1. Combine d1 and d2 to create a DataFrame called df1 with 8 rows and 3 columns.

In [69]:

df1 = pd.concat([d1, d2])
df1

Out[69]:

	group	subject	value
0	A	a	1
1	B	b	2
2	C	c	3
3	D	d	4
0	A	w	2
1	B	x	1
2	C	y	4
3	D	z	3

2. Combine df1 with d3 to create a DataFrame called df2 with 8 rows and 4 columns.

In [71]:

df2 = pd.merge(df1, d3)
df2

Out[71]:

	group	subject	value	age
0	A	a	1	10
1	B	b	2	50
2	C	c	3	60
3	D	d	4	40
4	A	w	2	30
5	B	x	1	20
6	C	y	4	70
7	D	z	3	80

Data Manipulation¶

(1) Combining data sets¶

Combining rows¶

Each DataFrame only contains data about one species of iris¶

Combine with `concat`¶

Note that the indices are maintained¶

We can ask for a new consecutive index¶

Combined DataFrame contains all 3 species¶

Combining columns¶

Simplest case - rows line up perfectly across data frames¶

Using `concat`¶

Joining on a single unique column¶

Make up dummy data sets for illustration¶

Using merge to join on columns containing unique information¶

Using merge on multiple columns¶

Using `join` when indexes are the keys¶

(2) Separate multiple values in a single column¶

(3) Reshaping DataFrames¶

Wide to long¶

We use teh `melt` method and specify the identifier (columns that stay the same) and value variables.¶

By default, all non-id variabels are treated as values¶

We can exclude columns we are not intersted in¶

Long to Wide¶

(4) Hierarchical Indexes (More advanced material)¶

Indexing for multi-index¶

Find TNF values¶

Find all values for Subject 2¶

Find TNF values for subject 2¶

Flattening hierarchicla indexes into column variables¶

Moving parts of an index into columns with `unstack`¶

Move pid from column to row¶

Move stim from column to row¶

Move pid and stim from column to row¶

Moving columns to a multi-index¶

Exercises¶

Page contents

Previous page

Next page

This Page

	pid	stim	visit	variable	value
18	2	cmv	2	il2	0.1
4	1	cmv	3	tnf	1.2
10	1	cmv	1	il2	0.0
0	1	cmv	1	tnf	1.0
16	2	cmv	1	il2	0.1
6	2	cmv	1	tnf	3.0

	pid	stim	visit	il2	tnf
5	1	hiv	3	0.2	2.2
4	1	hiv	2	0.1	2.1
8	2	hiv	1	0.3	4.0
6	2	cmv	1	0.1	3.0
3	1	hiv	1	0.0	2.0
9	2	hiv	2	0.1	4.1

	pid	stim	visit	variable	value
18	2	cmv	2	il2	0.1
4	1	cmv	3	tnf	1.2
10	1	cmv	1	il2	0.0
0	1	cmv	1	tnf	1.0
16	2	cmv	1	il2	0.1
6	2	cmv	1	tnf	3.0

	pid	stim	visit	il2	tnf
5	1	hiv	3	0.2	2.2
4	1	hiv	2	0.1	2.1
8	2	hiv	1	0.3	4.0
6	2	cmv	1	0.1	3.0
3	1	hiv	1	0.0	2.0
9	2	hiv	2	0.1	4.1

Data Manipulation¶

(1) Combining data sets¶

Combining rows¶

Each DataFrame only contains data about one species of iris¶

Combine with concat¶

Note that the indices are maintained¶

We can ask for a new consecutive index¶

Combined DataFrame contains all 3 species¶

Combining columns¶

Simplest case - rows line up perfectly across data frames¶

Using concat¶

Joining on a single unique column¶

Make up dummy data sets for illustration¶

Using merge to join on columns containing unique information¶

Using merge on multiple columns¶

Using join when indexes are the keys¶

(2) Separate multiple values in a single column¶

(3) Reshaping DataFrames¶

Wide to long¶

We use teh melt method and specify the identifier (columns that stay the same) and value variables.¶

By default, all non-id variabels are treated as values¶

We can exclude columns we are not intersted in¶

Long to Wide¶

(4) Hierarchical Indexes (More advanced material)¶

Indexing for multi-index¶

Find TNF values¶

Find all values for Subject 2¶

Find TNF values for subject 2¶

Flattening hierarchicla indexes into column variables¶

Moving parts of an index into columns with unstack¶

Move pid from column to row¶

Move stim from column to row¶

Move pid and stim from column to row¶

Moving columns to a multi-index¶

Exercises¶

Combine with `concat`¶

Using `concat`¶

Using `join` when indexes are the keys¶

We use teh `melt` method and specify the identifier (columns that stay the same) and value variables.¶

Moving parts of an index into columns with `unstack`¶

	pid	stim	visit	variable	value
18	2	cmv	2	il2	0.1
4	1	cmv	3	tnf	1.2
10	1	cmv	1	il2	0.0
0	1	cmv	1	tnf	1.0
16	2	cmv	1	il2	0.1
6	2	cmv	1	tnf	3.0

	pid	stim	visit	il2	tnf
5	1	hiv	3	0.2	2.2
4	1	hiv	2	0.1	2.1
8	2	hiv	1	0.3	4.0
6	2	cmv	1	0.1	3.0
3	1	hiv	1	0.0	2.0
9	2	hiv	2	0.1	4.1