Introduction to pandas

In [1]:
import numpy as np
import pandas as pd

Series and Data Frames

Series objects

A Series is like a vector. All elements must have the same type or are nulls.

In [2]:
s = pd.Series([1,1,2,3] + [None])
s
Out[2]:
0    1.0
1    1.0
2    2.0
3    3.0
4    NaN
dtype: float64

Size

In [3]:
s.size
Out[3]:
5

Unique Counts

In [4]:
s.value_counts()
Out[4]:
1.0    2
3.0    1
2.0    1
dtype: int64

Special types of series

Strings

In [5]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1
Out[5]:
0      the quick
1    quick brown
2      brown fox
3      fox jumps
4     jumps over
5       over the
6       the lazy
7       lazy dog
dtype: object
In [6]:
s1.str.upper()
Out[6]:
0      THE QUICK
1    QUICK BROWN
2      BROWN FOX
3      FOX JUMPS
4     JUMPS OVER
5       OVER THE
6       THE LAZY
7       LAZY DOG
dtype: object
In [7]:
s1.str.split()
Out[7]:
0      [the, quick]
1    [quick, brown]
2      [brown, fox]
3      [fox, jumps]
4     [jumps, over]
5       [over, the]
6       [the, lazy]
7       [lazy, dog]
dtype: object
In [8]:
s1.str.split().str[1]
Out[8]:
0    quick
1    brown
2      fox
3    jumps
4     over
5      the
6     lazy
7      dog
dtype: object

Categories

In [9]:
s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2
Out[9]:
0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: object
In [10]:
s2 = s2.astype('category')
s2
Out[10]:
0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: category
Categories (4, object): [Asian, Black, Hispanic, White]
In [11]:
s2.cat.categories
Out[11]:
Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')
In [12]:
s2.cat.codes
Out[12]:
0    0
1    0
2    3
3    1
4    3
5    2
dtype: int8

DataFrame objects

A DataFrame is like a matrix. Columns in a DataFrame are Series.

  • Each column in a DataFrame represents a variale

  • Each row in a DataFrame represents an observation

  • Each cell in a DataFrame represents a value

In [13]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df
Out[13]:
num
0 1.0
1 2.0
2 3.0
3 NaN
In [14]:
df.num
Out[14]:
0    1.0
1    2.0
2    3.0
3    NaN
Name: num, dtype: float64

Index

Row and column identifiers are of Index type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

In [15]:
df.index
Out[15]:
RangeIndex(start=0, stop=4, step=1)

Setting a column as the row index

In [16]:
df
Out[16]:
num
0 1.0
1 2.0
2 3.0
3 NaN
In [17]:
df1 = df.set_index('num')
df1
Out[17]:
num
1.0
2.0
3.0
NaN

Making an index into a column

In [18]:
df1.reset_index()
Out[18]:
num
0 1.0
1 2.0
2 3.0
3 NaN

Columns

This is just a different index object

In [19]:
df.columns
Out[19]:
Index(['num'], dtype='object')

Getting raw values

Sometimes you just want a numpy array, and not a pandas object.

In [20]:
df.values
Out[20]:
array([[ 1.],
       [ 2.],
       [ 3.],
       [nan]])

Creating Data Frames

Manual

In [21]:
from collections import OrderedDict
In [22]:
n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n),
                              weight=np.random.normal(70, 20, n),
                              height=np.random.normal(170, 15, n),
                              date=dates,
                             ))
df
Out[22]:
pid weight height date
0 258 102.249873 179.829518 2019-01-23 08:33:52.400613
1 887 61.520051 179.539043 2019-01-24 08:33:52.400613
2 689 90.455795 167.137938 2019-01-25 08:33:52.400613
3 184 111.817942 169.137430 2019-01-26 08:33:52.400613
4 228 57.779447 190.507569 2019-01-27 08:33:52.400613

From file

You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like read_X where X is the data type.

In [23]:
%%file measures.txt
pid weight  height  date
328 72.654347       203.560866      2018-11-11 14:16:18.148411
756 34.027679       189.847316      2018-11-12 14:16:18.148411
185 28.501914       158.646074      2018-11-13 14:16:18.148411
507 17.396343       180.795993      2018-11-14 14:16:18.148411
919 64.724301       173.564725      2018-11-15 14:16:18.148411
Writing measures.txt
In [24]:
df = pd.read_table('measures.txt')
df
Out[24]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Indexing Data Frames

Implicit defaults

if you provide a slice, it is assumed that you are asking for rows.

In [25]:
df[1:3]
Out[25]:
pid weight height date
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411

If you provide a singe value or list, it is assumed that you are asking for columns.

In [26]:
df[['pid', 'weight']]
Out[26]:
pid weight
0 328 72.654347
1 756 34.027679
2 185 28.501914
3 507 17.396343
4 919 64.724301

Extracting a column

Dictionary style access

In [27]:
df['pid']
Out[27]:
0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Property style access

This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)

In [28]:
df.pid
Out[28]:
0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Indexing by location

This is similar to numpy indexing

In [29]:
df.iloc[1:3, :]
Out[29]:
pid weight height date
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
In [30]:
df.iloc[1:3, [True, False, True]]
Out[30]:
pid height
1 756 189.847316
2 185 158.646074

Indexing by name

In [31]:
df.loc[1:3, 'weight':'height']
Out[31]:
weight height
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993

Warning: When using loc, the row slice indicates row names, not positions.

In [32]:
df1 = df.copy()
df1.index = df.index + 1
df1
Out[32]:
pid weight height date
1 328 72.654347 203.560866 2018-11-11 14:16:18.148411
2 756 34.027679 189.847316 2018-11-12 14:16:18.148411
3 185 28.501914 158.646074 2018-11-13 14:16:18.148411
4 507 17.396343 180.795993 2018-11-14 14:16:18.148411
5 919 64.724301 173.564725 2018-11-15 14:16:18.148411
In [33]:
df1.loc[1:3, 'weight':'height']
Out[33]:
weight height
1 72.654347 203.560866
2 34.027679 189.847316
3 28.501914 158.646074

Structure of a Data Frame

Data types

In [34]:
df.dtypes
Out[34]:
pid         int64
weight    float64
height    float64
date       object
dtype: object

Converting data types

Using astype on one column

In [35]:
df.pid = df.pid.astype('category')

Using astype on multiple columns

In [36]:
df = df.astype(dict(weight=float, height=float))

Using a conversion function

In [37]:
df.date = pd.to_datetime(df.date)

Check

In [38]:
df.dtypes
Out[38]:
pid             category
weight           float64
height           float64
date      datetime64[ns]
dtype: object

Basic properties

In [39]:
df.size
Out[39]:
20
In [40]:
df.shape
Out[40]:
(5, 4)
In [41]:
df.describe()
Out[41]:
weight height
count 5.000000 5.000000
mean 43.460917 181.282995
std 23.960945 16.895933
min 17.396343 158.646074
25% 28.501914 173.564725
50% 34.027679 180.795993
75% 64.724301 189.847316
max 72.654347 203.560866

Inspection

In [42]:
df.head(n=3)
Out[42]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
In [43]:
df.tail(n=3)
Out[43]:
pid weight height date
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
In [44]:
df.sample(n=3)
Out[44]:
pid weight height date
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
In [45]:
df.sample(frac=0.5)
Out[45]:
pid weight height date
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411

Selecting, Renaming and Removing Columns

Selecting columns

In [46]:
df.filter(items=['pid', 'date'])
Out[46]:
pid date
0 328 2018-11-11 14:16:18.148411
1 756 2018-11-12 14:16:18.148411
2 185 2018-11-13 14:16:18.148411
3 507 2018-11-14 14:16:18.148411
4 919 2018-11-15 14:16:18.148411
In [47]:
df.filter(regex='.*ght')
Out[47]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725

Note that you can also use regular string methods on the columns

In [48]:
df.loc[:, df.columns.str.contains('d')]
Out[48]:
pid date
0 328 2018-11-11 14:16:18.148411
1 756 2018-11-12 14:16:18.148411
2 185 2018-11-13 14:16:18.148411
3 507 2018-11-14 14:16:18.148411
4 919 2018-11-15 14:16:18.148411

Renaming columns

In [49]:
df.rename(dict(weight='w', height='h'), axis=1)
Out[49]:
pid w h date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
In [50]:
orig_cols = df.columns
In [51]:
df.columns = list('abcd')
In [52]:
df
Out[52]:
a b c d
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
In [53]:
df.columns = orig_cols
In [54]:
df
Out[54]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Removing columns

In [55]:
df.drop(['pid', 'date'], axis=1)
Out[55]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725
In [56]:
df.drop(columns=['pid', 'date'])
Out[56]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725
In [57]:
df.drop(columns=df.columns[df.columns.str.contains('d')])
Out[57]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725

Selecting, Renaming and Removing Rows

Selecting rows

In [58]:
df[df.weight.between(60,70)]
Out[58]:
pid weight height date
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
In [59]:
df[(69 <= df.weight) & (df.weight < 70)]
Out[59]:
pid weight height date
In [60]:
df[df.date.between(pd.to_datetime('2018-11-13'),
                   pd.to_datetime('2018-11-15 23:59:59'))]
Out[60]:
pid weight height date
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Renaming rows

In [61]:
df.rename({i:letter for i,letter in enumerate('abcde')})
Out[61]:
pid weight height date
a 328 72.654347 203.560866 2018-11-11 14:16:18.148411
b 756 34.027679 189.847316 2018-11-12 14:16:18.148411
c 185 28.501914 158.646074 2018-11-13 14:16:18.148411
d 507 17.396343 180.795993 2018-11-14 14:16:18.148411
e 919 64.724301 173.564725 2018-11-15 14:16:18.148411
In [62]:
df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']
In [63]:
df
Out[63]:
pid weight height date
the 328 72.654347 203.560866 2018-11-11 14:16:18.148411
quick 756 34.027679 189.847316 2018-11-12 14:16:18.148411
brown 185 28.501914 158.646074 2018-11-13 14:16:18.148411
fox 507 17.396343 180.795993 2018-11-14 14:16:18.148411
jumphs 919 64.724301 173.564725 2018-11-15 14:16:18.148411
In [64]:
df = df.reset_index(drop=True)
In [65]:
df
Out[65]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Dropping rows

In [66]:
df.drop([1,3], axis=0)
Out[66]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Dropping duplicated data

In [67]:
df['something'] = [1,1,None,2,None]
In [68]:
df.loc[df.something.duplicated()]
Out[68]:
pid weight height date something
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN
In [69]:
df.drop_duplicates(subset='something')
Out[69]:
pid weight height date something
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0

Dropping missing data

In [70]:
df
Out[70]:
pid weight height date something
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN
In [71]:
df.something.fillna(0)
Out[71]:
0    1.0
1    1.0
2    0.0
3    2.0
4    0.0
Name: something, dtype: float64
In [72]:
df.something.ffill()
Out[72]:
0    1.0
1    1.0
2    1.0
3    2.0
4    2.0
Name: something, dtype: float64
In [73]:
df.something.bfill()
Out[73]:
0    1.0
1    1.0
2    2.0
3    2.0
4    NaN
Name: something, dtype: float64
In [74]:
df.something.interpolate()
Out[74]:
0    1.0
1    1.0
2    1.5
3    2.0
4    2.0
Name: something, dtype: float64
In [75]:
df.dropna()
Out[75]:
pid weight height date something
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0

Transforming and Creating Columns

In [76]:
df.assign(bmi=df['weight'] / (df['height']/100)**2)
Out[76]:
pid weight height date something bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 9.441118
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 5.322067
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449
In [77]:
df['bmi'] = df['weight'] / (df['height']/100)**2
In [78]:
df
Out[78]:
pid weight height date something bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 9.441118
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 5.322067
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449
In [79]:
df['something'] = [2,2,None,None,3]
In [80]:
df
Out[80]:
pid weight height date something bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449

Sorting Data Frames

Sort on indexes

In [81]:
df.sort_index(axis=1)
Out[81]:
bmi date height pid something weight
0 17.533678 2018-11-11 14:16:18.148411 203.560866 328 2.0 72.654347
1 9.441118 2018-11-12 14:16:18.148411 189.847316 756 2.0 34.027679
2 11.324404 2018-11-13 14:16:18.148411 158.646074 185 NaN 28.501914
3 5.322067 2018-11-14 14:16:18.148411 180.795993 507 NaN 17.396343
4 21.485449 2018-11-15 14:16:18.148411 173.564725 919 3.0 64.724301
In [82]:
df.sort_index(axis=0, ascending=False)
Out[82]:
pid weight height date something bmi
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678

Sort on values

In [83]:
df.sort_values(by=['something', 'bmi'], ascending=[True, False])
Out[83]:
pid weight height date something bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067

Summarizing

Apply an aggregation function

In [84]:
df.select_dtypes(include=np.number)
Out[84]:
weight height something bmi
0 72.654347 203.560866 2.0 17.533678
1 34.027679 189.847316 2.0 9.441118
2 28.501914 158.646074 NaN 11.324404
3 17.396343 180.795993 NaN 5.322067
4 64.724301 173.564725 3.0 21.485449
In [85]:
df.select_dtypes(include=np.number).agg(np.sum)
Out[85]:
weight       217.304584
height       906.414974
something      7.000000
bmi           65.106716
dtype: float64
In [86]:
df.agg(['count', np.sum, np.mean])
Out[86]:
pid weight height date something bmi
count 5.0 5.000000 5.000000 5.0 3.000000 5.000000
sum NaN 217.304584 906.414974 NaN 7.000000 65.106716
mean NaN 43.460917 181.282995 NaN 2.333333 13.021343

Split-Apply-Combine

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with groupby followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

In [87]:
df['treatment'] = list('ababa')
In [88]:
df
Out[88]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
In [89]:
grouped = df.groupby('treatment')
In [90]:
grouped.get_group('a')
Out[90]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
In [91]:
grouped.mean()
Out[91]:
weight height something bmi
treatment
a 55.293521 178.590555 2.5 16.781177
b 25.712011 185.321654 2.0 7.381592

Using agg with groupby

In [92]:
grouped.agg('mean')
Out[92]:
weight height something bmi
treatment
a 55.293521 178.590555 2.5 16.781177
b 25.712011 185.321654 2.0 7.381592
In [93]:
grouped.agg(['mean', 'std'])
Out[93]:
weight height something bmi
mean std mean std mean std mean std
treatment
a 55.293521 23.538565 178.590555 22.875289 2.5 0.707107 16.781177 5.122148
b 25.712011 11.760130 185.321654 6.400252 2.0 NaN 7.381592 2.912608
In [94]:
grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})
Out[94]:
weight height bmi
mean std min max <lambda>
treatment
a 55.293521 23.538565 158.646074 203.560866 897.296525
b 25.712011 11.760130 180.795993 189.847316 117.459100

Using trasnform wtih groupby

In [95]:
g_mean = grouped['weight', 'height'].transform(np.mean)
g_mean
Out[95]:
weight height
0 55.293521 178.590555
1 25.712011 185.321654
2 55.293521 178.590555
3 25.712011 185.321654
4 55.293521 178.590555
In [96]:
g_std = grouped['weight', 'height'].transform(np.std)
g_std
Out[96]:
weight height
0 23.538565 22.875289
1 11.760130 6.400252
2 23.538565 22.875289
3 11.760130 6.400252
4 23.538565 22.875289
In [97]:
(df[['weight', 'height']] - g_mean)/g_std
Out[97]:
weight height
0 0.737548 1.091584
1 0.707107 0.707107
2 -1.138201 -0.871879
3 -0.707107 -0.707107
4 0.400652 -0.219706

Combining Data Frames

In [98]:
df
Out[98]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
In [99]:
df1 =  df.iloc[3:].copy()
In [100]:
df1.drop('something', axis=1, inplace=True)
df1
Out[100]:
pid weight height date bmi treatment
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 21.485449 a

Adding rows

Note that pandas aligns by column indexes automatically.

In [101]:
df.append(df1, sort=False)
Out[101]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449 a
In [102]:
pd.concat([df, df1], sort=False)
Out[102]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449 a

Adding columns

In [103]:
df.pid
Out[103]:
0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: category
Categories (5, int64): [185, 328, 507, 756, 919]
In [104]:
df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))
In [105]:
df2.pid
Out[105]:
0    649
1    533
2    400
3    600
Name: pid, dtype: int64
In [106]:
df.pid = df.pid.astype('int')
In [107]:
pd.merge(df, df2, on='pid', how='inner')
Out[107]:
pid weight height date something bmi treatment age
In [108]:
pd.merge(df, df2, on='pid', how='left')
Out[108]:
pid weight height date something bmi treatment age
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a NaN
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b NaN
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b NaN
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a NaN
In [109]:
pd.merge(df, df2, on='pid', how='right')
Out[109]:
pid weight height date something bmi treatment age
0 649 NaN NaN NaT NaN NaN NaN 23
1 533 NaN NaN NaT NaN NaN NaN 34
2 400 NaN NaN NaT NaN NaN NaN 45
3 600 NaN NaN NaT NaN NaN NaN 56
In [110]:
pd.merge(df, df2, on='pid', how='outer')
Out[110]:
pid weight height date something bmi treatment age
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a NaN
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b NaN
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b NaN
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a NaN
5 649 NaN NaN NaT NaN NaN NaN 23.0
6 533 NaN NaN NaT NaN NaN NaN 34.0
7 400 NaN NaN NaT NaN NaN NaN 45.0
8 600 NaN NaN NaT NaN NaN NaN 56.0

Merging on the index

In [111]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))
In [112]:
df1
Out[112]:
x
a 1
b 2
c 3
In [113]:
df2
Out[113]:
y
a 4
b 5
c 6
In [114]:
df3
Out[114]:
z
a 7
b 8
c 9
In [115]:
df1.join([df2, df3])
Out[115]:
x y z
a 1 4 7
b 2 5 8
c 3 6 9

Fixing common DataFrame issues

Multiple variables in a column

In [116]:
df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df
Out[116]:
pid_treat
0 A-1
1 B-2
2 C-1
3 D-2
In [117]:
df.pid_treat.str.split('-')
Out[117]:
0    [A, 1]
1    [B, 2]
2    [C, 1]
3    [D, 2]
Name: pid_treat, dtype: object
In [118]:
df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])
Out[118]:
pid treat
0 A 1
1 B 2
2 C 1
3 D 2

Multiple values in a cell

In [119]:
df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df
Out[119]:
pid vals
0 a (1, 2, 3)
1 b (4, 5, 6)
2 c (7, 8, 9)
In [120]:
df[['t1', 't2', 't3']]  = df.vals.apply(pd.Series)
df
Out[120]:
pid vals t1 t2 t3
0 a (1, 2, 3) 1 2 3
1 b (4, 5, 6) 4 5 6
2 c (7, 8, 9) 7 8 9
In [121]:
df.drop('vals', axis=1, inplace=True)
In [122]:
pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)
Out[122]:
pid vals
0 a 1
1 b 4
2 c 7
3 a 2
4 b 5
5 c 8
6 a 3
7 b 6
8 c 9

Reshaping Data Frames

Sometimes we need to make rows into columns or vice versa.

Converting multiple columns into a single column

This is often useful if you need to condition on some variable.

In [123]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)
In [124]:
iris.head()
Out[124]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [125]:
iris.shape
Out[125]:
(150, 5)
In [126]:
df_iris = pd.melt(iris, id_vars='species')
In [127]:
df_iris.sample(10)
Out[127]:
species variable value
535 versicolor petal_width 1.6
175 setosa sepal_width 3.0
430 virginica petal_length 6.1
481 setosa petal_width 0.4
3 setosa sepal_length 4.6
73 versicolor sepal_length 6.1
581 virginica petal_width 2.0
362 versicolor petal_length 4.0
294 virginica sepal_width 3.3
66 versicolor sepal_length 5.6

Chaining commands

Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.

In [128]:
(
    iris.
    sample(frac=0.2).
    filter(regex='s.*').
    assign(both=iris.sepal_length + iris.sepal_length).
    groupby('species').agg(['mean', 'sum']).
    pipe(lambda x: np.around(x, 1))
)
Out[128]:
sepal_length sepal_width both
mean sum mean sum mean sum
species
setosa 5.2 56.8 3.6 39.1 10.3 113.6
versicolor 6.0 42.1 2.8 19.6 12.0 84.2
virginica 6.7 80.1 2.9 34.6 13.4 160.2

Moving between R and Python in Jupyter

In [129]:
%load_ext rpy2.ipython
In [130]:
import warnings
warnings.simplefilter('ignore', FutureWarning)
In [131]:
iris = %R iris
In [132]:
iris.head()
Out[132]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [133]:
iris_py = iris.copy()
iris_py.Species = iris_py.Species.str.upper()
In [134]:
%%R -i iris_py -o iris_r

iris_r <- iris_py[1:3,]
In [135]:
iris_r
Out[135]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 SETOSA
1 4.9 3.0 1.4 0.2 SETOSA
2 4.7 3.2 1.3 0.2 SETOSA