Introduction to `pandas`¶

In [1]:

import numpy as np
import pandas as pd

Series and Data Frames¶

Series objects¶

A Series is like a vector. All elements must have the same type or are nulls.

In [2]:

s = pd.Series([1,1,2,3] + [None])
s

Out[2]:

  1.0
  1.0
  2.0
  3.0
  NaN
dtype: float64

Size¶

In [3]:

s.size

Out[3]:

Unique Counts¶

In [4]:

s.value_counts()

Out[4]:

0    2
0    1
0    1
dtype: int64

Special types of series¶

Strings¶

In [5]:

words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1

Out[5]:

    the quick
  quick brown
    brown fox
    fox jumps
   jumps over
     over the
     the lazy
     lazy dog
dtype: object

In [6]:

s1.str.upper()

Out[6]:

    THE QUICK
  QUICK BROWN
    BROWN FOX
    FOX JUMPS
   JUMPS OVER
     OVER THE
     THE LAZY
     LAZY DOG
dtype: object

In [7]:

s1.str.split()

Out[7]:

    [the, quick]
  [quick, brown]
    [brown, fox]
    [fox, jumps]
   [jumps, over]
     [over, the]
     [the, lazy]
     [lazy, dog]
dtype: object

In [8]:

s1.str.split().str[1]

Out[8]:

  quick
  brown
    fox
  jumps
   over
    the
   lazy
    dog
dtype: object

Categories¶

In [9]:

s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2

Out[9]:

     Asian
     Asian
     White
     Black
     White
  Hispanic
dtype: object

In [10]:

s2 = s2.astype('category')
s2

Out[10]:

     Asian
     Asian
     White
     Black
     White
  Hispanic
dtype: category
Categories (4, object): [Asian, Black, Hispanic, White]

In [11]:

s2.cat.categories

Out[11]:

Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')

In [12]:

s2.cat.codes

Out[12]:

  0
  0
  3
  1
  3
  2
dtype: int8

DataFrame objects¶

A DataFrame is like a matrix. Columns in a DataFrame are Series.

Each column in a DataFrame represents a variale
Each row in a DataFrame represents an observation
Each cell in a DataFrame represents a value

In [13]:

df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df

Out[13]:

	num
0	1.0
1	2.0
2	3.0
3	NaN

In [14]:

df.num

Out[14]:

  1.0
  2.0
  3.0
  NaN
Name: num, dtype: float64

Index¶

Row and column identifiers are of Index type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

In [15]:

df.index

Out[15]:

RangeIndex(start=0, stop=4, step=1)

Setting a column as the row index¶

In [16]:

df

Out[16]:

	num
0	1.0
1	2.0
2	3.0
3	NaN

In [17]:

df1 = df.set_index('num')
df1

Out[17]:


num
1.0
2.0
3.0
NaN

Making an index into a column¶

In [18]:

df1.reset_index()

Out[18]:

	num
0	1.0
1	2.0
2	3.0
3	NaN

Columns¶

This is just a different index object

In [19]:

df.columns

Out[19]:

Index(['num'], dtype='object')

Getting raw values¶

Sometimes you just want a numpy array, and not a pandas object.

In [20]:

df.values

Out[20]:

array([[ 1.],
       [ 2.],
       [ 3.],
       [nan]])

Creating Data Frames¶

Manual¶

In [21]:

from collections import OrderedDict

In [22]:

n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n),
                              weight=np.random.normal(70, 20, n),
                              height=np.random.normal(170, 15, n),
                              date=dates,
                             ))
df

Out[22]:

	pid	weight	height	date
0	258	102.249873	179.829518	2019-01-23 08:33:52.400613
1	887	61.520051	179.539043	2019-01-24 08:33:52.400613
2	689	90.455795	167.137938	2019-01-25 08:33:52.400613
3	184	111.817942	169.137430	2019-01-26 08:33:52.400613
4	228	57.779447	190.507569	2019-01-27 08:33:52.400613

From file¶

You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like read_X where X is the data type.

In [23]:

%%file measures.txt
pid weight  height  date
72.654347       203.560866      2018-11-11 14:16:18.148411
34.027679       189.847316      2018-11-12 14:16:18.148411
28.501914       158.646074      2018-11-13 14:16:18.148411
17.396343       180.795993      2018-11-14 14:16:18.148411
64.724301       173.564725      2018-11-15 14:16:18.148411

Writing measures.txt

In [24]:

df = pd.read_table('measures.txt')
df

Out[24]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Indexing Data Frames¶

Implicit defaults¶

if you provide a slice, it is assumed that you are asking for rows.

In [25]:

df[1:3]

Out[25]:

	pid	weight	height	date
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

If you provide a singe value or list, it is assumed that you are asking for columns.

In [26]:

df[['pid', 'weight']]

Out[26]:

	pid	weight
0	328	72.654347
1	756	34.027679
2	185	28.501914
3	507	17.396343
4	919	64.724301

Extracting a column¶

Dictionary style access¶

In [27]:

df['pid']

Out[27]:

  328
  756
  185
  507
  919
Name: pid, dtype: int64

Property style access¶

This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)

In [28]:

df.pid

Out[28]:

  328
  756
  185
  507
  919
Name: pid, dtype: int64

Indexing by location¶

This is similar to numpy indexing

In [29]:

df.iloc[1:3, :]

Out[29]:

	pid	weight	height	date
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

In [30]:

df.iloc[1:3, [True, False, True]]

Out[30]:

	pid	height
1	756	189.847316
2	185	158.646074

Indexing by name¶

In [31]:

df.loc[1:3, 'weight':'height']

Out[31]:

	weight	height
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993

Warning: When using loc, the row slice indicates row names, not positions.

In [32]:

df1 = df.copy()
df1.index = df.index + 1
df1

Out[32]:

	pid	weight	height	date
1	328	72.654347	203.560866	2018-11-11 14:16:18.148411
2	756	34.027679	189.847316	2018-11-12 14:16:18.148411
3	185	28.501914	158.646074	2018-11-13 14:16:18.148411
4	507	17.396343	180.795993	2018-11-14 14:16:18.148411
5	919	64.724301	173.564725	2018-11-15 14:16:18.148411

In [33]:

df1.loc[1:3, 'weight':'height']

Out[33]:

	weight	height
1	72.654347	203.560866
2	34.027679	189.847316
3	28.501914	158.646074

Structure of a Data Frame¶

Data types¶

In [34]:

df.dtypes

Out[34]:

pid         int64
weight    float64
height    float64
date       object
dtype: object

Converting data types¶

Using `astype` on one column¶

In [35]:

df.pid = df.pid.astype('category')

Using `astype` on multiple columns¶

In [36]:

df = df.astype(dict(weight=float, height=float))

Using a conversion function¶

In [37]:

df.date = pd.to_datetime(df.date)

Check¶

In [38]:

df.dtypes

Out[38]:

pid             category
weight           float64
height           float64
date      datetime64[ns]
dtype: object

Basic properties¶

In [39]:

df.size

Out[39]:

In [40]:

df.shape

Out[40]:

(5, 4)

In [41]:

df.describe()

Out[41]:

	weight	height
count	5.000000	5.000000
mean	43.460917	181.282995
std	23.960945	16.895933
min	17.396343	158.646074
25%	28.501914	173.564725
50%	34.027679	180.795993
75%	64.724301	189.847316
max	72.654347	203.560866

Inspection¶

In [42]:

df.head(n=3)

Out[42]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

In [43]:

df.tail(n=3)

Out[43]:

	pid	weight	height	date
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

In [44]:

df.sample(n=3)

Out[44]:

	pid	weight	height	date
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

In [45]:

df.sample(frac=0.5)

Out[45]:

	pid	weight	height	date
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411

Selecting, Renaming and Removing Columns¶

Selecting columns¶

In [46]:

df.filter(items=['pid', 'date'])

Out[46]:

	pid	date
0	328	2018-11-11 14:16:18.148411
1	756	2018-11-12 14:16:18.148411
2	185	2018-11-13 14:16:18.148411
3	507	2018-11-14 14:16:18.148411
4	919	2018-11-15 14:16:18.148411

In [47]:

df.filter(regex='.*ght')

Out[47]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

Note that you can also use regular string methods on the columns¶

In [48]:

df.loc[:, df.columns.str.contains('d')]

Out[48]:

	pid	date
0	328	2018-11-11 14:16:18.148411
1	756	2018-11-12 14:16:18.148411
2	185	2018-11-13 14:16:18.148411
3	507	2018-11-14 14:16:18.148411
4	919	2018-11-15 14:16:18.148411

Renaming columns¶

In [49]:

df.rename(dict(weight='w', height='h'), axis=1)

Out[49]:

	pid	w	h	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

In [50]:

orig_cols = df.columns

In [51]:

df.columns = list('abcd')

In [52]:

df

Out[52]:

	a	b	c	d
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

In [53]:

df.columns = orig_cols

In [54]:

df

Out[54]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Removing columns¶

In [55]:

df.drop(['pid', 'date'], axis=1)

Out[55]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

In [56]:

df.drop(columns=['pid', 'date'])

Out[56]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

In [57]:

df.drop(columns=df.columns[df.columns.str.contains('d')])

Out[57]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

Selecting, Renaming and Removing Rows¶

Selecting rows¶

In [58]:

df[df.weight.between(60,70)]

Out[58]:

	pid	weight	height	date
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

In [59]:

df[(69 <= df.weight) & (df.weight < 70)]

Out[59]:

	pid	weight	height	date

In [60]:

df[df.date.between(pd.to_datetime('2018-11-13'),
                   pd.to_datetime('2018-11-15 23:59:59'))]

Out[60]:

	pid	weight	height	date
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Renaming rows¶

In [61]:

df.rename({i:letter for i,letter in enumerate('abcde')})

Out[61]:

	pid	weight	height	date
a	328	72.654347	203.560866	2018-11-11 14:16:18.148411
b	756	34.027679	189.847316	2018-11-12 14:16:18.148411
c	185	28.501914	158.646074	2018-11-13 14:16:18.148411
d	507	17.396343	180.795993	2018-11-14 14:16:18.148411
e	919	64.724301	173.564725	2018-11-15 14:16:18.148411

In [62]:

df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']

In [63]:

df

Out[63]:

	pid	weight	height	date
the	328	72.654347	203.560866	2018-11-11 14:16:18.148411
quick	756	34.027679	189.847316	2018-11-12 14:16:18.148411
brown	185	28.501914	158.646074	2018-11-13 14:16:18.148411
fox	507	17.396343	180.795993	2018-11-14 14:16:18.148411
jumphs	919	64.724301	173.564725	2018-11-15 14:16:18.148411

In [64]:

df = df.reset_index(drop=True)

In [65]:

df

Out[65]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Dropping rows¶

In [66]:

df.drop([1,3], axis=0)

Out[66]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Dropping duplicated data¶

In [67]:

df['something'] = [1,1,None,2,None]

In [68]:

df.loc[df.something.duplicated()]

Out[68]:

	pid	weight	height	date	something
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN

In [69]:

df.drop_duplicates(subset='something')

Out[69]:

	pid	weight	height	date	something
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0

Dropping missing data¶

In [70]:

df

Out[70]:

	pid	weight	height	date	something
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN

In [71]:

df.something.fillna(0)

Out[71]:

  1.0
  1.0
  0.0
  2.0
  0.0
Name: something, dtype: float64

In [72]:

df.something.ffill()

Out[72]:

  1.0
  1.0
  1.0
  2.0
  2.0
Name: something, dtype: float64

In [73]:

df.something.bfill()

Out[73]:

  1.0
  1.0
  2.0
  2.0
  NaN
Name: something, dtype: float64

In [74]:

df.something.interpolate()

Out[74]:

  1.0
  1.0
  1.5
  2.0
  2.0
Name: something, dtype: float64

In [75]:

df.dropna()

Out[75]:

	pid	weight	height	date	something
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0

Transforming and Creating Columns¶

In [76]:

df.assign(bmi=df['weight'] / (df['height']/100)**2)

Out[76]:

	pid	weight	height	date	something	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0	9.441118
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0	5.322067
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	21.485449

In [77]:

df['bmi'] = df['weight'] / (df['height']/100)**2

In [78]:

df

Out[78]:

	pid	weight	height	date	something	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0	9.441118
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0	5.322067
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	21.485449

In [79]:

df['something'] = [2,2,None,None,3]

In [80]:

df

Out[80]:

	pid	weight	height	date	something	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449

Sorting Data Frames¶

Sort on indexes¶

In [81]:

df.sort_index(axis=1)

Out[81]:

	bmi	date	height	pid	something	weight
0	17.533678	2018-11-11 14:16:18.148411	203.560866	328	2.0	72.654347
1	9.441118	2018-11-12 14:16:18.148411	189.847316	756	2.0	34.027679
2	11.324404	2018-11-13 14:16:18.148411	158.646074	185	NaN	28.501914
3	5.322067	2018-11-14 14:16:18.148411	180.795993	507	NaN	17.396343
4	21.485449	2018-11-15 14:16:18.148411	173.564725	919	3.0	64.724301

In [82]:

df.sort_index(axis=0, ascending=False)

Out[82]:

	pid	weight	height	date	something	bmi
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678

Sort on values¶

In [83]:

df.sort_values(by=['something', 'bmi'], ascending=[True, False])

Out[83]:

	pid	weight	height	date	something	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067

Summarizing¶

Apply an aggregation function¶

In [84]:

df.select_dtypes(include=np.number)

Out[84]:

	weight	height	something	bmi
0	72.654347	203.560866	2.0	17.533678
1	34.027679	189.847316	2.0	9.441118
2	28.501914	158.646074	NaN	11.324404
3	17.396343	180.795993	NaN	5.322067
4	64.724301	173.564725	3.0	21.485449

In [85]:

df.select_dtypes(include=np.number).agg(np.sum)

Out[85]:

weight       217.304584
height       906.414974
something      7.000000
bmi           65.106716
dtype: float64

In [86]:

df.agg(['count', np.sum, np.mean])

Out[86]:

	pid	weight	height	date	something	bmi
count	5.0	5.000000	5.000000	5.0	3.000000	5.000000
sum	NaN	217.304584	906.414974	NaN	7.000000	65.106716
mean	NaN	43.460917	181.282995	NaN	2.333333	13.021343

Split-Apply-Combine¶

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with groupby followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

In [87]:

df['treatment'] = list('ababa')

In [88]:

df

Out[88]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a

In [89]:

grouped = df.groupby('treatment')

In [90]:

grouped.get_group('a')

Out[90]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a

In [91]:

grouped.mean()

Out[91]:

	weight	height	something	bmi
treatment
a	55.293521	178.590555	2.5	16.781177
b	25.712011	185.321654	2.0	7.381592

Using `agg` with `groupby`¶

In [92]:

grouped.agg('mean')

Out[92]:

	weight	height	something	bmi
treatment
a	55.293521	178.590555	2.5	16.781177
b	25.712011	185.321654	2.0	7.381592

In [93]:

grouped.agg(['mean', 'std'])

Out[93]:

	weight		height		something		bmi
	mean	std	mean	std	mean	std	mean	std
treatment
a	55.293521	23.538565	178.590555	22.875289	2.5	0.707107	16.781177	5.122148
b	25.712011	11.760130	185.321654	6.400252	2.0	NaN	7.381592	2.912608

In [94]:

grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})

Out[94]:

	weight		height		bmi
	mean	std	min	max	<lambda>
treatment
a	55.293521	23.538565	158.646074	203.560866	897.296525
b	25.712011	11.760130	180.795993	189.847316	117.459100

Using `trasnform` wtih `groupby`¶

In [95]:

g_mean = grouped['weight', 'height'].transform(np.mean)
g_mean

Out[95]:

	weight	height
0	55.293521	178.590555
1	25.712011	185.321654
2	55.293521	178.590555
3	25.712011	185.321654
4	55.293521	178.590555

In [96]:

g_std = grouped['weight', 'height'].transform(np.std)
g_std

Out[96]:

	weight	height
0	23.538565	22.875289
1	11.760130	6.400252
2	23.538565	22.875289
3	11.760130	6.400252
4	23.538565	22.875289

In [97]:

(df[['weight', 'height']] - g_mean)/g_std

Out[97]:

	weight	height
0	0.737548	1.091584
1	0.707107	0.707107
2	-1.138201	-0.871879
3	-0.707107	-0.707107
4	0.400652	-0.219706

Combining Data Frames¶

In [98]:

df

Out[98]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a

In [99]:

df1 =  df.iloc[3:].copy()

In [100]:

df1.drop('something', axis=1, inplace=True)
df1

Out[100]:

	pid	weight	height	date	bmi	treatment
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	21.485449	a

Adding rows¶

Note that pandas aligns by column indexes automatically.

In [101]:

df.append(df1, sort=False)

Out[101]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	21.485449	a

In [102]:

pd.concat([df, df1], sort=False)

Out[102]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	21.485449	a

Adding columns¶

In [103]:

df.pid

Out[103]:

  328
  756
  185
  507
  919
Name: pid, dtype: category
Categories (5, int64): [185, 328, 507, 756, 919]

In [104]:

df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))

In [105]:

df2.pid

Out[105]:

  649
  533
  400
  600
Name: pid, dtype: int64

In [106]:

df.pid = df.pid.astype('int')

In [107]:

pd.merge(df, df2, on='pid', how='inner')

Out[107]:

	pid	weight	height	date	something	bmi	treatment	age

In [108]:

pd.merge(df, df2, on='pid', how='left')

Out[108]:

	pid	weight	height	date	something	bmi	treatment	age
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a	NaN
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b	NaN
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b	NaN
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a	NaN

In [109]:

pd.merge(df, df2, on='pid', how='right')

Out[109]:

	pid	weight	height	date	something	bmi	treatment	age
0	649	NaN	NaN	NaT	NaN	NaN	NaN	23
1	533	NaN	NaN	NaT	NaN	NaN	NaN	34
2	400	NaN	NaN	NaT	NaN	NaN	NaN	45
3	600	NaN	NaN	NaT	NaN	NaN	NaN	56

In [110]:

pd.merge(df, df2, on='pid', how='outer')

Out[110]:

	pid	weight	height	date	something	bmi	treatment	age
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a	NaN
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b	NaN
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b	NaN
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a	NaN
5	649	NaN	NaN	NaT	NaN	NaN	NaN	23.0
6	533	NaN	NaN	NaT	NaN	NaN	NaN	34.0
7	400	NaN	NaN	NaT	NaN	NaN	NaN	45.0
8	600	NaN	NaN	NaT	NaN	NaN	NaN	56.0

Merging on the index¶

In [111]:

df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))

In [112]:

df1

Out[112]:

	x
a	1
b	2
c	3

In [113]:

df2

Out[113]:

	y
a	4
b	5
c	6

In [114]:

df3

Out[114]:

	z
a	7
b	8
c	9

In [115]:

df1.join([df2, df3])

Out[115]:

	x	y	z
a	1	4	7
b	2	5	8
c	3	6	9

Fixing common DataFrame issues¶

Multiple variables in a column¶

In [116]:

df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df

Out[116]:

	pid_treat
0	A-1
1	B-2
2	C-1
3	D-2

In [117]:

df.pid_treat.str.split('-')

Out[117]:

  [A, 1]
  [B, 2]
  [C, 1]
  [D, 2]
Name: pid_treat, dtype: object

In [118]:

df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])

Out[118]:

	pid	treat
0	A	1
1	B	2
2	C	1
3	D	2

Multiple values in a cell¶

In [119]:

df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df

Out[119]:

	pid	vals
0	a	(1, 2, 3)
1	b	(4, 5, 6)
2	c	(7, 8, 9)

In [120]:

df[['t1', 't2', 't3']]  = df.vals.apply(pd.Series)
df

Out[120]:

	pid	vals	t1	t2	t3
0	a	(1, 2, 3)	1	2	3
1	b	(4, 5, 6)	4	5	6
2	c	(7, 8, 9)	7	8	9

In [121]:

df.drop('vals', axis=1, inplace=True)

In [122]:

pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)

Out[122]:

	pid	vals
0	a	1
1	b	4
2	c	7
3	a	2
4	b	5
5	c	8
6	a	3
7	b	6
8	c	9

Reshaping Data Frames¶

Sometimes we need to make rows into columns or vice versa.

Converting multiple columns into a single column¶

This is often useful if you need to condition on some variable.

In [123]:

url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)

In [124]:

iris.head()

Out[124]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

In [125]:

iris.shape

Out[125]:

(150, 5)

In [126]:

df_iris = pd.melt(iris, id_vars='species')

In [127]:

df_iris.sample(10)

Out[127]:

	species	variable	value
535	versicolor	petal_width	1.6
175	setosa	sepal_width	3.0
430	virginica	petal_length	6.1
481	setosa	petal_width	0.4
3	setosa	sepal_length	4.6
73	versicolor	sepal_length	6.1
581	virginica	petal_width	2.0
362	versicolor	petal_length	4.0
294	virginica	sepal_width	3.3
66	versicolor	sepal_length	5.6

Chaining commands¶

Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.

In [128]:

(
    iris.
    sample(frac=0.2).
    filter(regex='s.*').
    assign(both=iris.sepal_length + iris.sepal_length).
    groupby('species').agg(['mean', 'sum']).
    pipe(lambda x: np.around(x, 1))
)

Out[128]:

	sepal_length		sepal_width		both
	mean	sum	mean	sum	mean	sum
species
setosa	5.2	56.8	3.6	39.1	10.3	113.6
versicolor	6.0	42.1	2.8	19.6	12.0	84.2
virginica	6.7	80.1	2.9	34.6	13.4	160.2

Moving between R and Python in Jupyter¶

In [129]:

%load_ext rpy2.ipython

In [130]:

import warnings
warnings.simplefilter('ignore', FutureWarning)

In [131]:

iris = %R iris

In [132]:

iris.head()

Out[132]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

In [133]:

iris_py = iris.copy()
iris_py.Species = iris_py.Species.str.upper()

In [134]:

%%R -i iris_py -o iris_r

iris_r <- iris_py[1:3,]

In [135]:

iris_r

Out[135]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	SETOSA
1	4.9	3.0	1.4	0.2	SETOSA
2	4.7	3.2	1.3	0.2	SETOSA

	pid	weight	height	date	something	bmi	treatment	age
0	649	NaN	NaN	NaT	NaN	NaN	NaN	23
1	533	NaN	NaN	NaT	NaN	NaN	NaN	34
2	400	NaN	NaN	NaT	NaN	NaN	NaN	45
3	600	NaN	NaN	NaT	NaN	NaN	NaN	56

	pid	weight	height	date	something	bmi	treatment	age
0	649	NaN	NaN	NaT	NaN	NaN	NaN	23
1	533	NaN	NaN	NaT	NaN	NaN	NaN	34
2	400	NaN	NaN	NaT	NaN	NaN	NaN	45
3	600	NaN	NaN	NaT	NaN	NaN	NaN	56

Introduction to pandas¶