Data Science: Data processing

Typical packages: pandas, plotnine, plotly, streamlit

References

Introduction to pandas

[1]:
import numpy as np
import pandas as pd

Series and Data Frames

Series objects

A Series is like a vector. All elements must have the same type or are nulls.

[2]:
s = pd.Series([1,1,2,3] + [None])
s
[2]:
0    1.0
1    1.0
2    2.0
3    3.0
4    NaN
dtype: float64

Size

[3]:
s.size
[3]:
5

Unique Counts

[4]:
s.value_counts()
[4]:
1.0    2
3.0    1
2.0    1
dtype: int64

Special types of series

Strings

[5]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1
[5]:
0      the quick
1    quick brown
2      brown fox
3      fox jumps
4     jumps over
5       over the
6       the lazy
7       lazy dog
dtype: object
[6]:
s1.str.upper()
[6]:
0      THE QUICK
1    QUICK BROWN
2      BROWN FOX
3      FOX JUMPS
4     JUMPS OVER
5       OVER THE
6       THE LAZY
7       LAZY DOG
dtype: object
[7]:
s1.str.split()
[7]:
0      [the, quick]
1    [quick, brown]
2      [brown, fox]
3      [fox, jumps]
4     [jumps, over]
5       [over, the]
6       [the, lazy]
7       [lazy, dog]
dtype: object
[8]:
s1.str.split().str[1]
[8]:
0    quick
1    brown
2      fox
3    jumps
4     over
5      the
6     lazy
7      dog
dtype: object

Categories

[9]:
s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2
[9]:
0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: object
[10]:
s2 = s2.astype('category')
s2
[10]:
0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: category
Categories (4, object): ['Asian', 'Black', 'Hispanic', 'White']
[11]:
s2.cat.categories
[11]:
Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')
[12]:
s2.cat.codes
[12]:
0    0
1    0
2    3
3    1
4    3
5    2
dtype: int8

Dates and times

Datetimes are often useful as indices to a time series.

[13]:
import pendulum
[14]:
d = pendulum.today()
[15]:
d.to_date_string()
[15]:
'2020-11-11'
[16]:
k = 18
s3 = pd.Series(range(k),
               index=pd.date_range(d.to_date_string(),
                                   periods=k,
                                   freq='M'))
[17]:
s3
[17]:
2020-11-30     0
2020-12-31     1
2021-01-31     2
2021-02-28     3
2021-03-31     4
2021-04-30     5
2021-05-31     6
2021-06-30     7
2021-07-31     8
2021-08-31     9
2021-09-30    10
2021-10-31    11
2021-11-30    12
2021-12-31    13
2022-01-31    14
2022-02-28    15
2022-03-31    16
2022-04-30    17
Freq: M, dtype: int64
[18]:
s3['2021']
[18]:
2021-01-31     2
2021-02-28     3
2021-03-31     4
2021-04-30     5
2021-05-31     6
2021-06-30     7
2021-07-31     8
2021-08-31     9
2021-09-30    10
2021-10-31    11
2021-11-30    12
2021-12-31    13
Freq: M, dtype: int64
[19]:
s3['2021-01':'2021-06']
[19]:
2021-01-31    2
2021-02-28    3
2021-03-31    4
2021-04-30    5
2021-05-31    6
2021-06-30    7
Freq: M, dtype: int64

If used as a series, then need dt accessor method

[20]:
s4 = s3.index.to_series()
[21]:
s4.dt.day_name()
[21]:
2020-11-30       Monday
2020-12-31     Thursday
2021-01-31       Sunday
2021-02-28       Sunday
2021-03-31    Wednesday
2021-04-30       Friday
2021-05-31       Monday
2021-06-30    Wednesday
2021-07-31     Saturday
2021-08-31      Tuesday
2021-09-30     Thursday
2021-10-31       Sunday
2021-11-30      Tuesday
2021-12-31       Friday
2022-01-31       Monday
2022-02-28       Monday
2022-03-31     Thursday
2022-04-30     Saturday
Freq: M, dtype: object

DataFrame objects

A DataFrame is like a matrix. Columns in a DataFrame are Series.

  • Each column in a DataFrame represents a variale

  • Each row in a DataFrame represents an observation

  • Each cell in a DataFrame represents a value

[22]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df
[22]:
num
0 1.0
1 2.0
2 3.0
3 NaN
[23]:
df.num
[23]:
0    1.0
1    2.0
2    3.0
3    NaN
Name: num, dtype: float64

Index

Row and column identifiers are of Index type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

[24]:
df.index
[24]:
RangeIndex(start=0, stop=4, step=1)

Setting a column as the row index

[25]:
df
[25]:
num
0 1.0
1 2.0
2 3.0
3 NaN
[26]:
df1 = df.set_index('num')
df1
[26]:
num
1.0
2.0
3.0
NaN

Making an index into a column

[27]:
df1.reset_index()
[27]:
num
0 1.0
1 2.0
2 3.0
3 NaN

Columns

This is just a different index object

[28]:
df.columns
[28]:
Index(['num'], dtype='object')

Getting raw values

Sometimes you just want a numpy array, and not a pandas object.

[29]:
df.values
[29]:
array([[ 1.],
       [ 2.],
       [ 3.],
       [nan]])

Creating Data Frames

Manual

[30]:
from collections import OrderedDict
[31]:
n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n),
                              weight=np.random.normal(70, 20, n),
                              height=np.random.normal(170, 15, n),
                              date=dates,
                             ))
df
[31]:
pid weight height date
0 730 88.630097 195.922737 2020-11-11 19:25:24.062141
1 112 53.958069 169.022381 2020-11-12 19:25:24.062141
2 740 89.709304 180.327636 2020-11-13 19:25:24.062141
3 485 52.879337 142.806223 2020-11-14 19:25:24.062141
4 905 80.488371 168.384792 2020-11-15 19:25:24.062141

From file

You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like read_X where X is the data type.

[32]:
%%file measures.txt
pid weight  height  date
328 72.654347       203.560866      2018-11-11 14:16:18.148411
756 34.027679       189.847316      2018-11-12 14:16:18.148411
185 28.501914       158.646074      2018-11-13 14:16:18.148411
507 17.396343       180.795993      2018-11-14 14:16:18.148411
919 64.724301       173.564725      2018-11-15 14:16:18.148411
Overwriting measures.txt
[33]:
df = pd.read_table('measures.txt')
df
[33]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Indexing Data Frames

Implicit defaults

if you provide a slice, it is assumed that you are asking for rows.

[34]:
df[1:3]
[34]:
pid weight height date
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411

If you provide a singe value or list, it is assumed that you are asking for columns.

[35]:
df[['pid', 'weight']]
[35]:
pid weight
0 328 72.654347
1 756 34.027679
2 185 28.501914
3 507 17.396343
4 919 64.724301

Extracting a column

Dictionary style access

[36]:
df['pid']
[36]:
0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Property style access

This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)

[37]:
df.pid
[37]:
0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Indexing by location

This is similar to numpy indexing

[38]:
df.iloc[1:3, :]
[38]:
pid weight height date
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
[39]:
df.iloc[1:3, 1:4:2]
[39]:
weight date
1 34.027679 2018-11-12 14:16:18.148411
2 28.501914 2018-11-13 14:16:18.148411

Indexing by name

[40]:
df.loc[1:3, 'weight':'height']
[40]:
weight height
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993

Warning: When using loc, the row slice indicates row names, not positions.

[41]:
df1 = df.copy()
df1.index = df.index + 1
df1
[41]:
pid weight height date
1 328 72.654347 203.560866 2018-11-11 14:16:18.148411
2 756 34.027679 189.847316 2018-11-12 14:16:18.148411
3 185 28.501914 158.646074 2018-11-13 14:16:18.148411
4 507 17.396343 180.795993 2018-11-14 14:16:18.148411
5 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[42]:
df1.loc[1:3, 'weight':'height']
[42]:
weight height
1 72.654347 203.560866
2 34.027679 189.847316
3 28.501914 158.646074

Structure of a Data Frame

Data types

[43]:
df.dtypes
[43]:
pid         int64
weight    float64
height    float64
date       object
dtype: object

Converting data types

Using astype on one column

[44]:
df.pid = df.pid.astype('category')

Using astype on multiple columns

[45]:
df = df.astype(dict(weight=float, height=float))

Using a conversion function

[46]:
df.date = pd.to_datetime(df.date)

Check

[47]:
df.dtypes
[47]:
pid             category
weight           float64
height           float64
date      datetime64[ns]
dtype: object

Basic properties

[48]:
df.size
[48]:
20
[49]:
df.shape
[49]:
(5, 4)
[50]:
df.describe()
[50]:
weight height
count 5.000000 5.000000
mean 43.460917 181.282995
std 23.960945 16.895933
min 17.396343 158.646074
25% 28.501914 173.564725
50% 34.027679 180.795993
75% 64.724301 189.847316
max 72.654347 203.560866
[51]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   pid     5 non-null      category
 1   weight  5 non-null      float64
 2   height  5 non-null      float64
 3   date    5 non-null      datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 453.0 bytes

Inspection

[52]:
df.head(n=3)
[52]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
[53]:
df.tail(n=3)
[53]:
pid weight height date
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[54]:
df.sample(n=3)
[54]:
pid weight height date
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
[55]:
df.sample(frac=0.5)
[55]:
pid weight height date
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411

Selecting, Renaming and Removing Columns

Selecting columns

[56]:
df.filter(items=['pid', 'date'])
[56]:
pid date
0 328 2018-11-11 14:16:18.148411
1 756 2018-11-12 14:16:18.148411
2 185 2018-11-13 14:16:18.148411
3 507 2018-11-14 14:16:18.148411
4 919 2018-11-15 14:16:18.148411
[57]:
df.filter(regex='.*ght')
[57]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725

Note that you can also use regular string methods on the columns

[58]:
df.loc[:, df.columns.str.contains('d')]
[58]:
pid date
0 328 2018-11-11 14:16:18.148411
1 756 2018-11-12 14:16:18.148411
2 185 2018-11-13 14:16:18.148411
3 507 2018-11-14 14:16:18.148411
4 919 2018-11-15 14:16:18.148411

Renaming columns

[59]:
df.rename(dict(weight='w', height='h'), axis=1)
[59]:
pid w h date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[60]:
orig_cols = df.columns
[61]:
df.columns = list('abcd')
[62]:
df
[62]:
a b c d
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[63]:
df.columns = orig_cols
[64]:
df
[64]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Removing columns

[65]:
df.drop(['pid', 'date'], axis=1)
[65]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725
[66]:
df.drop(columns=['pid', 'date'])
[66]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725
[67]:
df.drop(columns=df.columns[df.columns.str.contains('d')])
[67]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725

Selecting, Renaming and Removing Rows

Selecting rows

[68]:
df[df.weight.between(60,70)]
[68]:
pid weight height date
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[69]:
df[(69 <= df.weight) & (df.weight < 70)]
[69]:
pid weight height date
[70]:
df[df.date.between(pd.to_datetime('2018-11-13'),
                   pd.to_datetime('2018-11-15 23:59:59'))]
[70]:
pid weight height date
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Renaming rows

[71]:
df.rename({i:letter for i,letter in enumerate('abcde')})
[71]:
pid weight height date
a 328 72.654347 203.560866 2018-11-11 14:16:18.148411
b 756 34.027679 189.847316 2018-11-12 14:16:18.148411
c 185 28.501914 158.646074 2018-11-13 14:16:18.148411
d 507 17.396343 180.795993 2018-11-14 14:16:18.148411
e 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[72]:
df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']
[73]:
df
[73]:
pid weight height date
the 328 72.654347 203.560866 2018-11-11 14:16:18.148411
quick 756 34.027679 189.847316 2018-11-12 14:16:18.148411
brown 185 28.501914 158.646074 2018-11-13 14:16:18.148411
fox 507 17.396343 180.795993 2018-11-14 14:16:18.148411
jumphs 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[74]:
df = df.reset_index(drop=True)
[75]:
df
[75]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Dropping rows

[76]:
df.drop([1,3], axis=0)
[76]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Dropping duplicated data

[77]:
df['something'] = [1,1,None,2,None]
[78]:
df.loc[df.something.duplicated()]
[78]:
pid weight height date something
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN
[79]:
df.drop_duplicates(subset='something')
[79]:
pid weight height date something
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0

Dropping missing data

[80]:
df
[80]:
pid weight height date something
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN
[81]:
df.something.fillna(0)
[81]:
0    1.0
1    1.0
2    0.0
3    2.0
4    0.0
Name: something, dtype: float64
[82]:
df.something.ffill()
[82]:
0    1.0
1    1.0
2    1.0
3    2.0
4    2.0
Name: something, dtype: float64
[83]:
df.something.bfill()
[83]:
0    1.0
1    1.0
2    2.0
3    2.0
4    NaN
Name: something, dtype: float64
[84]:
df.something.interpolate()
[84]:
0    1.0
1    1.0
2    1.5
3    2.0
4    2.0
Name: something, dtype: float64
[85]:
df.dropna()
[85]:
pid weight height date something
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0

Transforming and Creating Columns

[86]:
df.assign(bmi=df['weight'] / (df['height']/100)**2)
[86]:
pid weight height date something bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 9.441118
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 5.322067
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449
[87]:
df['bmi'] = df['weight'] / (df['height']/100)**2
[88]:
df
[88]:
pid weight height date something bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 9.441118
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 5.322067
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449
[89]:
df['something'] = [2,2,None,None,3]
[90]:
df
[90]:
pid weight height date something bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449

Sorting Data Frames

Sort on indexes

[91]:
df.sort_index(axis=1)
[91]:
bmi date height pid something weight
0 17.533678 2018-11-11 14:16:18.148411 203.560866 328 2.0 72.654347
1 9.441118 2018-11-12 14:16:18.148411 189.847316 756 2.0 34.027679
2 11.324404 2018-11-13 14:16:18.148411 158.646074 185 NaN 28.501914
3 5.322067 2018-11-14 14:16:18.148411 180.795993 507 NaN 17.396343
4 21.485449 2018-11-15 14:16:18.148411 173.564725 919 3.0 64.724301
[92]:
df.sort_index(axis=0, ascending=False)
[92]:
pid weight height date something bmi
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678

Sort on values

[93]:
df.sort_values(by=['something', 'bmi'], ascending=[True, False])
[93]:
pid weight height date something bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067

Summarizing

Apply an aggregation function

[94]:
df.select_dtypes(include=np.number)
[94]:
weight height something bmi
0 72.654347 203.560866 2.0 17.533678
1 34.027679 189.847316 2.0 9.441118
2 28.501914 158.646074 NaN 11.324404
3 17.396343 180.795993 NaN 5.322067
4 64.724301 173.564725 3.0 21.485449
[95]:
df.select_dtypes(include=np.number).agg(np.sum)
[95]:
weight       217.304584
height       906.414974
something      7.000000
bmi           65.106716
dtype: float64
[96]:
df.agg(['count', np.sum, np.mean])
[96]:
pid weight height date something bmi
count 5.0 5.000000 5.000000 5 3.000000 5.000000
sum NaN 217.304584 906.414974 NaN 7.000000 65.106716
mean NaN 43.460917 181.282995 2018-11-13 14:16:18.148410880 2.333333 13.021343

Split-Apply-Combine

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with groupby followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

[97]:
df['treatment'] = list('ababa')
[98]:
df
[98]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
[99]:
grouped = df.groupby('treatment')
[100]:
grouped.get_group('a')
[100]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
[101]:
grouped.mean()
[101]:
weight height something bmi
treatment
a 55.293521 178.590555 2.5 16.781177
b 25.712011 185.321654 2.0 7.381592

Using agg with groupby

[102]:
grouped.agg('mean')
[102]:
weight height something bmi
treatment
a 55.293521 178.590555 2.5 16.781177
b 25.712011 185.321654 2.0 7.381592
[103]:
grouped.agg(['mean', 'std'])
[103]:
weight height something bmi
mean std mean std mean std mean std
treatment
a 55.293521 23.538565 178.590555 22.875289 2.5 0.707107 16.781177 5.122148
b 25.712011 11.760130 185.321654 6.400252 2.0 NaN 7.381592 2.912608
[104]:
grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})
[104]:
weight height bmi
mean std min max <lambda>
treatment
a 55.293521 23.538565 158.646074 203.560866 897.296525
b 25.712011 11.760130 180.795993 189.847316 117.459100

Using trasnform wtih groupby

[105]:
g_mean = grouped[['weight', 'height']].transform(np.mean)
g_mean
[105]:
weight height
0 55.293521 178.590555
1 25.712011 185.321654
2 55.293521 178.590555
3 25.712011 185.321654
4 55.293521 178.590555
[106]:
g_std = grouped[['weight', 'height']].transform(np.std)
g_std
[106]:
weight height
0 23.538565 22.875289
1 11.760130 6.400252
2 23.538565 22.875289
3 11.760130 6.400252
4 23.538565 22.875289
[107]:
(df[['weight', 'height']] - g_mean)/g_std
[107]:
weight height
0 0.737548 1.091584
1 0.707107 0.707107
2 -1.138201 -0.871879
3 -0.707107 -0.707107
4 0.400652 -0.219706

Combining Data Frames

[108]:
df
[108]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
[109]:
df1 =  df.iloc[3:].copy()
[110]:
df1.drop('something', axis=1, inplace=True)
df1
[110]:
pid weight height date bmi treatment
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 21.485449 a

Adding rows

Note that pandas aligns by column indexes automatically.

[111]:
df.append(df1, sort=False)
[111]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449 a
[112]:
pd.concat([df, df1], sort=False)
[112]:
pid weight height date something bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449 a

Adding columns

[113]:
df.pid
[113]:
0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: category
Categories (5, int64): [185, 328, 507, 756, 919]
[114]:
df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))
[115]:
df2.pid
[115]:
0    649
1    533
2    400
3    600
Name: pid, dtype: int64
[116]:
df.pid = df.pid.astype('int')
[117]:
pd.merge(df, df2, on='pid', how='inner')
[117]:
pid weight height date something bmi treatment age
[118]:
pd.merge(df, df2, on='pid', how='left')
[118]:
pid weight height date something bmi treatment age
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a NaN
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b NaN
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b NaN
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a NaN
[119]:
pd.merge(df, df2, on='pid', how='right')
[119]:
pid weight height date something bmi treatment age
0 649 NaN NaN NaT NaN NaN NaN 23
1 533 NaN NaN NaT NaN NaN NaN 34
2 400 NaN NaN NaT NaN NaN NaN 45
3 600 NaN NaN NaT NaN NaN NaN 56
[120]:
pd.merge(df, df2, on='pid', how='outer')
[120]:
pid weight height date something bmi treatment age
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678 a NaN
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118 b NaN
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404 a NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067 b NaN
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449 a NaN
5 649 NaN NaN NaT NaN NaN NaN 23.0
6 533 NaN NaN NaT NaN NaN NaN 34.0
7 400 NaN NaN NaT NaN NaN NaN 45.0
8 600 NaN NaN NaT NaN NaN NaN 56.0

Merging on the index

[121]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))
[122]:
df1
[122]:
x
a 1
b 2
c 3
[123]:
df2
[123]:
y
a 4
b 5
c 6
[124]:
df3
[124]:
z
a 7
b 8
c 9
[125]:
df1.join([df2, df3])
[125]:
x y z
a 1 4 7
b 2 5 8
c 3 6 9

Fixing common DataFrame issues

Multiple variables in a column

[126]:
df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df
[126]:
pid_treat
0 A-1
1 B-2
2 C-1
3 D-2
[127]:
df.pid_treat.str.split('-')
[127]:
0    [A, 1]
1    [B, 2]
2    [C, 1]
3    [D, 2]
Name: pid_treat, dtype: object
[128]:
df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])
[128]:
pid treat
0 A 1
1 B 2
2 C 1
3 D 2

Multiple values in a cell

[129]:
df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df
[129]:
pid vals
0 a (1, 2, 3)
1 b (4, 5, 6)
2 c (7, 8, 9)
[130]:
df[['t1', 't2', 't3']]  = df.vals.apply(pd.Series)
df
[130]:
pid vals t1 t2 t3
0 a (1, 2, 3) 1 2 3
1 b (4, 5, 6) 4 5 6
2 c (7, 8, 9) 7 8 9
[131]:
df.drop('vals', axis=1, inplace=True)
[132]:
pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)
[132]:
pid vals
0 a 1
1 b 4
2 c 7
3 a 2
4 b 5
5 c 8
6 a 3
7 b 6
8 c 9

Reshaping Data Frames

Sometimes we need to make rows into columns or vice versa.

Converting multiple columns into a single column

This is often useful if you need to condition on some variable.

[133]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)
[134]:
iris.head()
[134]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
[135]:
iris.shape
[135]:
(150, 5)
[136]:
df_iris = pd.melt(iris, id_vars='species')
[137]:
df_iris.sample(10)
[137]:
species variable value
463 setosa petal_width 0.1
210 versicolor sepal_width 2.0
70 versicolor sepal_length 5.9
468 setosa petal_width 0.3
57 versicolor sepal_length 4.9
98 versicolor sepal_length 5.1
111 virginica sepal_length 6.4
385 versicolor petal_length 4.5
558 virginica petal_width 1.8
469 setosa petal_width 0.3

Pivoting

Sometimes we need to convert categorical values in a column into separate columns. This is often done at the same time as performing a summary.

[138]:
df_iris.pivot_table(index='variable', columns='species', values='value', aggfunc='mean')
[138]:
species setosa versicolor virginica
variable
petal_length 1.464 4.260 5.552
petal_width 0.244 1.326 2.026
sepal_length 5.006 5.936 6.588
sepal_width 3.418 2.770 2.974

Functional style - apply, applymap and map

apply can be used to apply a custom function

[139]:
scores = pd.DataFrame(
    np.around(np.clip(np.random.normal(90, 10, (5,3)), 0, 100), 1),
    columns = ['math', 'stat', 'biol'],
    index = ['anne', 'bob', 'charles', 'dirk', 'edgar']
)
[140]:
scores
[140]:
math stat biol
anne 93.4 99.5 80.9
bob 67.5 82.9 94.5
charles 93.5 91.5 96.4
dirk 89.4 93.3 84.9
edgar 88.3 92.4 97.0
[141]:
def convert_grade_1(score):
    return np.where(score > 90, 'A',
                    np.where(score > 80, 'B',
                            np.where(score > 70, 'C', 'F')))
[142]:
scores.apply(convert_grade_1)
[142]:
math stat biol
anne A A B
bob F B A
charles A A A
dirk B A B
edgar B A A

The np.where is a little clumsy - here is an alternative.

[143]:
def convert_grade_2(score):
    if score.name == 'math': # math professors are mean
        return np.choose(
            pd.cut(score, [-1, 80, 90, 95, 100], labels=False),
            ['F', 'C', 'B', 'A']
        )
    else:
        return np.choose(
            pd.cut(score, [-1, 70, 80, 90, 100], labels=False),
            ['F', 'C', 'B', 'A']
        )
[144]:
scores.apply(convert_grade_2)
[144]:
math stat biol
anne B A B
bob F B A
charles B A A
dirk C A B
edgar C A A

apply can be used to avoid explicit looping

[145]:
def likely_profession(row):
    if (row.biol > row.math) and (row.biol > row.stat):
        return 'farmer'
    elif (row.math > row.biol) and (row.math > row.stat):
        return 'high school teacher'
    elif (row.stat > row.math) and (row.stat > row.biol):
        return 'actuary'
    else:
        return 'doctor'
[146]:
scores.apply(likely_profession, axis=1)
[146]:
anne       actuary
bob         farmer
charles     farmer
dirk       actuary
edgar       farmer
dtype: object

If all else fails, you can loop over pandas data frames.

  • Be prepared for pitying looks from more snobbish Python coders

Loops are frowned upon because they are not efficient, but sometimes pragmatism beats elegance.

[147]:
for idx, row in scores.iterrows():
    print(f'\nidx = {idx}\nrow = {row.index}: {row.values}\n',
          end='-'*30)

idx = anne
row = Index(['math', 'stat', 'biol'], dtype='object'): [93.4 99.5 80.9]
------------------------------
idx = bob
row = Index(['math', 'stat', 'biol'], dtype='object'): [67.5 82.9 94.5]
------------------------------
idx = charles
row = Index(['math', 'stat', 'biol'], dtype='object'): [93.5 91.5 96.4]
------------------------------
idx = dirk
row = Index(['math', 'stat', 'biol'], dtype='object'): [89.4 93.3 84.9]
------------------------------
idx = edgar
row = Index(['math', 'stat', 'biol'], dtype='object'): [88.3 92.4 97. ]
------------------------------

apply can be used for reductions along margins

[148]:
df = pd.DataFrame(np.random.randint(0, 10, (4,5)), columns=list('abcde'), index=list('wxyz'))
[149]:
df
[149]:
a b c d e
w 3 7 0 3 7
x 9 5 5 2 5
y 9 4 5 8 3
z 0 4 9 1 8
[150]:
df.apply(sum, axis=0)
[150]:
a    21
b    20
c    19
d    14
e    23
dtype: int64
[151]:
df.apply(sum, axis=1)
[151]:
w    20
x    26
y    29
z    22
dtype: int64
[152]:
import string
[153]:
char_map = {i: c for i,c in enumerate(string.ascii_uppercase)}
[154]:
df.applymap(lambda x: char_map[x])
[154]:
a b c d e
w D H A D H
x J F F C F
y J E F I D
z A E J B I
[155]:
df.assign(b_map = df.b.map(char_map))
[155]:
a b c d e b_map
w 3 7 0 3 7 H
x 9 5 5 2 5 F
y 9 4 5 8 3 E
z 0 4 9 1 8 E

Chaining commands

Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.

[156]:
(
    iris.
    sample(frac=0.2).
    filter(regex='s.*').
    assign(both=iris.sepal_length + iris.petal_length).
    query('both > 2').
    groupby('species').agg(['mean', 'sum']).
    pipe(lambda x: np.around(x, 1))
)
[156]:
sepal_length sepal_width both
mean sum mean sum mean sum
species
setosa 5.1 50.7 3.5 35.1 6.6 65.7
versicolor 5.8 52.2 2.8 24.8 10.1 90.9
virginica 7.0 77.0 3.1 33.8 12.7 140.2

Moving between R and Python in Jupyter

[157]:
%load_ext rpy2.ipython
[158]:
import warnings
warnings.simplefilter('ignore', FutureWarning)
[159]:
iris = %R iris
[160]:
iris.head()
[160]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
[161]:
iris_py = iris.copy()
iris_py.Species = iris_py.Species.str.upper()
[162]:
%%R -i iris_py -o iris_r

iris_r <- iris_py[1:3,]
[163]:
iris_r
[163]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 SETOSA
2 4.9 3.0 1.4 0.2 SETOSA
3 4.7 3.2 1.3 0.2 SETOSA
[164]:
! python3 -m pip install --quiet watermark
[165]:
%load_ext watermark
[166]:
%watermark -v -iv
pendulum 2.1.2
pandas   1.1.1
numpy    1.18.5
CPython 3.8.5
IPython 7.17.0