Data Science: Data processing¶

Typical packages: pandas, plotnine, plotly, streamlit

References

Introduction to `pandas`¶

[1]:

import numpy as np
import pandas as pd

Series and Data Frames¶

Series objects¶

A Series is like a vector. All elements must have the same type or are nulls.

[2]:

s = pd.Series([1,1,2,3] + [None])
s

[2]:

0    1.0
1    1.0
2    2.0
3    3.0
4    NaN
dtype: float64

Size¶

[3]:

s.size

[3]:

Unique Counts¶

[4]:

s.value_counts()

[4]:

1.0    2
3.0    1
2.0    1
dtype: int64

Special types of series¶

Strings¶

[5]:

words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1

[5]:

0      the quick
1    quick brown
2      brown fox
3      fox jumps
4     jumps over
5       over the
6       the lazy
7       lazy dog
dtype: object

[6]:

s1.str.upper()

[6]:

0      THE QUICK
1    QUICK BROWN
2      BROWN FOX
3      FOX JUMPS
4     JUMPS OVER
5       OVER THE
6       THE LAZY
7       LAZY DOG
dtype: object

[7]:

s1.str.split()

[7]:

0      [the, quick]
1    [quick, brown]
2      [brown, fox]
3      [fox, jumps]
4     [jumps, over]
5       [over, the]
6       [the, lazy]
7       [lazy, dog]
dtype: object

[8]:

s1.str.split().str[1]

[8]:

0    quick
1    brown
2      fox
3    jumps
4     over
5      the
6     lazy
7      dog
dtype: object

Categories¶

[9]:

s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2

[9]:

0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: object

[10]:

s2 = s2.astype('category')
s2

[10]:

0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: category
Categories (4, object): ['Asian', 'Black', 'Hispanic', 'White']

[11]:

s2.cat.categories

[11]:

Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')

[12]:

s2.cat.codes

[12]:

0    0
1    0
2    3
3    1
4    3
5    2
dtype: int8

Dates and times¶

Datetimes are often useful as indices to a time series.

[13]:

import pendulum

[14]:

d = pendulum.today()

[15]:

d.to_date_string()

[15]:

'2020-11-11'

[16]:

k = 18
s3 = pd.Series(range(k),
               index=pd.date_range(d.to_date_string(),
                                   periods=k,
                                   freq='M'))

[17]:

s3

[17]:

2020-11-30     0
2020-12-31     1
2021-01-31     2
2021-02-28     3
2021-03-31     4
2021-04-30     5
2021-05-31     6
2021-06-30     7
2021-07-31     8
2021-08-31     9
2021-09-30    10
2021-10-31    11
2021-11-30    12
2021-12-31    13
2022-01-31    14
2022-02-28    15
2022-03-31    16
2022-04-30    17
Freq: M, dtype: int64

[18]:

s3['2021']

[18]:

2021-01-31     2
2021-02-28     3
2021-03-31     4
2021-04-30     5
2021-05-31     6
2021-06-30     7
2021-07-31     8
2021-08-31     9
2021-09-30    10
2021-10-31    11
2021-11-30    12
2021-12-31    13
Freq: M, dtype: int64

[19]:

s3['2021-01':'2021-06']

[19]:

2021-01-31    2
2021-02-28    3
2021-03-31    4
2021-04-30    5
2021-05-31    6
2021-06-30    7
Freq: M, dtype: int64

If used as a series, then need dt accessor method

[20]:

s4 = s3.index.to_series()

[21]:

s4.dt.day_name()

[21]:

2020-11-30       Monday
2020-12-31     Thursday
2021-01-31       Sunday
2021-02-28       Sunday
2021-03-31    Wednesday
2021-04-30       Friday
2021-05-31       Monday
2021-06-30    Wednesday
2021-07-31     Saturday
2021-08-31      Tuesday
2021-09-30     Thursday
2021-10-31       Sunday
2021-11-30      Tuesday
2021-12-31       Friday
2022-01-31       Monday
2022-02-28       Monday
2022-03-31     Thursday
2022-04-30     Saturday
Freq: M, dtype: object

DataFrame objects¶

A DataFrame is like a matrix. Columns in a DataFrame are Series.

Each column in a DataFrame represents a variale
Each row in a DataFrame represents an observation
Each cell in a DataFrame represents a value

[22]:

df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df

[22]:

	num
0	1.0
1	2.0
2	3.0
3	NaN

[23]:

df.num

[23]:

0    1.0
1    2.0
2    3.0
3    NaN
Name: num, dtype: float64

Index¶

Row and column identifiers are of Index type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

[24]:

df.index

[24]:

RangeIndex(start=0, stop=4, step=1)

Setting a column as the row index¶

[25]:

df

[25]:

	num
0	1.0
1	2.0
2	3.0
3	NaN

[26]:

df1 = df.set_index('num')
df1

[26]:


num
1.0
2.0
3.0
NaN

Making an index into a column¶

[27]:

df1.reset_index()

[27]:

	num
0	1.0
1	2.0
2	3.0
3	NaN

Columns¶

This is just a different index object

[28]:

df.columns

[28]:

Index(['num'], dtype='object')

Getting raw values¶

Sometimes you just want a numpy array, and not a pandas object.

[29]:

df.values

[29]:

array([[ 1.],
       [ 2.],
       [ 3.],
       [nan]])

Creating Data Frames¶

Manual¶

[30]:

from collections import OrderedDict

[31]:

n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n),
                              weight=np.random.normal(70, 20, n),
                              height=np.random.normal(170, 15, n),
                              date=dates,
                             ))
df

[31]:

	pid	weight	height	date
0	730	88.630097	195.922737	2020-11-11 19:25:24.062141
1	112	53.958069	169.022381	2020-11-12 19:25:24.062141
2	740	89.709304	180.327636	2020-11-13 19:25:24.062141
3	485	52.879337	142.806223	2020-11-14 19:25:24.062141
4	905	80.488371	168.384792	2020-11-15 19:25:24.062141

From file¶

You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like read_X where X is the data type.

[32]:

%%file measures.txt
pid weight  height  date
72.654347       203.560866      2018-11-11 14:16:18.148411
34.027679       189.847316      2018-11-12 14:16:18.148411
28.501914       158.646074      2018-11-13 14:16:18.148411
17.396343       180.795993      2018-11-14 14:16:18.148411
64.724301       173.564725      2018-11-15 14:16:18.148411

Overwriting measures.txt

[33]:

df = pd.read_table('measures.txt')
df

[33]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Indexing Data Frames¶

Implicit defaults¶

if you provide a slice, it is assumed that you are asking for rows.

[34]:

df[1:3]

[34]:

	pid	weight	height	date
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

If you provide a singe value or list, it is assumed that you are asking for columns.

[35]:

df[['pid', 'weight']]

[35]:

	pid	weight
0	328	72.654347
1	756	34.027679
2	185	28.501914
3	507	17.396343
4	919	64.724301

Extracting a column¶

Dictionary style access¶

[36]:

df['pid']

[36]:

0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Property style access¶

This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)

[37]:

df.pid

[37]:

0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Indexing by location¶

This is similar to numpy indexing

[38]:

df.iloc[1:3, :]

[38]:

	pid	weight	height	date
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

[39]:

df.iloc[1:3, 1:4:2]

[39]:

	weight	date
1	34.027679	2018-11-12 14:16:18.148411
2	28.501914	2018-11-13 14:16:18.148411

Indexing by name¶

[40]:

df.loc[1:3, 'weight':'height']

[40]:

	weight	height
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993

Warning: When using loc, the row slice indicates row names, not positions.

[41]:

df1 = df.copy()
df1.index = df.index + 1
df1

[41]:

	pid	weight	height	date
1	328	72.654347	203.560866	2018-11-11 14:16:18.148411
2	756	34.027679	189.847316	2018-11-12 14:16:18.148411
3	185	28.501914	158.646074	2018-11-13 14:16:18.148411
4	507	17.396343	180.795993	2018-11-14 14:16:18.148411
5	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[42]:

df1.loc[1:3, 'weight':'height']

[42]:

	weight	height
1	72.654347	203.560866
2	34.027679	189.847316
3	28.501914	158.646074

Structure of a Data Frame¶

Data types¶

[43]:

df.dtypes

[43]:

pid         int64
weight    float64
height    float64
date       object
dtype: object

Converting data types¶

Using `astype` on one column¶

[44]:

df.pid = df.pid.astype('category')

Using `astype` on multiple columns¶

[45]:

df = df.astype(dict(weight=float, height=float))

Using a conversion function¶

[46]:

df.date = pd.to_datetime(df.date)

Check¶

[47]:

df.dtypes

[47]:

pid             category
weight           float64
height           float64
date      datetime64[ns]
dtype: object

Basic properties¶

[48]:

df.size

[48]:

[49]:

df.shape

[49]:

(5, 4)

[50]:

df.describe()

[50]:

	weight	height
count	5.000000	5.000000
mean	43.460917	181.282995
std	23.960945	16.895933
min	17.396343	158.646074
25%	28.501914	173.564725
50%	34.027679	180.795993
75%	64.724301	189.847316
max	72.654347	203.560866

[51]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   pid     5 non-null      category
 1   weight  5 non-null      float64
 2   height  5 non-null      float64
 3   date    5 non-null      datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 453.0 bytes

Inspection¶

[52]:

df.head(n=3)

[52]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

[53]:

df.tail(n=3)

[53]:

	pid	weight	height	date
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[54]:

df.sample(n=3)

[54]:

	pid	weight	height	date
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411

[55]:

df.sample(frac=0.5)

[55]:

	pid	weight	height	date
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

Selecting, Renaming and Removing Columns¶

Selecting columns¶

[56]:

df.filter(items=['pid', 'date'])

[56]:

	pid	date
0	328	2018-11-11 14:16:18.148411
1	756	2018-11-12 14:16:18.148411
2	185	2018-11-13 14:16:18.148411
3	507	2018-11-14 14:16:18.148411
4	919	2018-11-15 14:16:18.148411

[57]:

df.filter(regex='.*ght')

[57]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

Note that you can also use regular string methods on the columns¶

[58]:

df.loc[:, df.columns.str.contains('d')]

[58]:

	pid	date
0	328	2018-11-11 14:16:18.148411
1	756	2018-11-12 14:16:18.148411
2	185	2018-11-13 14:16:18.148411
3	507	2018-11-14 14:16:18.148411
4	919	2018-11-15 14:16:18.148411

Renaming columns¶

[59]:

df.rename(dict(weight='w', height='h'), axis=1)

[59]:

	pid	w	h	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[60]:

orig_cols = df.columns

[61]:

df.columns = list('abcd')

[62]:

df

[62]:

	a	b	c	d
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[63]:

df.columns = orig_cols

[64]:

df

[64]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Removing columns¶

[65]:

df.drop(['pid', 'date'], axis=1)

[65]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

[66]:

df.drop(columns=['pid', 'date'])

[66]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

[67]:

df.drop(columns=df.columns[df.columns.str.contains('d')])

[67]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

Selecting, Renaming and Removing Rows¶

Selecting rows¶

[68]:

df[df.weight.between(60,70)]

[68]:

	pid	weight	height	date
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[69]:

df[(69 <= df.weight) & (df.weight < 70)]

[69]:

	pid	weight	height	date

[70]:

df[df.date.between(pd.to_datetime('2018-11-13'),
                   pd.to_datetime('2018-11-15 23:59:59'))]

[70]:

	pid	weight	height	date
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Renaming rows¶

[71]:

df.rename({i:letter for i,letter in enumerate('abcde')})

[71]:

	pid	weight	height	date
a	328	72.654347	203.560866	2018-11-11 14:16:18.148411
b	756	34.027679	189.847316	2018-11-12 14:16:18.148411
c	185	28.501914	158.646074	2018-11-13 14:16:18.148411
d	507	17.396343	180.795993	2018-11-14 14:16:18.148411
e	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[72]:

df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']

[73]:

df

[73]:

	pid	weight	height	date
the	328	72.654347	203.560866	2018-11-11 14:16:18.148411
quick	756	34.027679	189.847316	2018-11-12 14:16:18.148411
brown	185	28.501914	158.646074	2018-11-13 14:16:18.148411
fox	507	17.396343	180.795993	2018-11-14 14:16:18.148411
jumphs	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[74]:

df = df.reset_index(drop=True)

[75]:

df

[75]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Dropping rows¶

[76]:

df.drop([1,3], axis=0)

[76]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Dropping duplicated data¶

[77]:

df['something'] = [1,1,None,2,None]

[78]:

df.loc[df.something.duplicated()]

[78]:

	pid	weight	height	date	something
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN

[79]:

df.drop_duplicates(subset='something')

[79]:

	pid	weight	height	date	something
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0

Dropping missing data¶

[80]:

df

[80]:

	pid	weight	height	date	something
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN

[81]:

df.something.fillna(0)

[81]:

0    1.0
1    1.0
2    0.0
3    2.0
4    0.0
Name: something, dtype: float64

[82]:

df.something.ffill()

[82]:

0    1.0
1    1.0
2    1.0
3    2.0
4    2.0
Name: something, dtype: float64

[83]:

df.something.bfill()

[83]:

0    1.0
1    1.0
2    2.0
3    2.0
4    NaN
Name: something, dtype: float64

[84]:

df.something.interpolate()

[84]:

0    1.0
1    1.0
2    1.5
3    2.0
4    2.0
Name: something, dtype: float64

[85]:

df.dropna()

[85]:

	pid	weight	height	date	something
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0

Transforming and Creating Columns¶

[86]:

df.assign(bmi=df['weight'] / (df['height']/100)**2)

[86]:

	pid	weight	height	date	something	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0	9.441118
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0	5.322067
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	21.485449

[87]:

df['bmi'] = df['weight'] / (df['height']/100)**2

[88]:

df

[88]:

	pid	weight	height	date	something	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0	9.441118
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0	5.322067
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	21.485449

[89]:

df['something'] = [2,2,None,None,3]

[90]:

df

[90]:

	pid	weight	height	date	something	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449

Sorting Data Frames¶

Sort on indexes¶

[91]:

df.sort_index(axis=1)

[91]:

	bmi	date	height	pid	something	weight
0	17.533678	2018-11-11 14:16:18.148411	203.560866	328	2.0	72.654347
1	9.441118	2018-11-12 14:16:18.148411	189.847316	756	2.0	34.027679
2	11.324404	2018-11-13 14:16:18.148411	158.646074	185	NaN	28.501914
3	5.322067	2018-11-14 14:16:18.148411	180.795993	507	NaN	17.396343
4	21.485449	2018-11-15 14:16:18.148411	173.564725	919	3.0	64.724301

[92]:

df.sort_index(axis=0, ascending=False)

[92]:

	pid	weight	height	date	something	bmi
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678

Sort on values¶

[93]:

df.sort_values(by=['something', 'bmi'], ascending=[True, False])

[93]:

	pid	weight	height	date	something	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067

Summarizing¶

Apply an aggregation function¶

[94]:

df.select_dtypes(include=np.number)

[94]:

	weight	height	something	bmi
0	72.654347	203.560866	2.0	17.533678
1	34.027679	189.847316	2.0	9.441118
2	28.501914	158.646074	NaN	11.324404
3	17.396343	180.795993	NaN	5.322067
4	64.724301	173.564725	3.0	21.485449

[95]:

df.select_dtypes(include=np.number).agg(np.sum)

[95]:

weight       217.304584
height       906.414974
something      7.000000
bmi           65.106716
dtype: float64

[96]:

df.agg(['count', np.sum, np.mean])

[96]:

	pid	weight	height	date	something	bmi
count	5.0	5.000000	5.000000	5	3.000000	5.000000
sum	NaN	217.304584	906.414974	NaN	7.000000	65.106716
mean	NaN	43.460917	181.282995	2018-11-13 14:16:18.148410880	2.333333	13.021343

Split-Apply-Combine¶

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with groupby followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

[97]:

df['treatment'] = list('ababa')

[98]:

df

[98]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a

[99]:

grouped = df.groupby('treatment')

[100]:

grouped.get_group('a')

[100]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a

[101]:

grouped.mean()

[101]:

	weight	height	something	bmi
treatment
a	55.293521	178.590555	2.5	16.781177
b	25.712011	185.321654	2.0	7.381592

Using `agg` with `groupby`¶

[102]:

grouped.agg('mean')

[102]:

	weight	height	something	bmi
treatment
a	55.293521	178.590555	2.5	16.781177
b	25.712011	185.321654	2.0	7.381592

[103]:

grouped.agg(['mean', 'std'])

[103]:

	weight		height		something		bmi
	mean	std	mean	std	mean	std	mean	std
treatment
a	55.293521	23.538565	178.590555	22.875289	2.5	0.707107	16.781177	5.122148
b	25.712011	11.760130	185.321654	6.400252	2.0	NaN	7.381592	2.912608

[104]:

grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})

[104]:

	weight		height		bmi
	mean	std	min	max	<lambda>
treatment
a	55.293521	23.538565	158.646074	203.560866	897.296525
b	25.712011	11.760130	180.795993	189.847316	117.459100

Using `trasnform` wtih `groupby`¶

[105]:

g_mean = grouped[['weight', 'height']].transform(np.mean)
g_mean

[105]:

	weight	height
0	55.293521	178.590555
1	25.712011	185.321654
2	55.293521	178.590555
3	25.712011	185.321654
4	55.293521	178.590555

[106]:

g_std = grouped[['weight', 'height']].transform(np.std)
g_std

[106]:

	weight	height
0	23.538565	22.875289
1	11.760130	6.400252
2	23.538565	22.875289
3	11.760130	6.400252
4	23.538565	22.875289

[107]:

(df[['weight', 'height']] - g_mean)/g_std

[107]:

	weight	height
0	0.737548	1.091584
1	0.707107	0.707107
2	-1.138201	-0.871879
3	-0.707107	-0.707107
4	0.400652	-0.219706

Combining Data Frames¶

[108]:

df

[108]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a

[109]:

df1 =  df.iloc[3:].copy()

[110]:

df1.drop('something', axis=1, inplace=True)
df1

[110]:

	pid	weight	height	date	bmi	treatment
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	21.485449	a

Adding rows¶

Note that pandas aligns by column indexes automatically.

[111]:

df.append(df1, sort=False)

[111]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	21.485449	a

[112]:

pd.concat([df, df1], sort=False)

[112]:

	pid	weight	height	date	something	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	21.485449	a

Adding columns¶

[113]:

df.pid

[113]:

0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: category
Categories (5, int64): [185, 328, 507, 756, 919]

[114]:

df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))

[115]:

df2.pid

[115]:

0    649
1    533
2    400
3    600
Name: pid, dtype: int64

[116]:

df.pid = df.pid.astype('int')

[117]:

pd.merge(df, df2, on='pid', how='inner')

[117]:

	pid	weight	height	date	something	bmi	treatment	age

[118]:

pd.merge(df, df2, on='pid', how='left')

[118]:

	pid	weight	height	date	something	bmi	treatment	age
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a	NaN
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b	NaN
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b	NaN
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a	NaN

[119]:

pd.merge(df, df2, on='pid', how='right')

[119]:

	pid	weight	height	date	something	bmi	treatment	age
0	649	NaN	NaN	NaT	NaN	NaN	NaN	23
1	533	NaN	NaN	NaT	NaN	NaN	NaN	34
2	400	NaN	NaN	NaT	NaN	NaN	NaN	45
3	600	NaN	NaN	NaT	NaN	NaN	NaN	56

[120]:

pd.merge(df, df2, on='pid', how='outer')

[120]:

	pid	weight	height	date	something	bmi	treatment	age
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	17.533678	a	NaN
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	9.441118	b	NaN
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	11.324404	a	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	5.322067	b	NaN
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	21.485449	a	NaN
5	649	NaN	NaN	NaT	NaN	NaN	NaN	23.0
6	533	NaN	NaN	NaT	NaN	NaN	NaN	34.0
7	400	NaN	NaN	NaT	NaN	NaN	NaN	45.0
8	600	NaN	NaN	NaT	NaN	NaN	NaN	56.0

Merging on the index¶

[121]:

df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))

[122]:

df1

[122]:

	x
a	1
b	2
c	3

[123]:

df2

[123]:

	y
a	4
b	5
c	6

[124]:

df3

[124]:

	z
a	7
b	8
c	9

[125]:

df1.join([df2, df3])

[125]:

	x	y	z
a	1	4	7
b	2	5	8
c	3	6	9

Fixing common DataFrame issues¶

Multiple variables in a column¶

[126]:

df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df

[126]:

	pid_treat
0	A-1
1	B-2
2	C-1
3	D-2

[127]:

df.pid_treat.str.split('-')

[127]:

0    [A, 1]
1    [B, 2]
2    [C, 1]
3    [D, 2]
Name: pid_treat, dtype: object

[128]:

df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])

[128]:

	pid	treat
0	A	1
1	B	2
2	C	1
3	D	2

Multiple values in a cell¶

[129]:

df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df

[129]:

	pid	vals
0	a	(1, 2, 3)
1	b	(4, 5, 6)
2	c	(7, 8, 9)

[130]:

df[['t1', 't2', 't3']]  = df.vals.apply(pd.Series)
df

[130]:

	pid	vals	t1	t2	t3
0	a	(1, 2, 3)	1	2	3
1	b	(4, 5, 6)	4	5	6
2	c	(7, 8, 9)	7	8	9

[131]:

df.drop('vals', axis=1, inplace=True)

[132]:

pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)

[132]:

	pid	vals
0	a	1
1	b	4
2	c	7
3	a	2
4	b	5
5	c	8
6	a	3
7	b	6
8	c	9

Reshaping Data Frames¶

Sometimes we need to make rows into columns or vice versa.

Converting multiple columns into a single column¶

This is often useful if you need to condition on some variable.

[133]:

url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)

[134]:

iris.head()

[134]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

[135]:

iris.shape

[135]:

(150, 5)

[136]:

df_iris = pd.melt(iris, id_vars='species')

[137]:

df_iris.sample(10)

[137]:

	species	variable	value
463	setosa	petal_width	0.1
210	versicolor	sepal_width	2.0
70	versicolor	sepal_length	5.9
468	setosa	petal_width	0.3
57	versicolor	sepal_length	4.9
98	versicolor	sepal_length	5.1
111	virginica	sepal_length	6.4
385	versicolor	petal_length	4.5
558	virginica	petal_width	1.8
469	setosa	petal_width	0.3

Pivoting¶

Sometimes we need to convert categorical values in a column into separate columns. This is often done at the same time as performing a summary.

[138]:

df_iris.pivot_table(index='variable', columns='species', values='value', aggfunc='mean')

[138]:

species	setosa	versicolor	virginica
variable
petal_length	1.464	4.260	5.552
petal_width	0.244	1.326	2.026
sepal_length	5.006	5.936	6.588
sepal_width	3.418	2.770	2.974

Functional style - `apply`, `applymap` and `map`¶

apply can be used to apply a custom function

[139]:

scores = pd.DataFrame(
    np.around(np.clip(np.random.normal(90, 10, (5,3)), 0, 100), 1),
    columns = ['math', 'stat', 'biol'],
    index = ['anne', 'bob', 'charles', 'dirk', 'edgar']
)

[140]:

scores

[140]:

	math	stat	biol
anne	93.4	99.5	80.9
bob	67.5	82.9	94.5
charles	93.5	91.5	96.4
dirk	89.4	93.3	84.9
edgar	88.3	92.4	97.0

[141]:

def convert_grade_1(score):
    return np.where(score > 90, 'A',
                    np.where(score > 80, 'B',
                            np.where(score > 70, 'C', 'F')))

[142]:

scores.apply(convert_grade_1)

[142]:

	math	stat	biol
anne	A	A	B
bob	F	B	A
charles	A	A	A
dirk	B	A	B
edgar	B	A	A

The np.where is a little clumsy - here is an alternative.

[143]:

def convert_grade_2(score):
    if score.name == 'math': # math professors are mean
        return np.choose(
            pd.cut(score, [-1, 80, 90, 95, 100], labels=False),
            ['F', 'C', 'B', 'A']
        )
    else:
        return np.choose(
            pd.cut(score, [-1, 70, 80, 90, 100], labels=False),
            ['F', 'C', 'B', 'A']
        )

[144]:

scores.apply(convert_grade_2)

[144]:

	math	stat	biol
anne	B	A	B
bob	F	B	A
charles	B	A	A
dirk	C	A	B
edgar	C	A	A

apply can be used to avoid explicit looping

[145]:

def likely_profession(row):
    if (row.biol > row.math) and (row.biol > row.stat):
        return 'farmer'
    elif (row.math > row.biol) and (row.math > row.stat):
        return 'high school teacher'
    elif (row.stat > row.math) and (row.stat > row.biol):
        return 'actuary'
    else:
        return 'doctor'

[146]:

scores.apply(likely_profession, axis=1)

[146]:

anne       actuary
bob         farmer
charles     farmer
dirk       actuary
edgar       farmer
dtype: object

If all else fails, you can loop over pandas data frames.

Be prepared for pitying looks from more snobbish Python coders

Loops are frowned upon because they are not efficient, but sometimes pragmatism beats elegance.

[147]:

for idx, row in scores.iterrows():
    print(f'\nidx = {idx}\nrow = {row.index}: {row.values}\n',
          end='-'*30)


idx = anne
row = Index(['math', 'stat', 'biol'], dtype='object'): [93.4 99.5 80.9]
------------------------------
idx = bob
row = Index(['math', 'stat', 'biol'], dtype='object'): [67.5 82.9 94.5]
------------------------------
idx = charles
row = Index(['math', 'stat', 'biol'], dtype='object'): [93.5 91.5 96.4]
------------------------------
idx = dirk
row = Index(['math', 'stat', 'biol'], dtype='object'): [89.4 93.3 84.9]
------------------------------
idx = edgar
row = Index(['math', 'stat', 'biol'], dtype='object'): [88.3 92.4 97. ]
------------------------------

apply can be used for reductions along margins

[148]:

df = pd.DataFrame(np.random.randint(0, 10, (4,5)), columns=list('abcde'), index=list('wxyz'))

[149]:

df

[149]:

	a	b	c	d	e
w	3	7	0	3	7
x	9	5	5	2	5
y	9	4	5	8	3
z	0	4	9	1	8

[150]:

df.apply(sum, axis=0)

[150]:

a    21
b    20
c    19
d    14
e    23
dtype: int64

[151]:

df.apply(sum, axis=1)

[151]:

w    20
x    26
y    29
z    22
dtype: int64

[152]:

import string

[153]:

char_map = {i: c for i,c in enumerate(string.ascii_uppercase)}

[154]:

df.applymap(lambda x: char_map[x])

[154]:

	a	b	c	d	e
w	D	H	A	D	H
x	J	F	F	C	F
y	J	E	F	I	D
z	A	E	J	B	I

[155]:

df.assign(b_map = df.b.map(char_map))

[155]:

	a	b	c	d	e	b_map
w	3	7	0	3	7	H
x	9	5	5	2	5	F
y	9	4	5	8	3	E
z	0	4	9	1	8	E

Chaining commands¶

Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.

[156]:

(
    iris.
    sample(frac=0.2).
    filter(regex='s.*').
    assign(both=iris.sepal_length + iris.petal_length).
    query('both > 2').
    groupby('species').agg(['mean', 'sum']).
    pipe(lambda x: np.around(x, 1))
)

[156]:

	sepal_length		sepal_width		both
	mean	sum	mean	sum	mean	sum
species
setosa	5.1	50.7	3.5	35.1	6.6	65.7
versicolor	5.8	52.2	2.8	24.8	10.1	90.9
virginica	7.0	77.0	3.1	33.8	12.7	140.2

Moving between R and Python in Jupyter¶

[157]:

%load_ext rpy2.ipython

[158]:

import warnings
warnings.simplefilter('ignore', FutureWarning)

[159]:

iris = %R iris

[160]:

iris.head()

[160]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa

[161]:

iris_py = iris.copy()
iris_py.Species = iris_py.Species.str.upper()

[162]:

%%R -i iris_py -o iris_r

iris_r <- iris_py[1:3,]

[163]:

iris_r

[163]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	SETOSA
2	4.9	3.0	1.4	0.2	SETOSA
3	4.7	3.2	1.3	0.2	SETOSA

[164]:

! python3 -m pip install --quiet watermark

[165]:

%load_ext watermark

[166]:

%watermark -v -iv

pendulum 2.1.2
pandas   1.1.1
numpy    1.18.5
CPython 3.8.5
IPython 7.17.0