Data Science: Data processing¶
Typical packages: pandas
, plotnine
, plotly
, streamlit
References
Introduction to pandas
¶
[1]:
import numpy as np
import pandas as pd
Series and Data Frames¶
Series objects¶
A Series
is like a vector. All elements must have the same type or are nulls.
[2]:
s = pd.Series([1,1,2,3] + [None])
s
[2]:
0 1.0
1 1.0
2 2.0
3 3.0
4 NaN
dtype: float64
Special types of series¶
Strings¶
[5]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1
[5]:
0 the quick
1 quick brown
2 brown fox
3 fox jumps
4 jumps over
5 over the
6 the lazy
7 lazy dog
dtype: object
[6]:
s1.str.upper()
[6]:
0 THE QUICK
1 QUICK BROWN
2 BROWN FOX
3 FOX JUMPS
4 JUMPS OVER
5 OVER THE
6 THE LAZY
7 LAZY DOG
dtype: object
[7]:
s1.str.split()
[7]:
0 [the, quick]
1 [quick, brown]
2 [brown, fox]
3 [fox, jumps]
4 [jumps, over]
5 [over, the]
6 [the, lazy]
7 [lazy, dog]
dtype: object
[8]:
s1.str.split().str[1]
[8]:
0 quick
1 brown
2 fox
3 jumps
4 over
5 the
6 lazy
7 dog
dtype: object
Categories¶
[9]:
s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2
[9]:
0 Asian
1 Asian
2 White
3 Black
4 White
5 Hispanic
dtype: object
[10]:
s2 = s2.astype('category')
s2
[10]:
0 Asian
1 Asian
2 White
3 Black
4 White
5 Hispanic
dtype: category
Categories (4, object): ['Asian', 'Black', 'Hispanic', 'White']
[11]:
s2.cat.categories
[11]:
Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')
[12]:
s2.cat.codes
[12]:
0 0
1 0
2 3
3 1
4 3
5 2
dtype: int8
Dates and times¶
Datetimes are often useful as indices to a time series.
[13]:
import pendulum
[14]:
d = pendulum.today()
[15]:
d.to_date_string()
[15]:
'2020-11-11'
[16]:
k = 18
s3 = pd.Series(range(k),
index=pd.date_range(d.to_date_string(),
periods=k,
freq='M'))
[17]:
s3
[17]:
2020-11-30 0
2020-12-31 1
2021-01-31 2
2021-02-28 3
2021-03-31 4
2021-04-30 5
2021-05-31 6
2021-06-30 7
2021-07-31 8
2021-08-31 9
2021-09-30 10
2021-10-31 11
2021-11-30 12
2021-12-31 13
2022-01-31 14
2022-02-28 15
2022-03-31 16
2022-04-30 17
Freq: M, dtype: int64
[18]:
s3['2021']
[18]:
2021-01-31 2
2021-02-28 3
2021-03-31 4
2021-04-30 5
2021-05-31 6
2021-06-30 7
2021-07-31 8
2021-08-31 9
2021-09-30 10
2021-10-31 11
2021-11-30 12
2021-12-31 13
Freq: M, dtype: int64
[19]:
s3['2021-01':'2021-06']
[19]:
2021-01-31 2
2021-02-28 3
2021-03-31 4
2021-04-30 5
2021-05-31 6
2021-06-30 7
Freq: M, dtype: int64
If used as a series, then need dt
accessor method
[20]:
s4 = s3.index.to_series()
[21]:
s4.dt.day_name()
[21]:
2020-11-30 Monday
2020-12-31 Thursday
2021-01-31 Sunday
2021-02-28 Sunday
2021-03-31 Wednesday
2021-04-30 Friday
2021-05-31 Monday
2021-06-30 Wednesday
2021-07-31 Saturday
2021-08-31 Tuesday
2021-09-30 Thursday
2021-10-31 Sunday
2021-11-30 Tuesday
2021-12-31 Friday
2022-01-31 Monday
2022-02-28 Monday
2022-03-31 Thursday
2022-04-30 Saturday
Freq: M, dtype: object
DataFrame objects¶
A DataFrame
is like a matrix. Columns in a DataFrame
are Series
.
Each column in a DataFrame represents a variale
Each row in a DataFrame represents an observation
Each cell in a DataFrame represents a value
[22]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df
[22]:
num | |
---|---|
0 | 1.0 |
1 | 2.0 |
2 | 3.0 |
3 | NaN |
[23]:
df.num
[23]:
0 1.0
1 2.0
2 3.0
3 NaN
Name: num, dtype: float64
Index¶
Row and column identifiers are of Index
type.
Somewhat confusingly, index is also a a synonym for the row identifiers.
[24]:
df.index
[24]:
RangeIndex(start=0, stop=4, step=1)
Setting a column as the row index¶
[25]:
df
[25]:
num | |
---|---|
0 | 1.0 |
1 | 2.0 |
2 | 3.0 |
3 | NaN |
[26]:
df1 = df.set_index('num')
df1
[26]:
num |
---|
1.0 |
2.0 |
3.0 |
NaN |
Columns¶
This is just a different index object
[28]:
df.columns
[28]:
Index(['num'], dtype='object')
Getting raw values¶
Sometimes you just want a numpy
array, and not a pandas
object.
[29]:
df.values
[29]:
array([[ 1.],
[ 2.],
[ 3.],
[nan]])
Creating Data Frames¶
Manual¶
[30]:
from collections import OrderedDict
[31]:
n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n),
weight=np.random.normal(70, 20, n),
height=np.random.normal(170, 15, n),
date=dates,
))
df
[31]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 730 | 88.630097 | 195.922737 | 2020-11-11 19:25:24.062141 |
1 | 112 | 53.958069 | 169.022381 | 2020-11-12 19:25:24.062141 |
2 | 740 | 89.709304 | 180.327636 | 2020-11-13 19:25:24.062141 |
3 | 485 | 52.879337 | 142.806223 | 2020-11-14 19:25:24.062141 |
4 | 905 | 80.488371 | 168.384792 | 2020-11-15 19:25:24.062141 |
From file¶
You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like read_X
where X is the data type.
[32]:
%%file measures.txt
pid weight height date
328 72.654347 203.560866 2018-11-11 14:16:18.148411
756 34.027679 189.847316 2018-11-12 14:16:18.148411
185 28.501914 158.646074 2018-11-13 14:16:18.148411
507 17.396343 180.795993 2018-11-14 14:16:18.148411
919 64.724301 173.564725 2018-11-15 14:16:18.148411
Overwriting measures.txt
[33]:
df = pd.read_table('measures.txt')
df
[33]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Indexing Data Frames¶
Implicit defaults¶
if you provide a slice, it is assumed that you are asking for rows.
[34]:
df[1:3]
[34]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
If you provide a singe value or list, it is assumed that you are asking for columns.
[35]:
df[['pid', 'weight']]
[35]:
pid | weight | |
---|---|---|
0 | 328 | 72.654347 |
1 | 756 | 34.027679 |
2 | 185 | 28.501914 |
3 | 507 | 17.396343 |
4 | 919 | 64.724301 |
Extracting a column¶
Dictionary style access¶
[36]:
df['pid']
[36]:
0 328
1 756
2 185
3 507
4 919
Name: pid, dtype: int64
Property style access¶
This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)
[37]:
df.pid
[37]:
0 328
1 756
2 185
3 507
4 919
Name: pid, dtype: int64
Indexing by location¶
This is similar to numpy
indexing
[38]:
df.iloc[1:3, :]
[38]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
[39]:
df.iloc[1:3, 1:4:2]
[39]:
weight | date | |
---|---|---|
1 | 34.027679 | 2018-11-12 14:16:18.148411 |
2 | 28.501914 | 2018-11-13 14:16:18.148411 |
Indexing by name¶
[40]:
df.loc[1:3, 'weight':'height']
[40]:
weight | height | |
---|---|---|
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
Warning: When using loc
, the row slice indicates row names, not positions.
[41]:
df1 = df.copy()
df1.index = df.index + 1
df1
[41]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
2 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
3 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
4 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
5 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[42]:
df1.loc[1:3, 'weight':'height']
[42]:
weight | height | |
---|---|---|
1 | 72.654347 | 203.560866 |
2 | 34.027679 | 189.847316 |
3 | 28.501914 | 158.646074 |
Structure of a Data Frame¶
Converting data types¶
Using astype
on one column¶
[44]:
df.pid = df.pid.astype('category')
Using astype
on multiple columns¶
[45]:
df = df.astype(dict(weight=float, height=float))
Using a conversion function¶
[46]:
df.date = pd.to_datetime(df.date)
Check¶
[47]:
df.dtypes
[47]:
pid category
weight float64
height float64
date datetime64[ns]
dtype: object
Basic properties¶
[48]:
df.size
[48]:
20
[49]:
df.shape
[49]:
(5, 4)
[50]:
df.describe()
[50]:
weight | height | |
---|---|---|
count | 5.000000 | 5.000000 |
mean | 43.460917 | 181.282995 |
std | 23.960945 | 16.895933 |
min | 17.396343 | 158.646074 |
25% | 28.501914 | 173.564725 |
50% | 34.027679 | 180.795993 |
75% | 64.724301 | 189.847316 |
max | 72.654347 | 203.560866 |
[51]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pid 5 non-null category
1 weight 5 non-null float64
2 height 5 non-null float64
3 date 5 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 453.0 bytes
Inspection¶
[52]:
df.head(n=3)
[52]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
[53]:
df.tail(n=3)
[53]:
pid | weight | height | date | |
---|---|---|---|---|
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[54]:
df.sample(n=3)
[54]:
pid | weight | height | date | |
---|---|---|---|---|
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
[55]:
df.sample(frac=0.5)
[55]:
pid | weight | height | date | |
---|---|---|---|---|
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
Selecting, Renaming and Removing Columns¶
Selecting columns¶
[56]:
df.filter(items=['pid', 'date'])
[56]:
pid | date | |
---|---|---|
0 | 328 | 2018-11-11 14:16:18.148411 |
1 | 756 | 2018-11-12 14:16:18.148411 |
2 | 185 | 2018-11-13 14:16:18.148411 |
3 | 507 | 2018-11-14 14:16:18.148411 |
4 | 919 | 2018-11-15 14:16:18.148411 |
[57]:
df.filter(regex='.*ght')
[57]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
Note that you can also use regular string methods on the columns¶
[58]:
df.loc[:, df.columns.str.contains('d')]
[58]:
pid | date | |
---|---|---|
0 | 328 | 2018-11-11 14:16:18.148411 |
1 | 756 | 2018-11-12 14:16:18.148411 |
2 | 185 | 2018-11-13 14:16:18.148411 |
3 | 507 | 2018-11-14 14:16:18.148411 |
4 | 919 | 2018-11-15 14:16:18.148411 |
Renaming columns¶
[59]:
df.rename(dict(weight='w', height='h'), axis=1)
[59]:
pid | w | h | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[60]:
orig_cols = df.columns
[61]:
df.columns = list('abcd')
[62]:
df
[62]:
a | b | c | d | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[63]:
df.columns = orig_cols
[64]:
df
[64]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Removing columns¶
[65]:
df.drop(['pid', 'date'], axis=1)
[65]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
[66]:
df.drop(columns=['pid', 'date'])
[66]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
[67]:
df.drop(columns=df.columns[df.columns.str.contains('d')])
[67]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
Selecting, Renaming and Removing Rows¶
Selecting rows¶
[68]:
df[df.weight.between(60,70)]
[68]:
pid | weight | height | date | |
---|---|---|---|---|
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[69]:
df[(69 <= df.weight) & (df.weight < 70)]
[69]:
pid | weight | height | date |
---|
[70]:
df[df.date.between(pd.to_datetime('2018-11-13'),
pd.to_datetime('2018-11-15 23:59:59'))]
[70]:
pid | weight | height | date | |
---|---|---|---|---|
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Renaming rows¶
[71]:
df.rename({i:letter for i,letter in enumerate('abcde')})
[71]:
pid | weight | height | date | |
---|---|---|---|---|
a | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
b | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
c | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
d | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
e | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[72]:
df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']
[73]:
df
[73]:
pid | weight | height | date | |
---|---|---|---|---|
the | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
quick | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
brown | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
fox | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
jumphs | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[74]:
df = df.reset_index(drop=True)
[75]:
df
[75]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Dropping rows¶
[76]:
df.drop([1,3], axis=0)
[76]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Dropping duplicated data¶
[77]:
df['something'] = [1,1,None,2,None]
[78]:
df.loc[df.something.duplicated()]
[78]:
pid | weight | height | date | something | |
---|---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN |
[79]:
df.drop_duplicates(subset='something')
[79]:
pid | weight | height | date | something | |
---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 |
Dropping missing data¶
[80]:
df
[80]:
pid | weight | height | date | something | |
---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN |
[81]:
df.something.fillna(0)
[81]:
0 1.0
1 1.0
2 0.0
3 2.0
4 0.0
Name: something, dtype: float64
[82]:
df.something.ffill()
[82]:
0 1.0
1 1.0
2 1.0
3 2.0
4 2.0
Name: something, dtype: float64
[83]:
df.something.bfill()
[83]:
0 1.0
1 1.0
2 2.0
3 2.0
4 NaN
Name: something, dtype: float64
[84]:
df.something.interpolate()
[84]:
0 1.0
1 1.0
2 1.5
3 2.0
4 2.0
Name: something, dtype: float64
[85]:
df.dropna()
[85]:
pid | weight | height | date | something | |
---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 |
Transforming and Creating Columns¶
[86]:
df.assign(bmi=df['weight'] / (df['height']/100)**2)
[86]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 | 9.441118 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 | 5.322067 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | 21.485449 |
[87]:
df['bmi'] = df['weight'] / (df['height']/100)**2
[88]:
df
[88]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 | 9.441118 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 | 5.322067 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | 21.485449 |
[89]:
df['something'] = [2,2,None,None,3]
[90]:
df
[90]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 |
Sorting Data Frames¶
Sort on indexes¶
[91]:
df.sort_index(axis=1)
[91]:
bmi | date | height | pid | something | weight | |
---|---|---|---|---|---|---|
0 | 17.533678 | 2018-11-11 14:16:18.148411 | 203.560866 | 328 | 2.0 | 72.654347 |
1 | 9.441118 | 2018-11-12 14:16:18.148411 | 189.847316 | 756 | 2.0 | 34.027679 |
2 | 11.324404 | 2018-11-13 14:16:18.148411 | 158.646074 | 185 | NaN | 28.501914 |
3 | 5.322067 | 2018-11-14 14:16:18.148411 | 180.795993 | 507 | NaN | 17.396343 |
4 | 21.485449 | 2018-11-15 14:16:18.148411 | 173.564725 | 919 | 3.0 | 64.724301 |
[92]:
df.sort_index(axis=0, ascending=False)
[92]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 |
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 |
Sort on values¶
[93]:
df.sort_values(by=['something', 'bmi'], ascending=[True, False])
[93]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 |
Summarizing¶
Apply an aggregation function¶
[94]:
df.select_dtypes(include=np.number)
[94]:
weight | height | something | bmi | |
---|---|---|---|---|
0 | 72.654347 | 203.560866 | 2.0 | 17.533678 |
1 | 34.027679 | 189.847316 | 2.0 | 9.441118 |
2 | 28.501914 | 158.646074 | NaN | 11.324404 |
3 | 17.396343 | 180.795993 | NaN | 5.322067 |
4 | 64.724301 | 173.564725 | 3.0 | 21.485449 |
[95]:
df.select_dtypes(include=np.number).agg(np.sum)
[95]:
weight 217.304584
height 906.414974
something 7.000000
bmi 65.106716
dtype: float64
[96]:
df.agg(['count', np.sum, np.mean])
[96]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
count | 5.0 | 5.000000 | 5.000000 | 5 | 3.000000 | 5.000000 |
sum | NaN | 217.304584 | 906.414974 | NaN | 7.000000 | 65.106716 |
mean | NaN | 43.460917 | 181.282995 | 2018-11-13 14:16:18.148410880 | 2.333333 | 13.021343 |
Split-Apply-Combine¶
We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with groupby
followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.
[97]:
df['treatment'] = list('ababa')
[98]:
df
[98]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
[99]:
grouped = df.groupby('treatment')
[100]:
grouped.get_group('a')
[100]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
[101]:
grouped.mean()
[101]:
weight | height | something | bmi | |
---|---|---|---|---|
treatment | ||||
a | 55.293521 | 178.590555 | 2.5 | 16.781177 |
b | 25.712011 | 185.321654 | 2.0 | 7.381592 |
Using agg
with groupby
¶
[102]:
grouped.agg('mean')
[102]:
weight | height | something | bmi | |
---|---|---|---|---|
treatment | ||||
a | 55.293521 | 178.590555 | 2.5 | 16.781177 |
b | 25.712011 | 185.321654 | 2.0 | 7.381592 |
[103]:
grouped.agg(['mean', 'std'])
[103]:
weight | height | something | bmi | |||||
---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | |
treatment | ||||||||
a | 55.293521 | 23.538565 | 178.590555 | 22.875289 | 2.5 | 0.707107 | 16.781177 | 5.122148 |
b | 25.712011 | 11.760130 | 185.321654 | 6.400252 | 2.0 | NaN | 7.381592 | 2.912608 |
[104]:
grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})
[104]:
weight | height | bmi | |||
---|---|---|---|---|---|
mean | std | min | max | <lambda> | |
treatment | |||||
a | 55.293521 | 23.538565 | 158.646074 | 203.560866 | 897.296525 |
b | 25.712011 | 11.760130 | 180.795993 | 189.847316 | 117.459100 |
Using trasnform
wtih groupby
¶
[105]:
g_mean = grouped[['weight', 'height']].transform(np.mean)
g_mean
[105]:
weight | height | |
---|---|---|
0 | 55.293521 | 178.590555 |
1 | 25.712011 | 185.321654 |
2 | 55.293521 | 178.590555 |
3 | 25.712011 | 185.321654 |
4 | 55.293521 | 178.590555 |
[106]:
g_std = grouped[['weight', 'height']].transform(np.std)
g_std
[106]:
weight | height | |
---|---|---|
0 | 23.538565 | 22.875289 |
1 | 11.760130 | 6.400252 |
2 | 23.538565 | 22.875289 |
3 | 11.760130 | 6.400252 |
4 | 23.538565 | 22.875289 |
[107]:
(df[['weight', 'height']] - g_mean)/g_std
[107]:
weight | height | |
---|---|---|
0 | 0.737548 | 1.091584 |
1 | 0.707107 | 0.707107 |
2 | -1.138201 | -0.871879 |
3 | -0.707107 | -0.707107 |
4 | 0.400652 | -0.219706 |
Combining Data Frames¶
[108]:
df
[108]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
[109]:
df1 = df.iloc[3:].copy()
[110]:
df1.drop('something', axis=1, inplace=True)
df1
[110]:
pid | weight | height | date | bmi | treatment | |
---|---|---|---|---|---|---|
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 21.485449 | a |
Adding rows¶
Note that pandas
aligns by column indexes automatically.
[111]:
df.append(df1, sort=False)
[111]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | 21.485449 | a |
[112]:
pd.concat([df, df1], sort=False)
[112]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | 21.485449 | a |
Adding columns¶
[113]:
df.pid
[113]:
0 328
1 756
2 185
3 507
4 919
Name: pid, dtype: category
Categories (5, int64): [185, 328, 507, 756, 919]
[114]:
df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))
[115]:
df2.pid
[115]:
0 649
1 533
2 400
3 600
Name: pid, dtype: int64
[116]:
df.pid = df.pid.astype('int')
[117]:
pd.merge(df, df2, on='pid', how='inner')
[117]:
pid | weight | height | date | something | bmi | treatment | age |
---|
[118]:
pd.merge(df, df2, on='pid', how='left')
[118]:
pid | weight | height | date | something | bmi | treatment | age | |
---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a | NaN |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b | NaN |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b | NaN |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a | NaN |
[119]:
pd.merge(df, df2, on='pid', how='right')
[119]:
pid | weight | height | date | something | bmi | treatment | age | |
---|---|---|---|---|---|---|---|---|
0 | 649 | NaN | NaN | NaT | NaN | NaN | NaN | 23 |
1 | 533 | NaN | NaN | NaT | NaN | NaN | NaN | 34 |
2 | 400 | NaN | NaN | NaT | NaN | NaN | NaN | 45 |
3 | 600 | NaN | NaN | NaT | NaN | NaN | NaN | 56 |
[120]:
pd.merge(df, df2, on='pid', how='outer')
[120]:
pid | weight | height | date | something | bmi | treatment | age | |
---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a | NaN |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b | NaN |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b | NaN |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a | NaN |
5 | 649 | NaN | NaN | NaT | NaN | NaN | NaN | 23.0 |
6 | 533 | NaN | NaN | NaT | NaN | NaN | NaN | 34.0 |
7 | 400 | NaN | NaN | NaT | NaN | NaN | NaN | 45.0 |
8 | 600 | NaN | NaN | NaT | NaN | NaN | NaN | 56.0 |
Merging on the index¶
[121]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))
[122]:
df1
[122]:
x | |
---|---|
a | 1 |
b | 2 |
c | 3 |
[123]:
df2
[123]:
y | |
---|---|
a | 4 |
b | 5 |
c | 6 |
[124]:
df3
[124]:
z | |
---|---|
a | 7 |
b | 8 |
c | 9 |
[125]:
df1.join([df2, df3])
[125]:
x | y | z | |
---|---|---|---|
a | 1 | 4 | 7 |
b | 2 | 5 | 8 |
c | 3 | 6 | 9 |
Fixing common DataFrame issues¶
Multiple variables in a column¶
[126]:
df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df
[126]:
pid_treat | |
---|---|
0 | A-1 |
1 | B-2 |
2 | C-1 |
3 | D-2 |
[127]:
df.pid_treat.str.split('-')
[127]:
0 [A, 1]
1 [B, 2]
2 [C, 1]
3 [D, 2]
Name: pid_treat, dtype: object
[128]:
df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])
[128]:
pid | treat | |
---|---|---|
0 | A | 1 |
1 | B | 2 |
2 | C | 1 |
3 | D | 2 |
Multiple values in a cell¶
[129]:
df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df
[129]:
pid | vals | |
---|---|---|
0 | a | (1, 2, 3) |
1 | b | (4, 5, 6) |
2 | c | (7, 8, 9) |
[130]:
df[['t1', 't2', 't3']] = df.vals.apply(pd.Series)
df
[130]:
pid | vals | t1 | t2 | t3 | |
---|---|---|---|---|---|
0 | a | (1, 2, 3) | 1 | 2 | 3 |
1 | b | (4, 5, 6) | 4 | 5 | 6 |
2 | c | (7, 8, 9) | 7 | 8 | 9 |
[131]:
df.drop('vals', axis=1, inplace=True)
[132]:
pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)
[132]:
pid | vals | |
---|---|---|
0 | a | 1 |
1 | b | 4 |
2 | c | 7 |
3 | a | 2 |
4 | b | 5 |
5 | c | 8 |
6 | a | 3 |
7 | b | 6 |
8 | c | 9 |
Reshaping Data Frames¶
Sometimes we need to make rows into columns or vice versa.
Converting multiple columns into a single column¶
This is often useful if you need to condition on some variable.
[133]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)
[134]:
iris.head()
[134]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
[135]:
iris.shape
[135]:
(150, 5)
[136]:
df_iris = pd.melt(iris, id_vars='species')
[137]:
df_iris.sample(10)
[137]:
species | variable | value | |
---|---|---|---|
463 | setosa | petal_width | 0.1 |
210 | versicolor | sepal_width | 2.0 |
70 | versicolor | sepal_length | 5.9 |
468 | setosa | petal_width | 0.3 |
57 | versicolor | sepal_length | 4.9 |
98 | versicolor | sepal_length | 5.1 |
111 | virginica | sepal_length | 6.4 |
385 | versicolor | petal_length | 4.5 |
558 | virginica | petal_width | 1.8 |
469 | setosa | petal_width | 0.3 |
Pivoting¶
Sometimes we need to convert categorical values in a column into separate columns. This is often done at the same time as performing a summary.
[138]:
df_iris.pivot_table(index='variable', columns='species', values='value', aggfunc='mean')
[138]:
species | setosa | versicolor | virginica |
---|---|---|---|
variable | |||
petal_length | 1.464 | 4.260 | 5.552 |
petal_width | 0.244 | 1.326 | 2.026 |
sepal_length | 5.006 | 5.936 | 6.588 |
sepal_width | 3.418 | 2.770 | 2.974 |
Functional style - apply
, applymap
and map
¶
apply
can be used to apply a custom function
[139]:
scores = pd.DataFrame(
np.around(np.clip(np.random.normal(90, 10, (5,3)), 0, 100), 1),
columns = ['math', 'stat', 'biol'],
index = ['anne', 'bob', 'charles', 'dirk', 'edgar']
)
[140]:
scores
[140]:
math | stat | biol | |
---|---|---|---|
anne | 93.4 | 99.5 | 80.9 |
bob | 67.5 | 82.9 | 94.5 |
charles | 93.5 | 91.5 | 96.4 |
dirk | 89.4 | 93.3 | 84.9 |
edgar | 88.3 | 92.4 | 97.0 |
[141]:
def convert_grade_1(score):
return np.where(score > 90, 'A',
np.where(score > 80, 'B',
np.where(score > 70, 'C', 'F')))
[142]:
scores.apply(convert_grade_1)
[142]:
math | stat | biol | |
---|---|---|---|
anne | A | A | B |
bob | F | B | A |
charles | A | A | A |
dirk | B | A | B |
edgar | B | A | A |
The np.where
is a little clumsy - here is an alternative.
[143]:
def convert_grade_2(score):
if score.name == 'math': # math professors are mean
return np.choose(
pd.cut(score, [-1, 80, 90, 95, 100], labels=False),
['F', 'C', 'B', 'A']
)
else:
return np.choose(
pd.cut(score, [-1, 70, 80, 90, 100], labels=False),
['F', 'C', 'B', 'A']
)
[144]:
scores.apply(convert_grade_2)
[144]:
math | stat | biol | |
---|---|---|---|
anne | B | A | B |
bob | F | B | A |
charles | B | A | A |
dirk | C | A | B |
edgar | C | A | A |
apply
can be used to avoid explicit looping
[145]:
def likely_profession(row):
if (row.biol > row.math) and (row.biol > row.stat):
return 'farmer'
elif (row.math > row.biol) and (row.math > row.stat):
return 'high school teacher'
elif (row.stat > row.math) and (row.stat > row.biol):
return 'actuary'
else:
return 'doctor'
[146]:
scores.apply(likely_profession, axis=1)
[146]:
anne actuary
bob farmer
charles farmer
dirk actuary
edgar farmer
dtype: object
If all else fails, you can loop over pandas
data frames.
Be prepared for pitying looks from more snobbish Python coders
Loops are frowned upon because they are not efficient, but sometimes pragmatism beats elegance.
[147]:
for idx, row in scores.iterrows():
print(f'\nidx = {idx}\nrow = {row.index}: {row.values}\n',
end='-'*30)
idx = anne
row = Index(['math', 'stat', 'biol'], dtype='object'): [93.4 99.5 80.9]
------------------------------
idx = bob
row = Index(['math', 'stat', 'biol'], dtype='object'): [67.5 82.9 94.5]
------------------------------
idx = charles
row = Index(['math', 'stat', 'biol'], dtype='object'): [93.5 91.5 96.4]
------------------------------
idx = dirk
row = Index(['math', 'stat', 'biol'], dtype='object'): [89.4 93.3 84.9]
------------------------------
idx = edgar
row = Index(['math', 'stat', 'biol'], dtype='object'): [88.3 92.4 97. ]
------------------------------
apply
can be used for reductions along margins
[148]:
df = pd.DataFrame(np.random.randint(0, 10, (4,5)), columns=list('abcde'), index=list('wxyz'))
[149]:
df
[149]:
a | b | c | d | e | |
---|---|---|---|---|---|
w | 3 | 7 | 0 | 3 | 7 |
x | 9 | 5 | 5 | 2 | 5 |
y | 9 | 4 | 5 | 8 | 3 |
z | 0 | 4 | 9 | 1 | 8 |
[150]:
df.apply(sum, axis=0)
[150]:
a 21
b 20
c 19
d 14
e 23
dtype: int64
[151]:
df.apply(sum, axis=1)
[151]:
w 20
x 26
y 29
z 22
dtype: int64
[152]:
import string
[153]:
char_map = {i: c for i,c in enumerate(string.ascii_uppercase)}
[154]:
df.applymap(lambda x: char_map[x])
[154]:
a | b | c | d | e | |
---|---|---|---|---|---|
w | D | H | A | D | H |
x | J | F | F | C | F |
y | J | E | F | I | D |
z | A | E | J | B | I |
[155]:
df.assign(b_map = df.b.map(char_map))
[155]:
a | b | c | d | e | b_map | |
---|---|---|---|---|---|---|
w | 3 | 7 | 0 | 3 | 7 | H |
x | 9 | 5 | 5 | 2 | 5 | F |
y | 9 | 4 | 5 | 8 | 3 | E |
z | 0 | 4 | 9 | 1 | 8 | E |
Chaining commands¶
Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.
[156]:
(
iris.
sample(frac=0.2).
filter(regex='s.*').
assign(both=iris.sepal_length + iris.petal_length).
query('both > 2').
groupby('species').agg(['mean', 'sum']).
pipe(lambda x: np.around(x, 1))
)
[156]:
sepal_length | sepal_width | both | ||||
---|---|---|---|---|---|---|
mean | sum | mean | sum | mean | sum | |
species | ||||||
setosa | 5.1 | 50.7 | 3.5 | 35.1 | 6.6 | 65.7 |
versicolor | 5.8 | 52.2 | 2.8 | 24.8 | 10.1 | 90.9 |
virginica | 7.0 | 77.0 | 3.1 | 33.8 | 12.7 | 140.2 |
Moving between R and Python in Jupyter¶
[157]:
%load_ext rpy2.ipython
[158]:
import warnings
warnings.simplefilter('ignore', FutureWarning)
[159]:
iris = %R iris
[160]:
iris.head()
[160]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
[161]:
iris_py = iris.copy()
iris_py.Species = iris_py.Species.str.upper()
[162]:
%%R -i iris_py -o iris_r
iris_r <- iris_py[1:3,]
[163]:
iris_r
[163]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | SETOSA |
2 | 4.9 | 3.0 | 1.4 | 0.2 | SETOSA |
3 | 4.7 | 3.2 | 1.3 | 0.2 | SETOSA |
[164]:
! python3 -m pip install --quiet watermark
[165]:
%load_ext watermark
[166]:
%watermark -v -iv
pendulum 2.1.2
pandas 1.1.1
numpy 1.18.5
CPython 3.8.5
IPython 7.17.0