Introduction to pandas
¶
In [1]:
import numpy as np
import pandas as pd
Series and Data Frames¶
Series objects¶
A Series
is like a vector. All elements must have the same type or
are nulls.
In [2]:
s = pd.Series([1,1,2,3] + [None])
s
Out[2]:
0 1.0
1 1.0
2 2.0
3 3.0
4 NaN
dtype: float64
Special types of series¶
Strings¶
In [5]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1
Out[5]:
0 the quick
1 quick brown
2 brown fox
3 fox jumps
4 jumps over
5 over the
6 the lazy
7 lazy dog
dtype: object
In [6]:
s1.str.upper()
Out[6]:
0 THE QUICK
1 QUICK BROWN
2 BROWN FOX
3 FOX JUMPS
4 JUMPS OVER
5 OVER THE
6 THE LAZY
7 LAZY DOG
dtype: object
In [7]:
s1.str.split()
Out[7]:
0 [the, quick]
1 [quick, brown]
2 [brown, fox]
3 [fox, jumps]
4 [jumps, over]
5 [over, the]
6 [the, lazy]
7 [lazy, dog]
dtype: object
In [8]:
s1.str.split().str[1]
Out[8]:
0 quick
1 brown
2 fox
3 jumps
4 over
5 the
6 lazy
7 dog
dtype: object
Categories¶
In [9]:
s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2
Out[9]:
0 Asian
1 Asian
2 White
3 Black
4 White
5 Hispanic
dtype: object
In [10]:
s2 = s2.astype('category')
s2
Out[10]:
0 Asian
1 Asian
2 White
3 Black
4 White
5 Hispanic
dtype: category
Categories (4, object): [Asian, Black, Hispanic, White]
In [11]:
s2.cat.categories
Out[11]:
Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')
In [12]:
s2.cat.codes
Out[12]:
0 0
1 0
2 3
3 1
4 3
5 2
dtype: int8
DataFrame objects¶
A DataFrame
is like a matrix. Columns in a DataFrame
are
Series
.
Each column in a DataFrame represents a variale
Each row in a DataFrame represents an observation
Each cell in a DataFrame represents a value
In [13]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df
Out[13]:
num | |
---|---|
0 | 1.0 |
1 | 2.0 |
2 | 3.0 |
3 | NaN |
In [14]:
df.num
Out[14]:
0 1.0
1 2.0
2 3.0
3 NaN
Name: num, dtype: float64
Index¶
Row and column identifiers are of Index
type.
Somewhat confusingly, index is also a a synonym for the row identifiers.
In [15]:
df.index
Out[15]:
RangeIndex(start=0, stop=4, step=1)
Setting a column as the row index¶
In [16]:
df
Out[16]:
num | |
---|---|
0 | 1.0 |
1 | 2.0 |
2 | 3.0 |
3 | NaN |
In [17]:
df1 = df.set_index('num')
df1
Out[17]:
num |
---|
1.0 |
2.0 |
3.0 |
NaN |
Columns¶
This is just a different index object
In [19]:
df.columns
Out[19]:
Index(['num'], dtype='object')
Getting raw values¶
Sometimes you just want a numpy
array, and not a pandas
object.
In [20]:
df.values
Out[20]:
array([[ 1.],
[ 2.],
[ 3.],
[nan]])
Creating Data Frames¶
Manual¶
In [21]:
from collections import OrderedDict
In [22]:
n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n),
weight=np.random.normal(70, 20, n),
height=np.random.normal(170, 15, n),
date=dates,
))
df
Out[22]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 258 | 102.249873 | 179.829518 | 2019-01-23 08:33:52.400613 |
1 | 887 | 61.520051 | 179.539043 | 2019-01-24 08:33:52.400613 |
2 | 689 | 90.455795 | 167.137938 | 2019-01-25 08:33:52.400613 |
3 | 184 | 111.817942 | 169.137430 | 2019-01-26 08:33:52.400613 |
4 | 228 | 57.779447 | 190.507569 | 2019-01-27 08:33:52.400613 |
From file¶
You can read in data from many different file types - plain text, JSON,
spreadsheets, databases etc. Functions to read in data look like
read_X
where X is the data type.
In [23]:
%%file measures.txt
pid weight height date
328 72.654347 203.560866 2018-11-11 14:16:18.148411
756 34.027679 189.847316 2018-11-12 14:16:18.148411
185 28.501914 158.646074 2018-11-13 14:16:18.148411
507 17.396343 180.795993 2018-11-14 14:16:18.148411
919 64.724301 173.564725 2018-11-15 14:16:18.148411
Writing measures.txt
In [24]:
df = pd.read_table('measures.txt')
df
Out[24]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Indexing Data Frames¶
Implicit defaults¶
if you provide a slice, it is assumed that you are asking for rows.
In [25]:
df[1:3]
Out[25]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
If you provide a singe value or list, it is assumed that you are asking for columns.
In [26]:
df[['pid', 'weight']]
Out[26]:
pid | weight | |
---|---|---|
0 | 328 | 72.654347 |
1 | 756 | 34.027679 |
2 | 185 | 28.501914 |
3 | 507 | 17.396343 |
4 | 919 | 64.724301 |
Extracting a column¶
Dictionary style access¶
In [27]:
df['pid']
Out[27]:
0 328
1 756
2 185
3 507
4 919
Name: pid, dtype: int64
Property style access¶
This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)
In [28]:
df.pid
Out[28]:
0 328
1 756
2 185
3 507
4 919
Name: pid, dtype: int64
Indexing by location¶
This is similar to numpy
indexing
In [29]:
df.iloc[1:3, :]
Out[29]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
In [30]:
df.iloc[1:3, [True, False, True]]
Out[30]:
pid | height | |
---|---|---|
1 | 756 | 189.847316 |
2 | 185 | 158.646074 |
Indexing by name¶
In [31]:
df.loc[1:3, 'weight':'height']
Out[31]:
weight | height | |
---|---|---|
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
Warning: When using loc
, the row slice indicates row names, not
positions.
In [32]:
df1 = df.copy()
df1.index = df.index + 1
df1
Out[32]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
2 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
3 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
4 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
5 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
In [33]:
df1.loc[1:3, 'weight':'height']
Out[33]:
weight | height | |
---|---|---|
1 | 72.654347 | 203.560866 |
2 | 34.027679 | 189.847316 |
3 | 28.501914 | 158.646074 |
Structure of a Data Frame¶
Data types¶
In [34]:
df.dtypes
Out[34]:
pid int64
weight float64
height float64
date object
dtype: object
Converting data types¶
Using astype
on one column¶
In [35]:
df.pid = df.pid.astype('category')
Using astype
on multiple columns¶
In [36]:
df = df.astype(dict(weight=float, height=float))
Using a conversion function¶
In [37]:
df.date = pd.to_datetime(df.date)
Check¶
In [38]:
df.dtypes
Out[38]:
pid category
weight float64
height float64
date datetime64[ns]
dtype: object
Basic properties¶
In [39]:
df.size
Out[39]:
20
In [40]:
df.shape
Out[40]:
(5, 4)
In [41]:
df.describe()
Out[41]:
weight | height | |
---|---|---|
count | 5.000000 | 5.000000 |
mean | 43.460917 | 181.282995 |
std | 23.960945 | 16.895933 |
min | 17.396343 | 158.646074 |
25% | 28.501914 | 173.564725 |
50% | 34.027679 | 180.795993 |
75% | 64.724301 | 189.847316 |
max | 72.654347 | 203.560866 |
Inspection¶
In [42]:
df.head(n=3)
Out[42]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
In [43]:
df.tail(n=3)
Out[43]:
pid | weight | height | date | |
---|---|---|---|---|
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
In [44]:
df.sample(n=3)
Out[44]:
pid | weight | height | date | |
---|---|---|---|---|
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
In [45]:
df.sample(frac=0.5)
Out[45]:
pid | weight | height | date | |
---|---|---|---|---|
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
Selecting, Renaming and Removing Columns¶
Selecting columns¶
In [46]:
df.filter(items=['pid', 'date'])
Out[46]:
pid | date | |
---|---|---|
0 | 328 | 2018-11-11 14:16:18.148411 |
1 | 756 | 2018-11-12 14:16:18.148411 |
2 | 185 | 2018-11-13 14:16:18.148411 |
3 | 507 | 2018-11-14 14:16:18.148411 |
4 | 919 | 2018-11-15 14:16:18.148411 |
In [47]:
df.filter(regex='.*ght')
Out[47]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
Note that you can also use regular string methods on the columns¶
In [48]:
df.loc[:, df.columns.str.contains('d')]
Out[48]:
pid | date | |
---|---|---|
0 | 328 | 2018-11-11 14:16:18.148411 |
1 | 756 | 2018-11-12 14:16:18.148411 |
2 | 185 | 2018-11-13 14:16:18.148411 |
3 | 507 | 2018-11-14 14:16:18.148411 |
4 | 919 | 2018-11-15 14:16:18.148411 |
Renaming columns¶
In [49]:
df.rename(dict(weight='w', height='h'), axis=1)
Out[49]:
pid | w | h | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
In [50]:
orig_cols = df.columns
In [51]:
df.columns = list('abcd')
In [52]:
df
Out[52]:
a | b | c | d | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
In [53]:
df.columns = orig_cols
In [54]:
df
Out[54]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Removing columns¶
In [55]:
df.drop(['pid', 'date'], axis=1)
Out[55]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
In [56]:
df.drop(columns=['pid', 'date'])
Out[56]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
In [57]:
df.drop(columns=df.columns[df.columns.str.contains('d')])
Out[57]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
Selecting, Renaming and Removing Rows¶
Selecting rows¶
In [58]:
df[df.weight.between(60,70)]
Out[58]:
pid | weight | height | date | |
---|---|---|---|---|
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
In [59]:
df[(69 <= df.weight) & (df.weight < 70)]
Out[59]:
pid | weight | height | date |
---|
In [60]:
df[df.date.between(pd.to_datetime('2018-11-13'),
pd.to_datetime('2018-11-15 23:59:59'))]
Out[60]:
pid | weight | height | date | |
---|---|---|---|---|
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Renaming rows¶
In [61]:
df.rename({i:letter for i,letter in enumerate('abcde')})
Out[61]:
pid | weight | height | date | |
---|---|---|---|---|
a | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
b | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
c | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
d | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
e | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
In [62]:
df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']
In [63]:
df
Out[63]:
pid | weight | height | date | |
---|---|---|---|---|
the | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
quick | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
brown | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
fox | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
jumphs | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
In [64]:
df = df.reset_index(drop=True)
In [65]:
df
Out[65]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Dropping rows¶
In [66]:
df.drop([1,3], axis=0)
Out[66]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Dropping duplicated data¶
In [67]:
df['something'] = [1,1,None,2,None]
In [68]:
df.loc[df.something.duplicated()]
Out[68]:
pid | weight | height | date | something | |
---|---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN |
In [69]:
df.drop_duplicates(subset='something')
Out[69]:
pid | weight | height | date | something | |
---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 |
Dropping missing data¶
In [70]:
df
Out[70]:
pid | weight | height | date | something | |
---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN |
In [71]:
df.something.fillna(0)
Out[71]:
0 1.0
1 1.0
2 0.0
3 2.0
4 0.0
Name: something, dtype: float64
In [72]:
df.something.ffill()
Out[72]:
0 1.0
1 1.0
2 1.0
3 2.0
4 2.0
Name: something, dtype: float64
In [73]:
df.something.bfill()
Out[73]:
0 1.0
1 1.0
2 2.0
3 2.0
4 NaN
Name: something, dtype: float64
In [74]:
df.something.interpolate()
Out[74]:
0 1.0
1 1.0
2 1.5
3 2.0
4 2.0
Name: something, dtype: float64
In [75]:
df.dropna()
Out[75]:
pid | weight | height | date | something | |
---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 |
Transforming and Creating Columns¶
In [76]:
df.assign(bmi=df['weight'] / (df['height']/100)**2)
Out[76]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 | 9.441118 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 | 5.322067 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | 21.485449 |
In [77]:
df['bmi'] = df['weight'] / (df['height']/100)**2
In [78]:
df
Out[78]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 | 9.441118 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 | 5.322067 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | 21.485449 |
In [79]:
df['something'] = [2,2,None,None,3]
In [80]:
df
Out[80]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 |
Sorting Data Frames¶
Sort on indexes¶
In [81]:
df.sort_index(axis=1)
Out[81]:
bmi | date | height | pid | something | weight | |
---|---|---|---|---|---|---|
0 | 17.533678 | 2018-11-11 14:16:18.148411 | 203.560866 | 328 | 2.0 | 72.654347 |
1 | 9.441118 | 2018-11-12 14:16:18.148411 | 189.847316 | 756 | 2.0 | 34.027679 |
2 | 11.324404 | 2018-11-13 14:16:18.148411 | 158.646074 | 185 | NaN | 28.501914 |
3 | 5.322067 | 2018-11-14 14:16:18.148411 | 180.795993 | 507 | NaN | 17.396343 |
4 | 21.485449 | 2018-11-15 14:16:18.148411 | 173.564725 | 919 | 3.0 | 64.724301 |
In [82]:
df.sort_index(axis=0, ascending=False)
Out[82]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 |
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 |
Sort on values¶
In [83]:
df.sort_values(by=['something', 'bmi'], ascending=[True, False])
Out[83]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 |
Summarizing¶
Apply an aggregation function¶
In [84]:
df.select_dtypes(include=np.number)
Out[84]:
weight | height | something | bmi | |
---|---|---|---|---|
0 | 72.654347 | 203.560866 | 2.0 | 17.533678 |
1 | 34.027679 | 189.847316 | 2.0 | 9.441118 |
2 | 28.501914 | 158.646074 | NaN | 11.324404 |
3 | 17.396343 | 180.795993 | NaN | 5.322067 |
4 | 64.724301 | 173.564725 | 3.0 | 21.485449 |
In [85]:
df.select_dtypes(include=np.number).agg(np.sum)
Out[85]:
weight 217.304584
height 906.414974
something 7.000000
bmi 65.106716
dtype: float64
In [86]:
df.agg(['count', np.sum, np.mean])
Out[86]:
pid | weight | height | date | something | bmi | |
---|---|---|---|---|---|---|
count | 5.0 | 5.000000 | 5.000000 | 5.0 | 3.000000 | 5.000000 |
sum | NaN | 217.304584 | 906.414974 | NaN | 7.000000 | 65.106716 |
mean | NaN | 43.460917 | 181.282995 | NaN | 2.333333 | 13.021343 |
Split-Apply-Combine¶
We often want to perform subgroup analysis (conditioning by some
discrete or categorical variable). This is done with groupby
followed by an aggregate function. Conceptually, we split the data frame
into separate groups, apply the aggregate function to each group
separately, then combine the aggregated results back into a single data
frame.
In [87]:
df['treatment'] = list('ababa')
In [88]:
df
Out[88]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
In [89]:
grouped = df.groupby('treatment')
In [90]:
grouped.get_group('a')
Out[90]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
In [91]:
grouped.mean()
Out[91]:
weight | height | something | bmi | |
---|---|---|---|---|
treatment | ||||
a | 55.293521 | 178.590555 | 2.5 | 16.781177 |
b | 25.712011 | 185.321654 | 2.0 | 7.381592 |
Using agg
with groupby
¶
In [92]:
grouped.agg('mean')
Out[92]:
weight | height | something | bmi | |
---|---|---|---|---|
treatment | ||||
a | 55.293521 | 178.590555 | 2.5 | 16.781177 |
b | 25.712011 | 185.321654 | 2.0 | 7.381592 |
In [93]:
grouped.agg(['mean', 'std'])
Out[93]:
weight | height | something | bmi | |||||
---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | |
treatment | ||||||||
a | 55.293521 | 23.538565 | 178.590555 | 22.875289 | 2.5 | 0.707107 | 16.781177 | 5.122148 |
b | 25.712011 | 11.760130 | 185.321654 | 6.400252 | 2.0 | NaN | 7.381592 | 2.912608 |
In [94]:
grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})
Out[94]:
weight | height | bmi | |||
---|---|---|---|---|---|
mean | std | min | max | <lambda> | |
treatment | |||||
a | 55.293521 | 23.538565 | 158.646074 | 203.560866 | 897.296525 |
b | 25.712011 | 11.760130 | 180.795993 | 189.847316 | 117.459100 |
Using trasnform
wtih groupby
¶
In [95]:
g_mean = grouped['weight', 'height'].transform(np.mean)
g_mean
Out[95]:
weight | height | |
---|---|---|
0 | 55.293521 | 178.590555 |
1 | 25.712011 | 185.321654 |
2 | 55.293521 | 178.590555 |
3 | 25.712011 | 185.321654 |
4 | 55.293521 | 178.590555 |
In [96]:
g_std = grouped['weight', 'height'].transform(np.std)
g_std
Out[96]:
weight | height | |
---|---|---|
0 | 23.538565 | 22.875289 |
1 | 11.760130 | 6.400252 |
2 | 23.538565 | 22.875289 |
3 | 11.760130 | 6.400252 |
4 | 23.538565 | 22.875289 |
In [97]:
(df[['weight', 'height']] - g_mean)/g_std
Out[97]:
weight | height | |
---|---|---|
0 | 0.737548 | 1.091584 |
1 | 0.707107 | 0.707107 |
2 | -1.138201 | -0.871879 |
3 | -0.707107 | -0.707107 |
4 | 0.400652 | -0.219706 |
Combining Data Frames¶
In [98]:
df
Out[98]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
In [99]:
df1 = df.iloc[3:].copy()
In [100]:
df1.drop('something', axis=1, inplace=True)
df1
Out[100]:
pid | weight | height | date | bmi | treatment | |
---|---|---|---|---|---|---|
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 21.485449 | a |
Adding rows¶
Note that pandas
aligns by column indexes automatically.
In [101]:
df.append(df1, sort=False)
Out[101]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | 21.485449 | a |
In [102]:
pd.concat([df, df1], sort=False)
Out[102]:
pid | weight | height | date | something | bmi | treatment | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | 21.485449 | a |
Adding columns¶
In [103]:
df.pid
Out[103]:
0 328
1 756
2 185
3 507
4 919
Name: pid, dtype: category
Categories (5, int64): [185, 328, 507, 756, 919]
In [104]:
df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))
In [105]:
df2.pid
Out[105]:
0 649
1 533
2 400
3 600
Name: pid, dtype: int64
In [106]:
df.pid = df.pid.astype('int')
In [107]:
pd.merge(df, df2, on='pid', how='inner')
Out[107]:
pid | weight | height | date | something | bmi | treatment | age |
---|
In [108]:
pd.merge(df, df2, on='pid', how='left')
Out[108]:
pid | weight | height | date | something | bmi | treatment | age | |
---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a | NaN |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b | NaN |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b | NaN |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a | NaN |
In [109]:
pd.merge(df, df2, on='pid', how='right')
Out[109]:
pid | weight | height | date | something | bmi | treatment | age | |
---|---|---|---|---|---|---|---|---|
0 | 649 | NaN | NaN | NaT | NaN | NaN | NaN | 23 |
1 | 533 | NaN | NaN | NaT | NaN | NaN | NaN | 34 |
2 | 400 | NaN | NaN | NaT | NaN | NaN | NaN | 45 |
3 | 600 | NaN | NaN | NaT | NaN | NaN | NaN | 56 |
In [110]:
pd.merge(df, df2, on='pid', how='outer')
Out[110]:
pid | weight | height | date | something | bmi | treatment | age | |
---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | 17.533678 | a | NaN |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | 9.441118 | b | NaN |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | 11.324404 | a | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | 5.322067 | b | NaN |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | 21.485449 | a | NaN |
5 | 649 | NaN | NaN | NaT | NaN | NaN | NaN | 23.0 |
6 | 533 | NaN | NaN | NaT | NaN | NaN | NaN | 34.0 |
7 | 400 | NaN | NaN | NaT | NaN | NaN | NaN | 45.0 |
8 | 600 | NaN | NaN | NaT | NaN | NaN | NaN | 56.0 |
Merging on the index¶
In [111]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))
In [112]:
df1
Out[112]:
x | |
---|---|
a | 1 |
b | 2 |
c | 3 |
In [113]:
df2
Out[113]:
y | |
---|---|
a | 4 |
b | 5 |
c | 6 |
In [114]:
df3
Out[114]:
z | |
---|---|
a | 7 |
b | 8 |
c | 9 |
In [115]:
df1.join([df2, df3])
Out[115]:
x | y | z | |
---|---|---|---|
a | 1 | 4 | 7 |
b | 2 | 5 | 8 |
c | 3 | 6 | 9 |
Fixing common DataFrame issues¶
Multiple variables in a column¶
In [116]:
df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df
Out[116]:
pid_treat | |
---|---|
0 | A-1 |
1 | B-2 |
2 | C-1 |
3 | D-2 |
In [117]:
df.pid_treat.str.split('-')
Out[117]:
0 [A, 1]
1 [B, 2]
2 [C, 1]
3 [D, 2]
Name: pid_treat, dtype: object
In [118]:
df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])
Out[118]:
pid | treat | |
---|---|---|
0 | A | 1 |
1 | B | 2 |
2 | C | 1 |
3 | D | 2 |
Multiple values in a cell¶
In [119]:
df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df
Out[119]:
pid | vals | |
---|---|---|
0 | a | (1, 2, 3) |
1 | b | (4, 5, 6) |
2 | c | (7, 8, 9) |
In [120]:
df[['t1', 't2', 't3']] = df.vals.apply(pd.Series)
df
Out[120]:
pid | vals | t1 | t2 | t3 | |
---|---|---|---|---|---|
0 | a | (1, 2, 3) | 1 | 2 | 3 |
1 | b | (4, 5, 6) | 4 | 5 | 6 |
2 | c | (7, 8, 9) | 7 | 8 | 9 |
In [121]:
df.drop('vals', axis=1, inplace=True)
In [122]:
pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)
Out[122]:
pid | vals | |
---|---|---|
0 | a | 1 |
1 | b | 4 |
2 | c | 7 |
3 | a | 2 |
4 | b | 5 |
5 | c | 8 |
6 | a | 3 |
7 | b | 6 |
8 | c | 9 |
Reshaping Data Frames¶
Sometimes we need to make rows into columns or vice versa.
Converting multiple columns into a single column¶
This is often useful if you need to condition on some variable.
In [123]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)
In [124]:
iris.head()
Out[124]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
In [125]:
iris.shape
Out[125]:
(150, 5)
In [126]:
df_iris = pd.melt(iris, id_vars='species')
In [127]:
df_iris.sample(10)
Out[127]:
species | variable | value | |
---|---|---|---|
535 | versicolor | petal_width | 1.6 |
175 | setosa | sepal_width | 3.0 |
430 | virginica | petal_length | 6.1 |
481 | setosa | petal_width | 0.4 |
3 | setosa | sepal_length | 4.6 |
73 | versicolor | sepal_length | 6.1 |
581 | virginica | petal_width | 2.0 |
362 | versicolor | petal_length | 4.0 |
294 | virginica | sepal_width | 3.3 |
66 | versicolor | sepal_length | 5.6 |
Chaining commands¶
Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.
In [128]:
(
iris.
sample(frac=0.2).
filter(regex='s.*').
assign(both=iris.sepal_length + iris.sepal_length).
groupby('species').agg(['mean', 'sum']).
pipe(lambda x: np.around(x, 1))
)
Out[128]:
sepal_length | sepal_width | both | ||||
---|---|---|---|---|---|---|
mean | sum | mean | sum | mean | sum | |
species | ||||||
setosa | 5.2 | 56.8 | 3.6 | 39.1 | 10.3 | 113.6 |
versicolor | 6.0 | 42.1 | 2.8 | 19.6 | 12.0 | 84.2 |
virginica | 6.7 | 80.1 | 2.9 | 34.6 | 13.4 | 160.2 |
Moving between R and Python in Jupyter¶
In [129]:
%load_ext rpy2.ipython
In [130]:
import warnings
warnings.simplefilter('ignore', FutureWarning)
In [131]:
iris = %R iris
In [132]:
iris.head()
Out[132]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
In [133]:
iris_py = iris.copy()
iris_py.Species = iris_py.Species.str.upper()
In [134]:
%%R -i iris_py -o iris_r
iris_r <- iris_py[1:3,]
In [135]:
iris_r
Out[135]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | SETOSA |
1 | 4.9 | 3.0 | 1.4 | 0.2 | SETOSA |
2 | 4.7 | 3.2 | 1.3 | 0.2 | SETOSA |