Introduction to pandas
¶
[3]:
import numpy as np
import pandas as pd
Series and Data Frames¶
Series objects¶
A Series
is like a vector. All elements must have the same type or are nulls.
[4]:
s = pd.Series([1,1,2,3] + [None])
s
[4]:
0 1.0
1 1.0
2 2.0
3 3.0
4 NaN
dtype: float64
Special types of series¶
Strings¶
[7]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1
[7]:
0 the quick
1 quick brown
2 brown fox
3 fox jumps
4 jumps over
5 over the
6 the lazy
7 lazy dog
dtype: object
[8]:
s1.str.upper()
[8]:
0 THE QUICK
1 QUICK BROWN
2 BROWN FOX
3 FOX JUMPS
4 JUMPS OVER
5 OVER THE
6 THE LAZY
7 LAZY DOG
dtype: object
[9]:
s1.str.split()
[9]:
0 [the, quick]
1 [quick, brown]
2 [brown, fox]
3 [fox, jumps]
4 [jumps, over]
5 [over, the]
6 [the, lazy]
7 [lazy, dog]
dtype: object
[10]:
s1.str.split().str[1]
[10]:
0 quick
1 brown
2 fox
3 jumps
4 over
5 the
6 lazy
7 dog
dtype: object
Categories¶
[11]:
s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2
[11]:
0 Asian
1 Asian
2 White
3 Black
4 White
5 Hispanic
dtype: object
[12]:
s2 = s2.astype('category')
s2
[12]:
0 Asian
1 Asian
2 White
3 Black
4 White
5 Hispanic
dtype: category
Categories (4, object): [Asian, Black, Hispanic, White]
[13]:
s2.cat.categories
[13]:
Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')
[14]:
s2.cat.codes
[14]:
0 0
1 0
2 3
3 1
4 3
5 2
dtype: int8
Ordered categories¶
[22]:
s3 = pd.Series(['Mon', 'Tue', 'Wed', 'Thu', 'Fri']).astype('category')
s3
[22]:
0 Mon
1 Tue
2 Wed
3 Thu
4 Fri
dtype: category
Categories (5, object): [Fri, Mon, Thu, Tue, Wed]
[23]:
s3.cat.ordered
[23]:
False
[24]:
s3.sort_values()
[24]:
4 Fri
0 Mon
3 Thu
1 Tue
2 Wed
dtype: category
Categories (5, object): [Fri, Mon, Thu, Tue, Wed]
[33]:
s3 = s3.cat.reorder_categories(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], ordered=True)
[34]:
s3.cat.ordered
[34]:
True
[35]:
s3.sort_values()
[35]:
0 Mon
1 Tue
2 Wed
3 Thu
4 Fri
dtype: category
Categories (5, object): [Mon < Tue < Wed < Thu < Fri]
DataFrame objects¶
A DataFrame
is like a matrix. Columns in a DataFrame
are Series
.
Each column in a DataFrame represents a variale
Each row in a DataFrame represents an observation
Each cell in a DataFrame represents a value
[37]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df
[37]:
num | |
---|---|
0 | 1.0 |
1 | 2.0 |
2 | 3.0 |
3 | NaN |
[38]:
df.num
[38]:
0 1.0
1 2.0
2 3.0
3 NaN
Name: num, dtype: float64
Index¶
Row and column identifiers are of Index
type.
Somewhat confusingly, index is also a a synonym for the row identifiers.
[39]:
df.index
[39]:
RangeIndex(start=0, stop=4, step=1)
Setting a column as the row index¶
[40]:
df
[40]:
num | |
---|---|
0 | 1.0 |
1 | 2.0 |
2 | 3.0 |
3 | NaN |
[41]:
df1 = df.set_index('num')
df1
[41]:
num |
---|
1.0 |
2.0 |
3.0 |
NaN |
Sometimes you don’t need to retain the index information¶
[61]:
df = pd.DataFrame(dict(letters = list('ABCDEFG')))
[62]:
df
[62]:
letters | |
---|---|
0 | A |
1 | B |
2 | C |
3 | D |
4 | E |
5 | F |
6 | G |
[63]:
df = df[df.letters.isin(list('AEIOU'))]
[64]:
df
[64]:
letters | |
---|---|
0 | A |
4 | E |
[65]:
df.reset_index(drop=True)
[65]:
letters | |
---|---|
0 | A |
1 | E |
Columns¶
This is just a different index object
[66]:
df.columns
[66]:
Index(['letters'], dtype='object')
Getting raw values¶
Sometimes you just want a numpy
array, and not a pandas
object.
[67]:
df.values
[67]:
array([['A'],
['E']], dtype=object)
Creating Data Frames¶
Manual¶
[69]:
n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(dict(pid=np.random.randint(100, 999, n),
weight=np.random.normal(70, 20, n),
height=np.random.normal(170, 15, n),
date=dates,
))
df
[69]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 842 | 44.598558 | 151.770120 | 2020-01-22 14:18:16.402553 |
1 | 995 | 85.515468 | 172.067821 | 2020-01-23 14:18:16.402553 |
2 | 716 | 93.733898 | 181.213071 | 2020-01-24 14:18:16.402553 |
3 | 887 | 91.109837 | 191.187563 | 2020-01-25 14:18:16.402553 |
4 | 971 | 79.608828 | 176.912780 | 2020-01-26 14:18:16.402553 |
From numpy array¶
[74]:
pd.DataFrame(np.eye(3,2), columns=['A', 'B'], index=['x', 'y', 'z'])
[74]:
A | B | |
---|---|---|
x | 1.0 | 0.0 |
y | 0.0 | 1.0 |
z | 0.0 | 0.0 |
From URL¶
[75]:
url = "https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
df = pd.read_csv(url)
df.head()
[75]:
sepal.length | sepal.width | petal.length | petal.width | variety | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa |
From file¶
You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like read_X
where X is the data type.
[78]:
%%file measures.txt
pid weight height date
328 72.654347 203.560866 2018-11-11 14:16:18.148411
756 34.027679 189.847316 2018-11-12 14:16:18.148411
185 28.501914 158.646074 2018-11-13 14:16:18.148411
507 17.396343 180.795993 2018-11-14 14:16:18.148411
919 64.724301 173.564725 2018-11-15 14:16:18.148411
Overwriting measures.txt
[79]:
df = pd.read_table('measures.txt')
df
[79]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Indexing Data Frames¶
Implicit defaults¶
if you provide a slice, it is assumed that you are asking for rows.
[80]:
df[1:3]
[80]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
If you provide a singe value or list, it is assumed that you are asking for columns.
[81]:
df[['pid', 'weight']]
[81]:
pid | weight | |
---|---|---|
0 | 328 | 72.654347 |
1 | 756 | 34.027679 |
2 | 185 | 28.501914 |
3 | 507 | 17.396343 |
4 | 919 | 64.724301 |
Extracting a column¶
Dictionary style access¶
[95]:
df['pid']
[95]:
0 328
1 756
2 185
3 507
4 919
Name: pid, dtype: int64
Property style access¶
This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)
[96]:
df.pid
[96]:
0 328
1 756
2 185
3 507
4 919
Name: pid, dtype: int64
Indexing by location¶
This is similar to numpy
indexing
[29]:
df.iloc[1:3, :]
[29]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
[30]:
df.iloc[1:3, [True, False, True]]
[30]:
pid | height | |
---|---|---|
1 | 756 | 189.847316 |
2 | 185 | 158.646074 |
Indexing by name¶
[31]:
df.loc[1:3, 'weight':'height']
[31]:
weight | height | |
---|---|---|
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
Warning: When using loc
, the row slice indicates row names, not positions.
[32]:
df1 = df.copy()
df1.index = df.index + 1
df1
[32]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
2 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
3 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
4 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
5 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[33]:
df1.loc[1:3, 'weight':'height']
[33]:
weight | height | |
---|---|---|
1 | 72.654347 | 203.560866 |
2 | 34.027679 | 189.847316 |
3 | 28.501914 | 158.646074 |
Structure of a Data Frame¶
Data types¶
[107]:
df.dtypes
[107]:
pid int64
weight float64
height float64
date object
dtype: object
Converting data types¶
Using astype
on one column¶
[108]:
df.pid = df.pid.astype('category')
Using astype
on multiple columns¶
[110]:
df = df.astype(dict(weight=float,
height=float))
Using a conversion function¶
[111]:
df.date = pd.to_datetime(df.date)
Check¶
[112]:
df.dtypes
[112]:
pid category
weight float64
height float64
date datetime64[ns]
dtype: object
Basic properties¶
[113]:
df.size
[113]:
20
[114]:
df.shape
[114]:
(5, 4)
[115]:
df.describe()
[115]:
weight | height | |
---|---|---|
count | 5.000000 | 5.000000 |
mean | 43.460917 | 181.282995 |
std | 23.960945 | 16.895933 |
min | 17.396343 | 158.646074 |
25% | 28.501914 | 173.564725 |
50% | 34.027679 | 180.795993 |
75% | 64.724301 | 189.847316 |
max | 72.654347 | 203.560866 |
[122]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
pid 5 non-null category
weight 5 non-null float64
height 5 non-null float64
date 5 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 453.0 bytes
Inspection¶
[123]:
df.head(n=3)
[123]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
[124]:
df.tail(n=3)
[124]:
pid | weight | height | date | |
---|---|---|---|---|
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[125]:
df.sample(n=3)
[125]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
[126]:
df.sample(frac=0.5)
[126]:
pid | weight | height | date | |
---|---|---|---|---|
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Selecting, Renaming and Removing Columns¶
Selecting columns¶
[128]:
df.filter(items=['pid', 'date'])
[128]:
pid | date | |
---|---|---|
0 | 328 | 2018-11-11 14:16:18.148411 |
1 | 756 | 2018-11-12 14:16:18.148411 |
2 | 185 | 2018-11-13 14:16:18.148411 |
3 | 507 | 2018-11-14 14:16:18.148411 |
4 | 919 | 2018-11-15 14:16:18.148411 |
[129]:
df.filter(regex='.*ght')
[129]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
I’m not actually clear about what like
does - it seeems to mean “contains”
[151]:
df.filter(like='ei')
[151]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
Filter has an optional axis argument if you want to select by row index¶
[134]:
df.filter([0,1,3,4], axis=0)
[134]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Note that you can also use regular string methods on the columns¶
[135]:
df.loc[:, df.columns.str.contains('d')]
[135]:
pid | date | |
---|---|---|
0 | 328 | 2018-11-11 14:16:18.148411 |
1 | 756 | 2018-11-12 14:16:18.148411 |
2 | 185 | 2018-11-13 14:16:18.148411 |
3 | 507 | 2018-11-14 14:16:18.148411 |
4 | 919 | 2018-11-15 14:16:18.148411 |
Renaming columns¶
[136]:
df.rename(dict(weight='w', height='h'), axis=1)
[136]:
pid | w | h | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[137]:
orig_cols = df.columns
[138]:
df.columns = list('abcd')
[139]:
df
[139]:
a | b | c | d | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[140]:
df.columns = orig_cols
[141]:
df
[141]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Removing columns¶
[142]:
df.drop(['pid', 'date'], axis=1)
[142]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
[143]:
df.drop(columns=['pid', 'date'])
[143]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
[144]:
df.drop(columns=df.columns[df.columns.str.contains('d')])
[144]:
weight | height | |
---|---|---|
0 | 72.654347 | 203.560866 |
1 | 34.027679 | 189.847316 |
2 | 28.501914 | 158.646074 |
3 | 17.396343 | 180.795993 |
4 | 64.724301 | 173.564725 |
Selecting, Renaming and Removing Rows¶
Selecting rows¶
[153]:
df[df.weight.between(60,70)]
[153]:
pid | weight | height | date | |
---|---|---|---|---|
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[154]:
df[(69 <= df.weight) & (df.weight < 70)]
[154]:
pid | weight | height | date |
---|
[155]:
df[df.date.between(pd.to_datetime('2018-11-13'),
pd.to_datetime('2018-11-15 23:59:59'))]
[155]:
pid | weight | height | date | |
---|---|---|---|---|
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[158]:
df.query('weight <= 70 and height > 90')
[158]:
pid | weight | height | date | |
---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Renaming rows¶
[159]:
df.rename({i:letter for i,letter in enumerate('abcde')})
[159]:
pid | weight | height | date | |
---|---|---|---|---|
a | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
b | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
c | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
d | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
e | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[160]:
df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']
[161]:
df
[161]:
pid | weight | height | date | |
---|---|---|---|---|
the | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
quick | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
brown | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
fox | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
jumphs | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[162]:
df = df.reset_index(drop=True)
[163]:
df
[163]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Dropping rows¶
[164]:
df.drop([1,3], axis=0)
[164]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
Dropping duplicated data¶
[179]:
df['something'] = [1,1,None,2,None]
df['nothing'] = [None, None, None, None, None]
[180]:
df.loc[df.something.duplicated()]
[180]:
pid | weight | height | date | something | nothing | |
---|---|---|---|---|---|---|
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 | None |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | None |
[181]:
df.drop_duplicates(subset='something')
[181]:
pid | weight | height | date | something | nothing | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 | None |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 | None |
Dropping missing data¶
[182]:
df
[182]:
pid | weight | height | date | something | nothing | |
---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 | None |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 | None |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 | None |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | None |
[183]:
df.dropna()
[183]:
pid | weight | height | date | something | nothing |
---|
[185]:
df.dropna(axis=1)
[185]:
pid | weight | height | date | |
---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 |
[186]:
df.dropna(axis=1, how='all')
[186]:
pid | weight | height | date | something | |
---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN |
Brute force replacement of missing values¶
[193]:
df.something.fillna(0)
[193]:
0 1.0
1 1.0
2 0.0
3 2.0
4 0.0
Name: something, dtype: float64
[197]:
df.something.fillna(df.something.mean())
[197]:
0 1.000000
1 1.000000
2 1.333333
3 2.000000
4 1.333333
Name: something, dtype: float64
[194]:
df.something.ffill()
[194]:
0 1.0
1 1.0
2 1.0
3 2.0
4 2.0
Name: something, dtype: float64
[195]:
df.something.bfill()
[195]:
0 1.0
1 1.0
2 2.0
3 2.0
4 NaN
Name: something, dtype: float64
[196]:
df.something.interpolate()
[196]:
0 1.0
1 1.0
2 1.5
3 2.0
4 2.0
Name: something, dtype: float64
Transforming and Creating Columns¶
[198]:
df.assign(bmi=df['weight'] / (df['height']/100)**2)
[198]:
pid | weight | height | date | something | nothing | bmi | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 | None | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 | None | 9.441118 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 | None | 5.322067 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | None | 21.485449 |
[199]:
df['bmi'] = df['weight'] / (df['height']/100)**2
[200]:
df
[200]:
pid | weight | height | date | something | nothing | bmi | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 1.0 | None | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 1.0 | None | 9.441118 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | 2.0 | None | 5.322067 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | None | 21.485449 |
[201]:
df['something'] = [2,2,None,None,3]
[202]:
df
[202]:
pid | weight | height | date | something | nothing | bmi | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | None | 9.441118 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 |
Sorting Data Frames¶
Sort on indexes¶
[203]:
df.sort_index(axis=1)
[203]:
bmi | date | height | nothing | pid | something | weight | |
---|---|---|---|---|---|---|---|
0 | 17.533678 | 2018-11-11 14:16:18.148411 | 203.560866 | None | 328 | 2.0 | 72.654347 |
1 | 9.441118 | 2018-11-12 14:16:18.148411 | 189.847316 | None | 756 | 2.0 | 34.027679 |
2 | 11.324404 | 2018-11-13 14:16:18.148411 | 158.646074 | None | 185 | NaN | 28.501914 |
3 | 5.322067 | 2018-11-14 14:16:18.148411 | 180.795993 | None | 507 | NaN | 17.396343 |
4 | 21.485449 | 2018-11-15 14:16:18.148411 | 173.564725 | None | 919 | 3.0 | 64.724301 |
[204]:
df.sort_index(axis=0, ascending=False)
[204]:
pid | weight | height | date | something | nothing | bmi | |
---|---|---|---|---|---|---|---|
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | None | 9.441118 |
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 |
Sort on values¶
[205]:
df.sort_values(by=['something', 'bmi'], ascending=[True, False])
[205]:
pid | weight | height | date | something | nothing | bmi | |
---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | None | 9.441118 |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 |
Summarizing¶
Apply an aggregation function¶
[206]:
df.select_dtypes(include=np.number)
[206]:
weight | height | something | bmi | |
---|---|---|---|---|
0 | 72.654347 | 203.560866 | 2.0 | 17.533678 |
1 | 34.027679 | 189.847316 | 2.0 | 9.441118 |
2 | 28.501914 | 158.646074 | NaN | 11.324404 |
3 | 17.396343 | 180.795993 | NaN | 5.322067 |
4 | 64.724301 | 173.564725 | 3.0 | 21.485449 |
[207]:
df.select_dtypes(include=np.number).agg(np.sum)
[207]:
weight 217.304584
height 906.414974
something 7.000000
bmi 65.106716
dtype: float64
[208]:
df.agg(['count', np.sum, np.mean])
[208]:
pid | weight | height | date | something | nothing | bmi | |
---|---|---|---|---|---|---|---|
count | 5.0 | 5.000000 | 5.000000 | 5 | 3.000000 | 0.0 | 5.000000 |
sum | NaN | 217.304584 | 906.414974 | NaN | 7.000000 | 0.0 | 65.106716 |
mean | NaN | 43.460917 | 181.282995 | 2018-11-13 14:16:18.148410880 | 2.333333 | NaN | 13.021343 |
Split-Apply-Combine¶
We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with groupby
followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.
[209]:
df['treatment'] = list('ababa')
[210]:
df
[210]:
pid | weight | height | date | something | nothing | bmi | treatment | |
---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | None | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 | a |
[211]:
grouped = df.groupby('treatment')
[212]:
grouped.get_group('a')
[212]:
pid | weight | height | date | something | nothing | bmi | treatment | |
---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 | a |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 | a |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 | a |
[213]:
grouped.mean()
[213]:
weight | height | something | bmi | |
---|---|---|---|---|
treatment | ||||
a | 55.293521 | 178.590555 | 2.5 | 16.781177 |
b | 25.712011 | 185.321654 | 2.0 | 7.381592 |
Using agg
with groupby
¶
[214]:
grouped.agg('mean')
[214]:
weight | height | something | bmi | |
---|---|---|---|---|
treatment | ||||
a | 55.293521 | 178.590555 | 2.5 | 16.781177 |
b | 25.712011 | 185.321654 | 2.0 | 7.381592 |
[215]:
grouped.agg(['mean', 'std'])
[215]:
weight | height | something | bmi | |||||
---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | |
treatment | ||||||||
a | 55.293521 | 23.538565 | 178.590555 | 22.875289 | 2.5 | 0.707107 | 16.781177 | 5.122148 |
b | 25.712011 | 11.760130 | 185.321654 | 6.400252 | 2.0 | NaN | 7.381592 | 2.912608 |
[216]:
grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})
[216]:
weight | height | bmi | |||
---|---|---|---|---|---|
mean | std | min | max | <lambda> | |
treatment | |||||
a | 55.293521 | 23.538565 | 158.646074 | 203.560866 | 897.296525 |
b | 25.712011 | 11.760130 | 180.795993 | 189.847316 | 117.459100 |
Using trasnform
wtih groupby
¶
[217]:
g_mean = grouped['weight', 'height'].transform(np.mean)
g_mean
[217]:
weight | height | |
---|---|---|
0 | 55.293521 | 178.590555 |
1 | 25.712011 | 185.321654 |
2 | 55.293521 | 178.590555 |
3 | 25.712011 | 185.321654 |
4 | 55.293521 | 178.590555 |
[218]:
g_std = grouped['weight', 'height'].transform(np.std)
g_std
[218]:
weight | height | |
---|---|---|
0 | 23.538565 | 22.875289 |
1 | 11.760130 | 6.400252 |
2 | 23.538565 | 22.875289 |
3 | 11.760130 | 6.400252 |
4 | 23.538565 | 22.875289 |
[219]:
(df[['weight', 'height']] - g_mean)/g_std
[219]:
weight | height | |
---|---|---|
0 | 0.737548 | 1.091584 |
1 | 0.707107 | 0.707107 |
2 | -1.138201 | -0.871879 |
3 | -0.707107 | -0.707107 |
4 | 0.400652 | -0.219706 |
Combining Data Frames¶
[220]:
df
[220]:
pid | weight | height | date | something | nothing | bmi | treatment | |
---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | None | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 | a |
[221]:
df1 = df.iloc[3:].copy()
[222]:
df1.drop('something', axis=1, inplace=True)
df1
[222]:
pid | weight | height | date | nothing | bmi | treatment | |
---|---|---|---|---|---|---|---|
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | None | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | None | 21.485449 | a |
Adding rows¶
Note that pandas
aligns by column indexes automatically.
[223]:
df.append(df1, sort=False)
[223]:
pid | weight | height | date | something | nothing | bmi | treatment | |
---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | None | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | None | 21.485449 | a |
[224]:
pd.concat([df, df1], sort=False)
[224]:
pid | weight | height | date | something | nothing | bmi | treatment | |
---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 | a |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | None | 9.441118 | b |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 | a |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 | b |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | NaN | None | 21.485449 | a |
Adding columns¶
[225]:
df.pid
[225]:
0 328
1 756
2 185
3 507
4 919
Name: pid, dtype: category
Categories (5, int64): [185, 328, 507, 756, 919]
[228]:
df2 = pd.DataFrame(dict(pid=[649, 533, 400, 600], age=[23,34,45,56]))
[229]:
df2.pid
[229]:
0 649
1 533
2 400
3 600
Name: pid, dtype: int64
[230]:
df.pid = df.pid.astype('int')
[231]:
pd.merge(df, df2, on='pid', how='inner')
[231]:
pid | weight | height | date | something | nothing | bmi | treatment | age |
---|
[232]:
pd.merge(df, df2, on='pid', how='left')
[232]:
pid | weight | height | date | something | nothing | bmi | treatment | age | |
---|---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 | a | NaN |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | None | 9.441118 | b | NaN |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 | a | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 | b | NaN |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 | a | NaN |
[233]:
pd.merge(df, df2, on='pid', how='right')
[233]:
pid | weight | height | date | something | nothing | bmi | treatment | age | |
---|---|---|---|---|---|---|---|---|---|
0 | 649 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | 23 |
1 | 533 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | 34 |
2 | 400 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | 45 |
3 | 600 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | 56 |
[234]:
pd.merge(df, df2, on='pid', how='outer')
[234]:
pid | weight | height | date | something | nothing | bmi | treatment | age | |
---|---|---|---|---|---|---|---|---|---|
0 | 328 | 72.654347 | 203.560866 | 2018-11-11 14:16:18.148411 | 2.0 | None | 17.533678 | a | NaN |
1 | 756 | 34.027679 | 189.847316 | 2018-11-12 14:16:18.148411 | 2.0 | None | 9.441118 | b | NaN |
2 | 185 | 28.501914 | 158.646074 | 2018-11-13 14:16:18.148411 | NaN | None | 11.324404 | a | NaN |
3 | 507 | 17.396343 | 180.795993 | 2018-11-14 14:16:18.148411 | NaN | None | 5.322067 | b | NaN |
4 | 919 | 64.724301 | 173.564725 | 2018-11-15 14:16:18.148411 | 3.0 | None | 21.485449 | a | NaN |
5 | 649 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | 23.0 |
6 | 533 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | 34.0 |
7 | 400 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | 45.0 |
8 | 600 | NaN | NaN | NaT | NaN | NaN | NaN | NaN | 56.0 |
Merging on the index¶
[235]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))
[236]:
df1
[236]:
x | |
---|---|
a | 1 |
b | 2 |
c | 3 |
[237]:
df2
[237]:
y | |
---|---|
a | 4 |
b | 5 |
c | 6 |
[238]:
df3
[238]:
z | |
---|---|
a | 7 |
b | 8 |
c | 9 |
[239]:
df1.join([df2, df3])
[239]:
x | y | z | |
---|---|---|---|
a | 1 | 4 | 7 |
b | 2 | 5 | 8 |
c | 3 | 6 | 9 |
Fixing common DataFrame issues¶
Multiple variables in a column¶
[240]:
df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df
[240]:
pid_treat | |
---|---|
0 | A-1 |
1 | B-2 |
2 | C-1 |
3 | D-2 |
[241]:
df.pid_treat.str.split('-')
[241]:
0 [A, 1]
1 [B, 2]
2 [C, 1]
3 [D, 2]
Name: pid_treat, dtype: object
[242]:
df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])
[242]:
pid | treat | |
---|---|---|
0 | A | 1 |
1 | B | 2 |
2 | C | 1 |
3 | D | 2 |
Multiple values in a cell¶
[243]:
df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df
[243]:
pid | vals | |
---|---|---|
0 | a | (1, 2, 3) |
1 | b | (4, 5, 6) |
2 | c | (7, 8, 9) |
[244]:
df[['t1', 't2', 't3']] = df.vals.apply(pd.Series)
df
[244]:
pid | vals | t1 | t2 | t3 | |
---|---|---|---|---|---|
0 | a | (1, 2, 3) | 1 | 2 | 3 |
1 | b | (4, 5, 6) | 4 | 5 | 6 |
2 | c | (7, 8, 9) | 7 | 8 | 9 |
[245]:
df.drop('vals', axis=1, inplace=True)
[246]:
pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)
[246]:
pid | vals | |
---|---|---|
0 | a | 1 |
1 | b | 4 |
2 | c | 7 |
3 | a | 2 |
4 | b | 5 |
5 | c | 8 |
6 | a | 3 |
7 | b | 6 |
8 | c | 9 |
Reshaping Data Frames¶
Sometimes we need to make rows into columns or vice versa.
Converting multiple columns into a single column¶
This is often useful if you need to condition on some variable.
[247]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)
[248]:
iris.head()
[248]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
[249]:
iris.shape
[249]:
(150, 5)
[250]:
df_iris = pd.melt(iris, id_vars='species')
[251]:
df_iris.sample(10)
[251]:
species | variable | value | |
---|---|---|---|
102 | virginica | sepal_length | 7.1 |
176 | setosa | sepal_width | 3.4 |
233 | versicolor | sepal_width | 2.7 |
158 | setosa | sepal_width | 2.9 |
116 | virginica | sepal_length | 6.5 |
97 | versicolor | sepal_length | 6.2 |
565 | virginica | petal_width | 2.3 |
525 | versicolor | petal_width | 1.4 |
45 | setosa | sepal_length | 4.8 |
594 | virginica | petal_width | 2.5 |
Chaining commands¶
Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.
[252]:
(
iris.
sample(frac=0.2).
filter(regex='s.*').
assign(both=iris.sepal_length + iris.sepal_length).
groupby('species').agg(['mean', 'sum']).
pipe(lambda x: np.around(x, 1))
)
[252]:
sepal_length | sepal_width | both | ||||
---|---|---|---|---|---|---|
mean | sum | mean | sum | mean | sum | |
species | ||||||
setosa | 5.1 | 50.6 | 3.5 | 35.0 | 10.1 | 101.2 |
versicolor | 5.8 | 69.5 | 2.7 | 32.8 | 11.6 | 139.0 |
virginica | 6.6 | 53.1 | 3.0 | 24.0 | 13.3 | 106.2 |