Introduction to pandas

[3]:
import numpy as np
import pandas as pd

Series and Data Frames

Series objects

A Series is like a vector. All elements must have the same type or are nulls.

[4]:
s = pd.Series([1,1,2,3] + [None])
s
[4]:
0    1.0
1    1.0
2    2.0
3    3.0
4    NaN
dtype: float64

Size

[5]:
s.size
[5]:
5

Unique Counts

[6]:
s.value_counts()
[6]:
1.0    2
3.0    1
2.0    1
dtype: int64

Special types of series

Strings

[7]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1
[7]:
0      the quick
1    quick brown
2      brown fox
3      fox jumps
4     jumps over
5       over the
6       the lazy
7       lazy dog
dtype: object
[8]:
s1.str.upper()
[8]:
0      THE QUICK
1    QUICK BROWN
2      BROWN FOX
3      FOX JUMPS
4     JUMPS OVER
5       OVER THE
6       THE LAZY
7       LAZY DOG
dtype: object
[9]:
s1.str.split()
[9]:
0      [the, quick]
1    [quick, brown]
2      [brown, fox]
3      [fox, jumps]
4     [jumps, over]
5       [over, the]
6       [the, lazy]
7       [lazy, dog]
dtype: object
[10]:
s1.str.split().str[1]
[10]:
0    quick
1    brown
2      fox
3    jumps
4     over
5      the
6     lazy
7      dog
dtype: object

Categories

[11]:
s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2
[11]:
0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: object
[12]:
s2 = s2.astype('category')
s2
[12]:
0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: category
Categories (4, object): [Asian, Black, Hispanic, White]
[13]:
s2.cat.categories
[13]:
Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')
[14]:
s2.cat.codes
[14]:
0    0
1    0
2    3
3    1
4    3
5    2
dtype: int8

Ordered categories

[22]:
s3 = pd.Series(['Mon', 'Tue', 'Wed', 'Thu', 'Fri']).astype('category')
s3
[22]:
0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, object): [Fri, Mon, Thu, Tue, Wed]
[23]:
s3.cat.ordered
[23]:
False
[24]:
s3.sort_values()
[24]:
4    Fri
0    Mon
3    Thu
1    Tue
2    Wed
dtype: category
Categories (5, object): [Fri, Mon, Thu, Tue, Wed]
[33]:
s3 = s3.cat.reorder_categories(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], ordered=True)
[34]:
s3.cat.ordered
[34]:
True
[35]:
s3.sort_values()
[35]:
0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, object): [Mon < Tue < Wed < Thu < Fri]

DataFrame objects

A DataFrame is like a matrix. Columns in a DataFrame are Series.

  • Each column in a DataFrame represents a variale

  • Each row in a DataFrame represents an observation

  • Each cell in a DataFrame represents a value

[37]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df
[37]:
num
0 1.0
1 2.0
2 3.0
3 NaN
[38]:
df.num
[38]:
0    1.0
1    2.0
2    3.0
3    NaN
Name: num, dtype: float64

Index

Row and column identifiers are of Index type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

[39]:
df.index
[39]:
RangeIndex(start=0, stop=4, step=1)

Setting a column as the row index

[40]:
df
[40]:
num
0 1.0
1 2.0
2 3.0
3 NaN
[41]:
df1 = df.set_index('num')
df1
[41]:
num
1.0
2.0
3.0
NaN

Making an index into a column

[43]:
df1.reset_index()
[43]:
num
0 1.0
1 2.0
2 3.0
3 NaN

Sometimes you don’t need to retain the index information

[61]:
df = pd.DataFrame(dict(letters = list('ABCDEFG')))
[62]:
df
[62]:
letters
0 A
1 B
2 C
3 D
4 E
5 F
6 G
[63]:
df = df[df.letters.isin(list('AEIOU'))]
[64]:
df
[64]:
letters
0 A
4 E
[65]:
df.reset_index(drop=True)
[65]:
letters
0 A
1 E

Columns

This is just a different index object

[66]:
df.columns
[66]:
Index(['letters'], dtype='object')

Getting raw values

Sometimes you just want a numpy array, and not a pandas object.

[67]:
df.values
[67]:
array([['A'],
       ['E']], dtype=object)

Creating Data Frames

Manual

[69]:
n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(dict(pid=np.random.randint(100, 999, n),
                       weight=np.random.normal(70, 20, n),
                       height=np.random.normal(170, 15, n),
                       date=dates,
                             ))
df
[69]:
pid weight height date
0 842 44.598558 151.770120 2020-01-22 14:18:16.402553
1 995 85.515468 172.067821 2020-01-23 14:18:16.402553
2 716 93.733898 181.213071 2020-01-24 14:18:16.402553
3 887 91.109837 191.187563 2020-01-25 14:18:16.402553
4 971 79.608828 176.912780 2020-01-26 14:18:16.402553

From numpy array

[74]:
pd.DataFrame(np.eye(3,2), columns=['A', 'B'], index=['x', 'y', 'z'])
[74]:
A B
x 1.0 0.0
y 0.0 1.0
z 0.0 0.0

From URL

[75]:
url = "https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"

df = pd.read_csv(url)
df.head()
[75]:
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa

From file

You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like read_X where X is the data type.

[78]:
%%file measures.txt
pid weight  height  date
328 72.654347       203.560866      2018-11-11 14:16:18.148411
756 34.027679       189.847316      2018-11-12 14:16:18.148411
185 28.501914       158.646074      2018-11-13 14:16:18.148411
507 17.396343       180.795993      2018-11-14 14:16:18.148411
919 64.724301       173.564725      2018-11-15 14:16:18.148411
Overwriting measures.txt
[79]:
df = pd.read_table('measures.txt')
df
[79]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Indexing Data Frames

Implicit defaults

if you provide a slice, it is assumed that you are asking for rows.

[80]:
df[1:3]
[80]:
pid weight height date
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411

If you provide a singe value or list, it is assumed that you are asking for columns.

[81]:
df[['pid', 'weight']]
[81]:
pid weight
0 328 72.654347
1 756 34.027679
2 185 28.501914
3 507 17.396343
4 919 64.724301

Extracting a column

Dictionary style access

[95]:
df['pid']
[95]:
0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Property style access

This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)

[96]:
df.pid
[96]:
0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Indexing by location

This is similar to numpy indexing

[29]:
df.iloc[1:3, :]
[29]:
pid weight height date
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
[30]:
df.iloc[1:3, [True, False, True]]
[30]:
pid height
1 756 189.847316
2 185 158.646074

Indexing by name

[31]:
df.loc[1:3, 'weight':'height']
[31]:
weight height
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993

Warning: When using loc, the row slice indicates row names, not positions.

[32]:
df1 = df.copy()
df1.index = df.index + 1
df1
[32]:
pid weight height date
1 328 72.654347 203.560866 2018-11-11 14:16:18.148411
2 756 34.027679 189.847316 2018-11-12 14:16:18.148411
3 185 28.501914 158.646074 2018-11-13 14:16:18.148411
4 507 17.396343 180.795993 2018-11-14 14:16:18.148411
5 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[33]:
df1.loc[1:3, 'weight':'height']
[33]:
weight height
1 72.654347 203.560866
2 34.027679 189.847316
3 28.501914 158.646074

Structure of a Data Frame

Data types

[107]:
df.dtypes
[107]:
pid         int64
weight    float64
height    float64
date       object
dtype: object

Converting data types

Using astype on one column

[108]:
df.pid = df.pid.astype('category')

Using astype on multiple columns

[110]:
df = df.astype(dict(weight=float,
                    height=float))

Using a conversion function

[111]:
df.date = pd.to_datetime(df.date)

Check

[112]:
df.dtypes
[112]:
pid             category
weight           float64
height           float64
date      datetime64[ns]
dtype: object

Basic properties

[113]:
df.size
[113]:
20
[114]:
df.shape
[114]:
(5, 4)
[115]:
df.describe()
[115]:
weight height
count 5.000000 5.000000
mean 43.460917 181.282995
std 23.960945 16.895933
min 17.396343 158.646074
25% 28.501914 173.564725
50% 34.027679 180.795993
75% 64.724301 189.847316
max 72.654347 203.560866
[122]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
pid       5 non-null category
weight    5 non-null float64
height    5 non-null float64
date      5 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 453.0 bytes

Inspection

[123]:
df.head(n=3)
[123]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
[124]:
df.tail(n=3)
[124]:
pid weight height date
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[125]:
df.sample(n=3)
[125]:
pid weight height date
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
[126]:
df.sample(frac=0.5)
[126]:
pid weight height date
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Selecting, Renaming and Removing Columns

Selecting columns

[128]:
df.filter(items=['pid', 'date'])
[128]:
pid date
0 328 2018-11-11 14:16:18.148411
1 756 2018-11-12 14:16:18.148411
2 185 2018-11-13 14:16:18.148411
3 507 2018-11-14 14:16:18.148411
4 919 2018-11-15 14:16:18.148411
[129]:
df.filter(regex='.*ght')
[129]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725

I’m not actually clear about what like does - it seeems to mean “contains”

[151]:
df.filter(like='ei')
[151]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725

Filter has an optional axis argument if you want to select by row index

[134]:
df.filter([0,1,3,4], axis=0)
[134]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Note that you can also use regular string methods on the columns

[135]:
df.loc[:, df.columns.str.contains('d')]
[135]:
pid date
0 328 2018-11-11 14:16:18.148411
1 756 2018-11-12 14:16:18.148411
2 185 2018-11-13 14:16:18.148411
3 507 2018-11-14 14:16:18.148411
4 919 2018-11-15 14:16:18.148411

Renaming columns

[136]:
df.rename(dict(weight='w', height='h'), axis=1)
[136]:
pid w h date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[137]:
orig_cols = df.columns
[138]:
df.columns = list('abcd')
[139]:
df
[139]:
a b c d
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[140]:
df.columns = orig_cols
[141]:
df
[141]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Removing columns

[142]:
df.drop(['pid', 'date'], axis=1)
[142]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725
[143]:
df.drop(columns=['pid', 'date'])
[143]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725
[144]:
df.drop(columns=df.columns[df.columns.str.contains('d')])
[144]:
weight height
0 72.654347 203.560866
1 34.027679 189.847316
2 28.501914 158.646074
3 17.396343 180.795993
4 64.724301 173.564725

Selecting, Renaming and Removing Rows

Selecting rows

[153]:
df[df.weight.between(60,70)]
[153]:
pid weight height date
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[154]:
df[(69 <= df.weight) & (df.weight < 70)]
[154]:
pid weight height date
[155]:
df[df.date.between(pd.to_datetime('2018-11-13'),
                   pd.to_datetime('2018-11-15 23:59:59'))]
[155]:
pid weight height date
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[158]:
df.query('weight <= 70 and height > 90')
[158]:
pid weight height date
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Renaming rows

[159]:
df.rename({i:letter for i,letter in enumerate('abcde')})
[159]:
pid weight height date
a 328 72.654347 203.560866 2018-11-11 14:16:18.148411
b 756 34.027679 189.847316 2018-11-12 14:16:18.148411
c 185 28.501914 158.646074 2018-11-13 14:16:18.148411
d 507 17.396343 180.795993 2018-11-14 14:16:18.148411
e 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[160]:
df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']
[161]:
df
[161]:
pid weight height date
the 328 72.654347 203.560866 2018-11-11 14:16:18.148411
quick 756 34.027679 189.847316 2018-11-12 14:16:18.148411
brown 185 28.501914 158.646074 2018-11-13 14:16:18.148411
fox 507 17.396343 180.795993 2018-11-14 14:16:18.148411
jumphs 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[162]:
df = df.reset_index(drop=True)
[163]:
df
[163]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Dropping rows

[164]:
df.drop([1,3], axis=0)
[164]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411

Dropping duplicated data

[179]:
df['something'] = [1,1,None,2,None]
df['nothing'] = [None, None, None, None, None]
[180]:
df.loc[df.something.duplicated()]
[180]:
pid weight height date something nothing
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 None
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN None
[181]:
df.drop_duplicates(subset='something')
[181]:
pid weight height date something nothing
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 None
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 None

Dropping missing data

[182]:
df
[182]:
pid weight height date something nothing
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 None
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 None
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 None
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN None
[183]:
df.dropna()
[183]:
pid weight height date something nothing
[185]:
df.dropna(axis=1)
[185]:
pid weight height date
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411
[186]:
df.dropna(axis=1, how='all')
[186]:
pid weight height date something
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN

Brute force replacement of missing values

[193]:
df.something.fillna(0)
[193]:
0    1.0
1    1.0
2    0.0
3    2.0
4    0.0
Name: something, dtype: float64
[197]:
df.something.fillna(df.something.mean())
[197]:
0    1.000000
1    1.000000
2    1.333333
3    2.000000
4    1.333333
Name: something, dtype: float64
[194]:
df.something.ffill()
[194]:
0    1.0
1    1.0
2    1.0
3    2.0
4    2.0
Name: something, dtype: float64
[195]:
df.something.bfill()
[195]:
0    1.0
1    1.0
2    2.0
3    2.0
4    NaN
Name: something, dtype: float64
[196]:
df.something.interpolate()
[196]:
0    1.0
1    1.0
2    1.5
3    2.0
4    2.0
Name: something, dtype: float64

Transforming and Creating Columns

[198]:
df.assign(bmi=df['weight'] / (df['height']/100)**2)
[198]:
pid weight height date something nothing bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 None 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 None 9.441118
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 None 5.322067
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN None 21.485449
[199]:
df['bmi'] = df['weight'] / (df['height']/100)**2
[200]:
df
[200]:
pid weight height date something nothing bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 None 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 None 9.441118
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 None 5.322067
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN None 21.485449
[201]:
df['something'] = [2,2,None,None,3]
[202]:
df
[202]:
pid weight height date something nothing bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 None 9.441118
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449

Sorting Data Frames

Sort on indexes

[203]:
df.sort_index(axis=1)
[203]:
bmi date height nothing pid something weight
0 17.533678 2018-11-11 14:16:18.148411 203.560866 None 328 2.0 72.654347
1 9.441118 2018-11-12 14:16:18.148411 189.847316 None 756 2.0 34.027679
2 11.324404 2018-11-13 14:16:18.148411 158.646074 None 185 NaN 28.501914
3 5.322067 2018-11-14 14:16:18.148411 180.795993 None 507 NaN 17.396343
4 21.485449 2018-11-15 14:16:18.148411 173.564725 None 919 3.0 64.724301
[204]:
df.sort_index(axis=0, ascending=False)
[204]:
pid weight height date something nothing bmi
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 None 9.441118
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678

Sort on values

[205]:
df.sort_values(by=['something', 'bmi'], ascending=[True, False])
[205]:
pid weight height date something nothing bmi
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 None 9.441118
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067

Summarizing

Apply an aggregation function

[206]:
df.select_dtypes(include=np.number)
[206]:
weight height something bmi
0 72.654347 203.560866 2.0 17.533678
1 34.027679 189.847316 2.0 9.441118
2 28.501914 158.646074 NaN 11.324404
3 17.396343 180.795993 NaN 5.322067
4 64.724301 173.564725 3.0 21.485449
[207]:
df.select_dtypes(include=np.number).agg(np.sum)
[207]:
weight       217.304584
height       906.414974
something      7.000000
bmi           65.106716
dtype: float64
[208]:
df.agg(['count', np.sum, np.mean])
[208]:
pid weight height date something nothing bmi
count 5.0 5.000000 5.000000 5 3.000000 0.0 5.000000
sum NaN 217.304584 906.414974 NaN 7.000000 0.0 65.106716
mean NaN 43.460917 181.282995 2018-11-13 14:16:18.148410880 2.333333 NaN 13.021343

Split-Apply-Combine

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with groupby followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

[209]:
df['treatment'] = list('ababa')
[210]:
df
[210]:
pid weight height date something nothing bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 None 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449 a
[211]:
grouped = df.groupby('treatment')
[212]:
grouped.get_group('a')
[212]:
pid weight height date something nothing bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678 a
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404 a
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449 a
[213]:
grouped.mean()
[213]:
weight height something bmi
treatment
a 55.293521 178.590555 2.5 16.781177
b 25.712011 185.321654 2.0 7.381592

Using agg with groupby

[214]:
grouped.agg('mean')
[214]:
weight height something bmi
treatment
a 55.293521 178.590555 2.5 16.781177
b 25.712011 185.321654 2.0 7.381592
[215]:
grouped.agg(['mean', 'std'])
[215]:
weight height something bmi
mean std mean std mean std mean std
treatment
a 55.293521 23.538565 178.590555 22.875289 2.5 0.707107 16.781177 5.122148
b 25.712011 11.760130 185.321654 6.400252 2.0 NaN 7.381592 2.912608
[216]:
grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})
[216]:
weight height bmi
mean std min max <lambda>
treatment
a 55.293521 23.538565 158.646074 203.560866 897.296525
b 25.712011 11.760130 180.795993 189.847316 117.459100

Using trasnform wtih groupby

[217]:
g_mean = grouped['weight', 'height'].transform(np.mean)
g_mean
[217]:
weight height
0 55.293521 178.590555
1 25.712011 185.321654
2 55.293521 178.590555
3 25.712011 185.321654
4 55.293521 178.590555
[218]:
g_std = grouped['weight', 'height'].transform(np.std)
g_std
[218]:
weight height
0 23.538565 22.875289
1 11.760130 6.400252
2 23.538565 22.875289
3 11.760130 6.400252
4 23.538565 22.875289
[219]:
(df[['weight', 'height']] - g_mean)/g_std
[219]:
weight height
0 0.737548 1.091584
1 0.707107 0.707107
2 -1.138201 -0.871879
3 -0.707107 -0.707107
4 0.400652 -0.219706

Combining Data Frames

[220]:
df
[220]:
pid weight height date something nothing bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 None 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449 a
[221]:
df1 =  df.iloc[3:].copy()
[222]:
df1.drop('something', axis=1, inplace=True)
df1
[222]:
pid weight height date nothing bmi treatment
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 None 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 None 21.485449 a

Adding rows

Note that pandas aligns by column indexes automatically.

[223]:
df.append(df1, sort=False)
[223]:
pid weight height date something nothing bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 None 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN None 21.485449 a
[224]:
pd.concat([df, df1], sort=False)
[224]:
pid weight height date something nothing bmi treatment
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678 a
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 None 9.441118 b
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449 a
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067 b
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN None 21.485449 a

Adding columns

[225]:
df.pid
[225]:
0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: category
Categories (5, int64): [185, 328, 507, 756, 919]
[228]:
df2 = pd.DataFrame(dict(pid=[649, 533, 400, 600], age=[23,34,45,56]))
[229]:
df2.pid
[229]:
0    649
1    533
2    400
3    600
Name: pid, dtype: int64
[230]:
df.pid = df.pid.astype('int')
[231]:
pd.merge(df, df2, on='pid', how='inner')
[231]:
pid weight height date something nothing bmi treatment age
[232]:
pd.merge(df, df2, on='pid', how='left')
[232]:
pid weight height date something nothing bmi treatment age
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678 a NaN
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 None 9.441118 b NaN
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404 a NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067 b NaN
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449 a NaN
[233]:
pd.merge(df, df2, on='pid', how='right')
[233]:
pid weight height date something nothing bmi treatment age
0 649 NaN NaN NaT NaN NaN NaN NaN 23
1 533 NaN NaN NaT NaN NaN NaN NaN 34
2 400 NaN NaN NaT NaN NaN NaN NaN 45
3 600 NaN NaN NaT NaN NaN NaN NaN 56
[234]:
pd.merge(df, df2, on='pid', how='outer')
[234]:
pid weight height date something nothing bmi treatment age
0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 None 17.533678 a NaN
1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 None 9.441118 b NaN
2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN None 11.324404 a NaN
3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN None 5.322067 b NaN
4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 None 21.485449 a NaN
5 649 NaN NaN NaT NaN NaN NaN NaN 23.0
6 533 NaN NaN NaT NaN NaN NaN NaN 34.0
7 400 NaN NaN NaT NaN NaN NaN NaN 45.0
8 600 NaN NaN NaT NaN NaN NaN NaN 56.0

Merging on the index

[235]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))
[236]:
df1
[236]:
x
a 1
b 2
c 3
[237]:
df2
[237]:
y
a 4
b 5
c 6
[238]:
df3
[238]:
z
a 7
b 8
c 9
[239]:
df1.join([df2, df3])
[239]:
x y z
a 1 4 7
b 2 5 8
c 3 6 9

Fixing common DataFrame issues

Multiple variables in a column

[240]:
df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df
[240]:
pid_treat
0 A-1
1 B-2
2 C-1
3 D-2
[241]:
df.pid_treat.str.split('-')
[241]:
0    [A, 1]
1    [B, 2]
2    [C, 1]
3    [D, 2]
Name: pid_treat, dtype: object
[242]:
df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])
[242]:
pid treat
0 A 1
1 B 2
2 C 1
3 D 2

Multiple values in a cell

[243]:
df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df
[243]:
pid vals
0 a (1, 2, 3)
1 b (4, 5, 6)
2 c (7, 8, 9)
[244]:
df[['t1', 't2', 't3']]  = df.vals.apply(pd.Series)
df
[244]:
pid vals t1 t2 t3
0 a (1, 2, 3) 1 2 3
1 b (4, 5, 6) 4 5 6
2 c (7, 8, 9) 7 8 9
[245]:
df.drop('vals', axis=1, inplace=True)
[246]:
pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)
[246]:
pid vals
0 a 1
1 b 4
2 c 7
3 a 2
4 b 5
5 c 8
6 a 3
7 b 6
8 c 9

Reshaping Data Frames

Sometimes we need to make rows into columns or vice versa.

Converting multiple columns into a single column

This is often useful if you need to condition on some variable.

[247]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)
[248]:
iris.head()
[248]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
[249]:
iris.shape
[249]:
(150, 5)
[250]:
df_iris = pd.melt(iris, id_vars='species')
[251]:
df_iris.sample(10)
[251]:
species variable value
102 virginica sepal_length 7.1
176 setosa sepal_width 3.4
233 versicolor sepal_width 2.7
158 setosa sepal_width 2.9
116 virginica sepal_length 6.5
97 versicolor sepal_length 6.2
565 virginica petal_width 2.3
525 versicolor petal_width 1.4
45 setosa sepal_length 4.8
594 virginica petal_width 2.5

Chaining commands

Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.

[252]:
(
    iris.
    sample(frac=0.2).
    filter(regex='s.*').
    assign(both=iris.sepal_length + iris.sepal_length).
    groupby('species').agg(['mean', 'sum']).
    pipe(lambda x: np.around(x, 1))
)
[252]:
sepal_length sepal_width both
mean sum mean sum mean sum
species
setosa 5.1 50.6 3.5 35.0 10.1 101.2
versicolor 5.8 69.5 2.7 32.8 11.6 139.0
virginica 6.6 53.1 3.0 24.0 13.3 106.2