Introduction to `pandas`¶

[3]:

import numpy as np
import pandas as pd

Series and Data Frames¶

Series objects¶

A Series is like a vector. All elements must have the same type or are nulls.

[4]:

s = pd.Series([1,1,2,3] + [None])
s

[4]:

0    1.0
1    1.0
2    2.0
3    3.0
4    NaN
dtype: float64

Size¶

[5]:

s.size

[5]:

Unique Counts¶

[6]:

s.value_counts()

[6]:

1.0    2
3.0    1
2.0    1
dtype: int64

Special types of series¶

Strings¶

[7]:

words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1

[7]:

0      the quick
1    quick brown
2      brown fox
3      fox jumps
4     jumps over
5       over the
6       the lazy
7       lazy dog
dtype: object

[8]:

s1.str.upper()

[8]:

0      THE QUICK
1    QUICK BROWN
2      BROWN FOX
3      FOX JUMPS
4     JUMPS OVER
5       OVER THE
6       THE LAZY
7       LAZY DOG
dtype: object

[9]:

s1.str.split()

[9]:

0      [the, quick]
1    [quick, brown]
2      [brown, fox]
3      [fox, jumps]
4     [jumps, over]
5       [over, the]
6       [the, lazy]
7       [lazy, dog]
dtype: object

[10]:

s1.str.split().str[1]

[10]:

0    quick
1    brown
2      fox
3    jumps
4     over
5      the
6     lazy
7      dog
dtype: object

Categories¶

[11]:

s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2

[11]:

0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: object

[12]:

s2 = s2.astype('category')
s2

[12]:

0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: category
Categories (4, object): [Asian, Black, Hispanic, White]

[13]:

s2.cat.categories

[13]:

Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')

[14]:

s2.cat.codes

[14]:

0    0
1    0
2    3
3    1
4    3
5    2
dtype: int8

Ordered categories¶

[22]:

s3 = pd.Series(['Mon', 'Tue', 'Wed', 'Thu', 'Fri']).astype('category')
s3

[22]:

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, object): [Fri, Mon, Thu, Tue, Wed]

[23]:

s3.cat.ordered

[23]:

False

[24]:

s3.sort_values()

[24]:

4    Fri
0    Mon
3    Thu
1    Tue
2    Wed
dtype: category
Categories (5, object): [Fri, Mon, Thu, Tue, Wed]

[33]:

s3 = s3.cat.reorder_categories(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], ordered=True)

[34]:

s3.cat.ordered

[34]:

True

[35]:

s3.sort_values()

[35]:

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, object): [Mon < Tue < Wed < Thu < Fri]

DataFrame objects¶

A DataFrame is like a matrix. Columns in a DataFrame are Series.

Each column in a DataFrame represents a variale
Each row in a DataFrame represents an observation
Each cell in a DataFrame represents a value

[37]:

df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df

[37]:

	num
0	1.0
1	2.0
2	3.0
3	NaN

[38]:

df.num

[38]:

0    1.0
1    2.0
2    3.0
3    NaN
Name: num, dtype: float64

Index¶

Row and column identifiers are of Index type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

[39]:

df.index

[39]:

RangeIndex(start=0, stop=4, step=1)

Setting a column as the row index¶

[40]:

df

[40]:

	num
0	1.0
1	2.0
2	3.0
3	NaN

[41]:

df1 = df.set_index('num')
df1

[41]:


num
1.0
2.0
3.0
NaN

Making an index into a column¶

[43]:

df1.reset_index()

[43]:

	num
0	1.0
1	2.0
2	3.0
3	NaN

Sometimes you don’t need to retain the index information¶

[61]:

df = pd.DataFrame(dict(letters = list('ABCDEFG')))

[62]:

df

[62]:

	letters
0	A
1	B
2	C
3	D
4	E
5	F
6	G

[63]:

df = df[df.letters.isin(list('AEIOU'))]

[64]:

df

[64]:

	letters
0	A
4	E

[65]:

df.reset_index(drop=True)

[65]:

	letters
0	A
1	E

Columns¶

This is just a different index object

[66]:

df.columns

[66]:

Index(['letters'], dtype='object')

Getting raw values¶

Sometimes you just want a numpy array, and not a pandas object.

[67]:

df.values

[67]:

array([['A'],
       ['E']], dtype=object)

Creating Data Frames¶

Manual¶

[69]:

n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(dict(pid=np.random.randint(100, 999, n),
                       weight=np.random.normal(70, 20, n),
                       height=np.random.normal(170, 15, n),
                       date=dates,
                             ))
df

[69]:

	pid	weight	height	date
0	842	44.598558	151.770120	2020-01-22 14:18:16.402553
1	995	85.515468	172.067821	2020-01-23 14:18:16.402553
2	716	93.733898	181.213071	2020-01-24 14:18:16.402553
3	887	91.109837	191.187563	2020-01-25 14:18:16.402553
4	971	79.608828	176.912780	2020-01-26 14:18:16.402553

From numpy array¶

[74]:

pd.DataFrame(np.eye(3,2), columns=['A', 'B'], index=['x', 'y', 'z'])

[74]:

	A	B
x	1.0	0.0
y	0.0	1.0
z	0.0	0.0

From URL¶

[75]:

url = "https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"

df = pd.read_csv(url)
df.head()

[75]:

	sepal.length	sepal.width	petal.length	petal.width	variety
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3.0	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5.0	3.6	1.4	0.2	Setosa

From file¶

You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like read_X where X is the data type.

[78]:

%%file measures.txt
pid weight  height  date
72.654347       203.560866      2018-11-11 14:16:18.148411
34.027679       189.847316      2018-11-12 14:16:18.148411
28.501914       158.646074      2018-11-13 14:16:18.148411
17.396343       180.795993      2018-11-14 14:16:18.148411
64.724301       173.564725      2018-11-15 14:16:18.148411

Overwriting measures.txt

[79]:

df = pd.read_table('measures.txt')
df

[79]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Indexing Data Frames¶

Implicit defaults¶

if you provide a slice, it is assumed that you are asking for rows.

[80]:

df[1:3]

[80]:

	pid	weight	height	date
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

If you provide a singe value or list, it is assumed that you are asking for columns.

[81]:

df[['pid', 'weight']]

[81]:

	pid	weight
0	328	72.654347
1	756	34.027679
2	185	28.501914
3	507	17.396343
4	919	64.724301

Extracting a column¶

Dictionary style access¶

[95]:

df['pid']

[95]:

0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Property style access¶

This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)

[96]:

df.pid

[96]:

0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

Indexing by location¶

This is similar to numpy indexing

[29]:

df.iloc[1:3, :]

[29]:

	pid	weight	height	date
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

[30]:

df.iloc[1:3, [True, False, True]]

[30]:

	pid	height
1	756	189.847316
2	185	158.646074

Indexing by name¶

[31]:

df.loc[1:3, 'weight':'height']

[31]:

	weight	height
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993

Warning: When using loc, the row slice indicates row names, not positions.

[32]:

df1 = df.copy()
df1.index = df.index + 1
df1

[32]:

	pid	weight	height	date
1	328	72.654347	203.560866	2018-11-11 14:16:18.148411
2	756	34.027679	189.847316	2018-11-12 14:16:18.148411
3	185	28.501914	158.646074	2018-11-13 14:16:18.148411
4	507	17.396343	180.795993	2018-11-14 14:16:18.148411
5	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[33]:

df1.loc[1:3, 'weight':'height']

[33]:

	weight	height
1	72.654347	203.560866
2	34.027679	189.847316
3	28.501914	158.646074

Structure of a Data Frame¶

Data types¶

[107]:

df.dtypes

[107]:

pid         int64
weight    float64
height    float64
date       object
dtype: object

Converting data types¶

Using `astype` on one column¶

[108]:

df.pid = df.pid.astype('category')

Using `astype` on multiple columns¶

[110]:

df = df.astype(dict(weight=float,
                    height=float))

Using a conversion function¶

[111]:

df.date = pd.to_datetime(df.date)

Check¶

[112]:

df.dtypes

[112]:

pid             category
weight           float64
height           float64
date      datetime64[ns]
dtype: object

Basic properties¶

[113]:

df.size

[113]:

[114]:

df.shape

[114]:

(5, 4)

[115]:

df.describe()

[115]:

	weight	height
count	5.000000	5.000000
mean	43.460917	181.282995
std	23.960945	16.895933
min	17.396343	158.646074
25%	28.501914	173.564725
50%	34.027679	180.795993
75%	64.724301	189.847316
max	72.654347	203.560866

[122]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
pid       5 non-null category
weight    5 non-null float64
height    5 non-null float64
date      5 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 453.0 bytes

Inspection¶

[123]:

df.head(n=3)

[123]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411

[124]:

df.tail(n=3)

[124]:

	pid	weight	height	date
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[125]:

df.sample(n=3)

[125]:

	pid	weight	height	date
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411

[126]:

df.sample(frac=0.5)

[126]:

	pid	weight	height	date
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Selecting, Renaming and Removing Columns¶

Selecting columns¶

[128]:

df.filter(items=['pid', 'date'])

[128]:

	pid	date
0	328	2018-11-11 14:16:18.148411
1	756	2018-11-12 14:16:18.148411
2	185	2018-11-13 14:16:18.148411
3	507	2018-11-14 14:16:18.148411
4	919	2018-11-15 14:16:18.148411

[129]:

df.filter(regex='.*ght')

[129]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

I’m not actually clear about what like does - it seeems to mean “contains”

[151]:

df.filter(like='ei')

[151]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

Filter has an optional axis argument if you want to select by row index¶

[134]:

df.filter([0,1,3,4], axis=0)

[134]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Note that you can also use regular string methods on the columns¶

[135]:

df.loc[:, df.columns.str.contains('d')]

[135]:

	pid	date
0	328	2018-11-11 14:16:18.148411
1	756	2018-11-12 14:16:18.148411
2	185	2018-11-13 14:16:18.148411
3	507	2018-11-14 14:16:18.148411
4	919	2018-11-15 14:16:18.148411

Renaming columns¶

[136]:

df.rename(dict(weight='w', height='h'), axis=1)

[136]:

	pid	w	h	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[137]:

orig_cols = df.columns

[138]:

df.columns = list('abcd')

[139]:

df

[139]:

	a	b	c	d
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[140]:

df.columns = orig_cols

[141]:

df

[141]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Removing columns¶

[142]:

df.drop(['pid', 'date'], axis=1)

[142]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

[143]:

df.drop(columns=['pid', 'date'])

[143]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

[144]:

df.drop(columns=df.columns[df.columns.str.contains('d')])

[144]:

	weight	height
0	72.654347	203.560866
1	34.027679	189.847316
2	28.501914	158.646074
3	17.396343	180.795993
4	64.724301	173.564725

Selecting, Renaming and Removing Rows¶

Selecting rows¶

[153]:

df[df.weight.between(60,70)]

[153]:

	pid	weight	height	date
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[154]:

df[(69 <= df.weight) & (df.weight < 70)]

[154]:

	pid	weight	height	date

[155]:

df[df.date.between(pd.to_datetime('2018-11-13'),
                   pd.to_datetime('2018-11-15 23:59:59'))]

[155]:

	pid	weight	height	date
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[158]:

df.query('weight <= 70 and height > 90')

[158]:

	pid	weight	height	date
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Renaming rows¶

[159]:

df.rename({i:letter for i,letter in enumerate('abcde')})

[159]:

	pid	weight	height	date
a	328	72.654347	203.560866	2018-11-11 14:16:18.148411
b	756	34.027679	189.847316	2018-11-12 14:16:18.148411
c	185	28.501914	158.646074	2018-11-13 14:16:18.148411
d	507	17.396343	180.795993	2018-11-14 14:16:18.148411
e	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[160]:

df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']

[161]:

df

[161]:

	pid	weight	height	date
the	328	72.654347	203.560866	2018-11-11 14:16:18.148411
quick	756	34.027679	189.847316	2018-11-12 14:16:18.148411
brown	185	28.501914	158.646074	2018-11-13 14:16:18.148411
fox	507	17.396343	180.795993	2018-11-14 14:16:18.148411
jumphs	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[162]:

df = df.reset_index(drop=True)

[163]:

df

[163]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Dropping rows¶

[164]:

df.drop([1,3], axis=0)

[164]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

Dropping duplicated data¶

[179]:

df['something'] = [1,1,None,2,None]
df['nothing'] = [None, None, None, None, None]

[180]:

df.loc[df.something.duplicated()]

[180]:

	pid	weight	height	date	something	nothing
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0	None
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	None

[181]:

df.drop_duplicates(subset='something')

[181]:

	pid	weight	height	date	something	nothing
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0	None
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0	None

Dropping missing data¶

[182]:

df

[182]:

	pid	weight	height	date	something	nothing
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0	None
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0	None
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0	None
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	None

[183]:

df.dropna()

[183]:

	pid	weight	height	date	something	nothing

[185]:

df.dropna(axis=1)

[185]:

	pid	weight	height	date
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411

[186]:

df.dropna(axis=1, how='all')

[186]:

	pid	weight	height	date	something
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN

Brute force replacement of missing values¶

[193]:

df.something.fillna(0)

[193]:

0    1.0
1    1.0
2    0.0
3    2.0
4    0.0
Name: something, dtype: float64

[197]:

df.something.fillna(df.something.mean())

[197]:

0    1.000000
1    1.000000
2    1.333333
3    2.000000
4    1.333333
Name: something, dtype: float64

[194]:

df.something.ffill()

[194]:

0    1.0
1    1.0
2    1.0
3    2.0
4    2.0
Name: something, dtype: float64

[195]:

df.something.bfill()

[195]:

0    1.0
1    1.0
2    2.0
3    2.0
4    NaN
Name: something, dtype: float64

[196]:

df.something.interpolate()

[196]:

0    1.0
1    1.0
2    1.5
3    2.0
4    2.0
Name: something, dtype: float64

Transforming and Creating Columns¶

[198]:

df.assign(bmi=df['weight'] / (df['height']/100)**2)

[198]:

	pid	weight	height	date	something	nothing	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0	None	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0	None	9.441118
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0	None	5.322067
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	None	21.485449

[199]:

df['bmi'] = df['weight'] / (df['height']/100)**2

[200]:

df

[200]:

	pid	weight	height	date	something	nothing	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	1.0	None	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	1.0	None	9.441118
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	2.0	None	5.322067
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	None	21.485449

[201]:

df['something'] = [2,2,None,None,3]

[202]:

df

[202]:

	pid	weight	height	date	something	nothing	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	None	9.441118
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449

Sorting Data Frames¶

Sort on indexes¶

[203]:

df.sort_index(axis=1)

[203]:

	bmi	date	height	nothing	pid	something	weight
0	17.533678	2018-11-11 14:16:18.148411	203.560866	None	328	2.0	72.654347
1	9.441118	2018-11-12 14:16:18.148411	189.847316	None	756	2.0	34.027679
2	11.324404	2018-11-13 14:16:18.148411	158.646074	None	185	NaN	28.501914
3	5.322067	2018-11-14 14:16:18.148411	180.795993	None	507	NaN	17.396343
4	21.485449	2018-11-15 14:16:18.148411	173.564725	None	919	3.0	64.724301

[204]:

df.sort_index(axis=0, ascending=False)

[204]:

	pid	weight	height	date	something	nothing	bmi
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	None	9.441118
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678

Sort on values¶

[205]:

df.sort_values(by=['something', 'bmi'], ascending=[True, False])

[205]:

	pid	weight	height	date	something	nothing	bmi
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	None	9.441118
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067

Summarizing¶

Apply an aggregation function¶

[206]:

df.select_dtypes(include=np.number)

[206]:

	weight	height	something	bmi
0	72.654347	203.560866	2.0	17.533678
1	34.027679	189.847316	2.0	9.441118
2	28.501914	158.646074	NaN	11.324404
3	17.396343	180.795993	NaN	5.322067
4	64.724301	173.564725	3.0	21.485449

[207]:

df.select_dtypes(include=np.number).agg(np.sum)

[207]:

weight       217.304584
height       906.414974
something      7.000000
bmi           65.106716
dtype: float64

[208]:

df.agg(['count', np.sum, np.mean])

[208]:

	pid	weight	height	date	something	nothing	bmi
count	5.0	5.000000	5.000000	5	3.000000	0.0	5.000000
sum	NaN	217.304584	906.414974	NaN	7.000000	0.0	65.106716
mean	NaN	43.460917	181.282995	2018-11-13 14:16:18.148410880	2.333333	NaN	13.021343

Split-Apply-Combine¶

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with groupby followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

[209]:

df['treatment'] = list('ababa')

[210]:

df

[210]:

	pid	weight	height	date	something	nothing	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	None	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449	a

[211]:

grouped = df.groupby('treatment')

[212]:

grouped.get_group('a')

[212]:

	pid	weight	height	date	something	nothing	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678	a
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404	a
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449	a

[213]:

grouped.mean()

[213]:

	weight	height	something	bmi
treatment
a	55.293521	178.590555	2.5	16.781177
b	25.712011	185.321654	2.0	7.381592

Using `agg` with `groupby`¶

[214]:

grouped.agg('mean')

[214]:

	weight	height	something	bmi
treatment
a	55.293521	178.590555	2.5	16.781177
b	25.712011	185.321654	2.0	7.381592

[215]:

grouped.agg(['mean', 'std'])

[215]:

	weight		height		something		bmi
	mean	std	mean	std	mean	std	mean	std
treatment
a	55.293521	23.538565	178.590555	22.875289	2.5	0.707107	16.781177	5.122148
b	25.712011	11.760130	185.321654	6.400252	2.0	NaN	7.381592	2.912608

[216]:

grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})

[216]:

	weight		height		bmi
	mean	std	min	max	<lambda>
treatment
a	55.293521	23.538565	158.646074	203.560866	897.296525
b	25.712011	11.760130	180.795993	189.847316	117.459100

Using `trasnform` wtih `groupby`¶

[217]:

g_mean = grouped['weight', 'height'].transform(np.mean)
g_mean

[217]:

	weight	height
0	55.293521	178.590555
1	25.712011	185.321654
2	55.293521	178.590555
3	25.712011	185.321654
4	55.293521	178.590555

[218]:

g_std = grouped['weight', 'height'].transform(np.std)
g_std

[218]:

	weight	height
0	23.538565	22.875289
1	11.760130	6.400252
2	23.538565	22.875289
3	11.760130	6.400252
4	23.538565	22.875289

[219]:

(df[['weight', 'height']] - g_mean)/g_std

[219]:

	weight	height
0	0.737548	1.091584
1	0.707107	0.707107
2	-1.138201	-0.871879
3	-0.707107	-0.707107
4	0.400652	-0.219706

Combining Data Frames¶

[220]:

df

[220]:

	pid	weight	height	date	something	nothing	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	None	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449	a

[221]:

df1 =  df.iloc[3:].copy()

[222]:

df1.drop('something', axis=1, inplace=True)
df1

[222]:

	pid	weight	height	date	nothing	bmi	treatment
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	None	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	None	21.485449	a

Adding rows¶

Note that pandas aligns by column indexes automatically.

[223]:

df.append(df1, sort=False)

[223]:

	pid	weight	height	date	something	nothing	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	None	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	None	21.485449	a

[224]:

pd.concat([df, df1], sort=False)

[224]:

	pid	weight	height	date	something	nothing	bmi	treatment
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678	a
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	None	9.441118	b
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449	a
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067	b
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	NaN	None	21.485449	a

Adding columns¶

[225]:

df.pid

[225]:

0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: category
Categories (5, int64): [185, 328, 507, 756, 919]

[228]:

df2 = pd.DataFrame(dict(pid=[649, 533, 400, 600], age=[23,34,45,56]))

[229]:

df2.pid

[229]:

0    649
1    533
2    400
3    600
Name: pid, dtype: int64

[230]:

df.pid = df.pid.astype('int')

[231]:

pd.merge(df, df2, on='pid', how='inner')

[231]:

	pid	weight	height	date	something	nothing	bmi	treatment	age

[232]:

pd.merge(df, df2, on='pid', how='left')

[232]:

	pid	weight	height	date	something	nothing	bmi	treatment	age
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678	a	NaN
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	None	9.441118	b	NaN
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404	a	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067	b	NaN
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449	a	NaN

[233]:

pd.merge(df, df2, on='pid', how='right')

[233]:

	pid	weight	height	date	something	nothing	bmi	treatment	age
0	649	NaN	NaN	NaT	NaN	NaN	NaN	NaN	23
1	533	NaN	NaN	NaT	NaN	NaN	NaN	NaN	34
2	400	NaN	NaN	NaT	NaN	NaN	NaN	NaN	45
3	600	NaN	NaN	NaT	NaN	NaN	NaN	NaN	56

[234]:

pd.merge(df, df2, on='pid', how='outer')

[234]:

	pid	weight	height	date	something	nothing	bmi	treatment	age
0	328	72.654347	203.560866	2018-11-11 14:16:18.148411	2.0	None	17.533678	a	NaN
1	756	34.027679	189.847316	2018-11-12 14:16:18.148411	2.0	None	9.441118	b	NaN
2	185	28.501914	158.646074	2018-11-13 14:16:18.148411	NaN	None	11.324404	a	NaN
3	507	17.396343	180.795993	2018-11-14 14:16:18.148411	NaN	None	5.322067	b	NaN
4	919	64.724301	173.564725	2018-11-15 14:16:18.148411	3.0	None	21.485449	a	NaN
5	649	NaN	NaN	NaT	NaN	NaN	NaN	NaN	23.0
6	533	NaN	NaN	NaT	NaN	NaN	NaN	NaN	34.0
7	400	NaN	NaN	NaT	NaN	NaN	NaN	NaN	45.0
8	600	NaN	NaN	NaT	NaN	NaN	NaN	NaN	56.0

Merging on the index¶

[235]:

df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))

[236]:

df1

[236]:

	x
a	1
b	2
c	3

[237]:

df2

[237]:

	y
a	4
b	5
c	6

[238]:

df3

[238]:

	z
a	7
b	8
c	9

[239]:

df1.join([df2, df3])

[239]:

	x	y	z
a	1	4	7
b	2	5	8
c	3	6	9

Fixing common DataFrame issues¶

Multiple variables in a column¶

[240]:

df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df

[240]:

	pid_treat
0	A-1
1	B-2
2	C-1
3	D-2

[241]:

df.pid_treat.str.split('-')

[241]:

0    [A, 1]
1    [B, 2]
2    [C, 1]
3    [D, 2]
Name: pid_treat, dtype: object

[242]:

df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])

[242]:

	pid	treat
0	A	1
1	B	2
2	C	1
3	D	2

Multiple values in a cell¶

[243]:

df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df

[243]:

	pid	vals
0	a	(1, 2, 3)
1	b	(4, 5, 6)
2	c	(7, 8, 9)

[244]:

df[['t1', 't2', 't3']]  = df.vals.apply(pd.Series)
df

[244]:

	pid	vals	t1	t2	t3
0	a	(1, 2, 3)	1	2	3
1	b	(4, 5, 6)	4	5	6
2	c	(7, 8, 9)	7	8	9

[245]:

df.drop('vals', axis=1, inplace=True)

[246]:

pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)

[246]:

	pid	vals
0	a	1
1	b	4
2	c	7
3	a	2
4	b	5
5	c	8
6	a	3
7	b	6
8	c	9

Reshaping Data Frames¶

Sometimes we need to make rows into columns or vice versa.

Converting multiple columns into a single column¶

This is often useful if you need to condition on some variable.

[247]:

url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)

[248]:

iris.head()

[248]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

[249]:

iris.shape

[249]:

(150, 5)

[250]:

df_iris = pd.melt(iris, id_vars='species')

[251]:

df_iris.sample(10)

[251]:

	species	variable	value
102	virginica	sepal_length	7.1
176	setosa	sepal_width	3.4
233	versicolor	sepal_width	2.7
158	setosa	sepal_width	2.9
116	virginica	sepal_length	6.5
97	versicolor	sepal_length	6.2
565	virginica	petal_width	2.3
525	versicolor	petal_width	1.4
45	setosa	sepal_length	4.8
594	virginica	petal_width	2.5

Chaining commands¶

Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.

[252]:

(
    iris.
    sample(frac=0.2).
    filter(regex='s.*').
    assign(both=iris.sepal_length + iris.sepal_length).
    groupby('species').agg(['mean', 'sum']).
    pipe(lambda x: np.around(x, 1))
)

[252]:

	sepal_length		sepal_width		both
	mean	sum	mean	sum	mean	sum
species
setosa	5.1	50.6	3.5	35.0	10.1	101.2
versicolor	5.8	69.5	2.7	32.8	11.6	139.0
virginica	6.6	53.1	3.0	24.0	13.3	106.2

Introduction to pandas¶