Pandas Part 1

In Part 1, we’ll deal with the basic pandas data structures (Series and DataFrame), loading data from a table, spreadsheet or database, and the basics of data manipulation. In Part 2, we’ll dive into more data manipulation techniques.

For those already familiar with R data munging, see how to do the same thing in pandas.

In [1]:
%matplotlib inline
In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

Data structures

Series

Series are similar to numpy arrays.

In [3]:
s1 = Series([1,1,2,4,5,8])
In [4]:
s1
Out[4]:
0    1
1    1
2    2
3    4
4    5
5    8
dtype: int64
In [5]:
s1.values
Out[5]:
array([1, 1, 2, 4, 5, 8])
In [6]:
s1.tolist()
Out[6]:
[1, 1, 2, 4, 5, 8]
In [7]:
s1.index
Out[7]:
RangeIndex(start=0, stop=6, step=1)
In [8]:
s1[s1 > 3]
Out[8]:
3    4
4    5
5    8
dtype: int64
In [9]:
s1[s1 > 3] ** 2
Out[9]:
3    16
4    25
5    64
dtype: int64

Conversions

In [10]:
s1.astype('float')
Out[10]:
0    1.0
1    1.0
2    2.0
3    4.0
4    5.0
5    8.0
dtype: float64
In [11]:
s1.astype('category')
Out[11]:
0    1
1    1
2    2
3    4
4    5
5    8
dtype: category
Categories (5, int64): [1, 2, 4, 5, 8]

Series can also behave like Python dictionaries.

In [12]:
s1.index = ['ann', 'bob', 'charles', 'david', 'esther', 'fred']
In [13]:
s1
Out[13]:
ann        1
bob        1
charles    2
david      4
esther     5
fred       8
dtype: int64
In [14]:
s1['esther']
Out[14]:
5
In [15]:
s2 = Series(dict(ann=3, charles=5, david=7, fred=9, gloria=11))

You can name the Series and its index.

In [16]:
s2.name = 'foo'
s2.index.name = 'name'
In [17]:
s2
Out[17]:
name
ann         3
charles     5
david       7
fred        9
gloria     11
Name: foo, dtype: int64

Automatic alignment on index occurs

In [18]:
s1 + s2
Out[18]:
ann         4.0
bob         NaN
charles     7.0
david      11.0
esther      NaN
fred       17.0
gloria      NaN
dtype: float64

Be careful when using custom integer indexes.

In [19]:
s3 = Series([1,2,3,4])
In [20]:
s4 = Series([1,2,3,4], index=range(1, 5))
In [21]:
s3 + s4
Out[21]:
0    NaN
1    3.0
2    5.0
3    7.0
4    NaN
dtype: float64

DataFrame

Conceptually, a DataFrame is a dict of Series sharing the same index values.

In [22]:
n = len(s1)
name = ['ann arbor', 'bob blanks', 'charles chin', 'david dumas', 'esther einstein', 'fred foster']
major = ['math', 'math', 'bio', 'stats', 'bio', 'stats']
age = np.random.randint(18, 30, n)
ht = np.random.normal(170, 15, n)
wt = np.random.normal(65, 25, n)
df1 = DataFrame(dict(name=name, age=age, major=major, height=ht, weight=wt),
                columns=['name', 'age', 'height', 'weight', 'major'])
In [23]:
df1
Out[23]:
name age height weight major
0 ann arbor 26 157.643795 46.121029 math
1 bob blanks 22 160.558592 15.930474 math
2 charles chin 27 156.663720 83.714374 bio
3 david dumas 25 183.490666 62.801966 stats
4 esther einstein 27 179.075648 71.689364 bio
5 fred foster 20 162.301896 84.537286 stats

Sampling from DataFrame

In [24]:
df1.head(3)
Out[24]:
name age height weight major
0 ann arbor 26 157.643795 46.121029 math
1 bob blanks 22 160.558592 15.930474 math
2 charles chin 27 156.663720 83.714374 bio
In [25]:
df1.tail(3)
Out[25]:
name age height weight major
3 david dumas 25 183.490666 62.801966 stats
4 esther einstein 27 179.075648 71.689364 bio
5 fred foster 20 162.301896 84.537286 stats
In [26]:
df1.sample(3)
Out[26]:
name age height weight major
5 fred foster 20 162.301896 84.537286 stats
2 charles chin 27 156.663720 83.714374 bio
4 esther einstein 27 179.075648 71.689364 bio

Slicing a DataFrame

You can slice by position or by index value.

In [27]:
df2 = DataFrame(np.random.randint(0, 10, (5,4)), index=range(1,6), columns=list('abcd'))
In [28]:
df2
Out[28]:
a b c d
1 9 3 7 2
2 0 9 4 7
3 7 8 3 3
4 4 2 6 4
5 0 6 9 3
In [29]:
df2[2:4]
Out[29]:
a b c d
3 7 8 3 3
4 4 2 6 4

Use loc if you want to retieve by value rather than position.

In [30]:
df2.loc[2:4]
Out[30]:
a b c d
2 0 9 4 7
3 7 8 3 3
4 4 2 6 4

Use iloc to use numpy style indexing.

In [31]:
df2.iloc[2:4]
Out[31]:
a b c d
3 7 8 3 3
4 4 2 6 4
In [32]:
df2.iloc[2:4, 1:3]
Out[32]:
b c
3 8 3
4 2 6
In [33]:
df2.iloc[:, 1:3]
Out[33]:
b c
1 3 7
2 9 4
3 8 3
4 2 6
5 6 9

Getting a Series from a DataFrame

In [34]:
df1['age']
Out[34]:
0    26
1    22
2    27
3    25
4    27
5    20
Name: age, dtype: int64
In [35]:
df1.age
Out[35]:
0    26
1    22
2    27
3    25
4    27
5    20
Name: age, dtype: int64

Gettingnumpy arrays from a DataFrame

In [36]:
df1.values
Out[36]:
array([['ann arbor', 26, 157.64379546715057, 46.12102942626177, 'math'],
       ['bob blanks', 22, 160.55859183627487, 15.93047432813055, 'math'],
       ['charles chin', 27, 156.66371998112533, 83.71437370950592, 'bio'],
       ['david dumas', 25, 183.49066607121097, 62.80196617116858, 'stats'],
       ['esther einstein', 27, 179.07564818672492, 71.68936436315059, 'bio'],
       ['fred foster', 20, 162.30189564187515, 84.53728614200384, 'stats']], dtype=object)

Index

Indexes are immutable collections used to store metadata for pandas Series and DataFrames. They behave like multisets or bags.

In [37]:
df1.index
Out[37]:
RangeIndex(start=0, stop=6, step=1)
In [38]:
df1.columns
Out[38]:
Index(['name', 'age', 'height', 'weight', 'major'], dtype='object')

I/O

Getting data from R

In [39]:
%load_ext rpy2.ipython
In [40]:
iris = %R iris
In [41]:
iris.head()
Out[41]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa

If you have no access to the R magic, you can do this.

In [42]:
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

iris2 = r['iris']
iris2.head()
Out[42]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa

Reading CSV files

In [43]:
import os

base = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master'
path = os.path.join(base, 'tips.csv')
In [44]:
tips = pd.read_csv(path)
In [45]:
tips.head(4)
Out[45]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
In [46]:
titanic = pd.read_csv(os.path.join(base, 'titanic.csv'))
In [47]:
titanic.head(4)
Out[47]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False

Working with Excel files

In [48]:
tips.to_excel('tips.xlsx')
In [49]:
ls *.xlsx
tips.xlsx
In [50]:
tips2 = pd.read_excel('tips.xlsx')
In [51]:
tips2.head(4)
Out[51]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2

Working with relational databases

In [52]:
! curl -L -O https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   181  100   181    0     0    877      0 --:--:-- --:--:-- --:--:--   878
100 1042k  100 1042k    0     0  1013k      0  0:00:01  0:00:01 --:--:-- 1903k
In [53]:
ls *sqlite
Chinook_Sqlite.sqlite

Standard access

In [54]:
from sqlalchemy import create_engine
In [55]:
engine = 'sqlite:///Chinook_Sqlite.sqlite'
In [56]:
db = create_engine(engine)
In [57]:
db.table_names()
Out[57]:
['Album',
 'Artist',
 'Customer',
 'Employee',
 'Genre',
 'Invoice',
 'InvoiceLine',
 'MediaType',
 'Playlist',
 'PlaylistTrack',
 'Track']
In [58]:
artist = pd.read_sql('select * from Artist where Name like "A%"', db)
In [59]:
artist.head(4)
Out[59]:
ArtistId Name
0 1 AC/DC
1 2 Accept
2 3 Aerosmith
3 4 Alanis Morissette

Ussing SQLmagic

In [60]:
%load_ext sql
In [61]:
%sql sqlite:///Chinook_Sqlite.sqlite
Out[61]:
'Connected: None@Chinook_Sqlite.sqlite'
In [62]:
%sql SELECT name FROM sqlite_master WHERE type='table'
Done.
Out[62]:
name
Album
Artist
Customer
Employee
Genre
Invoice
InvoiceLine
MediaType
Playlist
PlaylistTrack
Track
In [63]:
result = %sql select * from Album;
Done.
In [64]:
album = result.DataFrame()
In [65]:
album.head(4)
Out[65]:
AlbumId Title ArtistId
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
3 4 Let There Be Rock 1

Basic data manipulation

In [66]:
iris.head()
Out[66]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa

Extracting columns

A single value or list of value selects columns.

In [67]:
iris['Species'].head(3)
Out[67]:
1    setosa
2    setosa
3    setosa
Name: Species, dtype: object
In [68]:
iris[['Sepal.Length', 'Petal.Length']].head(3)
Out[68]:
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3

You can use loc and iloc.

In [69]:
iris.loc[:, :'Sepal.Width'].head(3)
Out[69]:
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
In [70]:
iris.iloc[:, [0,2,4]].head(3)
Out[70]:
Sepal.Length Petal.Length Species
1 5.1 1.4 setosa
2 4.9 1.4 setosa
3 4.7 1.3 setosa

Extracting rows

A single slice returns rows.

In [71]:
iris[2:4]
Out[71]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa

You can use loc or iloc

In [72]:
iris.loc[3:4]
Out[72]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
In [73]:
iris.iloc[2:4]
Out[73]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa

Or use Boolean indexing

In [74]:
iris[iris.Species == 'virginica'].head(3)
Out[74]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica

Using select and filter

select uses a predicate to select rows or columns.

In [75]:
iris.select(lambda x: x.startswith('S'), axis=1).head(3)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement
  """Entry point for launching an IPython kernel.
Out[75]:
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa

filter subsets based on labels in an index.

In [76]:
iris.filter(items=['Species', 'Sepal.Width']).head(3)
Out[76]:
Species Sepal.Width
1 setosa 3.5
2 setosa 3.0
3 setosa 3.2
In [77]:
iris.filter(items=[5,10,15], axis=0).head(3)
Out[77]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5 5.0 3.6 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
In [78]:
iris.filter(like='Width').head(3)
Out[78]:
Sepal.Width Petal.Width
1 3.5 0.2
2 3.0 0.2
3 3.2 0.2
In [79]:
iris.filter(regex='.+\..+').head(3)
Out[79]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2

Sort

Sorting on index.

In [80]:
iris.sort_index(ascending=False).head(3)
Out[80]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
150 5.9 3.0 5.1 1.8 virginica
149 6.2 3.4 5.4 2.3 virginica
148 6.5 3.0 5.2 2.0 virginica
In [81]:
iris.sort_index(axis=1).head(3)
Out[81]:
Petal.Length Petal.Width Sepal.Length Sepal.Width Species
1 1.4 0.2 5.1 3.5 setosa
2 1.4 0.2 4.9 3.0 setosa
3 1.3 0.2 4.7 3.2 setosa
In [82]:
iris.sort_index(axis=1).head(3)
Out[82]:
Petal.Length Petal.Width Sepal.Length Sepal.Width Species
1 1.4 0.2 5.1 3.5 setosa
2 1.4 0.2 4.9 3.0 setosa
3 1.3 0.2 4.7 3.2 setosa

Sorting on value.

In [83]:
iris.sort_values(by=['Sepal.Length', 'Sepal.Width'], ascending=[True, False]).head()
Out[83]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14 4.3 3.0 1.1 0.1 setosa
43 4.4 3.2 1.3 0.2 setosa
39 4.4 3.0 1.3 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
42 4.5 2.3 1.3 0.3 setosa

Getting ranks.

In [84]:
iris.rank(axis=0, numeric_only=True, method='first').head(3)
Out[84]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 33.0 126.0 12.0 6.0
2 17.0 58.0 13.0 7.0
3 10.0 95.0 5.0 8.0

Chaining

DataFrame methods can be chained together (like piping).

In [85]:
(
    iris.select(lambda x: x % 2 == 0).
    filter(regex='.+\..+').
    sort_values('Sepal.Length').
    sample(3)
)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement

Out[85]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
118 7.7 3.8 6.7 2.2
58 4.9 2.4 3.3 1.0
70 5.6 2.5 3.9 1.1

You can add your own function to the chain using pipe.

In [86]:
def f(x, n):
    return x**n
In [87]:
(
    iris.select(lambda x: x % 2 == 0).
    filter(regex='.+\..+').
    sort_values('Sepal.Length').
    pipe(f, n=2).
    sample(3)
)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement

Out[87]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
92 37.21 9.00 21.16 1.96
124 39.69 7.29 24.01 3.24
78 44.89 9.00 25.00 2.89

Transformations

In [88]:
iris.iloc[:, :-1].head(3)
Out[88]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
In [89]:
(iris.iloc[:, :-1] ** 2).head(3)
Out[89]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 26.01 12.25 1.96 0.04
2 24.01 9.00 1.96 0.04
3 22.09 10.24 1.69 0.04
In [90]:
iris2 = iris.copy()

Creating new columns

In [91]:
iris2['Length'] = (iris2.filter(like='Length')).sum(axis=1)
iris2.head(3)
Out[91]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Length
1 5.1 3.5 1.4 0.2 setosa 6.5
2 4.9 3.0 1.4 0.2 setosa 6.3
3 4.7 3.2 1.3 0.2 setosa 6.0
In [92]:
iris2.assign(Width=iris2.filter(like='Width').sum(axis=1)).head(3)
Out[92]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Length Width
1 5.1 3.5 1.4 0.2 setosa 6.5 3.7
2 4.9 3.0 1.4 0.2 setosa 6.3 3.2
3 4.7 3.2 1.3 0.2 setosa 6.0 3.4

Use applymap to perform an element-wise transformation

In [93]:
f = lambda x: len(str(x))
iris.applymap(f).head(3)
Out[93]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 3 3 3 3 6
2 3 3 3 3 6
3 3 3 3 3 6

For Series, there is a map function

In [94]:
iris['Species'].apply(f).head(3)
Out[94]:
1    6
2    6
3    6
Name: Species, dtype: int64

Accumulations

In [95]:
iris.cumsum().head(3)
Out[95]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 10 6.5 2.8 0.4 setosasetosa
3 14.7 9.7 4.1 0.6 setosasetosasetosa
In [96]:
iris.cummax().head(3)
Out[96]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 5.1 3.5 1.4 0.2 setosa
3 5.1 3.5 1.4 0.2 setosa

Summaries

In [97]:
iris.describe()
Out[97]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Many statistical summaries are built-in

In [98]:
iris.median()
Out[98]:
Sepal.Length    5.80
Sepal.Width     3.00
Petal.Length    4.35
Petal.Width     1.30
dtype: float64
In [99]:
iris.max()
Out[99]:
Sepal.Length          7.9
Sepal.Width           4.4
Petal.Length          6.9
Petal.Width           2.5
Species         virginica
dtype: object
In [100]:
iris.quantile(0.25)
Out[100]:
Sepal.Length    5.1
Sepal.Width     2.8
Petal.Length    1.6
Petal.Width     0.3
Name: 0.25, dtype: float64

You can do this over rows as well, even if it does not make much sense in this example.

In [101]:
iris.sum(axis=1).head(3)
Out[101]:
1    10.2
2     9.5
3     9.4
dtype: float64

Use apply for a custom summary

In [102]:
def cv(x):
    """Coefficient of variation."""

    return x.mean() / x.std()
In [103]:
iris.iloc[:, : -1].apply(cv)
Out[103]:
Sepal.Length    7.056602
Sepal.Width     7.014384
Petal.Length    2.128819
Petal.Width     1.573438
dtype: float64

Correlation and covariance

These work on pairs of values.

In [104]:
iris.corr()
Out[104]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.000000 -0.117570 0.871754 0.817941
Sepal.Width -0.117570 1.000000 -0.428440 -0.366126
Petal.Length 0.871754 -0.428440 1.000000 0.962865
Petal.Width 0.817941 -0.366126 0.962865 1.000000
In [105]:
iris.cov()
Out[105]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.685694 -0.042434 1.274315 0.516271
Sepal.Width -0.042434 0.189979 -0.329656 -0.121639
Petal.Length 1.274315 -0.329656 3.116278 1.295609
Petal.Width 0.516271 -0.121639 1.295609 0.581006
In [106]:
iris.iloc[:,:-1].corrwith(iris['Sepal.Width']).sort_values()
Out[106]:
Petal.Length   -0.428440
Petal.Width    -0.366126
Sepal.Length   -0.117570
Sepal.Width     1.000000
dtype: float64

Finding unique values and count for a Series

In [107]:
iris.Species.unique()
Out[107]:
array(['setosa', 'versicolor', 'virginica'], dtype=object)
In [108]:
iris.Species.value_counts()
Out[108]:
setosa        50
versicolor    50
virginica     50
Name: Species, dtype: int64

Split-apply-combine

In [109]:
g = iris.groupby('Species')
In [110]:
g.get_group('virginica').head(3)
Out[110]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
In [111]:
g.mean()
Out[111]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
In [112]:
g[['Sepal.Length']].mean()
Out[112]:
Sepal.Length
Species
setosa 5.006
versicolor 5.936
virginica 6.588
In [113]:
g.apply(cv)
Out[113]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
setosa 14.201834 9.043319 8.418556 2.334285
versicolor 11.500062 8.827326 9.065547 6.705345
virginica 10.360452 9.221802 10.059890 7.376660
In [114]:
g.agg(['min', 'max', cv])
Out[114]:
Sepal.Length Sepal.Width Petal.Length Petal.Width
min max cv min max cv min max cv min max cv
Species
setosa 4.3 5.8 14.201834 2.3 4.4 9.043319 1.0 1.9 8.418556 0.1 0.6 2.334285
versicolor 4.9 7.0 11.500062 2.0 3.4 8.827326 3.0 5.1 9.065547 1.0 1.8 6.705345
virginica 4.9 7.9 10.360452 2.2 3.8 9.221802 4.5 6.9 10.059890 1.4 2.5 7.376660

Flexibility of agg

In [115]:
g.agg({
    'Sepal.Length': [('smallest', 'min'), ('largest', 'max')],
    'Sepal.Width': ['mean', 'std', cv],
})
Out[115]:
Sepal.Length Sepal.Width
smallest largest mean std cv
Species
setosa 4.3 5.8 3.428 0.379064 9.043319
versicolor 4.9 7.0 2.770 0.313798 8.827326
virginica 4.9 7.9 2.974 0.322497 9.221802

If you need to do something other than an aggregate function, use apply.

In [116]:
def top_k(df, col, k=3):
    return df.sort_values(col)[:k]
In [117]:
iris.groupby('Species').apply(top_k, col='Sepal.Length', k=2)
Out[117]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Species
setosa 14 4.3 3.0 1.1 0.1 setosa
9 4.4 2.9 1.4 0.2 setosa
versicolor 58 4.9 2.4 3.3 1.0 versicolor
61 5.0 2.0 3.5 1.0 versicolor
virginica 107 4.9 2.5 4.5 1.7 virginica
122 5.6 2.8 4.9 2.0 virginica

Alternaive without group keys.

In [118]:
iris.groupby('Species', group_keys=False).apply(top_k, col='Sepal.Length', k=2)
Out[118]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14 4.3 3.0 1.1 0.1 setosa
9 4.4 2.9 1.4 0.2 setosa
58 4.9 2.4 3.3 1.0 versicolor
61 5.0 2.0 3.5 1.0 versicolor
107 4.9 2.5 4.5 1.7 virginica
122 5.6 2.8 4.9 2.0 virginica

Grouping on multiple columns

In [119]:
tips.head()
Out[119]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [120]:
tips[tips.day=='Sun'].groupby(['sex', 'smoker', 'time']).mean()
Out[120]:
total_bill tip size
sex smoker time
Female No Dinner 20.824286 3.329286 3.071429
Yes Dinner 16.540000 3.500000 2.500000
Male No Dinner 20.403256 3.115349 2.883721
Yes Dinner 26.141333 3.521333 2.600000

Alternative without hierarchical index.

In [121]:
tips[tips.day=='Sun'].groupby(['sex', 'smoker', 'time'], as_index=False).mean()
Out[121]:
sex smoker time total_bill tip size
0 Female No Dinner 20.824286 3.329286 3.071429
1 Female Yes Dinner 16.540000 3.500000 2.500000
2 Male No Dinner 20.403256 3.115349 2.883721
3 Male Yes Dinner 26.141333 3.521333 2.600000

Grouping can be on objects other than an existing column

Using a dictionary to provide mapping.

In [122]:
mapping = {
    'Sepal.Length' : 'Sepal',
    'Sepal.Width': 'Sepal',
    'Petal.Length': 'Petal',
    'Petal.Width': 'Petal',
          }
In [123]:
iris.iloc[:, :-1].groupby(mapping, axis=1).mean().head(4)
Out[123]:
Petal Sepal
1 0.80 4.30
2 0.80 3.95
3 0.75 3.95
4 0.85 3.85

Using a function to provide mapping.

In [124]:
iris.iloc[:, :-1].groupby(lambda x: x.split('.')[0], axis=1).mean().head(4)
Out[124]:
Petal Sepal
1 0.80 4.30
2 0.80 3.95
3 0.75 3.95
4 0.85 3.85