Pandas Part 1¶

In Part 1, we’ll deal with the basic pandas data structures (Series and DataFrame), loading data from a table, spreadsheet or database, and the basics of data manipulation. In Part 2, we’ll dive into more data manipulation techniques.

For those already familiar with R data munging, see how to do the same thing in pandas.

In [1]:

%matplotlib inline

In [2]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

Data structures¶

Series¶

Series are similar to numpy arrays.

In [3]:

s1 = Series([1,1,2,4,5,8])

In [4]:

s1

Out[4]:

  1
  1
  2
  4
  5
  8
dtype: int64

In [5]:

s1.values

Out[5]:

array([1, 1, 2, 4, 5, 8])

In [6]:

s1.tolist()

Out[6]:

[1, 1, 2, 4, 5, 8]

In [7]:

s1.index

Out[7]:

RangeIndex(start=0, stop=6, step=1)

In [8]:

s1[s1 > 3]

Out[8]:

  4
  5
  8
dtype: int64

In [9]:

s1[s1 > 3] ** 2

Out[9]:

  16
  25
  64
dtype: int64

Conversions¶

In [10]:

s1.astype('float')

Out[10]:

  1.0
  1.0
  2.0
  4.0
  5.0
  8.0
dtype: float64

In [11]:

s1.astype('category')

Out[11]:

  1
  1
  2
  4
  5
  8
dtype: category
Categories (5, int64): [1, 2, 4, 5, 8]

Series can also behave like Python dictionaries.

In [12]:

s1.index = ['ann', 'bob', 'charles', 'david', 'esther', 'fred']

In [13]:

s1

Out[13]:

ann        1
bob        1
charles    2
david      4
esther     5
fred       8
dtype: int64

In [14]:

s1['esther']

Out[14]:

In [15]:

s2 = Series(dict(ann=3, charles=5, david=7, fred=9, gloria=11))

You can name the Series and its index.

In [16]:

s2.name = 'foo'
s2.index.name = 'name'

In [17]:

s2

Out[17]:

name
ann         3
charles     5
david       7
fred        9
gloria     11
Name: foo, dtype: int64

Automatic alignment on index occurs

In [18]:

s1 + s2

Out[18]:

ann         4.0
bob         NaN
charles     7.0
david      11.0
esther      NaN
fred       17.0
gloria      NaN
dtype: float64

Be careful when using custom integer indexes.

In [19]:

s3 = Series([1,2,3,4])

In [20]:

s4 = Series([1,2,3,4], index=range(1, 5))

In [21]:

s3 + s4

Out[21]:

  NaN
  3.0
  5.0
  7.0
  NaN
dtype: float64

DataFrame¶

Conceptually, a DataFrame is a dict of Series sharing the same index values.

In [22]:

n = len(s1)
name = ['ann arbor', 'bob blanks', 'charles chin', 'david dumas', 'esther einstein', 'fred foster']
major = ['math', 'math', 'bio', 'stats', 'bio', 'stats']
age = np.random.randint(18, 30, n)
ht = np.random.normal(170, 15, n)
wt = np.random.normal(65, 25, n)
df1 = DataFrame(dict(name=name, age=age, major=major, height=ht, weight=wt),
                columns=['name', 'age', 'height', 'weight', 'major'])

In [23]:

df1

Out[23]:

	name	age	height	weight	major
0	ann arbor	26	157.643795	46.121029	math
1	bob blanks	22	160.558592	15.930474	math
2	charles chin	27	156.663720	83.714374	bio
3	david dumas	25	183.490666	62.801966	stats
4	esther einstein	27	179.075648	71.689364	bio
5	fred foster	20	162.301896	84.537286	stats

Sampling from DataFrame¶

In [24]:

df1.head(3)

Out[24]:

	name	age	height	weight	major
0	ann arbor	26	157.643795	46.121029	math
1	bob blanks	22	160.558592	15.930474	math
2	charles chin	27	156.663720	83.714374	bio

In [25]:

df1.tail(3)

Out[25]:

	name	age	height	weight	major
3	david dumas	25	183.490666	62.801966	stats
4	esther einstein	27	179.075648	71.689364	bio
5	fred foster	20	162.301896	84.537286	stats

In [26]:

df1.sample(3)

Out[26]:

	name	age	height	weight	major
5	fred foster	20	162.301896	84.537286	stats
2	charles chin	27	156.663720	83.714374	bio
4	esther einstein	27	179.075648	71.689364	bio

Slicing a DataFrame¶

You can slice by position or by index value.

In [27]:

df2 = DataFrame(np.random.randint(0, 10, (5,4)), index=range(1,6), columns=list('abcd'))

In [28]:

df2

Out[28]:

	a	b	c	d
1	9	3	7	2
2	0	9	4	7
3	7	8	3	3
4	4	2	6	4
5	0	6	9	3

In [29]:

df2[2:4]

Out[29]:

	a	b	c	d
3	7	8	3	3
4	4	2	6	4

Use loc if you want to retieve by value rather than position.

In [30]:

df2.loc[2:4]

Out[30]:

	a	b	c	d
2	0	9	4	7
3	7	8	3	3
4	4	2	6	4

Use iloc to use numpy style indexing.

In [31]:

df2.iloc[2:4]

Out[31]:

	a	b	c	d
3	7	8	3	3
4	4	2	6	4

In [32]:

df2.iloc[2:4, 1:3]

Out[32]:

	b	c
3	8	3
4	2	6

In [33]:

df2.iloc[:, 1:3]

Out[33]:

	b	c
1	3	7
2	9	4
3	8	3
4	2	6
5	6	9

Getting a Series from a DataFrame¶

In [34]:

df1['age']

Out[34]:

  26
  22
  27
  25
  27
  20
Name: age, dtype: int64

In [35]:

df1.age

Out[35]:

  26
  22
  27
  25
  27
  20
Name: age, dtype: int64

Getting`numpy` arrays from a DataFrame¶

In [36]:

df1.values

Out[36]:

array([['ann arbor', 26, 157.64379546715057, 46.12102942626177, 'math'],
       ['bob blanks', 22, 160.55859183627487, 15.93047432813055, 'math'],
       ['charles chin', 27, 156.66371998112533, 83.71437370950592, 'bio'],
       ['david dumas', 25, 183.49066607121097, 62.80196617116858, 'stats'],
       ['esther einstein', 27, 179.07564818672492, 71.68936436315059, 'bio'],
       ['fred foster', 20, 162.30189564187515, 84.53728614200384, 'stats']], dtype=object)

Index¶

Indexes are immutable collections used to store metadata for pandas Series and DataFrames. They behave like multisets or bags.

In [37]:

df1.index

Out[37]:

RangeIndex(start=0, stop=6, step=1)

In [38]:

df1.columns

Out[38]:

Index(['name', 'age', 'height', 'weight', 'major'], dtype='object')

I/O¶

Getting data from R¶

In [39]:

%load_ext rpy2.ipython

In [40]:

iris = %R iris

In [41]:

iris.head()

Out[41]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa

If you have no access to the R magic, you can do this.

In [42]:

from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

iris2 = r['iris']
iris2.head()

Out[42]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa

Reading CSV files¶

In [43]:

import os

base = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master'
path = os.path.join(base, 'tips.csv')

In [44]:

tips = pd.read_csv(path)

In [45]:

tips.head(4)

Out[45]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2

In [46]:

titanic = pd.read_csv(os.path.join(base, 'titanic.csv'))

In [47]:

titanic.head(4)

Out[47]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False

Working with Excel files¶

In [48]:

tips.to_excel('tips.xlsx')

In [49]:

ls *.xlsx

tips.xlsx

In [50]:

tips2 = pd.read_excel('tips.xlsx')

In [51]:

tips2.head(4)

Out[51]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2

Working with relational databases¶

In [52]:

! curl -L -O https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   181  100   181    0     0    877      0 --:--:-- --:--:-- --:--:--   878
100 1042k  100 1042k    0     0  1013k      0  0:00:01  0:00:01 --:--:-- 1903k

In [53]:

ls *sqlite

Chinook_Sqlite.sqlite

Standard access¶

In [54]:

from sqlalchemy import create_engine

In [55]:

engine = 'sqlite:///Chinook_Sqlite.sqlite'

In [56]:

db = create_engine(engine)

In [57]:

db.table_names()

Out[57]:

['Album',
 'Artist',
 'Customer',
 'Employee',
 'Genre',
 'Invoice',
 'InvoiceLine',
 'MediaType',
 'Playlist',
 'PlaylistTrack',
 'Track']

In [58]:

artist = pd.read_sql('select * from Artist where Name like "A%"', db)

In [59]:

artist.head(4)

Out[59]:

	ArtistId	Name
0	1	AC/DC
1	2	Accept
2	3	Aerosmith
3	4	Alanis Morissette

Ussing SQLmagic¶

In [60]:

%load_ext sql

In [61]:

%sql sqlite:///Chinook_Sqlite.sqlite

Out[61]:

'Connected: None@Chinook_Sqlite.sqlite'

In [62]:

%sql SELECT name FROM sqlite_master WHERE type='table'

Done.

Out[62]:

name
Album
Artist
Customer
Employee
Genre
Invoice
InvoiceLine
MediaType
Playlist
PlaylistTrack
Track

In [63]:

result = %sql select * from Album;

Done.

In [64]:

album = result.DataFrame()

In [65]:

album.head(4)

Out[65]:

	AlbumId	Title	ArtistId
0	1	For Those About To Rock We Salute You	1
1	2	Balls to the Wall	2
2	3	Restless and Wild	2
3	4	Let There Be Rock	1

Basic data manipulation¶

In [66]:

iris.head()

Out[66]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa

Extracting columns¶

A single value or list of value selects columns.

In [67]:

iris['Species'].head(3)

Out[67]:

  setosa
  setosa
  setosa
Name: Species, dtype: object

In [68]:

iris[['Sepal.Length', 'Petal.Length']].head(3)

Out[68]:

	Sepal.Length	Petal.Length
1	5.1	1.4
2	4.9	1.4
3	4.7	1.3

You can use loc and iloc.

In [69]:

iris.loc[:, :'Sepal.Width'].head(3)

Out[69]:

	Sepal.Length	Sepal.Width
1	5.1	3.5
2	4.9	3.0
3	4.7	3.2

In [70]:

iris.iloc[:, [0,2,4]].head(3)

Out[70]:

	Sepal.Length	Petal.Length	Species
1	5.1	1.4	setosa
2	4.9	1.4	setosa
3	4.7	1.3	setosa

Extracting rows¶

A single slice returns rows.

In [71]:

iris[2:4]

Out[71]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa

You can use loc or iloc

In [72]:

iris.loc[3:4]

Out[72]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa

In [73]:

iris.iloc[2:4]

Out[73]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa

Or use Boolean indexing

In [74]:

iris[iris.Species == 'virginica'].head(3)

Out[74]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
101	6.3	3.3	6.0	2.5	virginica
102	5.8	2.7	5.1	1.9	virginica
103	7.1	3.0	5.9	2.1	virginica

Using `select` and `filter`¶

select uses a predicate to select rows or columns.

In [75]:

iris.select(lambda x: x.startswith('S'), axis=1).head(3)

/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement
  """Entry point for launching an IPython kernel.

Out[75]:

	Sepal.Length	Sepal.Width	Species
1	5.1	3.5	setosa
2	4.9	3.0	setosa
3	4.7	3.2	setosa

filter subsets based on labels in an index.

In [76]:

iris.filter(items=['Species', 'Sepal.Width']).head(3)

Out[76]:

	Species	Sepal.Width
1	setosa	3.5
2	setosa	3.0
3	setosa	3.2

In [77]:

iris.filter(items=[5,10,15], axis=0).head(3)

Out[77]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5	5.0	3.6	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
15	5.8	4.0	1.2	0.2	setosa

In [78]:

iris.filter(like='Width').head(3)

Out[78]:

	Sepal.Width	Petal.Width
1	3.5	0.2
2	3.0	0.2
3	3.2	0.2

In [79]:

iris.filter(regex='.+\..+').head(3)

Out[79]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
1	5.1	3.5	1.4	0.2
2	4.9	3.0	1.4	0.2
3	4.7	3.2	1.3	0.2

Sort¶

Sorting on index.

In [80]:

iris.sort_index(ascending=False).head(3)

Out[80]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
150	5.9	3.0	5.1	1.8	virginica
149	6.2	3.4	5.4	2.3	virginica
148	6.5	3.0	5.2	2.0	virginica

In [81]:

iris.sort_index(axis=1).head(3)

Out[81]:

	Petal.Length	Petal.Width	Sepal.Length	Sepal.Width	Species
1	1.4	0.2	5.1	3.5	setosa
2	1.4	0.2	4.9	3.0	setosa
3	1.3	0.2	4.7	3.2	setosa

In [82]:

iris.sort_index(axis=1).head(3)

Out[82]:

	Petal.Length	Petal.Width	Sepal.Length	Sepal.Width	Species
1	1.4	0.2	5.1	3.5	setosa
2	1.4	0.2	4.9	3.0	setosa
3	1.3	0.2	4.7	3.2	setosa

Sorting on value.

In [83]:

iris.sort_values(by=['Sepal.Length', 'Sepal.Width'], ascending=[True, False]).head()

Out[83]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
14	4.3	3.0	1.1	0.1	setosa
43	4.4	3.2	1.3	0.2	setosa
39	4.4	3.0	1.3	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
42	4.5	2.3	1.3	0.3	setosa

Getting ranks.

In [84]:

iris.rank(axis=0, numeric_only=True, method='first').head(3)

Out[84]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
1	33.0	126.0	12.0	6.0
2	17.0	58.0	13.0	7.0
3	10.0	95.0	5.0	8.0

Chaining¶

DataFrame methods can be chained together (like piping).

In [85]:

(
    iris.select(lambda x: x % 2 == 0).
    filter(regex='.+\..+').
    sort_values('Sepal.Length').
    sample(3)
)

/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement

Out[85]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
118	7.7	3.8	6.7	2.2
58	4.9	2.4	3.3	1.0
70	5.6	2.5	3.9	1.1

You can add your own function to the chain using pipe.

In [86]:

def f(x, n):
    return x**n

In [87]:

(
    iris.select(lambda x: x % 2 == 0).
    filter(regex='.+\..+').
    sort_values('Sepal.Length').
    pipe(f, n=2).
    sample(3)
)

/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement

Out[87]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
92	37.21	9.00	21.16	1.96
124	39.69	7.29	24.01	3.24
78	44.89	9.00	25.00	2.89

Transformations¶

In [88]:

iris.iloc[:, :-1].head(3)

Out[88]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
1	5.1	3.5	1.4	0.2
2	4.9	3.0	1.4	0.2
3	4.7	3.2	1.3	0.2

In [89]:

(iris.iloc[:, :-1] ** 2).head(3)

Out[89]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
1	26.01	12.25	1.96	0.04
2	24.01	9.00	1.96	0.04
3	22.09	10.24	1.69	0.04

In [90]:

iris2 = iris.copy()

Creating new columns¶

In [91]:

iris2['Length'] = (iris2.filter(like='Length')).sum(axis=1)
iris2.head(3)

Out[91]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	Length
1	5.1	3.5	1.4	0.2	setosa	6.5
2	4.9	3.0	1.4	0.2	setosa	6.3
3	4.7	3.2	1.3	0.2	setosa	6.0

In [92]:

iris2.assign(Width=iris2.filter(like='Width').sum(axis=1)).head(3)

Out[92]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	Length	Width
1	5.1	3.5	1.4	0.2	setosa	6.5	3.7
2	4.9	3.0	1.4	0.2	setosa	6.3	3.2
3	4.7	3.2	1.3	0.2	setosa	6.0	3.4

Use `applymap` to perform an element-wise transformation¶

In [93]:

f = lambda x: len(str(x))
iris.applymap(f).head(3)

Out[93]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	3	3	3	3	6
2	3	3	3	3	6
3	3	3	3	3	6

For Series, there is a `map` function¶

In [94]:

iris['Species'].apply(f).head(3)

Out[94]:

  6
  6
  6
Name: Species, dtype: int64

Accumulations¶

In [95]:

iris.cumsum().head(3)

Out[95]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	10	6.5	2.8	0.4	setosasetosa
3	14.7	9.7	4.1	0.6	setosasetosasetosa

In [96]:

iris.cummax().head(3)

Out[96]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	5.1	3.5	1.4	0.2	setosa
3	5.1	3.5	1.4	0.2	setosa

Summaries¶

In [97]:

iris.describe()

Out[97]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Many statistical summaries are built-in¶

In [98]:

iris.median()

Out[98]:

Sepal.Length    5.80
Sepal.Width     3.00
Petal.Length    4.35
Petal.Width     1.30
dtype: float64

In [99]:

iris.max()

Out[99]:

Sepal.Length          7.9
Sepal.Width           4.4
Petal.Length          6.9
Petal.Width           2.5
Species         virginica
dtype: object

In [100]:

iris.quantile(0.25)

Out[100]:

Sepal.Length    5.1
Sepal.Width     2.8
Petal.Length    1.6
Petal.Width     0.3
Name: 0.25, dtype: float64

You can do this over rows as well, even if it does not make much sense in this example.

In [101]:

iris.sum(axis=1).head(3)

Out[101]:

  10.2
   9.5
   9.4
dtype: float64

Use `apply` for a custom summary¶

In [102]:

def cv(x):
    """Coefficient of variation."""

    return x.mean() / x.std()

In [103]:

iris.iloc[:, : -1].apply(cv)

Out[103]:

Sepal.Length    7.056602
Sepal.Width     7.014384
Petal.Length    2.128819
Petal.Width     1.573438
dtype: float64

Correlation and covariance¶

These work on pairs of values.

In [104]:

iris.corr()

Out[104]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Sepal.Length	1.000000	-0.117570	0.871754	0.817941
Sepal.Width	-0.117570	1.000000	-0.428440	-0.366126
Petal.Length	0.871754	-0.428440	1.000000	0.962865
Petal.Width	0.817941	-0.366126	0.962865	1.000000

In [105]:

iris.cov()

Out[105]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Sepal.Length	0.685694	-0.042434	1.274315	0.516271
Sepal.Width	-0.042434	0.189979	-0.329656	-0.121639
Petal.Length	1.274315	-0.329656	3.116278	1.295609
Petal.Width	0.516271	-0.121639	1.295609	0.581006

In [106]:

iris.iloc[:,:-1].corrwith(iris['Sepal.Width']).sort_values()

Out[106]:

Petal.Length   -0.428440
Petal.Width    -0.366126
Sepal.Length   -0.117570
Sepal.Width     1.000000
dtype: float64

Finding unique values and count for a Series¶

In [107]:

iris.Species.unique()

Out[107]:

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [108]:

iris.Species.value_counts()

Out[108]:

setosa        50
versicolor    50
virginica     50
Name: Species, dtype: int64

Split-apply-combine¶

In [109]:

g = iris.groupby('Species')

In [110]:

g.get_group('virginica').head(3)

Out[110]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
101	6.3	3.3	6.0	2.5	virginica
102	5.8	2.7	5.1	1.9	virginica
103	7.1	3.0	5.9	2.1	virginica

In [111]:

g.mean()

Out[111]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Species
setosa	5.006	3.428	1.462	0.246
versicolor	5.936	2.770	4.260	1.326
virginica	6.588	2.974	5.552	2.026

In [112]:

g[['Sepal.Length']].mean()

Out[112]:

	Sepal.Length
Species
setosa	5.006
versicolor	5.936
virginica	6.588

In [113]:

g.apply(cv)

Out[113]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
Species
setosa	14.201834	9.043319	8.418556	2.334285
versicolor	11.500062	8.827326	9.065547	6.705345
virginica	10.360452	9.221802	10.059890	7.376660

In [114]:

g.agg(['min', 'max', cv])

Out[114]:

	Sepal.Length			Sepal.Width			Petal.Length			Petal.Width
	min	max	cv	min	max	cv	min	max	cv	min	max	cv
Species
setosa	4.3	5.8	14.201834	2.3	4.4	9.043319	1.0	1.9	8.418556	0.1	0.6	2.334285
versicolor	4.9	7.0	11.500062	2.0	3.4	8.827326	3.0	5.1	9.065547	1.0	1.8	6.705345
virginica	4.9	7.9	10.360452	2.2	3.8	9.221802	4.5	6.9	10.059890	1.4	2.5	7.376660

Flexibility of agg

In [115]:

g.agg({
    'Sepal.Length': [('smallest', 'min'), ('largest', 'max')],
    'Sepal.Width': ['mean', 'std', cv],
})

Out[115]:

	Sepal.Length		Sepal.Width
	smallest	largest	mean	std	cv
Species
setosa	4.3	5.8	3.428	0.379064	9.043319
versicolor	4.9	7.0	2.770	0.313798	8.827326
virginica	4.9	7.9	2.974	0.322497	9.221802

If you need to do something other than an aggregate function, use apply.

In [116]:

def top_k(df, col, k=3):
    return df.sort_values(col)[:k]

In [117]:

iris.groupby('Species').apply(top_k, col='Sepal.Length', k=2)

Out[117]:

		Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
Species
setosa	14	4.3	3.0	1.1	0.1	setosa
setosa	9	4.4	2.9	1.4	0.2	setosa
versicolor	58	4.9	2.4	3.3	1.0	versicolor
versicolor	61	5.0	2.0	3.5	1.0	versicolor
virginica	107	4.9	2.5	4.5	1.7	virginica
virginica	122	5.6	2.8	4.9	2.0	virginica

Alternaive without group keys.

In [118]:

iris.groupby('Species', group_keys=False).apply(top_k, col='Sepal.Length', k=2)

Out[118]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
14	4.3	3.0	1.1	0.1	setosa
9	4.4	2.9	1.4	0.2	setosa
58	4.9	2.4	3.3	1.0	versicolor
61	5.0	2.0	3.5	1.0	versicolor
107	4.9	2.5	4.5	1.7	virginica
122	5.6	2.8	4.9	2.0	virginica

Grouping on multiple columns¶

In [119]:

tips.head()

Out[119]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

In [120]:

tips[tips.day=='Sun'].groupby(['sex', 'smoker', 'time']).mean()

Out[120]:

			total_bill	tip	size
sex	smoker	time
Female	No	Dinner	20.824286	3.329286	3.071429
Female	Yes	Dinner	16.540000	3.500000	2.500000
Male	No	Dinner	20.403256	3.115349	2.883721
Male	Yes	Dinner	26.141333	3.521333	2.600000

Alternative without hierarchical index.

In [121]:

tips[tips.day=='Sun'].groupby(['sex', 'smoker', 'time'], as_index=False).mean()

Out[121]:

	sex	smoker	time	total_bill	tip	size
0	Female	No	Dinner	20.824286	3.329286	3.071429
1	Female	Yes	Dinner	16.540000	3.500000	2.500000
2	Male	No	Dinner	20.403256	3.115349	2.883721
3	Male	Yes	Dinner	26.141333	3.521333	2.600000

Grouping can be on objects other than an existing column¶

Using a dictionary to provide mapping.

In [122]:

mapping = {
    'Sepal.Length' : 'Sepal',
    'Sepal.Width': 'Sepal',
    'Petal.Length': 'Petal',
    'Petal.Width': 'Petal',
          }

In [123]:

iris.iloc[:, :-1].groupby(mapping, axis=1).mean().head(4)

Out[123]:

	Petal	Sepal
1	0.80	4.30
2	0.80	3.95
3	0.75	3.95
4	0.85	3.85

Using a function to provide mapping.

In [124]:

iris.iloc[:, :-1].groupby(lambda x: x.split('.')[0], axis=1).mean().head(4)

Out[124]:

	Petal	Sepal
1	0.80	4.30
2	0.80	3.95
3	0.75	3.95
4	0.85	3.85

Pandas Part 1¶

Data structures¶

Series¶

Conversions¶

DataFrame¶

Sampling from DataFrame¶

Slicing a DataFrame¶

Getting a Series from a DataFrame¶

Getting`numpy` arrays from a DataFrame¶

Index¶

I/O¶

Getting data from R¶

Reading CSV files¶

Working with Excel files¶

Working with relational databases¶

Standard access¶

Ussing SQLmagic¶

Basic data manipulation¶

Extracting columns¶

Extracting rows¶

Using `select` and `filter`¶

Sort¶

Chaining¶

Transformations¶

Creating new columns¶

Use `applymap` to perform an element-wise transformation¶

For Series, there is a `map` function¶

Accumulations¶

Summaries¶

Many statistical summaries are built-in¶

Use `apply` for a custom summary¶

Correlation and covariance¶

Finding unique values and count for a Series¶

Split-apply-combine¶

Grouping on multiple columns¶

Grouping can be on objects other than an existing column¶

Page contents

Previous page

Next page

This Page

Pandas Part 1¶

Data structures¶

Series¶

Conversions¶

DataFrame¶

Sampling from DataFrame¶

Slicing a DataFrame¶

Getting a Series from a DataFrame¶

Gettingnumpy arrays from a DataFrame¶

Index¶

I/O¶

Getting data from R¶

Reading CSV files¶

Working with Excel files¶

Working with relational databases¶

Standard access¶

Ussing SQLmagic¶

Basic data manipulation¶

Extracting columns¶

Extracting rows¶

Using select and filter¶

Sort¶

Chaining¶

Transformations¶

Creating new columns¶

Use applymap to perform an element-wise transformation¶

For Series, there is a map function¶

Accumulations¶

Summaries¶

Many statistical summaries are built-in¶

Use apply for a custom summary¶

Correlation and covariance¶

Finding unique values and count for a Series¶

Split-apply-combine¶

Grouping on multiple columns¶

Grouping can be on objects other than an existing column¶

Getting`numpy` arrays from a DataFrame¶

Using `select` and `filter`¶

Use `applymap` to perform an element-wise transformation¶

For Series, there is a `map` function¶

Use `apply` for a custom summary¶