Pandas Part 1¶
In Part 1, we’ll deal with the basic pandas
data structures (Series
and DataFrame), loading data from a table, spreadsheet or database, and
the basics of data manipulation. In Part 2, we’ll dive into more data
manipulation techniques.
For those already familiar with R data munging, see how to do the same
thing
in pandas
.
In [1]:
%matplotlib inline
In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
Data structures¶
Series¶
Series are similar to numpy
arrays.
In [3]:
s1 = Series([1,1,2,4,5,8])
In [4]:
s1
Out[4]:
0 1
1 1
2 2
3 4
4 5
5 8
dtype: int64
In [5]:
s1.values
Out[5]:
array([1, 1, 2, 4, 5, 8])
In [6]:
s1.tolist()
Out[6]:
[1, 1, 2, 4, 5, 8]
In [7]:
s1.index
Out[7]:
RangeIndex(start=0, stop=6, step=1)
In [8]:
s1[s1 > 3]
Out[8]:
3 4
4 5
5 8
dtype: int64
In [9]:
s1[s1 > 3] ** 2
Out[9]:
3 16
4 25
5 64
dtype: int64
Conversions¶
In [10]:
s1.astype('float')
Out[10]:
0 1.0
1 1.0
2 2.0
3 4.0
4 5.0
5 8.0
dtype: float64
In [11]:
s1.astype('category')
Out[11]:
0 1
1 1
2 2
3 4
4 5
5 8
dtype: category
Categories (5, int64): [1, 2, 4, 5, 8]
Series can also behave like Python dictionaries.
In [12]:
s1.index = ['ann', 'bob', 'charles', 'david', 'esther', 'fred']
In [13]:
s1
Out[13]:
ann 1
bob 1
charles 2
david 4
esther 5
fred 8
dtype: int64
In [14]:
s1['esther']
Out[14]:
5
In [15]:
s2 = Series(dict(ann=3, charles=5, david=7, fred=9, gloria=11))
You can name the Series and its index.
In [16]:
s2.name = 'foo'
s2.index.name = 'name'
In [17]:
s2
Out[17]:
name
ann 3
charles 5
david 7
fred 9
gloria 11
Name: foo, dtype: int64
Automatic alignment on index occurs
In [18]:
s1 + s2
Out[18]:
ann 4.0
bob NaN
charles 7.0
david 11.0
esther NaN
fred 17.0
gloria NaN
dtype: float64
Be careful when using custom integer indexes.
In [19]:
s3 = Series([1,2,3,4])
In [20]:
s4 = Series([1,2,3,4], index=range(1, 5))
In [21]:
s3 + s4
Out[21]:
0 NaN
1 3.0
2 5.0
3 7.0
4 NaN
dtype: float64
DataFrame¶
Conceptually, a DataFrame is a dict of Series sharing the same index values.
In [22]:
n = len(s1)
name = ['ann arbor', 'bob blanks', 'charles chin', 'david dumas', 'esther einstein', 'fred foster']
major = ['math', 'math', 'bio', 'stats', 'bio', 'stats']
age = np.random.randint(18, 30, n)
ht = np.random.normal(170, 15, n)
wt = np.random.normal(65, 25, n)
df1 = DataFrame(dict(name=name, age=age, major=major, height=ht, weight=wt),
columns=['name', 'age', 'height', 'weight', 'major'])
In [23]:
df1
Out[23]:
name | age | height | weight | major | |
---|---|---|---|---|---|
0 | ann arbor | 26 | 157.643795 | 46.121029 | math |
1 | bob blanks | 22 | 160.558592 | 15.930474 | math |
2 | charles chin | 27 | 156.663720 | 83.714374 | bio |
3 | david dumas | 25 | 183.490666 | 62.801966 | stats |
4 | esther einstein | 27 | 179.075648 | 71.689364 | bio |
5 | fred foster | 20 | 162.301896 | 84.537286 | stats |
Sampling from DataFrame¶
In [24]:
df1.head(3)
Out[24]:
name | age | height | weight | major | |
---|---|---|---|---|---|
0 | ann arbor | 26 | 157.643795 | 46.121029 | math |
1 | bob blanks | 22 | 160.558592 | 15.930474 | math |
2 | charles chin | 27 | 156.663720 | 83.714374 | bio |
In [25]:
df1.tail(3)
Out[25]:
name | age | height | weight | major | |
---|---|---|---|---|---|
3 | david dumas | 25 | 183.490666 | 62.801966 | stats |
4 | esther einstein | 27 | 179.075648 | 71.689364 | bio |
5 | fred foster | 20 | 162.301896 | 84.537286 | stats |
In [26]:
df1.sample(3)
Out[26]:
name | age | height | weight | major | |
---|---|---|---|---|---|
5 | fred foster | 20 | 162.301896 | 84.537286 | stats |
2 | charles chin | 27 | 156.663720 | 83.714374 | bio |
4 | esther einstein | 27 | 179.075648 | 71.689364 | bio |
Slicing a DataFrame¶
You can slice by position or by index value.
In [27]:
df2 = DataFrame(np.random.randint(0, 10, (5,4)), index=range(1,6), columns=list('abcd'))
In [28]:
df2
Out[28]:
a | b | c | d | |
---|---|---|---|---|
1 | 9 | 3 | 7 | 2 |
2 | 0 | 9 | 4 | 7 |
3 | 7 | 8 | 3 | 3 |
4 | 4 | 2 | 6 | 4 |
5 | 0 | 6 | 9 | 3 |
In [29]:
df2[2:4]
Out[29]:
a | b | c | d | |
---|---|---|---|---|
3 | 7 | 8 | 3 | 3 |
4 | 4 | 2 | 6 | 4 |
Use loc
if you want to retieve by value rather than position.
In [30]:
df2.loc[2:4]
Out[30]:
a | b | c | d | |
---|---|---|---|---|
2 | 0 | 9 | 4 | 7 |
3 | 7 | 8 | 3 | 3 |
4 | 4 | 2 | 6 | 4 |
Use iloc
to use numpy
style indexing.
In [31]:
df2.iloc[2:4]
Out[31]:
a | b | c | d | |
---|---|---|---|---|
3 | 7 | 8 | 3 | 3 |
4 | 4 | 2 | 6 | 4 |
In [32]:
df2.iloc[2:4, 1:3]
Out[32]:
b | c | |
---|---|---|
3 | 8 | 3 |
4 | 2 | 6 |
In [33]:
df2.iloc[:, 1:3]
Out[33]:
b | c | |
---|---|---|
1 | 3 | 7 |
2 | 9 | 4 |
3 | 8 | 3 |
4 | 2 | 6 |
5 | 6 | 9 |
Getting a Series from a DataFrame¶
In [34]:
df1['age']
Out[34]:
0 26
1 22
2 27
3 25
4 27
5 20
Name: age, dtype: int64
In [35]:
df1.age
Out[35]:
0 26
1 22
2 27
3 25
4 27
5 20
Name: age, dtype: int64
Gettingnumpy
arrays from a DataFrame¶
In [36]:
df1.values
Out[36]:
array([['ann arbor', 26, 157.64379546715057, 46.12102942626177, 'math'],
['bob blanks', 22, 160.55859183627487, 15.93047432813055, 'math'],
['charles chin', 27, 156.66371998112533, 83.71437370950592, 'bio'],
['david dumas', 25, 183.49066607121097, 62.80196617116858, 'stats'],
['esther einstein', 27, 179.07564818672492, 71.68936436315059, 'bio'],
['fred foster', 20, 162.30189564187515, 84.53728614200384, 'stats']], dtype=object)
Index¶
Indexes are immutable collections used to store metadata for pandas
Series and DataFrames. They behave like multisets or bags.
In [37]:
df1.index
Out[37]:
RangeIndex(start=0, stop=6, step=1)
In [38]:
df1.columns
Out[38]:
Index(['name', 'age', 'height', 'weight', 'major'], dtype='object')
I/O¶
Getting data from R¶
In [39]:
%load_ext rpy2.ipython
In [40]:
iris = %R iris
In [41]:
iris.head()
Out[41]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
If you have no access to the R magic, you can do this.
In [42]:
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
iris2 = r['iris']
iris2.head()
Out[42]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Reading CSV files¶
In [43]:
import os
base = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master'
path = os.path.join(base, 'tips.csv')
In [44]:
tips = pd.read_csv(path)
In [45]:
tips.head(4)
Out[45]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
In [46]:
titanic = pd.read_csv(os.path.join(base, 'titanic.csv'))
In [47]:
titanic.head(4)
Out[47]:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
Working with Excel files¶
In [48]:
tips.to_excel('tips.xlsx')
In [49]:
ls *.xlsx
tips.xlsx
In [50]:
tips2 = pd.read_excel('tips.xlsx')
In [51]:
tips2.head(4)
Out[51]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
Working with relational databases¶
In [52]:
! curl -L -O https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 181 100 181 0 0 877 0 --:--:-- --:--:-- --:--:-- 878
100 1042k 100 1042k 0 0 1013k 0 0:00:01 0:00:01 --:--:-- 1903k
In [53]:
ls *sqlite
Chinook_Sqlite.sqlite
Standard access¶
In [54]:
from sqlalchemy import create_engine
In [55]:
engine = 'sqlite:///Chinook_Sqlite.sqlite'
In [56]:
db = create_engine(engine)
In [57]:
db.table_names()
Out[57]:
['Album',
'Artist',
'Customer',
'Employee',
'Genre',
'Invoice',
'InvoiceLine',
'MediaType',
'Playlist',
'PlaylistTrack',
'Track']
In [58]:
artist = pd.read_sql('select * from Artist where Name like "A%"', db)
In [59]:
artist.head(4)
Out[59]:
ArtistId | Name | |
---|---|---|
0 | 1 | AC/DC |
1 | 2 | Accept |
2 | 3 | Aerosmith |
3 | 4 | Alanis Morissette |
Ussing SQLmagic¶
In [60]:
%load_ext sql
In [61]:
%sql sqlite:///Chinook_Sqlite.sqlite
Out[61]:
'Connected: None@Chinook_Sqlite.sqlite'
In [62]:
%sql SELECT name FROM sqlite_master WHERE type='table'
Done.
Out[62]:
name |
---|
Album |
Artist |
Customer |
Employee |
Genre |
Invoice |
InvoiceLine |
MediaType |
Playlist |
PlaylistTrack |
Track |
In [63]:
result = %sql select * from Album;
Done.
In [64]:
album = result.DataFrame()
In [65]:
album.head(4)
Out[65]:
AlbumId | Title | ArtistId | |
---|---|---|---|
0 | 1 | For Those About To Rock We Salute You | 1 |
1 | 2 | Balls to the Wall | 2 |
2 | 3 | Restless and Wild | 2 |
3 | 4 | Let There Be Rock | 1 |
Basic data manipulation¶
In [66]:
iris.head()
Out[66]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Extracting columns¶
A single value or list of value selects columns.
In [67]:
iris['Species'].head(3)
Out[67]:
1 setosa
2 setosa
3 setosa
Name: Species, dtype: object
In [68]:
iris[['Sepal.Length', 'Petal.Length']].head(3)
Out[68]:
Sepal.Length | Petal.Length | |
---|---|---|
1 | 5.1 | 1.4 |
2 | 4.9 | 1.4 |
3 | 4.7 | 1.3 |
You can use loc
and iloc
.
In [69]:
iris.loc[:, :'Sepal.Width'].head(3)
Out[69]:
Sepal.Length | Sepal.Width | |
---|---|---|
1 | 5.1 | 3.5 |
2 | 4.9 | 3.0 |
3 | 4.7 | 3.2 |
In [70]:
iris.iloc[:, [0,2,4]].head(3)
Out[70]:
Sepal.Length | Petal.Length | Species | |
---|---|---|---|
1 | 5.1 | 1.4 | setosa |
2 | 4.9 | 1.4 | setosa |
3 | 4.7 | 1.3 | setosa |
Extracting rows¶
A single slice returns rows.
In [71]:
iris[2:4]
Out[71]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
You can use loc
or iloc
In [72]:
iris.loc[3:4]
Out[72]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
In [73]:
iris.iloc[2:4]
Out[73]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
Or use Boolean indexing
In [74]:
iris[iris.Species == 'virginica'].head(3)
Out[74]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
101 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
102 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
103 | 7.1 | 3.0 | 5.9 | 2.1 | virginica |
Using select
and filter
¶
select
uses a predicate to select rows or columns.
In [75]:
iris.select(lambda x: x.startswith('S'), axis=1).head(3)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement
"""Entry point for launching an IPython kernel.
Out[75]:
Sepal.Length | Sepal.Width | Species | |
---|---|---|---|
1 | 5.1 | 3.5 | setosa |
2 | 4.9 | 3.0 | setosa |
3 | 4.7 | 3.2 | setosa |
filter
subsets based on labels in an index.
In [76]:
iris.filter(items=['Species', 'Sepal.Width']).head(3)
Out[76]:
Species | Sepal.Width | |
---|---|---|
1 | setosa | 3.5 |
2 | setosa | 3.0 |
3 | setosa | 3.2 |
In [77]:
iris.filter(items=[5,10,15], axis=0).head(3)
Out[77]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
15 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
In [78]:
iris.filter(like='Width').head(3)
Out[78]:
Sepal.Width | Petal.Width | |
---|---|---|
1 | 3.5 | 0.2 |
2 | 3.0 | 0.2 |
3 | 3.2 | 0.2 |
In [79]:
iris.filter(regex='.+\..+').head(3)
Out[79]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 |
2 | 4.9 | 3.0 | 1.4 | 0.2 |
3 | 4.7 | 3.2 | 1.3 | 0.2 |
Sort¶
Sorting on index.
In [80]:
iris.sort_index(ascending=False).head(3)
Out[80]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
149 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
In [81]:
iris.sort_index(axis=1).head(3)
Out[81]:
Petal.Length | Petal.Width | Sepal.Length | Sepal.Width | Species | |
---|---|---|---|---|---|
1 | 1.4 | 0.2 | 5.1 | 3.5 | setosa |
2 | 1.4 | 0.2 | 4.9 | 3.0 | setosa |
3 | 1.3 | 0.2 | 4.7 | 3.2 | setosa |
In [82]:
iris.sort_index(axis=1).head(3)
Out[82]:
Petal.Length | Petal.Width | Sepal.Length | Sepal.Width | Species | |
---|---|---|---|---|---|
1 | 1.4 | 0.2 | 5.1 | 3.5 | setosa |
2 | 1.4 | 0.2 | 4.9 | 3.0 | setosa |
3 | 1.3 | 0.2 | 4.7 | 3.2 | setosa |
Sorting on value.
In [83]:
iris.sort_values(by=['Sepal.Length', 'Sepal.Width'], ascending=[True, False]).head()
Out[83]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
43 | 4.4 | 3.2 | 1.3 | 0.2 | setosa |
39 | 4.4 | 3.0 | 1.3 | 0.2 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
42 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
Getting ranks.
In [84]:
iris.rank(axis=0, numeric_only=True, method='first').head(3)
Out[84]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
1 | 33.0 | 126.0 | 12.0 | 6.0 |
2 | 17.0 | 58.0 | 13.0 | 7.0 |
3 | 10.0 | 95.0 | 5.0 | 8.0 |
Chaining¶
DataFrame methods can be chained together (like piping).
In [85]:
(
iris.select(lambda x: x % 2 == 0).
filter(regex='.+\..+').
sort_values('Sepal.Length').
sample(3)
)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement
Out[85]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
118 | 7.7 | 3.8 | 6.7 | 2.2 |
58 | 4.9 | 2.4 | 3.3 | 1.0 |
70 | 5.6 | 2.5 | 3.9 | 1.1 |
You can add your own function to the chain using pipe
.
In [86]:
def f(x, n):
return x**n
In [87]:
(
iris.select(lambda x: x % 2 == 0).
filter(regex='.+\..+').
sort_values('Sepal.Length').
pipe(f, n=2).
sample(3)
)
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement
Out[87]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
92 | 37.21 | 9.00 | 21.16 | 1.96 |
124 | 39.69 | 7.29 | 24.01 | 3.24 |
78 | 44.89 | 9.00 | 25.00 | 2.89 |
Transformations¶
In [88]:
iris.iloc[:, :-1].head(3)
Out[88]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 |
2 | 4.9 | 3.0 | 1.4 | 0.2 |
3 | 4.7 | 3.2 | 1.3 | 0.2 |
In [89]:
(iris.iloc[:, :-1] ** 2).head(3)
Out[89]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
1 | 26.01 | 12.25 | 1.96 | 0.04 |
2 | 24.01 | 9.00 | 1.96 | 0.04 |
3 | 22.09 | 10.24 | 1.69 | 0.04 |
In [90]:
iris2 = iris.copy()
Creating new columns¶
In [91]:
iris2['Length'] = (iris2.filter(like='Length')).sum(axis=1)
iris2.head(3)
Out[91]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Length | |
---|---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 6.5 |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 6.3 |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 6.0 |
In [92]:
iris2.assign(Width=iris2.filter(like='Width').sum(axis=1)).head(3)
Out[92]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Length | Width | |
---|---|---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 6.5 | 3.7 |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 6.3 | 3.2 |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 6.0 | 3.4 |
Use applymap
to perform an element-wise transformation¶
In [93]:
f = lambda x: len(str(x))
iris.applymap(f).head(3)
Out[93]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 3 | 3 | 3 | 3 | 6 |
2 | 3 | 3 | 3 | 3 | 6 |
3 | 3 | 3 | 3 | 3 | 6 |
For Series, there is a map
function¶
In [94]:
iris['Species'].apply(f).head(3)
Out[94]:
1 6
2 6
3 6
Name: Species, dtype: int64
Accumulations¶
In [95]:
iris.cumsum().head(3)
Out[95]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 10 | 6.5 | 2.8 | 0.4 | setosasetosa |
3 | 14.7 | 9.7 | 4.1 | 0.6 | setosasetosasetosa |
In [96]:
iris.cummax().head(3)
Out[96]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
3 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
Summaries¶
In [97]:
iris.describe()
Out[97]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
Many statistical summaries are built-in¶
In [98]:
iris.median()
Out[98]:
Sepal.Length 5.80
Sepal.Width 3.00
Petal.Length 4.35
Petal.Width 1.30
dtype: float64
In [99]:
iris.max()
Out[99]:
Sepal.Length 7.9
Sepal.Width 4.4
Petal.Length 6.9
Petal.Width 2.5
Species virginica
dtype: object
In [100]:
iris.quantile(0.25)
Out[100]:
Sepal.Length 5.1
Sepal.Width 2.8
Petal.Length 1.6
Petal.Width 0.3
Name: 0.25, dtype: float64
You can do this over rows as well, even if it does not make much sense in this example.
In [101]:
iris.sum(axis=1).head(3)
Out[101]:
1 10.2
2 9.5
3 9.4
dtype: float64
Use apply
for a custom summary¶
In [102]:
def cv(x):
"""Coefficient of variation."""
return x.mean() / x.std()
In [103]:
iris.iloc[:, : -1].apply(cv)
Out[103]:
Sepal.Length 7.056602
Sepal.Width 7.014384
Petal.Length 2.128819
Petal.Width 1.573438
dtype: float64
Correlation and covariance¶
These work on pairs of values.
In [104]:
iris.corr()
Out[104]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
Sepal.Length | 1.000000 | -0.117570 | 0.871754 | 0.817941 |
Sepal.Width | -0.117570 | 1.000000 | -0.428440 | -0.366126 |
Petal.Length | 0.871754 | -0.428440 | 1.000000 | 0.962865 |
Petal.Width | 0.817941 | -0.366126 | 0.962865 | 1.000000 |
In [105]:
iris.cov()
Out[105]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
Sepal.Length | 0.685694 | -0.042434 | 1.274315 | 0.516271 |
Sepal.Width | -0.042434 | 0.189979 | -0.329656 | -0.121639 |
Petal.Length | 1.274315 | -0.329656 | 3.116278 | 1.295609 |
Petal.Width | 0.516271 | -0.121639 | 1.295609 | 0.581006 |
In [106]:
iris.iloc[:,:-1].corrwith(iris['Sepal.Width']).sort_values()
Out[106]:
Petal.Length -0.428440
Petal.Width -0.366126
Sepal.Length -0.117570
Sepal.Width 1.000000
dtype: float64
Finding unique values and count for a Series¶
In [107]:
iris.Species.unique()
Out[107]:
array(['setosa', 'versicolor', 'virginica'], dtype=object)
In [108]:
iris.Species.value_counts()
Out[108]:
setosa 50
versicolor 50
virginica 50
Name: Species, dtype: int64
Split-apply-combine¶
In [109]:
g = iris.groupby('Species')
In [110]:
g.get_group('virginica').head(3)
Out[110]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
101 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
102 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
103 | 7.1 | 3.0 | 5.9 | 2.1 | virginica |
In [111]:
g.mean()
Out[111]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
Species | ||||
setosa | 5.006 | 3.428 | 1.462 | 0.246 |
versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
virginica | 6.588 | 2.974 | 5.552 | 2.026 |
In [112]:
g[['Sepal.Length']].mean()
Out[112]:
Sepal.Length | |
---|---|
Species | |
setosa | 5.006 |
versicolor | 5.936 |
virginica | 6.588 |
In [113]:
g.apply(cv)
Out[113]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
Species | ||||
setosa | 14.201834 | 9.043319 | 8.418556 | 2.334285 |
versicolor | 11.500062 | 8.827326 | 9.065547 | 6.705345 |
virginica | 10.360452 | 9.221802 | 10.059890 | 7.376660 |
In [114]:
g.agg(['min', 'max', cv])
Out[114]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
min | max | cv | min | max | cv | min | max | cv | min | max | cv | |
Species | ||||||||||||
setosa | 4.3 | 5.8 | 14.201834 | 2.3 | 4.4 | 9.043319 | 1.0 | 1.9 | 8.418556 | 0.1 | 0.6 | 2.334285 |
versicolor | 4.9 | 7.0 | 11.500062 | 2.0 | 3.4 | 8.827326 | 3.0 | 5.1 | 9.065547 | 1.0 | 1.8 | 6.705345 |
virginica | 4.9 | 7.9 | 10.360452 | 2.2 | 3.8 | 9.221802 | 4.5 | 6.9 | 10.059890 | 1.4 | 2.5 | 7.376660 |
Flexibility of agg
In [115]:
g.agg({
'Sepal.Length': [('smallest', 'min'), ('largest', 'max')],
'Sepal.Width': ['mean', 'std', cv],
})
Out[115]:
Sepal.Length | Sepal.Width | ||||
---|---|---|---|---|---|
smallest | largest | mean | std | cv | |
Species | |||||
setosa | 4.3 | 5.8 | 3.428 | 0.379064 | 9.043319 |
versicolor | 4.9 | 7.0 | 2.770 | 0.313798 | 8.827326 |
virginica | 4.9 | 7.9 | 2.974 | 0.322497 | 9.221802 |
If you need to do something other than an aggregate function, use apply.
In [116]:
def top_k(df, col, k=3):
return df.sort_values(col)[:k]
In [117]:
iris.groupby('Species').apply(top_k, col='Sepal.Length', k=2)
Out[117]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | ||
---|---|---|---|---|---|---|
Species | ||||||
setosa | 14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | |
versicolor | 58 | 4.9 | 2.4 | 3.3 | 1.0 | versicolor |
61 | 5.0 | 2.0 | 3.5 | 1.0 | versicolor | |
virginica | 107 | 4.9 | 2.5 | 4.5 | 1.7 | virginica |
122 | 5.6 | 2.8 | 4.9 | 2.0 | virginica |
Alternaive without group keys.
In [118]:
iris.groupby('Species', group_keys=False).apply(top_k, col='Sepal.Length', k=2)
Out[118]:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
58 | 4.9 | 2.4 | 3.3 | 1.0 | versicolor |
61 | 5.0 | 2.0 | 3.5 | 1.0 | versicolor |
107 | 4.9 | 2.5 | 4.5 | 1.7 | virginica |
122 | 5.6 | 2.8 | 4.9 | 2.0 | virginica |
Grouping on multiple columns¶
In [119]:
tips.head()
Out[119]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
In [120]:
tips[tips.day=='Sun'].groupby(['sex', 'smoker', 'time']).mean()
Out[120]:
total_bill | tip | size | |||
---|---|---|---|---|---|
sex | smoker | time | |||
Female | No | Dinner | 20.824286 | 3.329286 | 3.071429 |
Yes | Dinner | 16.540000 | 3.500000 | 2.500000 | |
Male | No | Dinner | 20.403256 | 3.115349 | 2.883721 |
Yes | Dinner | 26.141333 | 3.521333 | 2.600000 |
Alternative without hierarchical index.
In [121]:
tips[tips.day=='Sun'].groupby(['sex', 'smoker', 'time'], as_index=False).mean()
Out[121]:
sex | smoker | time | total_bill | tip | size | |
---|---|---|---|---|---|---|
0 | Female | No | Dinner | 20.824286 | 3.329286 | 3.071429 |
1 | Female | Yes | Dinner | 16.540000 | 3.500000 | 2.500000 |
2 | Male | No | Dinner | 20.403256 | 3.115349 | 2.883721 |
3 | Male | Yes | Dinner | 26.141333 | 3.521333 | 2.600000 |
Grouping can be on objects other than an existing column¶
Using a dictionary to provide mapping.
In [122]:
mapping = {
'Sepal.Length' : 'Sepal',
'Sepal.Width': 'Sepal',
'Petal.Length': 'Petal',
'Petal.Width': 'Petal',
}
In [123]:
iris.iloc[:, :-1].groupby(mapping, axis=1).mean().head(4)
Out[123]:
Petal | Sepal | |
---|---|---|
1 | 0.80 | 4.30 |
2 | 0.80 | 3.95 |
3 | 0.75 | 3.95 |
4 | 0.85 | 3.85 |
Using a function to provide mapping.
In [124]:
iris.iloc[:, :-1].groupby(lambda x: x.split('.')[0], axis=1).mean().head(4)
Out[124]:
Petal | Sepal | |
---|---|---|
1 | 0.80 | 4.30 |
2 | 0.80 | 3.95 |
3 | 0.75 | 3.95 |
4 | 0.85 | 3.85 |