Data¶
Resources¶
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
Working with Series¶
x = Series(range(5,10))
x
0 5
1 6
2 7
3 8
4 9
dtype: int64
We cna treat Series objects much like numpy vectors¶
x.sum(), x.mean(), x.std()
(35, 7.0, 1.5811388300841898)
x**2
0 25
1 36
2 49
3 64
4 81
dtype: int64
x[x >= 8]
3 8
4 9
dtype: int64
Series can also contain more information than numpy vectors¶
Series index¶
But you can also assign labeled indexes.
x.index = list('abcde')
x
a 5
b 6
c 7
d 8
e 9
dtype: int64
Even when you have a labeled index, positional arguments still work¶
x[1:4]
b 6
c 7
d 8
dtype: int64
x.a, x.c, x.e
(5, 7, 9)
Working with missing data¶
Missing data is indicated with NaN (not a number).
y = Series([10, np.nan, np.nan, 13, 14])
y
0 10.0
1 NaN
2 NaN
3 13.0
4 14.0
dtype: float64
Concatenating two series¶
z = pd.concat([x, y])
z
a 5.0
b 6.0
c 7.0
d 8.0
e 9.0
0 10.0
1 NaN
2 NaN
3 13.0
4 14.0
dtype: float64
Reset index to default¶
z = z.reset_index(drop=True)
z
0 5.0
1 6.0
2 7.0
3 8.0
4 9.0
5 10.0
6 NaN
7 NaN
8 13.0
9 14.0
dtype: float64
pandas
aggregate functions ignore missing data¶
z.sum(), z.mean(), z.std()
(72.0, 9.0, 3.2071349029490928)
Selecting non-missing values¶
z[z.notnull()]
0 5.0
1 6.0
2 7.0
3 8.0
4 9.0
5 10.0
8 13.0
9 14.0
dtype: float64
Replacement of missing values¶
z.fillna(0)
0 5.0
1 6.0
2 7.0
3 8.0
4 9.0
5 10.0
6 0.0
7 0.0
8 13.0
9 14.0
dtype: float64
z.fillna(method='ffill')
0 5.0
1 6.0
2 7.0
3 8.0
4 9.0
5 10.0
6 10.0
7 10.0
8 13.0
9 14.0
dtype: float64
z.fillna(method='bfill')
0 5.0
1 6.0
2 7.0
3 8.0
4 9.0
5 10.0
6 13.0
7 13.0
8 13.0
9 14.0
dtype: float64
z.fillna(z.mean())
0 5.0
1 6.0
2 7.0
3 8.0
4 9.0
5 10.0
6 9.0
7 9.0
8 13.0
9 14.0
dtype: float64
Working with dates / times¶
We will see more date/time handling in the DataFrame section.
z.index = pd.date_range('01-Jan-2016', periods=len(z))
z
2016-01-01 5.0
2016-01-02 6.0
2016-01-03 7.0
2016-01-04 8.0
2016-01-05 9.0
2016-01-06 10.0
2016-01-07 NaN
2016-01-08 NaN
2016-01-09 13.0
2016-01-10 14.0
Freq: D, dtype: float64
Intelligent aggregation over datetime ranges¶
z.resample('W').sum()
2016-01-03 18.0
2016-01-10 54.0
Freq: W-SUN, dtype: float64
Formatting datetime objects (see http://strftime.org)¶
z.index.strftime('%b %d, %Y')
array(['Jan 01, 2016', 'Jan 02, 2016', 'Jan 03, 2016', 'Jan 04, 2016',
'Jan 05, 2016', 'Jan 06, 2016', 'Jan 07, 2016', 'Jan 08, 2016',
'Jan 09, 2016', 'Jan 10, 2016'],
dtype='<U12')
DataFrame¶
Similar to R.
Titanic data¶
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
titanic = pd.read_csv(url)
titanic.shape
(891, 15)
titanic.columns
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
'alive', 'alone'],
dtype='object')
# For display purposes, we will drop some columns
titanic = titanic[['survived', 'sex', 'age', 'fare',
'embarked', 'class', 'who', 'deck', 'embark_town',]]
titanic.dtypes
survived int64
sex object
age float64
fare float64
embarked object
class object
who object
deck object
embark_town object
dtype: object
Summarizing a data frame¶
titanic.ix[0]
survived 0
sex male
age 22
fare 7.25
embarked S
class Third
who man
deck NaN
embark_town Southampton
Name: 0, dtype: object
titanic.describe()
survived | age | fare | |
---|---|---|---|
count | 891.000000 | 714.000000 | 891.000000 |
mean | 0.383838 | 29.699118 | 32.204208 |
std | 0.486592 | 14.526497 | 49.693429 |
min | 0.000000 | 0.420000 | 0.000000 |
25% | 0.000000 | 20.125000 | 7.910400 |
50% | 0.000000 | 28.000000 | 14.454200 |
75% | 1.000000 | 38.000000 | 31.000000 |
max | 1.000000 | 80.000000 | 512.329200 |
titanic.head()
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | male | 22.0 | 7.2500 | S | Third | man | NaN | Southampton |
1 | 1 | female | 38.0 | 71.2833 | C | First | woman | C | Cherbourg |
2 | 1 | female | 26.0 | 7.9250 | S | Third | woman | NaN | Southampton |
3 | 1 | female | 35.0 | 53.1000 | S | First | woman | C | Southampton |
4 | 0 | male | 35.0 | 8.0500 | S | Third | man | NaN | Southampton |
titanic.tail()
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
886 | 0 | male | 27.0 | 13.00 | S | Second | man | NaN | Southampton |
887 | 1 | female | 19.0 | 30.00 | S | First | woman | B | Southampton |
888 | 0 | female | NaN | 23.45 | S | Third | woman | NaN | Southampton |
889 | 1 | male | 26.0 | 30.00 | C | First | man | C | Cherbourg |
890 | 0 | male | 32.0 | 7.75 | Q | Third | man | NaN | Queenstown |
titanic.columns
Index(['survived', 'sex', 'age', 'fare', 'embarked', 'class', 'who', 'deck',
'embark_town'],
dtype='object')
titanic.index
RangeIndex(start=0, stop=891, step=1)
Indexing¶
titanic[['sex', 'age', 'class']].head()
sex | age | class | |
---|---|---|---|
0 | male | 22.0 | Third |
1 | female | 38.0 | First |
2 | female | 26.0 | Third |
3 | female | 35.0 | First |
4 | male | 35.0 | Third |
titanic[10:15]
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
10 | 1 | female | 4.0 | 16.7000 | S | Third | child | G | Southampton |
11 | 1 | female | 58.0 | 26.5500 | S | First | woman | C | Southampton |
12 | 0 | male | 20.0 | 8.0500 | S | Third | man | NaN | Southampton |
13 | 0 | male | 39.0 | 31.2750 | S | Third | man | NaN | Southampton |
14 | 0 | female | 14.0 | 7.8542 | S | Third | child | NaN | Southampton |
Using the ix
helper for indexing¶
titanic.ix[10:15, 'age':'fare']
age | fare | |
---|---|---|
10 | 4.0 | 16.7000 |
11 | 58.0 | 26.5500 |
12 | 20.0 | 8.0500 |
13 | 39.0 | 31.2750 |
14 | 14.0 | 7.8542 |
15 | 55.0 | 16.0000 |
titanic.ix[10:15, [1,3,5]]
sex | fare | class | |
---|---|---|---|
10 | female | 16.7000 | Third |
11 | female | 26.5500 | First |
12 | male | 8.0500 | Third |
13 | male | 31.2750 | Third |
14 | female | 7.8542 | Third |
15 | female | 16.0000 | Second |
titanic[titanic.age < 2]
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
78 | 1 | male | 0.83 | 29.0000 | S | Second | child | NaN | Southampton |
164 | 0 | male | 1.00 | 39.6875 | S | Third | child | NaN | Southampton |
172 | 1 | female | 1.00 | 11.1333 | S | Third | child | NaN | Southampton |
183 | 1 | male | 1.00 | 39.0000 | S | Second | child | F | Southampton |
305 | 1 | male | 0.92 | 151.5500 | S | First | child | C | Southampton |
381 | 1 | female | 1.00 | 15.7417 | C | Third | child | NaN | Cherbourg |
386 | 0 | male | 1.00 | 46.9000 | S | Third | child | NaN | Southampton |
469 | 1 | female | 0.75 | 19.2583 | C | Third | child | NaN | Cherbourg |
644 | 1 | female | 0.75 | 19.2583 | C | Third | child | NaN | Cherbourg |
755 | 1 | male | 0.67 | 14.5000 | S | Second | child | NaN | Southampton |
788 | 1 | male | 1.00 | 20.5750 | S | Third | child | NaN | Southampton |
803 | 1 | male | 0.42 | 8.5167 | C | Third | child | NaN | Cherbourg |
827 | 1 | male | 1.00 | 37.0042 | C | Second | child | NaN | Cherbourg |
831 | 1 | male | 0.83 | 18.7500 | S | Second | child | NaN | Southampton |
Sorting and ordering data¶
titanic.sort_index().head()
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | male | 22.0 | 7.2500 | S | Third | man | NaN | Southampton |
1 | 1 | female | 38.0 | 71.2833 | C | First | woman | C | Cherbourg |
2 | 1 | female | 26.0 | 7.9250 | S | Third | woman | NaN | Southampton |
3 | 1 | female | 35.0 | 53.1000 | S | First | woman | C | Southampton |
4 | 0 | male | 35.0 | 8.0500 | S | Third | man | NaN | Southampton |
titanic.sort_values('age', ascending=True).head()
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
803 | 1 | male | 0.42 | 8.5167 | C | Third | child | NaN | Cherbourg |
755 | 1 | male | 0.67 | 14.5000 | S | Second | child | NaN | Southampton |
644 | 1 | female | 0.75 | 19.2583 | C | Third | child | NaN | Cherbourg |
469 | 1 | female | 0.75 | 19.2583 | C | Third | child | NaN | Cherbourg |
78 | 1 | male | 0.83 | 29.0000 | S | Second | child | NaN | Southampton |
titanic.sort_values(['survived', 'age'], ascending=[True, True]).head()
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
164 | 0 | male | 1.0 | 39.6875 | S | Third | child | NaN | Southampton |
386 | 0 | male | 1.0 | 46.9000 | S | Third | child | NaN | Southampton |
7 | 0 | male | 2.0 | 21.0750 | S | Third | child | NaN | Southampton |
16 | 0 | male | 2.0 | 29.1250 | Q | Third | child | NaN | Queenstown |
119 | 0 | female | 2.0 | 31.2750 | S | Third | child | NaN | Southampton |
Grouping data¶
sex_class = titanic.groupby(['sex', 'class'])
sex_class.count()
survived | age | fare | embarked | who | deck | embark_town | ||
---|---|---|---|---|---|---|---|---|
sex | class | |||||||
female | First | 94 | 85 | 94 | 92 | 94 | 81 | 92 |
Second | 76 | 74 | 76 | 76 | 76 | 10 | 76 | |
Third | 144 | 102 | 144 | 144 | 144 | 6 | 144 | |
male | First | 122 | 101 | 122 | 122 | 122 | 94 | 122 |
Second | 108 | 99 | 108 | 108 | 108 | 6 | 108 | |
Third | 347 | 253 | 347 | 347 | 347 | 6 | 347 |
Why Kate Winslett survived and Leonardo DiCaprio didn’t¶
df = sex_class.mean()
df['survived']
sex class
female First 0.968085
Second 0.921053
Third 0.500000
male First 0.368852
Second 0.157407
Third 0.135447
Name: survived, dtype: float64
Of the females who were in first class, count the number from each embarking town¶
sex_class.get_group(('female', 'First')).groupby('embark_town').count()
survived | sex | age | fare | embarked | class | who | deck | |
---|---|---|---|---|---|---|---|---|
embark_town | ||||||||
Cherbourg | 43 | 43 | 38 | 43 | 43 | 43 | 43 | 35 |
Queenstown | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Southampton | 48 | 48 | 44 | 48 | 48 | 48 | 48 | 43 |
Cross-tabulation¶
pd.crosstab(titanic.survived, titanic['class'])
class | First | Second | Third |
---|---|---|---|
survived | |||
0 | 80 | 97 | 372 |
1 | 136 | 87 | 119 |
We can also get multiple summaries at the same time¶
def my_func(x):
return np.max(x)
mapped_funcs = {'embarked': 'count', 'age': ('mean', 'median', my_func), 'survived': sum}
sex_class.get_group(('female', 'First')).groupby('embark_town').agg(mapped_funcs)
survived | embarked | age | |||
---|---|---|---|---|---|
sum | count | mean | median | my_func | |
embark_town | |||||
Cherbourg | 42 | 43 | 36.052632 | 37.0 | 60.0 |
Queenstown | 1 | 1 | 33.000000 | 33.0 | 33.0 |
Southampton | 46 | 48 | 32.704545 | 33.0 | 63.0 |
titanic.columns
Index(['survived', 'sex', 'age', 'fare', 'embarked', 'class', 'who', 'deck',
'embark_town'],
dtype='object')
Visualizing tables¶
See more examples in the Graphics notebook.
import seaborn as sns
sns.set_context(font_scale=4)
sns.factorplot(x='sex', y='age', hue='survived', col='class', kind='box', data=titanic)
pass
Making plots with pandas
¶
from pandas_datareader import data as web
import datetime
apple = web.DataReader('AAPL', 'google',
start = datetime.datetime(2015, 1, 1),
end = datetime.datetime(2015, 12, 31))
apple.head()
Open | High | Low | Close | Volume | |
---|---|---|---|---|---|
Date | |||||
2015-01-02 | 111.39 | 111.44 | 107.35 | 109.33 | 53204626 |
2015-01-05 | 108.29 | 108.65 | 105.41 | 106.25 | 64285491 |
2015-01-06 | 106.54 | 107.43 | 104.63 | 106.26 | 65797116 |
2015-01-07 | 107.20 | 108.20 | 106.70 | 107.75 | 40105934 |
2015-01-08 | 109.23 | 112.15 | 108.70 | 111.89 | 59364547 |
apple.plot.line(y='Close', marker='o', markersize=3, linewidth=0.5)
pass
# Zoom in on large drop in August
aug = apple['2015-08-01':'2015-08-30']
aug.plot.line(y=['High', 'Low', 'Open', 'Close'], marker='o', markersize=10, linewidth=1)
pass
Data conversions¶
One of the nicest features of pandas
is the ease of converting
tabular data across different storage formats. We will illustrate by
converting the titanic
dataframe into multiple formats.
titanic.head(2)
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | male | 22.0 | 7.2500 | S | Third | man | NaN | Southampton |
1 | 1 | female | 38.0 | 71.2833 | C | First | woman | C | Cherbourg |
titanic.to_csv('titanic.csv', index=False)
t1 = pd.read_csv('titanic.csv')
t1.head(2)
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | male | 22.0 | 7.2500 | S | Third | man | NaN | Southampton |
1 | 1 | female | 38.0 | 71.2833 | C | First | woman | C | Cherbourg |
!pip install openpyxl
t1.to_excel('titanic.xlsx')
Requirement already satisfied: openpyxl in /Users/cliburn/anaconda2/lib/python2.7/site-packages
Requirement already satisfied: jdcal in /Users/cliburn/anaconda2/lib/python2.7/site-packages (from openpyxl)
Requirement already satisfied: et_xmlfile in /Users/cliburn/anaconda2/lib/python2.7/site-packages (from openpyxl)
t2 = pd.read_excel('titanic.xlsx')
t2.head(2)
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | male | 22.0 | 7.2500 | S | Third | man | NaN | Southampton |
1 | 1 | female | 38.0 | 71.2833 | C | First | woman | C | Cherbourg |
import sqlite3
con = sqlite3.connect('titanic.db')
t2.to_sql('titanic', con, index=False, if_exists='replace')
t3 = pd.read_sql('select * from titanic', con)
t3.head(2)
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | male | 22.0 | 7.2500 | S | Third | man | None | Southampton |
1 | 1 | female | 38.0 | 71.2833 | C | First | woman | C | Cherbourg |
t3.to_json('titanic.json')
t4 = pd.read_json('titanic.json')
t4.head(2)
age | class | deck | embark_town | embarked | fare | sex | survived | who | |
---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | Third | None | Southampton | S | 7.2500 | male | 0 | man |
1 | 38.0 | First | C | Cherbourg | C | 71.2833 | female | 1 | woman |
t4 = t4[t3.columns]
t4.head(2)
survived | sex | age | fare | embarked | class | who | deck | embark_town | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | male | 22.0 | 7.2500 | S | Third | man | None | Southampton |
1 | 1 | female | 38.0 | 71.2833 | C | First | woman | C | Cherbourg |