Data

Resources

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

Working with Series

x = Series(range(5,10))
x
0    5
1    6
2    7
3    8
4    9
dtype: int64

We cna treat Series objects much like numpy vectors

x.sum(), x.mean(), x.std()
(35, 7.0, 1.5811388300841898)
x**2
0    25
1    36
2    49
3    64
4    81
dtype: int64
x[x >= 8]
3    8
4    9
dtype: int64

Series can also contain more information than numpy vectors

You can always use standard positional indexing

x[1:4]
1    6
2    7
3    8
dtype: int64

Series index

But you can also assign labeled indexes.

x.index = list('abcde')
x
a    5
b    6
c    7
d    8
e    9
dtype: int64

Note that with labels, the end index is included

x['a':'c']
a    5
b    6
c    7
dtype: int64

Even when you have a labeled index, positional arguments still work

x[1:4]
b    6
c    7
d    8
dtype: int64
x.a, x.c, x.e
(5, 7, 9)

Working with missing data

Missing data is indicated with NaN (not a number).

y = Series([10, np.nan, np.nan, 13, 14])
y
0    10.0
1     NaN
2     NaN
3    13.0
4    14.0
dtype: float64

Concatenating two series

z = pd.concat([x, y])
z
a     5.0
b     6.0
c     7.0
d     8.0
e     9.0
0    10.0
1     NaN
2     NaN
3    13.0
4    14.0
dtype: float64

Reset index to default

z = z.reset_index(drop=True)
z
0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6     NaN
7     NaN
8    13.0
9    14.0
dtype: float64

pandas aggregate functions ignore missing data

z.sum(), z.mean(), z.std()
(72.0, 9.0, 3.2071349029490928)

Selecting missing values

z[z.isnull()]
6   NaN
7   NaN
dtype: float64

Selecting non-missing values

z[z.notnull()]
0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
8    13.0
9    14.0
dtype: float64

Replacement of missing values

z.fillna(0)
0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6     0.0
7     0.0
8    13.0
9    14.0
dtype: float64
z.fillna(method='ffill')
0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6    10.0
7    10.0
8    13.0
9    14.0
dtype: float64
z.fillna(method='bfill')
0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6    13.0
7    13.0
8    13.0
9    14.0
dtype: float64
z.fillna(z.mean())
0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6     9.0
7     9.0
8    13.0
9    14.0
dtype: float64

Working with dates / times

We will see more date/time handling in the DataFrame section.

z.index = pd.date_range('01-Jan-2016', periods=len(z))
z
2016-01-01     5.0
2016-01-02     6.0
2016-01-03     7.0
2016-01-04     8.0
2016-01-05     9.0
2016-01-06    10.0
2016-01-07     NaN
2016-01-08     NaN
2016-01-09    13.0
2016-01-10    14.0
Freq: D, dtype: float64

Intelligent aggregation over datetime ranges

z.resample('W').sum()
2016-01-03    18.0
2016-01-10    54.0
Freq: W-SUN, dtype: float64

Formatting datetime objects (see http://strftime.org)

z.index.strftime('%b %d, %Y')
array(['Jan 01, 2016', 'Jan 02, 2016', 'Jan 03, 2016', 'Jan 04, 2016',
       'Jan 05, 2016', 'Jan 06, 2016', 'Jan 07, 2016', 'Jan 08, 2016',
       'Jan 09, 2016', 'Jan 10, 2016'],
      dtype='<U12')

DataFrame

Similar to R.

Titanic data

url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
titanic = pd.read_csv(url)
titanic.shape
(891, 15)
titanic.columns
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')
# For display purposes, we will drop some columns
titanic = titanic[['survived', 'sex', 'age', 'fare',
                   'embarked', 'class', 'who', 'deck', 'embark_town',]]
titanic.dtypes
survived         int64
sex             object
age            float64
fare           float64
embarked        object
class           object
who             object
deck            object
embark_town     object
dtype: object

Summarizing a data frame

titanic.ix[0]
survived                 0
sex                   male
age                     22
fare                  7.25
embarked                 S
class                Third
who                    man
deck                   NaN
embark_town    Southampton
Name: 0, dtype: object
titanic.describe()
survived age fare
count 891.000000 714.000000 891.000000
mean 0.383838 29.699118 32.204208
std 0.486592 14.526497 49.693429
min 0.000000 0.420000 0.000000
25% 0.000000 20.125000 7.910400
50% 0.000000 28.000000 14.454200
75% 1.000000 38.000000 31.000000
max 1.000000 80.000000 512.329200
titanic.head()
survived sex age fare embarked class who deck embark_town
0 0 male 22.0 7.2500 S Third man NaN Southampton
1 1 female 38.0 71.2833 C First woman C Cherbourg
2 1 female 26.0 7.9250 S Third woman NaN Southampton
3 1 female 35.0 53.1000 S First woman C Southampton
4 0 male 35.0 8.0500 S Third man NaN Southampton
titanic.tail()
survived sex age fare embarked class who deck embark_town
886 0 male 27.0 13.00 S Second man NaN Southampton
887 1 female 19.0 30.00 S First woman B Southampton
888 0 female NaN 23.45 S Third woman NaN Southampton
889 1 male 26.0 30.00 C First man C Cherbourg
890 0 male 32.0 7.75 Q Third man NaN Queenstown
titanic.columns
Index(['survived', 'sex', 'age', 'fare', 'embarked', 'class', 'who', 'deck',
       'embark_town'],
      dtype='object')
titanic.index
RangeIndex(start=0, stop=891, step=1)

Indexing

titanic[['sex', 'age', 'class']].head()
sex age class
0 male 22.0 Third
1 female 38.0 First
2 female 26.0 Third
3 female 35.0 First
4 male 35.0 Third
titanic[10:15]
survived sex age fare embarked class who deck embark_town
10 1 female 4.0 16.7000 S Third child G Southampton
11 1 female 58.0 26.5500 S First woman C Southampton
12 0 male 20.0 8.0500 S Third man NaN Southampton
13 0 male 39.0 31.2750 S Third man NaN Southampton
14 0 female 14.0 7.8542 S Third child NaN Southampton

Using the ix helper for indexing

titanic.ix[10:15, 'age':'fare']
age fare
10 4.0 16.7000
11 58.0 26.5500
12 20.0 8.0500
13 39.0 31.2750
14 14.0 7.8542
15 55.0 16.0000
titanic.ix[10:15, [1,3,5]]
sex fare class
10 female 16.7000 Third
11 female 26.5500 First
12 male 8.0500 Third
13 male 31.2750 Third
14 female 7.8542 Third
15 female 16.0000 Second
titanic[titanic.age < 2]
survived sex age fare embarked class who deck embark_town
78 1 male 0.83 29.0000 S Second child NaN Southampton
164 0 male 1.00 39.6875 S Third child NaN Southampton
172 1 female 1.00 11.1333 S Third child NaN Southampton
183 1 male 1.00 39.0000 S Second child F Southampton
305 1 male 0.92 151.5500 S First child C Southampton
381 1 female 1.00 15.7417 C Third child NaN Cherbourg
386 0 male 1.00 46.9000 S Third child NaN Southampton
469 1 female 0.75 19.2583 C Third child NaN Cherbourg
644 1 female 0.75 19.2583 C Third child NaN Cherbourg
755 1 male 0.67 14.5000 S Second child NaN Southampton
788 1 male 1.00 20.5750 S Third child NaN Southampton
803 1 male 0.42 8.5167 C Third child NaN Cherbourg
827 1 male 1.00 37.0042 C Second child NaN Cherbourg
831 1 male 0.83 18.7500 S Second child NaN Southampton

Sorting and ordering data

titanic.sort_index().head()
survived sex age fare embarked class who deck embark_town
0 0 male 22.0 7.2500 S Third man NaN Southampton
1 1 female 38.0 71.2833 C First woman C Cherbourg
2 1 female 26.0 7.9250 S Third woman NaN Southampton
3 1 female 35.0 53.1000 S First woman C Southampton
4 0 male 35.0 8.0500 S Third man NaN Southampton
titanic.sort_values('age', ascending=True).head()
survived sex age fare embarked class who deck embark_town
803 1 male 0.42 8.5167 C Third child NaN Cherbourg
755 1 male 0.67 14.5000 S Second child NaN Southampton
644 1 female 0.75 19.2583 C Third child NaN Cherbourg
469 1 female 0.75 19.2583 C Third child NaN Cherbourg
78 1 male 0.83 29.0000 S Second child NaN Southampton
titanic.sort_values(['survived', 'age'], ascending=[True, True]).head()
survived sex age fare embarked class who deck embark_town
164 0 male 1.0 39.6875 S Third child NaN Southampton
386 0 male 1.0 46.9000 S Third child NaN Southampton
7 0 male 2.0 21.0750 S Third child NaN Southampton
16 0 male 2.0 29.1250 Q Third child NaN Queenstown
119 0 female 2.0 31.2750 S Third child NaN Southampton

Grouping data

sex_class = titanic.groupby(['sex', 'class'])
sex_class.count()
survived age fare embarked who deck embark_town
sex class
female First 94 85 94 92 94 81 92
Second 76 74 76 76 76 10 76
Third 144 102 144 144 144 6 144
male First 122 101 122 122 122 94 122
Second 108 99 108 108 108 6 108
Third 347 253 347 347 347 6 347

Why Kate Winslett survived and Leonardo DiCaprio didn’t

df = sex_class.mean()
df['survived']
sex     class
female  First     0.968085
        Second    0.921053
        Third     0.500000
male    First     0.368852
        Second    0.157407
        Third     0.135447
Name: survived, dtype: float64

Of the females who were in first class, count the number from each embarking town

sex_class.get_group(('female', 'First')).groupby('embark_town').count()
survived sex age fare embarked class who deck
embark_town
Cherbourg 43 43 38 43 43 43 43 35
Queenstown 1 1 1 1 1 1 1 1
Southampton 48 48 44 48 48 48 48 43

Cross-tabulation

pd.crosstab(titanic.survived, titanic['class'])
class First Second Third
survived
0 80 97 372
1 136 87 119

We can also get multiple summaries at the same time

def my_func(x):
    return np.max(x)
mapped_funcs = {'embarked': 'count', 'age': ('mean', 'median', my_func), 'survived': sum}
sex_class.get_group(('female', 'First')).groupby('embark_town').agg(mapped_funcs)
survived embarked age
sum count mean median my_func
embark_town
Cherbourg 42 43 36.052632 37.0 60.0
Queenstown 1 1 33.000000 33.0 33.0
Southampton 46 48 32.704545 33.0 63.0
titanic.columns
Index(['survived', 'sex', 'age', 'fare', 'embarked', 'class', 'who', 'deck',
       'embark_town'],
      dtype='object')

Visualizing tables

See more examples in the Graphics notebook.

import seaborn as sns
sns.set_context(font_scale=4)
sns.factorplot(x='sex', y='age', hue='survived', col='class', kind='box', data=titanic)
pass
_images/07_Data_84_0.png

Making plots with pandas

from pandas_datareader import data as web
import datetime
apple = web.DataReader('AAPL', 'google',
                        start = datetime.datetime(2015, 1, 1),
                        end = datetime.datetime(2015, 12, 31))
apple.head()
Open High Low Close Volume
Date
2015-01-02 111.39 111.44 107.35 109.33 53204626
2015-01-05 108.29 108.65 105.41 106.25 64285491
2015-01-06 106.54 107.43 104.63 106.26 65797116
2015-01-07 107.20 108.20 106.70 107.75 40105934
2015-01-08 109.23 112.15 108.70 111.89 59364547
apple.plot.line(y='Close', marker='o', markersize=3, linewidth=0.5)
pass
_images/07_Data_89_0.png
# Zoom in on large drop in August
aug = apple['2015-08-01':'2015-08-30']
aug.plot.line(y=['High', 'Low', 'Open', 'Close'], marker='o', markersize=10, linewidth=1)
pass
_images/07_Data_90_0.png

Data conversions

One of the nicest features of pandas is the ease of converting tabular data across different storage formats. We will illustrate by converting the titanic dataframe into multiple formats.

titanic.head(2)
survived sex age fare embarked class who deck embark_town
0 0 male 22.0 7.2500 S Third man NaN Southampton
1 1 female 38.0 71.2833 C First woman C Cherbourg
titanic.to_csv('titanic.csv', index=False)
t1 = pd.read_csv('titanic.csv')
t1.head(2)
survived sex age fare embarked class who deck embark_town
0 0 male 22.0 7.2500 S Third man NaN Southampton
1 1 female 38.0 71.2833 C First woman C Cherbourg
!pip install openpyxl
t1.to_excel('titanic.xlsx')
Requirement already satisfied: openpyxl in /Users/cliburn/anaconda2/lib/python2.7/site-packages
Requirement already satisfied: jdcal in /Users/cliburn/anaconda2/lib/python2.7/site-packages (from openpyxl)
Requirement already satisfied: et_xmlfile in /Users/cliburn/anaconda2/lib/python2.7/site-packages (from openpyxl)
t2 = pd.read_excel('titanic.xlsx')
t2.head(2)
survived sex age fare embarked class who deck embark_town
0 0 male 22.0 7.2500 S Third man NaN Southampton
1 1 female 38.0 71.2833 C First woman C Cherbourg
import sqlite3

con = sqlite3.connect('titanic.db')
t2.to_sql('titanic', con, index=False, if_exists='replace')
t3 = pd.read_sql('select * from titanic', con)
t3.head(2)
survived sex age fare embarked class who deck embark_town
0 0 male 22.0 7.2500 S Third man None Southampton
1 1 female 38.0 71.2833 C First woman C Cherbourg
t3.to_json('titanic.json')
t4 = pd.read_json('titanic.json')
t4.head(2)
age class deck embark_town embarked fare sex survived who
0 22.0 Third None Southampton S 7.2500 male 0 man
1 38.0 First C Cherbourg C 71.2833 female 1 woman
t4 = t4[t3.columns]
t4.head(2)
survived sex age fare embarked class who deck embark_town
0 0 male 22.0 7.2500 S Third man None Southampton
1 1 female 38.0 71.2833 C First woman C Cherbourg