Data¶

Resources¶

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('ggplot')

Working with Series¶

x = Series(range(5,10))

  5
  6
  7
  8
  9
dtype: int64

We cna treat Series objects much like numpy vectors¶

x.sum(), x.mean(), x.std()

(35, 7.0, 1.5811388300841898)

x**2

  25
  36
  49
  64
  81
dtype: int64

x[x >= 8]

3    8
4    9
dtype: int64

Series can also contain more information than numpy vectors¶

You can always use standard positional indexing¶

x[1:4]

  6
  7
  8
dtype: int64

Series index¶

But you can also assign labeled indexes.

x.index = list('abcde')
x

a    5
b    6
c    7
d    8
e    9
dtype: int64

Note that with labels, the end index is included¶

x['a':'c']

a    5
b    6
c    7
dtype: int64

Even when you have a labeled index, positional arguments still work¶

x[1:4]

b    6
c    7
d    8
dtype: int64

x.a, x.c, x.e

(5, 7, 9)

Working with missing data¶

Missing data is indicated with NaN (not a number).

y = Series([10, np.nan, np.nan, 13, 14])
y

  10.0
   NaN
   NaN
  13.0
  14.0
dtype: float64

Concatenating two series¶

z = pd.concat([x, y])
z

a     5.0
b     6.0
c     7.0
d     8.0
e     9.0
0    10.0
1     NaN
2     NaN
3    13.0
4    14.0
dtype: float64

Reset index to default¶

z = z.reset_index(drop=True)
z

   5.0
   6.0
   7.0
   8.0
   9.0
  10.0
   NaN
   NaN
  13.0
  14.0
dtype: float64

`pandas` aggregate functions ignore missing data¶

z.sum(), z.mean(), z.std()

(72.0, 9.0, 3.2071349029490928)

Selecting missing values¶

z[z.isnull()]

6   NaN
7   NaN
dtype: float64

Selecting non-missing values¶

z[z.notnull()]

   5.0
   6.0
   7.0
   8.0
   9.0
  10.0
  13.0
  14.0
dtype: float64

Replacement of missing values¶

z.fillna(0)

   5.0
   6.0
   7.0
   8.0
   9.0
  10.0
   0.0
   0.0
  13.0
  14.0
dtype: float64

z.fillna(method='ffill')

   5.0
   6.0
   7.0
   8.0
   9.0
  10.0
  10.0
  10.0
  13.0
  14.0
dtype: float64

z.fillna(method='bfill')

   5.0
   6.0
   7.0
   8.0
   9.0
  10.0
  13.0
  13.0
  13.0
  14.0
dtype: float64

z.fillna(z.mean())

   5.0
   6.0
   7.0
   8.0
   9.0
  10.0
   9.0
   9.0
  13.0
  14.0
dtype: float64

Working with dates / times¶

We will see more date/time handling in the DataFrame section.

z.index = pd.date_range('01-Jan-2016', periods=len(z))

2016-01-01     5.0
2016-01-02     6.0
2016-01-03     7.0
2016-01-04     8.0
2016-01-05     9.0
2016-01-06    10.0
2016-01-07     NaN
2016-01-08     NaN
2016-01-09    13.0
2016-01-10    14.0
Freq: D, dtype: float64

Intelligent aggregation over datetime ranges¶

z.resample('W').sum()

2016-01-03    18.0
2016-01-10    54.0
Freq: W-SUN, dtype: float64

Formatting datetime objects (see http://strftime.org)¶

z.index.strftime('%b %d, %Y')

array(['Jan 01, 2016', 'Jan 02, 2016', 'Jan 03, 2016', 'Jan 04, 2016',
       'Jan 05, 2016', 'Jan 06, 2016', 'Jan 07, 2016', 'Jan 08, 2016',
       'Jan 09, 2016', 'Jan 10, 2016'],
      dtype='<U12')

DataFrame¶

Similar to R.

Titanic data¶

url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
titanic = pd.read_csv(url)

titanic.shape

(891, 15)

titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

# For display purposes, we will drop some columns
titanic = titanic[['survived', 'sex', 'age', 'fare',
                   'embarked', 'class', 'who', 'deck', 'embark_town',]]

titanic.dtypes

survived         int64
sex             object
age            float64
fare           float64
embarked        object
class           object
who             object
deck            object
embark_town     object
dtype: object

Summarizing a data frame¶

titanic.ix[0]

survived                 0
sex                   male
age                     22
fare                  7.25
embarked                 S
class                Third
who                    man
deck                   NaN
embark_town    Southampton
Name: 0, dtype: object

titanic.describe()

	survived	age	fare
count	891.000000	714.000000	891.000000
mean	0.383838	29.699118	32.204208
std	0.486592	14.526497	49.693429
min	0.000000	0.420000	0.000000
25%	0.000000	20.125000	7.910400
50%	0.000000	28.000000	14.454200
75%	1.000000	38.000000	31.000000
max	1.000000	80.000000	512.329200

titanic.head()

	survived	sex	age	fare	embarked	class	who	deck	embark_town
0	0	male	22.0	7.2500	S	Third	man	NaN	Southampton
1	1	female	38.0	71.2833	C	First	woman	C	Cherbourg
2	1	female	26.0	7.9250	S	Third	woman	NaN	Southampton
3	1	female	35.0	53.1000	S	First	woman	C	Southampton
4	0	male	35.0	8.0500	S	Third	man	NaN	Southampton

titanic.tail()

	survived	sex	age	fare	embarked	class	who	deck	embark_town
886	0	male	27.0	13.00	S	Second	man	NaN	Southampton
887	1	female	19.0	30.00	S	First	woman	B	Southampton
888	0	female	NaN	23.45	S	Third	woman	NaN	Southampton
889	1	male	26.0	30.00	C	First	man	C	Cherbourg
890	0	male	32.0	7.75	Q	Third	man	NaN	Queenstown

titanic.columns

Index(['survived', 'sex', 'age', 'fare', 'embarked', 'class', 'who', 'deck',
       'embark_town'],
      dtype='object')

titanic.index

RangeIndex(start=0, stop=891, step=1)

Indexing¶

titanic[['sex', 'age', 'class']].head()

	sex	age	class
0	male	22.0	Third
1	female	38.0	First
2	female	26.0	Third
3	female	35.0	First
4	male	35.0	Third

titanic[10:15]

	survived	sex	age	fare	embarked	class	who	deck	embark_town
10	1	female	4.0	16.7000	S	Third	child	G	Southampton
11	1	female	58.0	26.5500	S	First	woman	C	Southampton
12	0	male	20.0	8.0500	S	Third	man	NaN	Southampton
13	0	male	39.0	31.2750	S	Third	man	NaN	Southampton
14	0	female	14.0	7.8542	S	Third	child	NaN	Southampton

Using the `ix` helper for indexing¶

titanic.ix[10:15, 'age':'fare']

	age	fare
10	4.0	16.7000
11	58.0	26.5500
12	20.0	8.0500
13	39.0	31.2750
14	14.0	7.8542
15	55.0	16.0000

titanic.ix[10:15, [1,3,5]]

	sex	fare	class
10	female	16.7000	Third
11	female	26.5500	First
12	male	8.0500	Third
13	male	31.2750	Third
14	female	7.8542	Third
15	female	16.0000	Second

titanic[titanic.age < 2]

	survived	sex	age	fare	embarked	class	who	deck	embark_town
78	1	male	0.83	29.0000	S	Second	child	NaN	Southampton
164	0	male	1.00	39.6875	S	Third	child	NaN	Southampton
172	1	female	1.00	11.1333	S	Third	child	NaN	Southampton
183	1	male	1.00	39.0000	S	Second	child	F	Southampton
305	1	male	0.92	151.5500	S	First	child	C	Southampton
381	1	female	1.00	15.7417	C	Third	child	NaN	Cherbourg
386	0	male	1.00	46.9000	S	Third	child	NaN	Southampton
469	1	female	0.75	19.2583	C	Third	child	NaN	Cherbourg
644	1	female	0.75	19.2583	C	Third	child	NaN	Cherbourg
755	1	male	0.67	14.5000	S	Second	child	NaN	Southampton
788	1	male	1.00	20.5750	S	Third	child	NaN	Southampton
803	1	male	0.42	8.5167	C	Third	child	NaN	Cherbourg
827	1	male	1.00	37.0042	C	Second	child	NaN	Cherbourg
831	1	male	0.83	18.7500	S	Second	child	NaN	Southampton

Sorting and ordering data¶

titanic.sort_index().head()

	survived	sex	age	fare	embarked	class	who	deck	embark_town
0	0	male	22.0	7.2500	S	Third	man	NaN	Southampton
1	1	female	38.0	71.2833	C	First	woman	C	Cherbourg
2	1	female	26.0	7.9250	S	Third	woman	NaN	Southampton
3	1	female	35.0	53.1000	S	First	woman	C	Southampton
4	0	male	35.0	8.0500	S	Third	man	NaN	Southampton

titanic.sort_values('age', ascending=True).head()

	survived	sex	age	fare	embarked	class	who	deck	embark_town
803	1	male	0.42	8.5167	C	Third	child	NaN	Cherbourg
755	1	male	0.67	14.5000	S	Second	child	NaN	Southampton
644	1	female	0.75	19.2583	C	Third	child	NaN	Cherbourg
469	1	female	0.75	19.2583	C	Third	child	NaN	Cherbourg
78	1	male	0.83	29.0000	S	Second	child	NaN	Southampton

titanic.sort_values(['survived', 'age'], ascending=[True, True]).head()

	sex	age	fare	embarked	class	who	deck	embark_town
164	male	1.0	39.6875	S	Third	child	NaN	Southampton
386	male	1.0	46.9000	S	Third	child	NaN	Southampton
7	male	2.0	21.0750	S	Third	child	NaN	Southampton
16	male	2.0	29.1250	Q	Third	child	NaN	Queenstown
119	female	2.0	31.2750	S	Third	child	NaN	Southampton

Grouping data¶

sex_class = titanic.groupby(['sex', 'class'])

sex_class.count()

		survived	age	fare	embarked	who	deck	embark_town
sex	class
female	First	94	85	94	92	94	81	92
	Second	76	74	76	76	76	10	76
	Third	144	102	144	144	144	6	144
male	First	122	101	122	122	122	94	122
	Second	108	99	108	108	108	6	108
	Third	347	253	347	347	347	6	347

Why Kate Winslett survived and Leonardo DiCaprio didn’t¶

df = sex_class.mean()
df['survived']

sex     class
female  First     0.968085
        Second    0.921053
        Third     0.500000
male    First     0.368852
        Second    0.157407
        Third     0.135447
Name: survived, dtype: float64

Of the females who were in first class, count the number from each embarking town¶

sex_class.get_group(('female', 'First')).groupby('embark_town').count()

	survived	sex	age	fare	embarked	class	who	deck
embark_town
Cherbourg	43	43	38	43	43	43	43	35
Queenstown	1	1	1	1	1	1	1	1
Southampton	48	48	44	48	48	48	48	43

Cross-tabulation¶

pd.crosstab(titanic.survived, titanic['class'])

class	First	Second	Third
survived
0	80	97	372
1	136	87	119

We can also get multiple summaries at the same time¶

def my_func(x):
    return np.max(x)

mapped_funcs = {'embarked': 'count', 'age': ('mean', 'median', my_func), 'survived': sum}
sex_class.get_group(('female', 'First')).groupby('embark_town').agg(mapped_funcs)

	survived	embarked	age
	sum	count	mean	median	my_func
embark_town
Cherbourg	42	43	36.052632	37.0	60.0
Queenstown	1	1	33.000000	33.0	33.0
Southampton	46	48	32.704545	33.0	63.0

titanic.columns

Index(['survived', 'sex', 'age', 'fare', 'embarked', 'class', 'who', 'deck',
       'embark_town'],
      dtype='object')

Visualizing tables¶

See more examples in the Graphics notebook.

import seaborn as sns
sns.set_context(font_scale=4)
sns.factorplot(x='sex', y='age', hue='survived', col='class', kind='box', data=titanic)
pass

Making plots with `pandas`¶

from pandas_datareader import data as web
import datetime

apple = web.DataReader('AAPL', 'google',
                        start = datetime.datetime(2015, 1, 1),
                        end = datetime.datetime(2015, 12, 31))

apple.head()

	Open	High	Low	Close	Volume
Date
2015-01-02	111.39	111.44	107.35	109.33	53204626
2015-01-05	108.29	108.65	105.41	106.25	64285491
2015-01-06	106.54	107.43	104.63	106.26	65797116
2015-01-07	107.20	108.20	106.70	107.75	40105934
2015-01-08	109.23	112.15	108.70	111.89	59364547

apple.plot.line(y='Close', marker='o', markersize=3, linewidth=0.5)
pass

# Zoom in on large drop in August
aug = apple['2015-08-01':'2015-08-30']
aug.plot.line(y=['High', 'Low', 'Open', 'Close'], marker='o', markersize=10, linewidth=1)
pass

Data conversions¶

One of the nicest features of pandas is the ease of converting tabular data across different storage formats. We will illustrate by converting the titanic dataframe into multiple formats.

titanic.head(2)

	survived	sex	age	fare	embarked	class	who	deck	embark_town
0	0	male	22.0	7.2500	S	Third	man	NaN	Southampton
1	1	female	38.0	71.2833	C	First	woman	C	Cherbourg

titanic.to_csv('titanic.csv', index=False)

t1 = pd.read_csv('titanic.csv')
t1.head(2)

	survived	sex	age	fare	embarked	class	who	deck	embark_town
0	0	male	22.0	7.2500	S	Third	man	NaN	Southampton
1	1	female	38.0	71.2833	C	First	woman	C	Cherbourg

!pip install openpyxl
t1.to_excel('titanic.xlsx')

Requirement already satisfied: openpyxl in /Users/cliburn/anaconda2/lib/python2.7/site-packages
Requirement already satisfied: jdcal in /Users/cliburn/anaconda2/lib/python2.7/site-packages (from openpyxl)
Requirement already satisfied: et_xmlfile in /Users/cliburn/anaconda2/lib/python2.7/site-packages (from openpyxl)

t2 = pd.read_excel('titanic.xlsx')
t2.head(2)

	survived	sex	age	fare	embarked	class	who	deck	embark_town
0	0	male	22.0	7.2500	S	Third	man	NaN	Southampton
1	1	female	38.0	71.2833	C	First	woman	C	Cherbourg

import sqlite3

con = sqlite3.connect('titanic.db')
t2.to_sql('titanic', con, index=False, if_exists='replace')

t3 = pd.read_sql('select * from titanic', con)
t3.head(2)

	survived	sex	age	fare	embarked	class	who	deck	embark_town
0	0	male	22.0	7.2500	S	Third	man	None	Southampton
1	1	female	38.0	71.2833	C	First	woman	C	Cherbourg

t3.to_json('titanic.json')

t4 = pd.read_json('titanic.json')
t4.head(2)

	age	class	deck	embark_town	embarked	fare	sex	survived	who
0	22.0	Third	None	Southampton	S	7.2500	male	0	man
1	38.0	First	C	Cherbourg	C	71.2833	female	1	woman

t4 = t4[t3.columns]
t4.head(2)

	survived	sex	age	fare	embarked	class	who	deck	embark_town
0	0	male	22.0	7.2500	S	Third	man	None	Southampton
1	1	female	38.0	71.2833	C	First	woman	C	Cherbourg

Data¶

Resources¶

Working with Series¶

We cna treat Series objects much like numpy vectors¶

Series can also contain more information than numpy vectors¶

You can always use standard positional indexing¶

Series index¶

Note that with labels, the end index is included¶

Even when you have a labeled index, positional arguments still work¶

Working with missing data¶

Concatenating two series¶

Reset index to default¶

`pandas` aggregate functions ignore missing data¶

Selecting missing values¶

Selecting non-missing values¶

Replacement of missing values¶

Working with dates / times¶

Intelligent aggregation over datetime ranges¶

Formatting datetime objects (see http://strftime.org)¶

DataFrame¶

Titanic data¶

Summarizing a data frame¶

Indexing¶

Using the `ix` helper for indexing¶

Sorting and ordering data¶

Grouping data¶

Why Kate Winslett survived and Leonardo DiCaprio didn’t¶

Of the females who were in first class, count the number from each embarking town¶

Cross-tabulation¶

We can also get multiple summaries at the same time¶

Visualizing tables¶

Making plots with `pandas`¶

Data conversions¶

Table Of Contents

Previous page

Next page

This Page

Data¶

Resources¶

Working with Series¶

We cna treat Series objects much like numpy vectors¶

Series can also contain more information than numpy vectors¶

You can always use standard positional indexing¶

Series index¶

Note that with labels, the end index is included¶

Even when you have a labeled index, positional arguments still work¶

Working with missing data¶

Concatenating two series¶

Reset index to default¶

pandas aggregate functions ignore missing data¶

Selecting missing values¶

Selecting non-missing values¶

Replacement of missing values¶

Working with dates / times¶

Intelligent aggregation over datetime ranges¶

Formatting datetime objects (see http://strftime.org)¶

DataFrame¶

Titanic data¶

Summarizing a data frame¶

Indexing¶

Using the ix helper for indexing¶

Sorting and ordering data¶

Grouping data¶

Why Kate Winslett survived and Leonardo DiCaprio didn’t¶

Of the females who were in first class, count the number from each embarking town¶

Cross-tabulation¶

We can also get multiple summaries at the same time¶

Visualizing tables¶

Making plots with pandas¶

Data conversions¶

`pandas` aggregate functions ignore missing data¶

Using the `ix` helper for indexing¶

Making plots with `pandas`¶