Data Archival and Management (Part 4)¶

In [1]:

import numpy as np
import h5py
import arrow

Using `pickle`¶

This is probably the default serialization method used by most Python developers. Its main disadvantage is that it is Python-specific, and cannot be easily loaded in other languages. However, it is convenient if your project is Python only.

In [1]:

import pickle

Source: From PokeAPI v2

In [2]:

bulbasaur = {
    "id": 1,
    "name": "bulbasaur",
    "base_experience": 64,
    "height": 7,
    "is_default": True,
    "order": 1,
    "weight": 69,
    "abilities": [
        {
            "is_hidden": True,
            "slot": 3,
            "ability": {
                "name": "chlorophyll",
                "url": "http://pokeapi.co/api/v2/ability/34/"
            }
        }
    ]
}

Pickle protocols¶

Objects can be pickled using 5 protocols. In general, use protocol 4 (HIGHEST_PROTOCOL) as it is the most flexible and supports very large objects, unless you need to share with Python 2, in which case use protocol 2.

Serialize¶

In [3]:

with open('data/bulbasaur.pickle', 'wb') as f:
    pickle.dump(bulbasaur, f, pickle.HIGHEST_PROTOCOL)

De-serialize¶

In [4]:

with open('data/bulbasaur.pickle', 'rb') as f:
    pokemon = pickle.load(f)

In [5]:

pokemon

Out[5]:

{'abilities': [{'ability': {'name': 'chlorophyll',
    'url': 'http://pokeapi.co/api/v2/ability/34/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 64,
 'height': 7,
 'id': 1,
 'is_default': True,
 'name': 'bulbasaur',
 'order': 1,
 'weight': 69}

Serialize to byte string¶

This just saves as a string (useful for sending to another machine) instead of saving to a file.

In [6]:

s = pickle.dumps(bulbasaur, pickle.HIGHEST_PROTOCOL)

In [7]:

Out[7]:

b'\x80\x04\x95\xd5\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x02id\x94K\x01\x8c\x04name\x94\x8c\tbulbasaur\x94\x8c\x0fbase_experience\x94K@\x8c\x06height\x94K\x07\x8c\nis_default\x94\x88\x8c\x05order\x94K\x01\x8c\x06weight\x94KE\x8c\tabilities\x94]\x94}\x94(\x8c\tis_hidden\x94\x88\x8c\x04slot\x94K\x03\x8c\x07ability\x94}\x94(h\x02\x8c\x0bchlorophyll\x94\x8c\x03url\x94\x8c$http://pokeapi.co/api/v2/ability/34/\x94uuau.'

De-serialize from byte string¶

In [8]:

pokemon2 = pickle.loads(s)

In [9]:

pokemon2

Out[9]:

{'abilities': [{'ability': {'name': 'chlorophyll',
    'url': 'http://pokeapi.co/api/v2/ability/34/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 64,
 'height': 7,
 'id': 1,
 'is_default': True,
 'name': 'bulbasaur',
 'order': 1,
 'weight': 69}

Using Feather¶

Feather is a new and highly optimized binary serialization format for columnar tabular data that is useful for loading and saving large data frames. It can also be used to share large data frames between Python and R and Julia./

Installation in Python

pip3 install  feather-format

Installation in R

install.packages("feather")

In [10]:

from pandas_datareader import data
import arrow
import feather

Download data from Google Finance¶

In [11]:

start = arrow.get('2010-01-01')
end = arrow.get('2016-12-31')
tickers = ['AAPL', 'MSFT', 'SPY']
data_source = 'google'
panel = data.DataReader(tickers, data_source, start.datetime, end.datetime)

In [12]:

panel.keys()

Out[12]:

Index(['Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')

Format closing prices¶

In [13]:

close = panel.loc['Close']
close = close.reset_index()
close.head()

Out[13]:

	Date	AAPL	MSFT	SPY
0	2016-11-14	105.71	58.12	216.59
1	2016-11-15	107.11	58.87	218.28
2	2016-11-16	109.99	59.65	217.87
3	2016-11-17	109.95	60.64	218.99
4	2016-11-18	110.06	60.35	218.50

Serialize¶

In [14]:

feather.write_dataframe(close, 'data/close.feather')

De-serialize¶

In [15]:

close2 = feather.read_dataframe('data/close.feather')
close2.head()

Out[15]:

	Date	AAPL	MSFT	SPY
0	2016-11-14	105.71	58.12	216.59
1	2016-11-15	107.11	58.87	218.28
2	2016-11-16	109.99	59.65	217.87
3	2016-11-17	109.95	60.64	218.99
4	2016-11-18	110.06	60.35	218.50

Sharing data frames between R and Python¶

The primary use of feather is to share large amounts of data between Python and R efficiently. Of course, R also has a feather package.

In [16]:

%load_ext rpy2.ipython

In [17]:

%%R

library(feather)
close <- read_feather('data/close.feather')
head(close)

# A tibble: 6 x 4
        Date   AAPL  MSFT    SPY
      <dttm>  <dbl> <dbl>  <dbl>
1 2016-11-14 105.71 58.12 216.59
2 2016-11-15 107.11 58.87 218.28
3 2016-11-16 109.99 59.65 217.87
4 2016-11-17 109.95 60.64 218.99
5 2016-11-18 110.06 60.35 218.50
6 2016-11-21 111.73 60.86 220.15

In [18]:

%%R

write_feather(close, 'data/closeR.feather')

In [19]:

close3 = feather.read_dataframe('data/closeR.feather')
close3.head()

Out[19]:

	Date	AAPL	MSFT	SPY
0	2016-11-14	105.71	58.12	216.59
1	2016-11-15	107.11	58.87	218.28
2	2016-11-16	109.99	59.65	217.87
3	2016-11-17	109.95	60.64	218.99
4	2016-11-18	110.06	60.35	218.50

In [ ]:

Data Archival and Management (Part 4)¶

Using `pickle`¶

Pickle protocols¶

Serialize¶

De-serialize¶

Serialize to byte string¶

De-serialize from byte string¶

Using Feather¶

Download data from Google Finance¶

Format closing prices¶

Serialize¶

De-serialize¶

Page contents

Previous page

Next page

This Page

Data Archival and Management (Part 4)¶

Using pickle¶

Pickle protocols¶

Serialize¶

De-serialize¶

Serialize to byte string¶

De-serialize from byte string¶

Using Feather¶

Download data from Google Finance¶

Format closing prices¶

Serialize¶

De-serialize¶

Sharing data frames between R and Python¶

Using `pickle`¶