Data Archival and Management (Part 4)

In [1]:
import numpy as np
import h5py
import arrow

Using pickle

This is probably the default serialization method used by most Python developers. Its main disadvantage is that it is Python-specific, and cannot be easily loaded in other languages. However, it is convenient if your project is Python only.

In [1]:
import pickle

Source: From PokeAPI v2

In [2]:
bulbasaur = {
    "id": 1,
    "name": "bulbasaur",
    "base_experience": 64,
    "height": 7,
    "is_default": True,
    "order": 1,
    "weight": 69,
    "abilities": [
        {
            "is_hidden": True,
            "slot": 3,
            "ability": {
                "name": "chlorophyll",
                "url": "http://pokeapi.co/api/v2/ability/34/"
            }
        }
    ]
}

Pickle protocols

Objects can be pickled using 5 protocols. In general, use protocol 4 (HIGHEST_PROTOCOL) as it is the most flexible and supports very large objects, unless you need to share with Python 2, in which case use protocol 2.

Serialize

In [3]:
with open('data/bulbasaur.pickle', 'wb') as f:
    pickle.dump(bulbasaur, f, pickle.HIGHEST_PROTOCOL)

De-serialize

In [4]:
with open('data/bulbasaur.pickle', 'rb') as f:
    pokemon = pickle.load(f)
In [5]:
pokemon
Out[5]:
{'abilities': [{'ability': {'name': 'chlorophyll',
    'url': 'http://pokeapi.co/api/v2/ability/34/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 64,
 'height': 7,
 'id': 1,
 'is_default': True,
 'name': 'bulbasaur',
 'order': 1,
 'weight': 69}

Serialize to byte string

This just saves as a string (useful for sending to another machine) instead of saving to a file.

In [6]:
s = pickle.dumps(bulbasaur, pickle.HIGHEST_PROTOCOL)
In [7]:
s
Out[7]:
b'\x80\x04\x95\xd5\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x02id\x94K\x01\x8c\x04name\x94\x8c\tbulbasaur\x94\x8c\x0fbase_experience\x94K@\x8c\x06height\x94K\x07\x8c\nis_default\x94\x88\x8c\x05order\x94K\x01\x8c\x06weight\x94KE\x8c\tabilities\x94]\x94}\x94(\x8c\tis_hidden\x94\x88\x8c\x04slot\x94K\x03\x8c\x07ability\x94}\x94(h\x02\x8c\x0bchlorophyll\x94\x8c\x03url\x94\x8c$http://pokeapi.co/api/v2/ability/34/\x94uuau.'

De-serialize from byte string

In [8]:
pokemon2 = pickle.loads(s)
In [9]:
pokemon2
Out[9]:
{'abilities': [{'ability': {'name': 'chlorophyll',
    'url': 'http://pokeapi.co/api/v2/ability/34/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 64,
 'height': 7,
 'id': 1,
 'is_default': True,
 'name': 'bulbasaur',
 'order': 1,
 'weight': 69}

Using Feather

Feather is a new and highly optimized binary serialization format for columnar tabular data that is useful for loading and saving large data frames. It can also be used to share large data frames between Python and R and Julia./

Installation in Python

pip3 install  feather-format

Installation in R

install.packages("feather")
In [10]:
from pandas_datareader import data
import arrow
import feather

Download data from Google Finance

In [11]:
start = arrow.get('2010-01-01')
end = arrow.get('2016-12-31')
tickers = ['AAPL', 'MSFT', 'SPY']
data_source = 'google'
panel = data.DataReader(tickers, data_source, start.datetime, end.datetime)
In [12]:
panel.keys()
Out[12]:
Index(['Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')

Format closing prices

In [13]:
close = panel.loc['Close']
close = close.reset_index()
close.head()
Out[13]:
Date AAPL MSFT SPY
0 2016-11-14 105.71 58.12 216.59
1 2016-11-15 107.11 58.87 218.28
2 2016-11-16 109.99 59.65 217.87
3 2016-11-17 109.95 60.64 218.99
4 2016-11-18 110.06 60.35 218.50

Serialize

In [14]:
feather.write_dataframe(close, 'data/close.feather')

De-serialize

In [15]:
close2 = feather.read_dataframe('data/close.feather')
close2.head()
Out[15]:
Date AAPL MSFT SPY
0 2016-11-14 105.71 58.12 216.59
1 2016-11-15 107.11 58.87 218.28
2 2016-11-16 109.99 59.65 217.87
3 2016-11-17 109.95 60.64 218.99
4 2016-11-18 110.06 60.35 218.50

Sharing data frames between R and Python

The primary use of feather is to share large amounts of data between Python and R efficiently. Of course, R also has a feather package.

In [16]:
%load_ext rpy2.ipython
In [17]:
%%R

library(feather)
close <- read_feather('data/close.feather')
head(close)
# A tibble: 6 x 4
        Date   AAPL  MSFT    SPY
      <dttm>  <dbl> <dbl>  <dbl>
1 2016-11-14 105.71 58.12 216.59
2 2016-11-15 107.11 58.87 218.28
3 2016-11-16 109.99 59.65 217.87
4 2016-11-17 109.95 60.64 218.99
5 2016-11-18 110.06 60.35 218.50
6 2016-11-21 111.73 60.86 220.15

In [18]:
%%R

write_feather(close, 'data/closeR.feather')
In [19]:
close3 = feather.read_dataframe('data/closeR.feather')
close3.head()
Out[19]:
Date AAPL MSFT SPY
0 2016-11-14 105.71 58.12 216.59
1 2016-11-15 107.11 58.87 218.28
2 2016-11-16 109.99 59.65 217.87
3 2016-11-17 109.95 60.64 218.99
4 2016-11-18 110.06 60.35 218.50
In [ ]: