Data Archival and Management (Part 4)¶
In [1]:
import numpy as np
import h5py
import arrow
Using pickle
¶
This is probably the default serialization method used by most Python developers. Its main disadvantage is that it is Python-specific, and cannot be easily loaded in other languages. However, it is convenient if your project is Python only.
In [1]:
import pickle
Source: From PokeAPI v2
In [2]:
bulbasaur = {
"id": 1,
"name": "bulbasaur",
"base_experience": 64,
"height": 7,
"is_default": True,
"order": 1,
"weight": 69,
"abilities": [
{
"is_hidden": True,
"slot": 3,
"ability": {
"name": "chlorophyll",
"url": "http://pokeapi.co/api/v2/ability/34/"
}
}
]
}
Pickle protocols¶
Objects can be pickled using 5 protocols. In general, use protocol 4 (HIGHEST_PROTOCOL) as it is the most flexible and supports very large objects, unless you need to share with Python 2, in which case use protocol 2.
Serialize¶
In [3]:
with open('data/bulbasaur.pickle', 'wb') as f:
pickle.dump(bulbasaur, f, pickle.HIGHEST_PROTOCOL)
De-serialize¶
In [4]:
with open('data/bulbasaur.pickle', 'rb') as f:
pokemon = pickle.load(f)
In [5]:
pokemon
Out[5]:
{'abilities': [{'ability': {'name': 'chlorophyll',
'url': 'http://pokeapi.co/api/v2/ability/34/'},
'is_hidden': True,
'slot': 3}],
'base_experience': 64,
'height': 7,
'id': 1,
'is_default': True,
'name': 'bulbasaur',
'order': 1,
'weight': 69}
Serialize to byte string¶
This just saves as a string (useful for sending to another machine) instead of saving to a file.
In [6]:
s = pickle.dumps(bulbasaur, pickle.HIGHEST_PROTOCOL)
In [7]:
s
Out[7]:
b'\x80\x04\x95\xd5\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x02id\x94K\x01\x8c\x04name\x94\x8c\tbulbasaur\x94\x8c\x0fbase_experience\x94K@\x8c\x06height\x94K\x07\x8c\nis_default\x94\x88\x8c\x05order\x94K\x01\x8c\x06weight\x94KE\x8c\tabilities\x94]\x94}\x94(\x8c\tis_hidden\x94\x88\x8c\x04slot\x94K\x03\x8c\x07ability\x94}\x94(h\x02\x8c\x0bchlorophyll\x94\x8c\x03url\x94\x8c$http://pokeapi.co/api/v2/ability/34/\x94uuau.'
De-serialize from byte string¶
In [8]:
pokemon2 = pickle.loads(s)
In [9]:
pokemon2
Out[9]:
{'abilities': [{'ability': {'name': 'chlorophyll',
'url': 'http://pokeapi.co/api/v2/ability/34/'},
'is_hidden': True,
'slot': 3}],
'base_experience': 64,
'height': 7,
'id': 1,
'is_default': True,
'name': 'bulbasaur',
'order': 1,
'weight': 69}
Using Feather¶
Feather is a new and highly optimized binary serialization format for columnar tabular data that is useful for loading and saving large data frames. It can also be used to share large data frames between Python and R and Julia./
Installation in Python
pip3 install feather-format
Installation in R
install.packages("feather")
In [10]:
from pandas_datareader import data
import arrow
import feather
Download data from Google Finance¶
In [11]:
start = arrow.get('2010-01-01')
end = arrow.get('2016-12-31')
tickers = ['AAPL', 'MSFT', 'SPY']
data_source = 'google'
panel = data.DataReader(tickers, data_source, start.datetime, end.datetime)
In [12]:
panel.keys()
Out[12]:
Index(['Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')
Format closing prices¶
In [13]:
close = panel.loc['Close']
close = close.reset_index()
close.head()
Out[13]:
Date | AAPL | MSFT | SPY | |
---|---|---|---|---|
0 | 2016-11-14 | 105.71 | 58.12 | 216.59 |
1 | 2016-11-15 | 107.11 | 58.87 | 218.28 |
2 | 2016-11-16 | 109.99 | 59.65 | 217.87 |
3 | 2016-11-17 | 109.95 | 60.64 | 218.99 |
4 | 2016-11-18 | 110.06 | 60.35 | 218.50 |
Serialize¶
In [14]:
feather.write_dataframe(close, 'data/close.feather')
De-serialize¶
In [15]:
close2 = feather.read_dataframe('data/close.feather')
close2.head()
Out[15]:
Date | AAPL | MSFT | SPY | |
---|---|---|---|---|
0 | 2016-11-14 | 105.71 | 58.12 | 216.59 |
1 | 2016-11-15 | 107.11 | 58.87 | 218.28 |
2 | 2016-11-16 | 109.99 | 59.65 | 217.87 |
3 | 2016-11-17 | 109.95 | 60.64 | 218.99 |
4 | 2016-11-18 | 110.06 | 60.35 | 218.50 |
Sharing data frames between R and Python¶
The primary use of feather is to share large amounts of data between Python and R efficiently. Of course, R also has a feather package.
In [16]:
%load_ext rpy2.ipython
In [17]:
%%R
library(feather)
close <- read_feather('data/close.feather')
head(close)
# A tibble: 6 x 4
Date AAPL MSFT SPY
<dttm> <dbl> <dbl> <dbl>
1 2016-11-14 105.71 58.12 216.59
2 2016-11-15 107.11 58.87 218.28
3 2016-11-16 109.99 59.65 217.87
4 2016-11-17 109.95 60.64 218.99
5 2016-11-18 110.06 60.35 218.50
6 2016-11-21 111.73 60.86 220.15
In [18]:
%%R
write_feather(close, 'data/closeR.feather')
In [19]:
close3 = feather.read_dataframe('data/closeR.feather')
close3.head()
Out[19]:
Date | AAPL | MSFT | SPY | |
---|---|---|---|---|
0 | 2016-11-14 | 105.71 | 58.12 | 216.59 |
1 | 2016-11-15 | 107.11 | 58.87 | 218.28 |
2 | 2016-11-16 | 109.99 | 59.65 | 217.87 |
3 | 2016-11-17 | 109.95 | 60.64 | 218.99 |
4 | 2016-11-18 | 110.06 | 60.35 | 218.50 |
In [ ]: