{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Archival and Management (Part 4)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import h5py\n",
"import arrow"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using `pickle` "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is probably the default serialization method used by most Python developers. Its main disadvantage is that it is Python-specific, and cannot be easily loaded in other languages. However, it is convenient if your project is Python only."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pickle"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Source: From PokeAPI v2"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"bulbasaur = {\n",
" \"id\": 1,\n",
" \"name\": \"bulbasaur\",\n",
" \"base_experience\": 64,\n",
" \"height\": 7,\n",
" \"is_default\": True,\n",
" \"order\": 1,\n",
" \"weight\": 69,\n",
" \"abilities\": [\n",
" {\n",
" \"is_hidden\": True,\n",
" \"slot\": 3,\n",
" \"ability\": {\n",
" \"name\": \"chlorophyll\",\n",
" \"url\": \"http://pokeapi.co/api/v2/ability/34/\"\n",
" }\n",
" }\n",
" ]\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pickle protocols\n",
"\n",
"Objects can be pickled using 5 protocols. In general, use protocol 4 (HIGHEST_PROTOCOL) as it is the most flexible and supports very large objects, unless you need to share with Python 2, in which case use protocol 2."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Serialize"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"with open('data/bulbasaur.pickle', 'wb') as f:\n",
" pickle.dump(bulbasaur, f, pickle.HIGHEST_PROTOCOL)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### De-serialize"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"with open('data/bulbasaur.pickle', 'rb') as f:\n",
" pokemon = pickle.load(f)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'abilities': [{'ability': {'name': 'chlorophyll',\n",
" 'url': 'http://pokeapi.co/api/v2/ability/34/'},\n",
" 'is_hidden': True,\n",
" 'slot': 3}],\n",
" 'base_experience': 64,\n",
" 'height': 7,\n",
" 'id': 1,\n",
" 'is_default': True,\n",
" 'name': 'bulbasaur',\n",
" 'order': 1,\n",
" 'weight': 69}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pokemon"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Serialize to byte string \n",
"\n",
"This just saves as a string (useful for sending to another machine) instead of saving to a file."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"s = pickle.dumps(bulbasaur, pickle.HIGHEST_PROTOCOL)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"b'\\x80\\x04\\x95\\xd5\\x00\\x00\\x00\\x00\\x00\\x00\\x00}\\x94(\\x8c\\x02id\\x94K\\x01\\x8c\\x04name\\x94\\x8c\\tbulbasaur\\x94\\x8c\\x0fbase_experience\\x94K@\\x8c\\x06height\\x94K\\x07\\x8c\\nis_default\\x94\\x88\\x8c\\x05order\\x94K\\x01\\x8c\\x06weight\\x94KE\\x8c\\tabilities\\x94]\\x94}\\x94(\\x8c\\tis_hidden\\x94\\x88\\x8c\\x04slot\\x94K\\x03\\x8c\\x07ability\\x94}\\x94(h\\x02\\x8c\\x0bchlorophyll\\x94\\x8c\\x03url\\x94\\x8c$http://pokeapi.co/api/v2/ability/34/\\x94uuau.'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### De-serialize from byte string\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pokemon2 = pickle.loads(s)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'abilities': [{'ability': {'name': 'chlorophyll',\n",
" 'url': 'http://pokeapi.co/api/v2/ability/34/'},\n",
" 'is_hidden': True,\n",
" 'slot': 3}],\n",
" 'base_experience': 64,\n",
" 'height': 7,\n",
" 'id': 1,\n",
" 'is_default': True,\n",
" 'name': 'bulbasaur',\n",
" 'order': 1,\n",
" 'weight': 69}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pokemon2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Feather\n",
"\n",
"Feather is a new and highly optimized binary serialization format for columnar tabular data that is useful for loading and saving large data frames. It can also be used to share large data frames between Python and R and Julia./\n",
"\n",
"Installation in Python\n",
"```bash\n",
"pip3 install feather-format\n",
"```\n",
"\n",
"Installation in R\n",
"```R\n",
"install.packages(\"feather\")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from pandas_datareader import data\n",
"import arrow\n",
"import feather"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download data from Google Finance"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"start = arrow.get('2010-01-01')\n",
"end = arrow.get('2016-12-31')\n",
"tickers = ['AAPL', 'MSFT', 'SPY']\n",
"data_source = 'google'\n",
"panel = data.DataReader(tickers, data_source, start.datetime, end.datetime)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"panel.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Format closing prices"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" AAPL | \n",
" MSFT | \n",
" SPY | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-11-14 | \n",
" 105.71 | \n",
" 58.12 | \n",
" 216.59 | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-11-15 | \n",
" 107.11 | \n",
" 58.87 | \n",
" 218.28 | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-11-16 | \n",
" 109.99 | \n",
" 59.65 | \n",
" 217.87 | \n",
"
\n",
" \n",
" 3 | \n",
" 2016-11-17 | \n",
" 109.95 | \n",
" 60.64 | \n",
" 218.99 | \n",
"
\n",
" \n",
" 4 | \n",
" 2016-11-18 | \n",
" 110.06 | \n",
" 60.35 | \n",
" 218.50 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date AAPL MSFT SPY\n",
"0 2016-11-14 105.71 58.12 216.59\n",
"1 2016-11-15 107.11 58.87 218.28\n",
"2 2016-11-16 109.99 59.65 217.87\n",
"3 2016-11-17 109.95 60.64 218.99\n",
"4 2016-11-18 110.06 60.35 218.50"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"close = panel.loc['Close']\n",
"close = close.reset_index()\n",
"close.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Serialize"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"feather.write_dataframe(close, 'data/close.feather')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### De-serialize"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" AAPL | \n",
" MSFT | \n",
" SPY | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-11-14 | \n",
" 105.71 | \n",
" 58.12 | \n",
" 216.59 | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-11-15 | \n",
" 107.11 | \n",
" 58.87 | \n",
" 218.28 | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-11-16 | \n",
" 109.99 | \n",
" 59.65 | \n",
" 217.87 | \n",
"
\n",
" \n",
" 3 | \n",
" 2016-11-17 | \n",
" 109.95 | \n",
" 60.64 | \n",
" 218.99 | \n",
"
\n",
" \n",
" 4 | \n",
" 2016-11-18 | \n",
" 110.06 | \n",
" 60.35 | \n",
" 218.50 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date AAPL MSFT SPY\n",
"0 2016-11-14 105.71 58.12 216.59\n",
"1 2016-11-15 107.11 58.87 218.28\n",
"2 2016-11-16 109.99 59.65 217.87\n",
"3 2016-11-17 109.95 60.64 218.99\n",
"4 2016-11-18 110.06 60.35 218.50"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"close2 = feather.read_dataframe('data/close.feather')\n",
"close2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sharing data frames between R and Python\n",
"\n",
"The primary use of feather is to share large amounts of data between Python and R efficiently. Of course, R also has a feather package."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%load_ext rpy2.ipython"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"# A tibble: 6 x 4\n",
" Date AAPL MSFT SPY\n",
" \n",
"1 2016-11-14 105.71 58.12 216.59\n",
"2 2016-11-15 107.11 58.87 218.28\n",
"3 2016-11-16 109.99 59.65 217.87\n",
"4 2016-11-17 109.95 60.64 218.99\n",
"5 2016-11-18 110.06 60.35 218.50\n",
"6 2016-11-21 111.73 60.86 220.15\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%R\n",
"\n",
"library(feather)\n",
"close <- read_feather('data/close.feather')\n",
"head(close)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%R \n",
"\n",
"write_feather(close, 'data/closeR.feather')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" AAPL | \n",
" MSFT | \n",
" SPY | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-11-14 | \n",
" 105.71 | \n",
" 58.12 | \n",
" 216.59 | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-11-15 | \n",
" 107.11 | \n",
" 58.87 | \n",
" 218.28 | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-11-16 | \n",
" 109.99 | \n",
" 59.65 | \n",
" 217.87 | \n",
"
\n",
" \n",
" 3 | \n",
" 2016-11-17 | \n",
" 109.95 | \n",
" 60.64 | \n",
" 218.99 | \n",
"
\n",
" \n",
" 4 | \n",
" 2016-11-18 | \n",
" 110.06 | \n",
" 60.35 | \n",
" 218.50 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date AAPL MSFT SPY\n",
"0 2016-11-14 105.71 58.12 216.59\n",
"1 2016-11-15 107.11 58.87 218.28\n",
"2 2016-11-16 109.99 59.65 217.87\n",
"3 2016-11-17 109.95 60.64 218.99\n",
"4 2016-11-18 110.06 60.35 218.50"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"close3 = feather.read_dataframe('data/closeR.feather')\n",
"close3.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}