{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Archival and Management (Part 3)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import h5py\n", "import arrow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using HDF5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HDF5 is a file format for storing **large** amounts of **annnotated**, **numerical** data organzied **hierarchically**. It is very useful in numerical work becasue isolate parts fo the data strucrue for processing in memory. It is the default storage format for modern versions of Matlab.\n", "\n", "There are two well known Python pacakges for working with HDF5, `h5py` and `pytables`. \n", "\n", "- `h5py` provides a `numpy` like inerface and is probably easier to use\n", "- `pytables` provides a database table abstrction and is used by `pandas` in its `read_hdf` and `to_hdf` I/O functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving simulaiton data\n", "\n", "We give a simple example where you perform a set of different simulations, and save to an HDF5 file.\n", "\n", "Suppose the simulaions are for infection rates by different pathogens, and for each pathogen, you run a few simulations with diferent assumptions. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simulaiton daa" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Simulations done on 10-11-2017 by Charles Darwin\n", "malaria_asia = np.random.poisson(10, (100, 10))\n", "malaria_africa = np.random.poisson(11, (100, 10))\n", "malaria_america = np.random.poisson(12, (100, 10))\n", "\n", "# Simulations done on 11-11-2017 by Greor Mendel\n", "aids_asia = np.random.gamma(102, 10, (100, 10))\n", "aids_africa = np.random.gamma(101, 11, (200, 10))\n", "aids_america = np.random.gamma(100, 12, (300, 10))\n", "\n", "# Simulations done on 12-11-2017 by Charlied Brown\n", "tb_asia = np.random.normal(5, 1, (100, 10))\n", "tb_africa = np.random.normal(6, 1, (100, 10))\n", "tb_america = np.random.normal(7, 1, (100, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the HDF5 file\n", "\n", "By default the file is open in read/write mdoe it exists, and created if it does not exist." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "f = h5py.File('sim.h5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Polulate HDF5 file with groups, datasets and annotations" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "diseases = ['malaria', 'aids', 'tuberculosis']\n", "creators = ['Charles Dawin', 'Gregor Mendel', 'Charlie Brown']\n", "dates = [arrow.get(date) for date in ['10-11-2017', '11-11-2017', '12-11-2017']]\n", "regions = ['asia', 'africa', 'america']\n", "datasets = [\n", " [malaria_asia, malaria_africa, malaria_america],\n", " [aids_asia, aids_africa, aids_america],\n", " [tb_asia, tb_africa, tb_america],\n", "]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true, "scrolled": false }, "outputs": [], "source": [ "for disease, creator, date, dataset in zip(diseases, creators, dates, datasets):\n", " g = f.create_group(disease)\n", " g.attrs['creaor'] = creator\n", " g.attrs['creation date'] = str(date)\n", " for region, simulation in zip(regions, dataset):\n", " d = g.create_dataset(region, data=simulation)\n", " d.attrs['timestamp'] = str(arrow.now())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Close file when no longer needed" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using an HDF5 file" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "f = h5py.File('sim.h5')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['aids', 'malaria', 'tuberculosis']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(f.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Iteration" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "aids creaor Gregor Mendel\n", "\n", "aids creation date 2017-01-01T00:00:00+00:00\n", "\n", "malaria creaor Charles Dawin\n", "\n", "malaria creation date 2017-01-01T00:00:00+00:00\n", "\n", "tuberculosis creaor Charlie Brown\n", "\n", "tuberculosis creation date 2017-01-01T00:00:00+00:00\n", "\n" ] } ], "source": [ "for group in f:\n", " for key, val in f[group].attrs.items():\n", " print(group, key, val)\n", " print(f[group])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using a visitor" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "aids\n", "aids/africa\n", "aids/america\n", "aids/asia\n", "malaria\n", "malaria/africa\n", "malaria/america\n", "malaria/asia\n", "tuberculosis\n", "tuberculosis/africa\n", "tuberculosis/america\n", "tuberculosis/asia\n" ] } ], "source": [ "f.visit(lambda x: print(x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using a visitor iterator" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "aids\n", "creaor Gregor Mendel\n", "creation date 2017-01-01T00:00:00+00:00\n", "\n", "aids/africa\n", "timestamp 2017-11-13T01:12:42.165815+00:00\n", "\n", "aids/america\n", "timestamp 2017-11-13T01:12:42.166395+00:00\n", "\n", "aids/asia\n", "timestamp 2017-11-13T01:12:42.165222+00:00\n", "\n", "malaria\n", "creaor Charles Dawin\n", "creation date 2017-01-01T00:00:00+00:00\n", "\n", "malaria/africa\n", "timestamp 2017-11-13T01:12:42.163596+00:00\n", "\n", "malaria/america\n", "timestamp 2017-11-13T01:12:42.164182+00:00\n", "\n", "malaria/asia\n", "timestamp 2017-11-13T01:12:42.162954+00:00\n", "\n", "tuberculosis\n", "creaor Charlie Brown\n", "creation date 2017-01-01T00:00:00+00:00\n", "\n", "tuberculosis/africa\n", "timestamp 2017-11-13T01:12:42.168003+00:00\n", "\n", "tuberculosis/america\n", "timestamp 2017-11-13T01:12:42.168549+00:00\n", "\n", "tuberculosis/asia\n", "timestamp 2017-11-13T01:12:42.167439+00:00\n", "\n" ] } ], "source": [ "def view(name, obj):\n", " print(name)\n", " for item in obj.attrs:\n", " print(item, obj.attrs[item])\n", " print()\n", " \n", "f.visititems(view)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load a slice of a data set into memory" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tb_am = f['tuberculosis']['america']\n", "tb_am" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 6.17699582, 5.75843404, 7.10402303],\n", " [ 7.02025947, 5.99105695, 6.46981765],\n", " [ 5.93552647, 6.66634888, 6.99460096],\n", " [ 5.50521143, 8.21152932, 6.19222898],\n", " [ 6.28336856, 6.83224101, 7.98080216]])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tb_am[:5, :3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modifying a dataa set" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tb_am[:5, :3] **= 2" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 38.15527734, 33.15956263, 50.46714314],\n", " [ 49.28404308, 35.8927634 , 41.85854039],\n", " [ 35.23047445, 44.44020735, 48.92444253],\n", " [ 30.3073529 , 67.42921375, 38.34369969],\n", " [ 39.4807205 , 46.67951722, 63.69320318]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tb_am[:5, :3]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 38.15527734, 33.15956263, 50.46714314],\n", " [ 49.28404308, 35.8927634 , 41.85854039],\n", " [ 35.23047445, 44.44020735, 48.92444253],\n", " [ 30.3073529 , 67.42921375, 38.34369969],\n", " [ 39.4807205 , 46.67951722, 63.69320318]])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f['tuberculosis']['america'][:5,:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating resizable data sets" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = f.create_group('dengue')\n", "g.create_dataset('asia', data=np.random.randint(0,10,(10,5)), maxshape=(None, 5))" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dset = f['dengue/asia']\n", "dset.resize((20, 5))\n", "dset[10:20, :] = np.ones((10, 5))" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[3, 2, 4, 0, 7],\n", " [2, 1, 4, 5, 8],\n", " [2, 8, 6, 7, 3],\n", " [2, 1, 5, 7, 3],\n", " [8, 5, 1, 2, 0],\n", " [2, 8, 5, 3, 5],\n", " [5, 5, 0, 9, 6],\n", " [4, 5, 6, 4, 1],\n", " [4, 6, 6, 5, 4],\n", " [6, 2, 1, 5, 6],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1],\n", " [1, 1, 1, 1, 1]])" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dset[:, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Creating compressed data sets\n", "\n", "Compression and decompresssion occurs automatically if the compressoin keyword is given the name of an appropriate compression algorihtm. Use `gzip` for goood compression/moderate speed or `lzf` for moderate compression/fast speed." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 1, 2, 3, 4],\n", " [ 5, 6, 7, 8, 9],\n", " [10, 11, 12, 13, 14],\n", " [15, 16, 17, 18, 19]])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.create_dataset('america', data=np.arange(20).reshape(-1, 5), \n", " maxshape=(None, 5),\n", " compression = 'lzf')\n", "dset = f['dengue/america']\n", "dset[:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Close file handle when done" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": true }, "outputs": [], "source": [ "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using with `pandas`" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": true }, "outputs": [], "source": [ "states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO',\n", " 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', \n", " 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', \n", " 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', \n", " 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',\n", " 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', \n", " 'OR', 'PA', 'RI', 'SC', 'SD', 'TN',\n", " 'TX', 'UT', 'VT', 'VA', 'WA', 'WV',\n", " 'WI', 'WY']" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pop = np.random.randint(500000, 10000000, 50)\n", "income = np.random.normal(50000, 10000, 50)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.DataFrame(dict(state=states, pop=pop, income=income))" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
incomepopstate
044406.9598984666424AL
144072.9763178842771AK
254005.2430825199134AZ
343124.8201886466072AR
460384.2532246568168CA
\n", "
" ], "text/plain": [ " income pop state\n", "0 44406.959898 4666424 AL\n", "1 44072.976317 8842771 AK\n", "2 54005.243082 5199134 AZ\n", "3 43124.820188 6466072 AR\n", "4 60384.253224 6568168 CA" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Standard dataset" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.to_hdf('simple.h5', 'stats')" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
incomepopstate
044406.9598984666424AL
144072.9763178842771AK
254005.2430825199134AZ
343124.8201886466072AR
460384.2532246568168CA
\n", "
" ], "text/plain": [ " income pop state\n", "0 44406.959898 4666424 AL\n", "1 44072.976317 8842771 AK\n", "2 54005.243082 5199134 AZ\n", "3 43124.820188 6466072 AR\n", "4 60384.253224 6568168 CA" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_hdf('simple.h5', 'stats').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using relatinal featurees \n", "\n", "If you are likely to query a column frequently to retrieve a subset, make that column a data column so that it is indexed." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.to_hdf('states.h5', 'stats', mode='w', \n", " data_columns=['state'],\n", " format='table')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Retreving via a table query" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
incomepopstate
044406.9598984666424AL
948488.4547116013478GA
3249482.9676766739070NC
\n", "
" ], "text/plain": [ " income pop state\n", "0 44406.959898 4666424 AL\n", "9 48488.454711 6013478 GA\n", "32 49482.967676 6739070 NC" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_hdf('states.h5', 'stats', \n", " where=\"state in ['AL', 'NC', 'GA']\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }