{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Archival and Management (Part 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import h5py\n",
    "import arrow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using HDF5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HDF5 is a file format for storing **large** amounts of **annnotated**, **numerical** data organzied **hierarchically**. It is very useful in numerical work becasue isolate parts fo the data strucrue for processing in memory. It is the default storage format for modern versions of Matlab.\n",
    "\n",
    "There are two well known Python pacakges for working with HDF5, `h5py` and `pytables`. \n",
    "\n",
    "- `h5py` provides a `numpy` like inerface and is probably easier to use\n",
    "- `pytables` provides a database table abstrction and is used by `pandas` in its `read_hdf` and `to_hdf` I/O functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving simulaiton data\n",
    "\n",
    "We give a simple example where you perform  a set of different simulations, and save to an HDF5 file.\n",
    "\n",
    "Suppose the simulaions are for infection rates by different pathogens, and for each pathogen, you run a few simulations with diferent assumptions. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simulaiton daa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Simulations done on 10-11-2017 by Charles Darwin\n",
    "malaria_asia = np.random.poisson(10, (100, 10))\n",
    "malaria_africa = np.random.poisson(11, (100, 10))\n",
    "malaria_america = np.random.poisson(12, (100, 10))\n",
    "\n",
    "# Simulations done on 11-11-2017 by Greor Mendel\n",
    "aids_asia = np.random.gamma(102, 10, (100, 10))\n",
    "aids_africa = np.random.gamma(101, 11, (200, 10))\n",
    "aids_america = np.random.gamma(100, 12, (300, 10))\n",
    "\n",
    "# Simulations done on 12-11-2017 by Charlied Brown\n",
    "tb_asia = np.random.normal(5, 1, (100, 10))\n",
    "tb_africa = np.random.normal(6, 1, (100, 10))\n",
    "tb_america = np.random.normal(7, 1, (100, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the HDF5 file\n",
    "\n",
    "By default the file is open in read/write mdoe it exists, and created if it does not exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "f = h5py.File('sim.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Polulate HDF5 file with groups, datasets and annotations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "diseases = ['malaria', 'aids', 'tuberculosis']\n",
    "creators = ['Charles Dawin', 'Gregor Mendel', 'Charlie Brown']\n",
    "dates = [arrow.get(date) for date in ['10-11-2017', '11-11-2017', '12-11-2017']]\n",
    "regions = ['asia', 'africa', 'america']\n",
    "datasets = [\n",
    "    [malaria_asia, malaria_africa, malaria_america],\n",
    "    [aids_asia, aids_africa, aids_america],\n",
    "    [tb_asia, tb_africa, tb_america],\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "for disease, creator, date, dataset in zip(diseases, creators, dates, datasets):\n",
    "    g = f.create_group(disease)\n",
    "    g.attrs['creaor'] = creator\n",
    "    g.attrs['creation date'] = str(date)\n",
    "    for region, simulation in zip(regions, dataset):\n",
    "        d = g.create_dataset(region, data=simulation)\n",
    "        d.attrs['timestamp'] = str(arrow.now())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Close file when no longer needed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using an HDF5 file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "f = h5py.File('sim.h5')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['aids', 'malaria', 'tuberculosis']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(f.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Iteration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aids creaor Gregor Mendel\n",
      "<HDF5 group \"/aids\" (3 members)>\n",
      "aids creation date 2017-01-01T00:00:00+00:00\n",
      "<HDF5 group \"/aids\" (3 members)>\n",
      "malaria creaor Charles Dawin\n",
      "<HDF5 group \"/malaria\" (3 members)>\n",
      "malaria creation date 2017-01-01T00:00:00+00:00\n",
      "<HDF5 group \"/malaria\" (3 members)>\n",
      "tuberculosis creaor Charlie Brown\n",
      "<HDF5 group \"/tuberculosis\" (3 members)>\n",
      "tuberculosis creation date 2017-01-01T00:00:00+00:00\n",
      "<HDF5 group \"/tuberculosis\" (3 members)>\n"
     ]
    }
   ],
   "source": [
    "for group in f:\n",
    "    for key, val in f[group].attrs.items():\n",
    "        print(group, key, val)\n",
    "        print(f[group])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Using a visitor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aids\n",
      "aids/africa\n",
      "aids/america\n",
      "aids/asia\n",
      "malaria\n",
      "malaria/africa\n",
      "malaria/america\n",
      "malaria/asia\n",
      "tuberculosis\n",
      "tuberculosis/africa\n",
      "tuberculosis/america\n",
      "tuberculosis/asia\n"
     ]
    }
   ],
   "source": [
    "f.visit(lambda x: print(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Using a visitor iterator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aids\n",
      "creaor Gregor Mendel\n",
      "creation date 2017-01-01T00:00:00+00:00\n",
      "\n",
      "aids/africa\n",
      "timestamp 2017-11-13T01:12:42.165815+00:00\n",
      "\n",
      "aids/america\n",
      "timestamp 2017-11-13T01:12:42.166395+00:00\n",
      "\n",
      "aids/asia\n",
      "timestamp 2017-11-13T01:12:42.165222+00:00\n",
      "\n",
      "malaria\n",
      "creaor Charles Dawin\n",
      "creation date 2017-01-01T00:00:00+00:00\n",
      "\n",
      "malaria/africa\n",
      "timestamp 2017-11-13T01:12:42.163596+00:00\n",
      "\n",
      "malaria/america\n",
      "timestamp 2017-11-13T01:12:42.164182+00:00\n",
      "\n",
      "malaria/asia\n",
      "timestamp 2017-11-13T01:12:42.162954+00:00\n",
      "\n",
      "tuberculosis\n",
      "creaor Charlie Brown\n",
      "creation date 2017-01-01T00:00:00+00:00\n",
      "\n",
      "tuberculosis/africa\n",
      "timestamp 2017-11-13T01:12:42.168003+00:00\n",
      "\n",
      "tuberculosis/america\n",
      "timestamp 2017-11-13T01:12:42.168549+00:00\n",
      "\n",
      "tuberculosis/asia\n",
      "timestamp 2017-11-13T01:12:42.167439+00:00\n",
      "\n"
     ]
    }
   ],
   "source": [
    "def view(name, obj):\n",
    "    print(name)\n",
    "    for item in obj.attrs:\n",
    "        print(item, obj.attrs[item])\n",
    "    print()\n",
    "        \n",
    "f.visititems(view)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load a slice of a data set into memory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<HDF5 dataset \"america\": shape (100, 10), type \"<f8\">"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tb_am = f['tuberculosis']['america']\n",
    "tb_am"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 6.17699582,  5.75843404,  7.10402303],\n",
       "       [ 7.02025947,  5.99105695,  6.46981765],\n",
       "       [ 5.93552647,  6.66634888,  6.99460096],\n",
       "       [ 5.50521143,  8.21152932,  6.19222898],\n",
       "       [ 6.28336856,  6.83224101,  7.98080216]])"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tb_am[:5, :3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modifying a dataa set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tb_am[:5, :3] **= 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 38.15527734,  33.15956263,  50.46714314],\n",
       "       [ 49.28404308,  35.8927634 ,  41.85854039],\n",
       "       [ 35.23047445,  44.44020735,  48.92444253],\n",
       "       [ 30.3073529 ,  67.42921375,  38.34369969],\n",
       "       [ 39.4807205 ,  46.67951722,  63.69320318]])"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tb_am[:5, :3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 38.15527734,  33.15956263,  50.46714314],\n",
       "       [ 49.28404308,  35.8927634 ,  41.85854039],\n",
       "       [ 35.23047445,  44.44020735,  48.92444253],\n",
       "       [ 30.3073529 ,  67.42921375,  38.34369969],\n",
       "       [ 39.4807205 ,  46.67951722,  63.69320318]])"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f['tuberculosis']['america'][:5,:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating resizable data sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<HDF5 dataset \"asia\": shape (10, 5), type \"<i8\">"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "g = f.create_group('dengue')\n",
    "g.create_dataset('asia', data=np.random.randint(0,10,(10,5)), maxshape=(None, 5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dset = f['dengue/asia']\n",
    "dset.resize((20, 5))\n",
    "dset[10:20, :] = np.ones((10, 5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[3, 2, 4, 0, 7],\n",
       "       [2, 1, 4, 5, 8],\n",
       "       [2, 8, 6, 7, 3],\n",
       "       [2, 1, 5, 7, 3],\n",
       "       [8, 5, 1, 2, 0],\n",
       "       [2, 8, 5, 3, 5],\n",
       "       [5, 5, 0, 9, 6],\n",
       "       [4, 5, 6, 4, 1],\n",
       "       [4, 6, 6, 5, 4],\n",
       "       [6, 2, 1, 5, 6],\n",
       "       [1, 1, 1, 1, 1],\n",
       "       [1, 1, 1, 1, 1],\n",
       "       [1, 1, 1, 1, 1],\n",
       "       [1, 1, 1, 1, 1],\n",
       "       [1, 1, 1, 1, 1],\n",
       "       [1, 1, 1, 1, 1],\n",
       "       [1, 1, 1, 1, 1],\n",
       "       [1, 1, 1, 1, 1],\n",
       "       [1, 1, 1, 1, 1],\n",
       "       [1, 1, 1, 1, 1]])"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dset[:, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Creating compressed data sets\n",
    "\n",
    "Compression and decompresssion occurs automatically if the compressoin keyword is given the name of an appropriate compression algorihtm. Use `gzip` for goood compression/moderate speed or `lzf` for moderate compression/fast speed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0,  1,  2,  3,  4],\n",
       "       [ 5,  6,  7,  8,  9],\n",
       "       [10, 11, 12, 13, 14],\n",
       "       [15, 16, 17, 18, 19]])"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "g.create_dataset('america', data=np.arange(20).reshape(-1, 5), \n",
    "                 maxshape=(None, 5),\n",
    "                 compression = 'lzf')\n",
    "dset = f['dengue/america']\n",
    "dset[:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Close file handle when done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using with `pandas`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO',\n",
    "          'CT', 'DE', 'FL', 'GA', 'HI', 'ID', \n",
    "          'IL', 'IN', 'IA', 'KS', 'KY', 'LA', \n",
    "          'ME', 'MD', 'MA', 'MI', 'MN', 'MS', \n",
    "          'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',\n",
    "          'NM', 'NY', 'NC', 'ND', 'OH', 'OK', \n",
    "          'OR', 'PA', 'RI', 'SC', 'SD', 'TN',\n",
    "          'TX', 'UT', 'VT', 'VA', 'WA', 'WV',\n",
    "          'WI', 'WY']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pop = np.random.randint(500000, 10000000, 50)\n",
    "income = np.random.normal(50000, 10000, 50)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(dict(state=states, pop=pop, income=income))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>income</th>\n",
       "      <th>pop</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>44406.959898</td>\n",
       "      <td>4666424</td>\n",
       "      <td>AL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>44072.976317</td>\n",
       "      <td>8842771</td>\n",
       "      <td>AK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>54005.243082</td>\n",
       "      <td>5199134</td>\n",
       "      <td>AZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>43124.820188</td>\n",
       "      <td>6466072</td>\n",
       "      <td>AR</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60384.253224</td>\n",
       "      <td>6568168</td>\n",
       "      <td>CA</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         income      pop state\n",
       "0  44406.959898  4666424    AL\n",
       "1  44072.976317  8842771    AK\n",
       "2  54005.243082  5199134    AZ\n",
       "3  43124.820188  6466072    AR\n",
       "4  60384.253224  6568168    CA"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Standard dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.to_hdf('simple.h5', 'stats')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>income</th>\n",
       "      <th>pop</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>44406.959898</td>\n",
       "      <td>4666424</td>\n",
       "      <td>AL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>44072.976317</td>\n",
       "      <td>8842771</td>\n",
       "      <td>AK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>54005.243082</td>\n",
       "      <td>5199134</td>\n",
       "      <td>AZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>43124.820188</td>\n",
       "      <td>6466072</td>\n",
       "      <td>AR</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>60384.253224</td>\n",
       "      <td>6568168</td>\n",
       "      <td>CA</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         income      pop state\n",
       "0  44406.959898  4666424    AL\n",
       "1  44072.976317  8842771    AK\n",
       "2  54005.243082  5199134    AZ\n",
       "3  43124.820188  6466072    AR\n",
       "4  60384.253224  6568168    CA"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_hdf('simple.h5', 'stats').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Using relatinal featurees \n",
    "\n",
    "If you are likely to query a column frequently to retrieve a subset, make that column a data column so that it is indexed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.to_hdf('states.h5', 'stats', mode='w', \n",
    "          data_columns=['state'],\n",
    "          format='table')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Retreving via a table query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>income</th>\n",
       "      <th>pop</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>44406.959898</td>\n",
       "      <td>4666424</td>\n",
       "      <td>AL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>48488.454711</td>\n",
       "      <td>6013478</td>\n",
       "      <td>GA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>49482.967676</td>\n",
       "      <td>6739070</td>\n",
       "      <td>NC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          income      pop state\n",
       "0   44406.959898  4666424    AL\n",
       "9   48488.454711  6013478    GA\n",
       "32  49482.967676  6739070    NC"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_hdf('states.h5', 'stats', \n",
    "            where=\"state in ['AL', 'NC', 'GA']\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}