Data Archival and Management (Part 3)¶

In [1]:

import numpy as np
import h5py
import arrow

Using HDF5¶

HDF5 is a file format for storing large amounts of annnotated, numerical data organzied hierarchically. It is very useful in numerical work becasue isolate parts fo the data strucrue for processing in memory. It is the default storage format for modern versions of Matlab.

There are two well known Python pacakges for working with HDF5, h5py and pytables.

h5py provides a numpy like inerface and is probably easier to use
pytables provides a database table abstrction and is used by pandas in its read_hdf and to_hdf I/O functions

Saving simulaiton data¶

We give a simple example where you perform a set of different simulations, and save to an HDF5 file.

Suppose the simulaions are for infection rates by different pathogens, and for each pathogen, you run a few simulations with diferent assumptions.

Simulaiton daa¶

In [2]:

# Simulations done on 10-11-2017 by Charles Darwin
malaria_asia = np.random.poisson(10, (100, 10))
malaria_africa = np.random.poisson(11, (100, 10))
malaria_america = np.random.poisson(12, (100, 10))

# Simulations done on 11-11-2017 by Greor Mendel
aids_asia = np.random.gamma(102, 10, (100, 10))
aids_africa = np.random.gamma(101, 11, (200, 10))
aids_america = np.random.gamma(100, 12, (300, 10))

# Simulations done on 12-11-2017 by Charlied Brown
tb_asia = np.random.normal(5, 1, (100, 10))
tb_africa = np.random.normal(6, 1, (100, 10))
tb_america = np.random.normal(7, 1, (100, 10))

Create the HDF5 file¶

By default the file is open in read/write mdoe it exists, and created if it does not exist.

In [3]:

f = h5py.File('sim.h5')

Polulate HDF5 file with groups, datasets and annotations¶

In [4]:

diseases = ['malaria', 'aids', 'tuberculosis']
creators = ['Charles Dawin', 'Gregor Mendel', 'Charlie Brown']
dates = [arrow.get(date) for date in ['10-11-2017', '11-11-2017', '12-11-2017']]
regions = ['asia', 'africa', 'america']
datasets = [
    [malaria_asia, malaria_africa, malaria_america],
    [aids_asia, aids_africa, aids_america],
    [tb_asia, tb_africa, tb_america],
]

In [5]:

for disease, creator, date, dataset in zip(diseases, creators, dates, datasets):
    g = f.create_group(disease)
    g.attrs['creaor'] = creator
    g.attrs['creation date'] = str(date)
    for region, simulation in zip(regions, dataset):
        d = g.create_dataset(region, data=simulation)
        d.attrs['timestamp'] = str(arrow.now())

Close file when no longer needed¶

In [6]:

f.close()

Using an HDF5 file¶

In [7]:

f = h5py.File('sim.h5')

In [8]:

list(f.keys())

Out[8]:

['aids', 'malaria', 'tuberculosis']

Iteration¶

In [9]:

for group in f:
    for key, val in f[group].attrs.items():
        print(group, key, val)
        print(f[group])

aids creaor Gregor Mendel
<HDF5 group "/aids" (3 members)>
aids creation date 2017-01-01T00:00:00+00:00
<HDF5 group "/aids" (3 members)>
malaria creaor Charles Dawin
<HDF5 group "/malaria" (3 members)>
malaria creation date 2017-01-01T00:00:00+00:00
<HDF5 group "/malaria" (3 members)>
tuberculosis creaor Charlie Brown
<HDF5 group "/tuberculosis" (3 members)>
tuberculosis creation date 2017-01-01T00:00:00+00:00
<HDF5 group "/tuberculosis" (3 members)>

Using a visitor¶

In [17]:

f.visit(lambda x: print(x))

aids
aids/africa
aids/america
aids/asia
malaria
malaria/africa
malaria/america
malaria/asia
tuberculosis
tuberculosis/africa
tuberculosis/america
tuberculosis/asia

Using a visitor iterator¶

In [24]:

def view(name, obj):
    print(name)
    for item in obj.attrs:
        print(item, obj.attrs[item])
    print()

f.visititems(view)

aids
creaor Gregor Mendel
creation date 2017-01-01T00:00:00+00:00

aids/africa
timestamp 2017-11-13T01:12:42.165815+00:00

aids/america
timestamp 2017-11-13T01:12:42.166395+00:00

aids/asia
timestamp 2017-11-13T01:12:42.165222+00:00

malaria
creaor Charles Dawin
creation date 2017-01-01T00:00:00+00:00

malaria/africa
timestamp 2017-11-13T01:12:42.163596+00:00

malaria/america
timestamp 2017-11-13T01:12:42.164182+00:00

malaria/asia
timestamp 2017-11-13T01:12:42.162954+00:00

tuberculosis
creaor Charlie Brown
creation date 2017-01-01T00:00:00+00:00

tuberculosis/africa
timestamp 2017-11-13T01:12:42.168003+00:00

tuberculosis/america
timestamp 2017-11-13T01:12:42.168549+00:00

tuberculosis/asia
timestamp 2017-11-13T01:12:42.167439+00:00

Load a slice of a data set into memory¶

In [32]:

tb_am = f['tuberculosis']['america']
tb_am

Out[32]:

<HDF5 dataset "america": shape (100, 10), type "<f8">

In [33]:

tb_am[:5, :3]

Out[33]:

array([[ 6.17699582,  5.75843404,  7.10402303],
       [ 7.02025947,  5.99105695,  6.46981765],
       [ 5.93552647,  6.66634888,  6.99460096],
       [ 5.50521143,  8.21152932,  6.19222898],
       [ 6.28336856,  6.83224101,  7.98080216]])

Modifying a dataa set¶

In [34]:

tb_am[:5, :3] **= 2

In [35]:

tb_am[:5, :3]

Out[35]:

array([[ 38.15527734,  33.15956263,  50.46714314],
       [ 49.28404308,  35.8927634 ,  41.85854039],
       [ 35.23047445,  44.44020735,  48.92444253],
       [ 30.3073529 ,  67.42921375,  38.34369969],
       [ 39.4807205 ,  46.67951722,  63.69320318]])

In [36]:

f['tuberculosis']['america'][:5,:3]

Out[36]:

array([[ 38.15527734,  33.15956263,  50.46714314],
       [ 49.28404308,  35.8927634 ,  41.85854039],
       [ 35.23047445,  44.44020735,  48.92444253],
       [ 30.3073529 ,  67.42921375,  38.34369969],
       [ 39.4807205 ,  46.67951722,  63.69320318]])

Creating resizable data sets¶

In [37]:

g = f.create_group('dengue')
g.create_dataset('asia', data=np.random.randint(0,10,(10,5)), maxshape=(None, 5))

Out[37]:

<HDF5 dataset "asia": shape (10, 5), type "<i8">

In [41]:

dset = f['dengue/asia']
dset.resize((20, 5))
dset[10:20, :] = np.ones((10, 5))

In [43]:

dset[:, :]

Out[43]:

array([[3, 2, 4, 0, 7],
       [2, 1, 4, 5, 8],
       [2, 8, 6, 7, 3],
       [2, 1, 5, 7, 3],
       [8, 5, 1, 2, 0],
       [2, 8, 5, 3, 5],
       [5, 5, 0, 9, 6],
       [4, 5, 6, 4, 1],
       [4, 6, 6, 5, 4],
       [6, 2, 1, 5, 6],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

Creating compressed data sets¶

Compression and decompresssion occurs automatically if the compressoin keyword is given the name of an appropriate compression algorihtm. Use gzip for goood compression/moderate speed or lzf for moderate compression/fast speed.

In [45]:

g.create_dataset('america', data=np.arange(20).reshape(-1, 5),
                 maxshape=(None, 5),
                 compression = 'lzf')
dset = f['dengue/america']
dset[:]

Out[45]:

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

Close file handle when done¶

In [46]:

f.close()

Using with `pandas`¶

In [54]:

import pandas as pd

In [65]:

states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO',
          'CT', 'DE', 'FL', 'GA', 'HI', 'ID',
          'IL', 'IN', 'IA', 'KS', 'KY', 'LA',
          'ME', 'MD', 'MA', 'MI', 'MN', 'MS',
          'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
          'NM', 'NY', 'NC', 'ND', 'OH', 'OK',
          'OR', 'PA', 'RI', 'SC', 'SD', 'TN',
          'TX', 'UT', 'VT', 'VA', 'WA', 'WV',
          'WI', 'WY']

In [66]:

pop = np.random.randint(500000, 10000000, 50)
income = np.random.normal(50000, 10000, 50)

In [67]:

df = pd.DataFrame(dict(state=states, pop=pop, income=income))

In [68]:

df.head()

Out[68]:

	income	pop	state
0	44406.959898	4666424	AL
1	44072.976317	8842771	AK
2	54005.243082	5199134	AZ
3	43124.820188	6466072	AR
4	60384.253224	6568168	CA

Standard dataset¶

In [74]:

df.to_hdf('simple.h5', 'stats')

In [76]:

pd.read_hdf('simple.h5', 'stats').head()

Out[76]:

	income	pop	state
0	44406.959898	4666424	AL
1	44072.976317	8842771	AK
2	54005.243082	5199134	AZ
3	43124.820188	6466072	AR
4	60384.253224	6568168	CA

Using relatinal featurees¶

If you are likely to query a column frequently to retrieve a subset, make that column a data column so that it is indexed.

In [70]:

df.to_hdf('states.h5', 'stats', mode='w',
          data_columns=['state'],
          format='table')

Retreving via a table query¶

In [73]:

pd.read_hdf('states.h5', 'stats',
            where="state in ['AL', 'NC', 'GA']")

Out[73]:

	income	pop	state
0	44406.959898	4666424	AL
9	48488.454711	6013478	GA
32	49482.967676	6739070	NC

In [ ]:

Data Archival and Management (Part 3)¶

Using HDF5¶

Saving simulaiton data¶

Simulaiton daa¶

Create the HDF5 file¶

Polulate HDF5 file with groups, datasets and annotations¶

Close file when no longer needed¶

Using an HDF5 file¶

Iteration¶

Using a visitor¶

Using a visitor iterator¶

Load a slice of a data set into memory¶

Modifying a dataa set¶

Creating resizable data sets¶

Creating compressed data sets¶

Close file handle when done¶

Using with `pandas`¶

Standard dataset¶

Using relatinal featurees¶

Retreving via a table query¶

Page contents

Previous page

Next page

This Page

Data Archival and Management (Part 3)¶

Using HDF5¶

Saving simulaiton data¶

Simulaiton daa¶

Create the HDF5 file¶

Polulate HDF5 file with groups, datasets and annotations¶

Close file when no longer needed¶

Using an HDF5 file¶

Iteration¶

Using a visitor¶

Using a visitor iterator¶

Load a slice of a data set into memory¶

Modifying a dataa set¶

Creating resizable data sets¶

Creating compressed data sets¶

Close file handle when done¶

Using with pandas¶

Standard dataset¶

Using relatinal featurees¶

Retreving via a table query¶

Using with `pandas`¶