Data Archival and Management (Part 3)

In [1]:
import numpy as np
import h5py
import arrow

Using HDF5

HDF5 is a file format for storing large amounts of annnotated, numerical data organzied hierarchically. It is very useful in numerical work becasue isolate parts fo the data strucrue for processing in memory. It is the default storage format for modern versions of Matlab.

There are two well known Python pacakges for working with HDF5, h5py and pytables.

  • h5py provides a numpy like inerface and is probably easier to use
  • pytables provides a database table abstrction and is used by pandas in its read_hdf and to_hdf I/O functions

Saving simulaiton data

We give a simple example where you perform a set of different simulations, and save to an HDF5 file.

Suppose the simulaions are for infection rates by different pathogens, and for each pathogen, you run a few simulations with diferent assumptions.

Simulaiton daa

In [2]:
# Simulations done on 10-11-2017 by Charles Darwin
malaria_asia = np.random.poisson(10, (100, 10))
malaria_africa = np.random.poisson(11, (100, 10))
malaria_america = np.random.poisson(12, (100, 10))

# Simulations done on 11-11-2017 by Greor Mendel
aids_asia = np.random.gamma(102, 10, (100, 10))
aids_africa = np.random.gamma(101, 11, (200, 10))
aids_america = np.random.gamma(100, 12, (300, 10))

# Simulations done on 12-11-2017 by Charlied Brown
tb_asia = np.random.normal(5, 1, (100, 10))
tb_africa = np.random.normal(6, 1, (100, 10))
tb_america = np.random.normal(7, 1, (100, 10))

Create the HDF5 file

By default the file is open in read/write mdoe it exists, and created if it does not exist.

In [3]:
f = h5py.File('sim.h5')

Polulate HDF5 file with groups, datasets and annotations

In [4]:
diseases = ['malaria', 'aids', 'tuberculosis']
creators = ['Charles Dawin', 'Gregor Mendel', 'Charlie Brown']
dates = [arrow.get(date) for date in ['10-11-2017', '11-11-2017', '12-11-2017']]
regions = ['asia', 'africa', 'america']
datasets = [
    [malaria_asia, malaria_africa, malaria_america],
    [aids_asia, aids_africa, aids_america],
    [tb_asia, tb_africa, tb_america],
]
In [5]:
for disease, creator, date, dataset in zip(diseases, creators, dates, datasets):
    g = f.create_group(disease)
    g.attrs['creaor'] = creator
    g.attrs['creation date'] = str(date)
    for region, simulation in zip(regions, dataset):
        d = g.create_dataset(region, data=simulation)
        d.attrs['timestamp'] = str(arrow.now())

Close file when no longer needed

In [6]:
f.close()

Using an HDF5 file

In [7]:
f = h5py.File('sim.h5')
In [8]:
list(f.keys())
Out[8]:
['aids', 'malaria', 'tuberculosis']

Iteration

In [9]:
for group in f:
    for key, val in f[group].attrs.items():
        print(group, key, val)
        print(f[group])
aids creaor Gregor Mendel
<HDF5 group "/aids" (3 members)>
aids creation date 2017-01-01T00:00:00+00:00
<HDF5 group "/aids" (3 members)>
malaria creaor Charles Dawin
<HDF5 group "/malaria" (3 members)>
malaria creation date 2017-01-01T00:00:00+00:00
<HDF5 group "/malaria" (3 members)>
tuberculosis creaor Charlie Brown
<HDF5 group "/tuberculosis" (3 members)>
tuberculosis creation date 2017-01-01T00:00:00+00:00
<HDF5 group "/tuberculosis" (3 members)>

Using a visitor

In [17]:
f.visit(lambda x: print(x))
aids
aids/africa
aids/america
aids/asia
malaria
malaria/africa
malaria/america
malaria/asia
tuberculosis
tuberculosis/africa
tuberculosis/america
tuberculosis/asia

Using a visitor iterator

In [24]:
def view(name, obj):
    print(name)
    for item in obj.attrs:
        print(item, obj.attrs[item])
    print()

f.visititems(view)
aids
creaor Gregor Mendel
creation date 2017-01-01T00:00:00+00:00

aids/africa
timestamp 2017-11-13T01:12:42.165815+00:00

aids/america
timestamp 2017-11-13T01:12:42.166395+00:00

aids/asia
timestamp 2017-11-13T01:12:42.165222+00:00

malaria
creaor Charles Dawin
creation date 2017-01-01T00:00:00+00:00

malaria/africa
timestamp 2017-11-13T01:12:42.163596+00:00

malaria/america
timestamp 2017-11-13T01:12:42.164182+00:00

malaria/asia
timestamp 2017-11-13T01:12:42.162954+00:00

tuberculosis
creaor Charlie Brown
creation date 2017-01-01T00:00:00+00:00

tuberculosis/africa
timestamp 2017-11-13T01:12:42.168003+00:00

tuberculosis/america
timestamp 2017-11-13T01:12:42.168549+00:00

tuberculosis/asia
timestamp 2017-11-13T01:12:42.167439+00:00

Load a slice of a data set into memory

In [32]:
tb_am = f['tuberculosis']['america']
tb_am
Out[32]:
<HDF5 dataset "america": shape (100, 10), type "<f8">
In [33]:
tb_am[:5, :3]
Out[33]:
array([[ 6.17699582,  5.75843404,  7.10402303],
       [ 7.02025947,  5.99105695,  6.46981765],
       [ 5.93552647,  6.66634888,  6.99460096],
       [ 5.50521143,  8.21152932,  6.19222898],
       [ 6.28336856,  6.83224101,  7.98080216]])

Modifying a dataa set

In [34]:
tb_am[:5, :3] **= 2
In [35]:
tb_am[:5, :3]
Out[35]:
array([[ 38.15527734,  33.15956263,  50.46714314],
       [ 49.28404308,  35.8927634 ,  41.85854039],
       [ 35.23047445,  44.44020735,  48.92444253],
       [ 30.3073529 ,  67.42921375,  38.34369969],
       [ 39.4807205 ,  46.67951722,  63.69320318]])
In [36]:
f['tuberculosis']['america'][:5,:3]
Out[36]:
array([[ 38.15527734,  33.15956263,  50.46714314],
       [ 49.28404308,  35.8927634 ,  41.85854039],
       [ 35.23047445,  44.44020735,  48.92444253],
       [ 30.3073529 ,  67.42921375,  38.34369969],
       [ 39.4807205 ,  46.67951722,  63.69320318]])

Creating resizable data sets

In [37]:
g = f.create_group('dengue')
g.create_dataset('asia', data=np.random.randint(0,10,(10,5)), maxshape=(None, 5))
Out[37]:
<HDF5 dataset "asia": shape (10, 5), type "<i8">
In [41]:
dset = f['dengue/asia']
dset.resize((20, 5))
dset[10:20, :] = np.ones((10, 5))
In [43]:
dset[:, :]
Out[43]:
array([[3, 2, 4, 0, 7],
       [2, 1, 4, 5, 8],
       [2, 8, 6, 7, 3],
       [2, 1, 5, 7, 3],
       [8, 5, 1, 2, 0],
       [2, 8, 5, 3, 5],
       [5, 5, 0, 9, 6],
       [4, 5, 6, 4, 1],
       [4, 6, 6, 5, 4],
       [6, 2, 1, 5, 6],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

Creating compressed data sets

Compression and decompresssion occurs automatically if the compressoin keyword is given the name of an appropriate compression algorihtm. Use gzip for goood compression/moderate speed or lzf for moderate compression/fast speed.

In [45]:
g.create_dataset('america', data=np.arange(20).reshape(-1, 5),
                 maxshape=(None, 5),
                 compression = 'lzf')
dset = f['dengue/america']
dset[:]
Out[45]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

Close file handle when done

In [46]:
f.close()

Using with pandas

In [54]:
import pandas as pd
In [65]:
states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO',
          'CT', 'DE', 'FL', 'GA', 'HI', 'ID',
          'IL', 'IN', 'IA', 'KS', 'KY', 'LA',
          'ME', 'MD', 'MA', 'MI', 'MN', 'MS',
          'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
          'NM', 'NY', 'NC', 'ND', 'OH', 'OK',
          'OR', 'PA', 'RI', 'SC', 'SD', 'TN',
          'TX', 'UT', 'VT', 'VA', 'WA', 'WV',
          'WI', 'WY']
In [66]:
pop = np.random.randint(500000, 10000000, 50)
income = np.random.normal(50000, 10000, 50)
In [67]:
df = pd.DataFrame(dict(state=states, pop=pop, income=income))
In [68]:
df.head()
Out[68]:
income pop state
0 44406.959898 4666424 AL
1 44072.976317 8842771 AK
2 54005.243082 5199134 AZ
3 43124.820188 6466072 AR
4 60384.253224 6568168 CA

Standard dataset

In [74]:
df.to_hdf('simple.h5', 'stats')
In [76]:
pd.read_hdf('simple.h5', 'stats').head()
Out[76]:
income pop state
0 44406.959898 4666424 AL
1 44072.976317 8842771 AK
2 54005.243082 5199134 AZ
3 43124.820188 6466072 AR
4 60384.253224 6568168 CA

Using relatinal featurees

If you are likely to query a column frequently to retrieve a subset, make that column a data column so that it is indexed.

In [70]:
df.to_hdf('states.h5', 'stats', mode='w',
          data_columns=['state'],
          format='table')

Retreving via a table query

In [73]:
pd.read_hdf('states.h5', 'stats',
            where="state in ['AL', 'NC', 'GA']")
Out[73]:
income pop state
0 44406.959898 4666424 AL
9 48488.454711 6013478 GA
32 49482.967676 6739070 NC
In [ ]: