Data Archival and Management (Part 3)¶
In [1]:
import numpy as np
import h5py
import arrow
Using HDF5¶
HDF5 is a file format for storing large amounts of annnotated, numerical data organzied hierarchically. It is very useful in numerical work becasue isolate parts fo the data strucrue for processing in memory. It is the default storage format for modern versions of Matlab.
There are two well known Python pacakges for working with HDF5, h5py
and pytables
.
h5py
provides anumpy
like inerface and is probably easier to usepytables
provides a database table abstrction and is used bypandas
in itsread_hdf
andto_hdf
I/O functions
Saving simulaiton data¶
We give a simple example where you perform a set of different simulations, and save to an HDF5 file.
Suppose the simulaions are for infection rates by different pathogens, and for each pathogen, you run a few simulations with diferent assumptions.
Simulaiton daa¶
In [2]:
# Simulations done on 10-11-2017 by Charles Darwin
malaria_asia = np.random.poisson(10, (100, 10))
malaria_africa = np.random.poisson(11, (100, 10))
malaria_america = np.random.poisson(12, (100, 10))
# Simulations done on 11-11-2017 by Greor Mendel
aids_asia = np.random.gamma(102, 10, (100, 10))
aids_africa = np.random.gamma(101, 11, (200, 10))
aids_america = np.random.gamma(100, 12, (300, 10))
# Simulations done on 12-11-2017 by Charlied Brown
tb_asia = np.random.normal(5, 1, (100, 10))
tb_africa = np.random.normal(6, 1, (100, 10))
tb_america = np.random.normal(7, 1, (100, 10))
Create the HDF5 file¶
By default the file is open in read/write mdoe it exists, and created if it does not exist.
In [3]:
f = h5py.File('sim.h5')
Polulate HDF5 file with groups, datasets and annotations¶
In [4]:
diseases = ['malaria', 'aids', 'tuberculosis']
creators = ['Charles Dawin', 'Gregor Mendel', 'Charlie Brown']
dates = [arrow.get(date) for date in ['10-11-2017', '11-11-2017', '12-11-2017']]
regions = ['asia', 'africa', 'america']
datasets = [
[malaria_asia, malaria_africa, malaria_america],
[aids_asia, aids_africa, aids_america],
[tb_asia, tb_africa, tb_america],
]
In [5]:
for disease, creator, date, dataset in zip(diseases, creators, dates, datasets):
g = f.create_group(disease)
g.attrs['creaor'] = creator
g.attrs['creation date'] = str(date)
for region, simulation in zip(regions, dataset):
d = g.create_dataset(region, data=simulation)
d.attrs['timestamp'] = str(arrow.now())
Close file when no longer needed¶
In [6]:
f.close()
Using an HDF5 file¶
In [7]:
f = h5py.File('sim.h5')
In [8]:
list(f.keys())
Out[8]:
['aids', 'malaria', 'tuberculosis']
Iteration¶
In [9]:
for group in f:
for key, val in f[group].attrs.items():
print(group, key, val)
print(f[group])
aids creaor Gregor Mendel
<HDF5 group "/aids" (3 members)>
aids creation date 2017-01-01T00:00:00+00:00
<HDF5 group "/aids" (3 members)>
malaria creaor Charles Dawin
<HDF5 group "/malaria" (3 members)>
malaria creation date 2017-01-01T00:00:00+00:00
<HDF5 group "/malaria" (3 members)>
tuberculosis creaor Charlie Brown
<HDF5 group "/tuberculosis" (3 members)>
tuberculosis creation date 2017-01-01T00:00:00+00:00
<HDF5 group "/tuberculosis" (3 members)>
Using a visitor¶
In [17]:
f.visit(lambda x: print(x))
aids
aids/africa
aids/america
aids/asia
malaria
malaria/africa
malaria/america
malaria/asia
tuberculosis
tuberculosis/africa
tuberculosis/america
tuberculosis/asia
Using a visitor iterator¶
In [24]:
def view(name, obj):
print(name)
for item in obj.attrs:
print(item, obj.attrs[item])
print()
f.visititems(view)
aids
creaor Gregor Mendel
creation date 2017-01-01T00:00:00+00:00
aids/africa
timestamp 2017-11-13T01:12:42.165815+00:00
aids/america
timestamp 2017-11-13T01:12:42.166395+00:00
aids/asia
timestamp 2017-11-13T01:12:42.165222+00:00
malaria
creaor Charles Dawin
creation date 2017-01-01T00:00:00+00:00
malaria/africa
timestamp 2017-11-13T01:12:42.163596+00:00
malaria/america
timestamp 2017-11-13T01:12:42.164182+00:00
malaria/asia
timestamp 2017-11-13T01:12:42.162954+00:00
tuberculosis
creaor Charlie Brown
creation date 2017-01-01T00:00:00+00:00
tuberculosis/africa
timestamp 2017-11-13T01:12:42.168003+00:00
tuberculosis/america
timestamp 2017-11-13T01:12:42.168549+00:00
tuberculosis/asia
timestamp 2017-11-13T01:12:42.167439+00:00
Load a slice of a data set into memory¶
In [32]:
tb_am = f['tuberculosis']['america']
tb_am
Out[32]:
<HDF5 dataset "america": shape (100, 10), type "<f8">
In [33]:
tb_am[:5, :3]
Out[33]:
array([[ 6.17699582, 5.75843404, 7.10402303],
[ 7.02025947, 5.99105695, 6.46981765],
[ 5.93552647, 6.66634888, 6.99460096],
[ 5.50521143, 8.21152932, 6.19222898],
[ 6.28336856, 6.83224101, 7.98080216]])
Modifying a dataa set¶
In [34]:
tb_am[:5, :3] **= 2
In [35]:
tb_am[:5, :3]
Out[35]:
array([[ 38.15527734, 33.15956263, 50.46714314],
[ 49.28404308, 35.8927634 , 41.85854039],
[ 35.23047445, 44.44020735, 48.92444253],
[ 30.3073529 , 67.42921375, 38.34369969],
[ 39.4807205 , 46.67951722, 63.69320318]])
In [36]:
f['tuberculosis']['america'][:5,:3]
Out[36]:
array([[ 38.15527734, 33.15956263, 50.46714314],
[ 49.28404308, 35.8927634 , 41.85854039],
[ 35.23047445, 44.44020735, 48.92444253],
[ 30.3073529 , 67.42921375, 38.34369969],
[ 39.4807205 , 46.67951722, 63.69320318]])
Creating resizable data sets¶
In [37]:
g = f.create_group('dengue')
g.create_dataset('asia', data=np.random.randint(0,10,(10,5)), maxshape=(None, 5))
Out[37]:
<HDF5 dataset "asia": shape (10, 5), type "<i8">
In [41]:
dset = f['dengue/asia']
dset.resize((20, 5))
dset[10:20, :] = np.ones((10, 5))
In [43]:
dset[:, :]
Out[43]:
array([[3, 2, 4, 0, 7],
[2, 1, 4, 5, 8],
[2, 8, 6, 7, 3],
[2, 1, 5, 7, 3],
[8, 5, 1, 2, 0],
[2, 8, 5, 3, 5],
[5, 5, 0, 9, 6],
[4, 5, 6, 4, 1],
[4, 6, 6, 5, 4],
[6, 2, 1, 5, 6],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])
Creating compressed data sets¶
Compression and decompresssion occurs automatically if the compressoin
keyword is given the name of an appropriate compression algorihtm. Use
gzip
for goood compression/moderate speed or lzf
for moderate
compression/fast speed.
In [45]:
g.create_dataset('america', data=np.arange(20).reshape(-1, 5),
maxshape=(None, 5),
compression = 'lzf')
dset = f['dengue/america']
dset[:]
Out[45]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
Close file handle when done¶
In [46]:
f.close()
Using with pandas
¶
In [54]:
import pandas as pd
In [65]:
states = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO',
'CT', 'DE', 'FL', 'GA', 'HI', 'ID',
'IL', 'IN', 'IA', 'KS', 'KY', 'LA',
'ME', 'MD', 'MA', 'MI', 'MN', 'MS',
'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
'NM', 'NY', 'NC', 'ND', 'OH', 'OK',
'OR', 'PA', 'RI', 'SC', 'SD', 'TN',
'TX', 'UT', 'VT', 'VA', 'WA', 'WV',
'WI', 'WY']
In [66]:
pop = np.random.randint(500000, 10000000, 50)
income = np.random.normal(50000, 10000, 50)
In [67]:
df = pd.DataFrame(dict(state=states, pop=pop, income=income))
In [68]:
df.head()
Out[68]:
income | pop | state | |
---|---|---|---|
0 | 44406.959898 | 4666424 | AL |
1 | 44072.976317 | 8842771 | AK |
2 | 54005.243082 | 5199134 | AZ |
3 | 43124.820188 | 6466072 | AR |
4 | 60384.253224 | 6568168 | CA |
Standard dataset¶
In [74]:
df.to_hdf('simple.h5', 'stats')
In [76]:
pd.read_hdf('simple.h5', 'stats').head()
Out[76]:
income | pop | state | |
---|---|---|---|
0 | 44406.959898 | 4666424 | AL |
1 | 44072.976317 | 8842771 | AK |
2 | 54005.243082 | 5199134 | AZ |
3 | 43124.820188 | 6466072 | AR |
4 | 60384.253224 | 6568168 | CA |
Using relatinal featurees¶
If you are likely to query a column frequently to retrieve a subset, make that column a data column so that it is indexed.
In [70]:
df.to_hdf('states.h5', 'stats', mode='w',
data_columns=['state'],
format='table')
Retreving via a table query¶
In [73]:
pd.read_hdf('states.h5', 'stats',
where="state in ['AL', 'NC', 'GA']")
Out[73]:
income | pop | state | |
---|---|---|---|
0 | 44406.959898 | 4666424 | AL |
9 | 48488.454711 | 6013478 | GA |
32 | 49482.967676 | 6739070 | NC |
In [ ]: