h5py: reading and writing HDF5 files in Python
If you’re storing large amounts of data that you need to quick access to, your standard text file isn’t going to cut it. The kinds of cosmological simulations that I run generate huge amounts of data, and to analyse them I need to be able access the exact data that I want quickly and painlessly.
HDF5 is one answer. It’s a powerful binary data format with no upper limit on the file size. It provides parallel IO, and carries out a bunch of low level optimisations under the hood to make queries faster and storage requirements smaller.
Here’s a quick intro to the h5py package, which provides a Python interface to the HDF5 data format. We’ll create a HDF5 file, query it, create a group and save compressed data.
You’ll need HDF5 installed, which can be a pain. Getting h5py is relatively painless in comparison, just use your favourite package manager.
Creating HDF5 files
We first load the numpy
and h5py
modules.
Now mock up some simple dummy data to save to our file.
(1000, 20) (1000, 200)
The first step to creating a HDF5 file is to initialise it. It uses a very similar syntax to initialising a typical text file in numpy. The first argument provides the filename and location, the second the mode. We’re writing the file, so we provide a w for write access.
This creates a file object, hf
, which has a bunch of associated methods. One is create_dataset
, which does what it says on the tin. Just provide a name for the dataset, and the numpy array.
<HDF5 dataset "dataset_2": shape (1000, 200), type "<f8">
All we need to do now is close the file, which will write all of our work to disk.
Reading HDF5 files
To open and read data we use the same File
method in read mode, r.
To see what data is in this file, we can call the keys()
method on the file object.
[u'group1']
We can then grab each dataset we created above using the get
method, specifying the name.
This returns a HDF5 dataset object. To convert this to an array, just call numpy’s array method.
(1000, 20)
Groups
Groups are the basic container mechanism in a HDF5 file, allowing hierarchical organisation of the data. Groups are created similarly to datasets, and datsets are then added using the group object.
<HDF5 dataset "data2": shape (100, 33), type "<f8">
We can also create subfolders. Just specify the group name as a directory format.
<HDF5 dataset "data3": shape (100, 3333), type "<f8">
As before, to read data in irectories and subdirectories use the get
method with the full subdirectory path.
[(u'data3', <HDF5 dataset "data3": shape (100, 3333), type "<f8">)]
[(u'data1', <HDF5 dataset "data1": shape (100, 33), type "<f8">),
(u'data2', <HDF5 dataset "data2": shape (100, 33), type "<f8">)]
(100, 33)
Compression
To save on disk space, while sacrificing read speed, you can compress the data. Just add the compression
argument, which can be either gzip
, lzf
or szip
. gzip
is the most portable, as it’s available with every HDF5 install, lzf
is the fastest but doesn’t compress as effectively as gzip
, and szip
is a NASA format that is patented up; if you don’t know about it, chances are your organisation doesn’t have the patent, so avoid.
For gzip
you can also specify the additional compression_opts
argument, which sets the compression level. The default is 4, but it can be an integer between 0 and 9.
Enjoy Reading This Article?
Here are some more articles you might like to read next: