HDF5 stands for (Hierarchical Data Format 5), and it is developed by the HDF Group. From their website:
HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.
Various programming languages have developed APIs for interacting with HDF formatted files, for example there are libraries in Python and R which I will briefly cover. There are also a set of command line tools developed by the HDF Group HERE, I will talk a little about h5ls and h5dump.
My goal here is just to give a little taste, the true power of HDF5 is not apparent until you look at real use cases for example the python package vcfnp converts a vcf file into an HDF5 file allowing you to quickly access different parts of the VCF, see here.
For all of these tools to work you need to install the HDF5 software from HDF5 group!
sudo apt-get update
sudo apt-get install h5utils hdf5-tools hdfview libhdf5-dev
On OSX take a look at MacPorts
For Linux, OSX, and Windows you can download and install from the HDF group
There are two major packages for interacting with HDF5 files (PyTables and h5py. Both packages have a slightly different interface which is discussed HERE. I will go over a quick example usage of PyTables, h5py, and Pandas + PyTables.
You will need to have installed:
PyTables can be installed using pip:
pip install tables --user
or using your python distributions package manager.
In [4]:
# Import packages
import numpy as np
import tables as pt # PyTables
import h5py as hp # h5py
import pandas as pd
import rpy2
%load_ext rpy2.ipython
In [5]:
# Create a New HDF5 File
h5file = pt.open_file('test.h5', mode='w', title='Test file')
HDF5 is organized in a hierarchical structure and syntax is similar to the Linux/OSX file structure. A group can be thought of as a folder.
In [6]:
# Create new group
group = h5file.create_group('/', 'pytables', 'PyTables Test')
print(group)
While a table can be thought of as a file in a folder.
In [7]:
# Create new table
class HgSnpCall(pt.IsDescription):
chrom = pt.StringCol(16) # 16-character String
start = pt.UInt32Col() # Unsigned 32-bit integer
end = pt.UInt32Col() # Unsigned 32-bit integer
call = pt.StringCol(16) # 16-character String
table = h5file.create_table(group, 'hg19', HgSnpCall, 'Human SNP Calls')
In [8]:
# Add a row of data to the table.
position = table.row
position['chrom'] = 'chr4'
position['start'] = 10023
position['end'] = 10024
position['call'] = 'A/G'
position.append()
# Flush table, similar to SQL
table.flush()
In [9]:
%%bash
# Lets look at the table we created using an external utility
hdfview 'test.h5'
In [10]:
# Close the h5file
h5file.close()
In [11]:
# Create a DataFrame
df_snp = pd.DataFrame({'chrom': [ 'chr4', 'chr4', 'chr2', 'chr2'],
'start': [10023, 3020, 40404, 20202],
'end': [10024, 3023, 40405, 20203],
'call': ['A/G', 'AA/G', 'T/C', 'A/C']},
columns=['chrom', 'start', 'end', 'call'])
print(df_snp)
In [12]:
# Save to hdf5 file
hdf = pd.HDFStore('test.h5')
hdf.put('pandas_test', df_snp, format='table', data_columns=True)
hdf.close()
In [13]:
%%bash
# Now lets look at it again
hdfview 'test.h5'
As I have mentioned, there are libraries for reading HDF5 files in R. Now we can open this file in R using the following:
In [16]:
%%R
library(rhdf5)
library(bit64)
data = h5read('test.h5', 'pandas_test/table', bit64conversion='bit64')
print(data)
h5py can be installed using pip:
pip install h5py --user
or using your python distributions package manager.
While Pandas + PyTables if very useful for traditional data sets, HDF5 can store a variety of data types. The python package h5py is nice for a higher level access to an HDF5 file and can quickly add and store arrays and lists.
In [17]:
# Open a new hdf5 file
hdf = hp.File('test.h5', 'a')
In [18]:
# Create a new group
group = hdf.create_group('h5py_test')
In [19]:
# Create a new dataset object
dat = group.create_dataset('matrix', shape=(100, 100), dtype='i')
In [20]:
# I made a 100 x 100 matrix
dat[...]
Out[20]:
In [21]:
# We can then do things to this matrix
dat[0,0] = 999
print(dat[...])
hdf.close()
In [22]:
%%bash
hdfview test.h5
In [23]:
%%bash
# On the command line we can also list the contents of an hdf5 file
h5ls test.h5
In [20]:
%%bash
# On the command line we can look at the contents an hdf5 file
h5dump -d /h5py_test/matrix -s "0,0" -c "5,15" test.h5
In [24]:
%%bash
# Clean up our mess
#rm test.h5
In [ ]: