HDF5 is a direct, easy path to "big" (or just annoyingly larger than RAM) data in scientific python. Using the HDF5 file format and careful coding of your algorithms, it is quite possible to process "big-ish" data on modest hardware, and really push your resources to the limit before moving to larger, fancier distributed platforms.
There are two primary options for working with HDF5 files in python: H5Py and PyTables. While many people enjoy H5Py, I am much more familiar with PyTables and prefer it for most tasks. This blog will show a variety of PyTables examples for several potential applications including standard numpy nd-array replacement, variable length (ragged) arrays, iterator rewriting for simpler retireval of data, and on-the-fly compression/decompression.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
To test our various techniques for storing data in HDF5, first we need some data to store. Here I have created a quick function which will generate random data from four clusters, as well as returning the cluster label for each datapoint. This data generator should be sufficient for testing all the PyTables techniques I will show in this post.
In [2]:
random_state = np.random.RandomState(1999)
def make_random_cluster_points(n_samples, random_state=random_state):
mu_options = np.array([(-1, -1), (1, 1), (1, -1), (-1, 1)])
sigma = 0.2
mu_choices = random_state.randint(0, len(mu_options), size=n_samples)
means = mu_options[mu_choices]
return means + np.random.randn(n_samples, 2) * sigma, mu_choices
def plot_clusters(data, clusters, name):
plt.figure()
colors = ["#9b59b6", "#3498db", "#e74c3c", "#2ecc71"]
for i in np.unique(clusters):
plt.scatter(data[clusters==i, 0], data[clusters==i, 1], color=colors[i])
plt.axis('off')
plt.title('Plot from %s' % name)
data, clusters = make_random_cluster_points(10000)
plot_clusters(data, clusters, "data in memory")
First, a brief high level discussion of how HDF5 and PyTables work. I think the website says it best:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free [...] One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space (specially if on-flight compression is used) than other solutions such as relational or object oriented databases.
PyTables allows us to quickly and easily deal with large volumes of on-disk data, while largely keeping the complexity of the data storage invisible to the downstream processing in numpy, scipy, theano, etc. Depending on the algorithms used downstream, you can sometimes drop in hdf5 data to replace a numpy array and everything will keep working exactly like before - when this happens it is quite an amazing feeling.
First we will start with the simplest use for HDF5 - data persistence. We will show this using two types of Arrays provided by PyTables - Array and CArray.
We take the following steps below:
import tables
Array
to hold the data, and storing our data in it
In [3]:
import tables
sample_data, sample_clusters = make_random_cluster_points(10000)
hdf5_path = "my_data.hdf5"
hdf5_file = tables.openFile(hdf5_path, mode='w')
data_storage = hdf5_file.createArray(hdf5_file.root, 'data', sample_data)
clusters_storage = hdf5_file.createArray(hdf5_file.root, 'clusters', sample_clusters)
hdf5_file.close()
Now we can open the fil, and use the data in it for the same things that we used the old arrays for. Make sure to notice that we are slicing (using [:]) the data
and clusters
back into memory so that we can use these variables for plotting. More on this in a moment.
In [4]:
hdf5_path = "my_data.hdf5"
read_hdf5_file = tables.openFile(hdf5_path, mode='r')
# Here we slice [:] all the data back into memory, then operate on it
hdf5_data = read_hdf5_file.root.data[:]
hdf5_clusters = read_hdf5_file.root.clusters[:]
read_hdf5_file.close()
plot_clusters(hdf5_data, hdf5_clusters, "PyTables Array")
So why did we need to slice the data in order to use it? This comes from the way that PyTables manages on-disk versus in memory data.
A PyTables array lives on disk until some data is sliced out using standard numpy notation. At that point, the data slice is read into memory from the disk, for the lifetime of that slice. Once the slice is finished being used, the python garbage collector will try to free the memory of the slice and any changes (if you made them) can be written back to the PyTables array on disk.
This makes PyTables a nearly transparent drop in for numpy arrays which are larger than memory but only need to be processed a little bit at a time. For my own usage, I typically have some separate scripts to preprocess and write data to an hdf5 file, and all algorithmic processing scripts open this as read only - this is very similar to what we have seen above.
Now that we have seen a basic array, we can take a look at using a compressed array, the CArray. PyTables makes it really easy to store and read compressed data from disk.
In [5]:
sample_data, sample_clusters = make_random_cluster_points(10000)
hdf5_path = "my_compressed_data.hdf5"
hdf5_file = tables.openFile(hdf5_path, mode='w')
filters = tables.Filters(complevel=5, complib='blosc')
data_storage = hdf5_file.createCArray(hdf5_file.root, 'data',
tables.Atom.from_dtype(sample_data.dtype),
shape=sample_data.shape,
filters=filters)
clusters_storage = hdf5_file.createCArray(hdf5_file.root, 'clusters',
tables.Atom.from_dtype(sample_clusters.dtype),
shape=sample_clusters.shape,
filters=filters)
data_storage[:] = sample_data
clusters_storage[:] = sample_clusters
hdf5_file.close()
Here we have introduced a few more key PyTables constructs: compression filters and Atom types.
Compression filters are incredibly easy to use, and can provide a massive benefit to through put and disk usage. Usage is as simple as what I showed above - simply define a filter and then use it when creating any of the compressible arrays (CArray, EArray (to be introduced)).
filters = tables.Filters(complevel=5, complib='blosc')
Arguments for complib are blosc
, zlib
, and lzo
, with blosc
being my goto compressor. There is a very in depth discussion of different compression algorithms here. If you are really tuning PyTables and HDF5 for your application, I highly advise you to look there - but in general blosc
is great, and can make your disk I/O effectively faster than streaming raw bytes.
Atoms are a special construct of PyTables which allow you to specify dtypes when creating Arrays. There are several methods such as tables.Atom.from_dtype
which can help you set the dtype, or it is easy to manually specify one using one of the listed types such as tables.Atom.Float32()
.
In [6]:
hdf5_path = "my_compressed_data.hdf5"
compressed_hdf5_file = tables.openFile(hdf5_path, mode='r')
# Here we slice [:] all the data back into memory, then operate on it
uncompressed_hdf5_data = compressed_hdf5_file.root.data[:]
uncompressed_hdf5_clusters = compressed_hdf5_file.root.clusters[:]
compressed_hdf5_file.close()
plot_clusters(uncompressed_hdf5_data, uncompressed_hdf5_clusters, "CArray")
So far, so good (so what). I can already anicipate questions such as "My data won't fit into memory at all... how can I get it into a PyTables array"?
Fear not, the EArray is here to save the day! This is the workhorse Array type, and the one I use for 99% of my PyTables usage. It is compressible and extendable using the append
method. Barring a few special exceptions, the EArray can take care of most data storage needs - lets see how it works.
In [7]:
hdf5_path = "my_extendable_compressed_data.hdf5"
hdf5_file = tables.openFile(hdf5_path, mode='w')
filters = tables.Filters(complevel=5, complib='blosc')
data_storage = hdf5_file.createEArray(hdf5_file.root, 'data',
tables.Atom.from_dtype(sample_data.dtype),
shape=(0, sample_data.shape[-1]),
filters=filters,
expectedrows=len(sample_data))
clusters_storage = hdf5_file.createEArray(hdf5_file.root, 'clusters',
tables.Atom.from_dtype(sample_clusters.dtype),
shape=(0,),
filters=filters,
expectedrows=len(sample_clusters))
for n, (d, c) in enumerate(zip(sample_data, sample_clusters)):
data_storage.append(sample_data[n][None])
clusters_storage.append(sample_clusters[n][None])
hdf5_file.close()
After adding data and (lossless-ly) compressing using EArray, we want to be sure that data retrieval is easy. Plotting the clusters should take care of this, and we do it just as we did for the CArray. In addition, we can also slice subsets of the EArray (e.g. hdf5_file.root.data[10:100]) which only brings that portion of the data into memory. For dealing with very large datasets this is extremely useful - only the sliced data is ever in memory, and python will get rid of the memory once the slice is not used anymore.
You can control memory usage by how much data is sliced out of an EArray, and python will clean up after you are done using that slice. The data stays on disk, so you can fetch it again if needed.
In [8]:
hdf5_path = "my_extendable_compressed_data.hdf5"
extendable_hdf5_file = tables.openFile(hdf5_path, mode='r')
extendable_hdf5_data = extendable_hdf5_file.root.data[:]
extendable_hdf5_clusters = extendable_hdf5_file.root.clusters[:]
plot_clusters(extendable_hdf5_file.root.data[10:100],
extendable_hdf5_file.root.clusters[10:100], "EArray subset")
plot_clusters(extendable_hdf5_data, extendable_hdf5_clusters, "full EArray")
extendable_hdf5_file.close()
We can also open old hdf5 files created with PyTables, and add new data into an EArray. For example, if you were processing daily data from a point-of-sale application an hdf5 file could be used to store new information as it comes in. As a bonus, your sysadmin/DBA people will have a heart attack because you are not using databases/backups... but I digress. hdf5 files are really great for something that is basically a numpy array, but much too large to fit in memory.
In [9]:
hdf5_path = "my_extendable_compressed_data.hdf5"
extendable_hdf5_file = tables.openFile(hdf5_path, mode='a')
extendable_hdf5_data = extendable_hdf5_file.root.data
extendable_hdf5_clusters = extendable_hdf5_file.root.clusters
print("Length of current data: %i" % len(extendable_hdf5_data))
print("Length of current cluster labels: %i" % len(extendable_hdf5_clusters))
n_added = 5
print("Now adding %i elements to each" % n_added)
for n, (d, c) in enumerate(zip(sample_data[:n_added], sample_clusters[:n_added])):
extendable_hdf5_data.append(sample_data[n][None])
extendable_hdf5_clusters.append(sample_clusters[n][None])
extendable_hdf5_file.close()
extendable_hdf5_file = tables.openFile(hdf5_path, mode='r')
print("Length of current data: %i" % len(extendable_hdf5_file.root.data))
print("Length of current cluster labels: %i" % len(extendable_hdf5_file.root.clusters))
extendable_hdf5_file.close()
EArray is incredibly flexible, and allows for easy numpy array-style data. There are some issues with variable length storage which can be solved by something called VLArray, or by fancy methods on top of EArray which I will discuss later.
There is one more type of array which is very useful for variable length data like speech (utterances) or text (sentences) - the VLArray. Though it does not support compression of the data (see note in the above documentation) it is still useful in many cases, and is definitely the simplest way to handle variable length data.
In [10]:
hdf5_path = "my_variable_length_data.hdf5"
hdf5_file = tables.openFile(hdf5_path, mode='w')
data_storage = hdf5_file.createVLArray(hdf5_file.root, 'data',
tables.Float32Atom(shape=()))
clusters_storage = hdf5_file.createVLArray(hdf5_file.root, 'clusters',
tables.Int32Atom(shape=()))
random_state = np.random.RandomState(1999)
for n in range(1000):
length = int(100 * random_state.randn() ** 2)
data_storage.append(random_state.randn(length,))
clusters_storage.append([length % random_state.randint(1, 5)])
hdf5_file.close()
To test the variable length approach we have done a few things above. For data_storage, we created a random length float vector between length 0 and ~2000, in order to simulate variable length data which is distributed approximately exponentially, which is common for things like text (google Zipf's law for more details). The clusters remained fixed size, but needed to be a sequence, so we wrapped it in the [] braces which made a list out of the single random integer either 1, 2, 3, or 4. This data doesn't really mean anything, but plotting the lengths clearly shows that we can store variable length data in an hdf5 container quite easily. This is very useful for raw text and raw speech, though it has limitations as well - namely that it does not handle multi-dimensional arrays!
In [11]:
hdf5_path = "my_variable_length_data.hdf5"
variable_hdf5_file = tables.openFile(hdf5_path, mode='r')
variable_hdf5_data = variable_hdf5_file.root.data
variable_hdf5_clusters = variable_hdf5_file.root.clusters
all_lengths = np.array([len(d) for d in variable_hdf5_data])
plt.hist(all_lengths, color="steelblue")
plt.title("Lengths of fake variable length data")
plt.figure()
clusters = variable_hdf5_clusters
colors = ["#9b59b6", "#3498db", "#e74c3c", "#2ecc71"]
for i in np.unique(clusters):
plt.scatter([len(d) for d in variable_hdf5_data[(clusters==i).ravel()]],
clusters[(clusters==i).ravel()], color=colors[i])
plt.title("Length vs. Class")
plt.show()
variable_hdf5_file.close()
We have seen EArray, which provides an extendable array in one dimension and can also be compressed and decompressed on the fly to improve "effective" disk speed and reduce size on the hard disk drive. We have also seen VLArray, which allows for storing "ragged" data which is variable length in one dimension (dimension 2). Is there a way we can store variables which are variable size and retain compression?
Yes - by overriding __getitem__ and some other special tricks, we can do a number of interesting things to PyTables arrays (primarily EArray) including compressed arbitrary shape data, memory caching, and more. First up, reshaping on the fly to store ragged arrays with multiple dimensions in a VLArray.
In [12]:
hdf5_path = "my_nd_variable_length_data.hdf5"
hdf5_file = tables.openFile(hdf5_path, mode='w')
data_storage = hdf5_file.createVLArray(hdf5_file.root, 'data',
tables.Float32Atom(shape=()))
data_shapes_storage = hdf5_file.createEArray(hdf5_file.root, 'data_shape',
tables.Int32Atom(),
shape=(0, 2),
expectedrows=1000)
random_state = np.random.RandomState(1999)
for n in range(1000):
shape = (int(100 * random_state.randn() ** 2), random_state.randint(2, 10))
data_storage.append(random_state.randn(*shape).ravel())
data_shapes_storage.append(np.array(shape)[None])
hdf5_file.close()
Thinking about the way numpy actually stores data, we can use the same trick to store n-dimensional variables in a VLArray.
If you are not familiar with the way numpy handles data internally, it is pretty awesome. Basically, there are 2 parts to a numpy array - the data, and an iterator over that data. The data is stored "flat", regardless of the value of shape
. This means that any valid reshape is simply rewriting the iterator, rather than copying all data.
<img src="../figures/numpy_data_layout.png", align="right", width="50%", height="50%">
Each of the colored cells have the same actual value, but there are several valid "views" onto the same data. We can create these views onto the underlying data using reshape
.
Knowing this, we can apply the same trick to VLArray. By storing a flattened version of the numpy array (using ravel
or flatten
) and the shape associated with that index (which is a usually constant size, in this case 2D) we can reconstruct arbitrary shaped data by overriding __getitem__, which is generally a special python method that controls what happens what the bracket operator ([]) is used.
In [13]:
hdf5_path = "my_nd_variable_length_data.hdf5"
nd_variable_hdf5_file = tables.openFile(hdf5_path, mode='r')
nd_variable_hdf5_data = nd_variable_hdf5_file.root.data
nd_variable_hdf5_shape = nd_variable_hdf5_file.root.data_shape
old_data_getter = nd_variable_hdf5_file.root.data.__getitem__
shape_getter = nd_variable_hdf5_file.root.data_shape.__getitem__
First we want to open the hdf5 file and get some shorter names for things we are going to use. We also want to "save" the old way that data fetching used to work (old_data_getter) to make our new data getter, which does reshaping for you. Next will define a new getter which will be used to fetch data and reshape internally. The first argument (self) shows that eventually we will make this a class method in order to make the reshape transparent - but first we need to define what behavior we want.
In [14]:
import numbers
def getter(self, key):
if isinstance(key, numbers.Integral) or isinstance(key, np.integer):
start, stop, step = self._processRange(key, key, 1)
if key < 0:
key = start
return old_data_getter(key).reshape(shape_getter(key))
elif isinstance(key, slice):
start, stop, step = self._processRange(key.start, key.stop, key.step)
raise ValueError("Variable length - what should we do?")
Now we can check that the getter fetches data from the old getter and reshapes it, and see that the shape matches what we originally stored.
In [15]:
print(getter(nd_variable_hdf5_data, -1).shape)
print(nd_variable_hdf5_data[-1].shape)
print(nd_variable_hdf5_shape[-1])
We can fetch one key - no problem. But what shold be done to fetch a slice (chunk) of variable length data? We can't really put them into a 2D array unless we pad all of them to the same length, or pass back a list of arrays approach. The method taken depends on your bias, but I personally prefer a list of arrays approach, as you can always pad later. Given this bias, let's update the getter functionality and get it "attached" to the VLArray instance to override __getitem__.
In [16]:
def getter(self, key):
if isinstance(key, numbers.Integral) or isinstance(key, np.integer):
start, stop, step = self._processRange(key, key, 1)
if key < 0:
key = start
return old_data_getter(key).reshape(shape_getter(key))
elif isinstance(key, slice):
start, stop, step = self._processRange(key.start, key.stop, key.step)
return [old_data_getter(k).reshape(shape_getter(k)) for k in range(start, stop, step)]
For now, we have to use a python slice() object to test. Once we attach this getter to the data we will be able to use the familiar slice syntax - but for now just know slice(start, stop, step)
is equivalent to [start:stop:step]
. This will return a list of arrays which we can then process using whatever algorithm we choose.
In [17]:
list_of_ragged_arrays = getter(nd_variable_hdf5_data, slice(-5, -1, 1))
print([d.shape for d in list_of_ragged_arrays])
Having done this work, we now need to add this functionaluty to the VLArray instance holding our data. It turns out this is pretty tricky (for reasons involving monkeypatching, c-extensions, and other Python business) - the only way I have found is to create a fake subclass, then change all instances of that subclass (which should be only this data) to fetch data using this special getter function. It is definitely not perfect, but it works.
Here it is also important to note that shape
of this ragged array storage is actually uninformative - len(data)
is the best way to see the number of samples, but since every example has a different shape you can't really look at shape
for the data as a whole. This is a somewhat obvious point in hindsight, but it can bite you in your code, so be sure to be aware.
In [18]:
class _my_VLArray_subclass(tables.VLArray):
pass
nd_variable_hdf5_file.root.data.__class__ = _my_VLArray_subclass
_my_VLArray_subclass.__getitem__ = getter
print(nd_variable_hdf5_data[-1].shape)
list_of_ragged_arrays = nd_variable_hdf5_data[-5:-1]
print([d.shape for d in list_of_ragged_arrays])
We can see that the shapes here are the same as the shapes above - by implementing a custom getter we have invisibly allowed ourselves to store "multi-ragged" (for lack of a better term) data in VLArray.
Now that we have figured out how to "patch" objects to use custom getters, how can we use this to store variable length data (like text) in a compressible way? It turns out that we can introduce another "side column" that lets us do what VLArray does, except with EArray! Let's see it in action.
In [19]:
hdf5_path = "my_earray_variable_length_data.hdf5"
hdf5_file.close()
hdf5_file = tables.openFile(hdf5_path, mode='w')
filters = tables.Filters(complevel=5, complib='blosc')
data_storage = hdf5_file.createEArray(hdf5_file.root, 'data',
tables.Int32Atom(),
shape=(0,),
filters=filters,
# guess that there will mean of 50 numbers
# per sample
expectedrows=50 * 1000)
data_start_and_stop_storage = hdf5_file.createEArray(hdf5_file.root,
'data_start_and_stop',
tables.Int32Atom(),
shape=(0, 2),
filters=filters)
data_shape_storage = hdf5_file.createEArray(hdf5_file.root,
'data_shape',
tables.Int32Atom(),
shape=(0, 2),
filters=filters)
random_state = np.random.RandomState(1999)
start = 0
stop = 0
for n in range(1000):
shape = random_state.randint(2, 10, 2)
length = shape[0] * shape[1]
# fake 2D array of ints (pseudo images)
fake_image = random_state.randint(0, 256, length)
for i in range(len(fake_image)):
data_storage.append(fake_image[i][None])
stop = start + length # Not inclusive!
data_start_and_stop_storage.append(np.array((start, stop))[None])
data_shape_storage.append(np.array((shape))[None])
start = stop
hdf5_file.close()
Now that we have created some fake data, we will begin the same procedure - open the file, create a fake getter, and use that fake getter to transform the data into a representation we want.
The key difference between this approach and the earlier approach is that we need to have something that works with EArray - the most obvious way (once again borrowing concepts from numpy memory layout) is to put the entire data into 1 long chunk, then index into that chunk to slice out the flattened samples, then reshaping those based on the true shape, we can get the actual values we desire. By tracking the start and stop of each flattened sample as well as the true shape, we can achieve this.
In [20]:
hdf5_path = "my_earray_variable_length_data.hdf5"
earray_variable_hdf5_file = tables.openFile(hdf5_path, mode='r')
earray_variable_hdf5_data = earray_variable_hdf5_file.root.data
earray_variable_hdf5_start_and_stop = earray_variable_hdf5_file.root.data_start_and_stop
earray_variable_hdf5_shape = earray_variable_hdf5_file.root.data_shape
print("Length of flattened data (*not* n_samples!) %i" % len(earray_variable_hdf5_data))
print("Length of shapes (actual n_samples) %i" % len(earray_variable_hdf5_shape))
print("Length of start and stop (actual n_samples) %i" % len(earray_variable_hdf5_shape))
In [21]:
old_data_getter = earray_variable_hdf5_file.root.data.__getitem__
shape_getter = earray_variable_hdf5_file.root.data_shape.__getitem__
start_and_stop_getter = earray_variable_hdf5_file.root.data_start_and_stop.__getitem__
def getter(self, key):
if isinstance(key, numbers.Integral) or isinstance(key, np.integer):
if key < 0:
key = len(earray_variable_hdf5_shape) + key
data_start, data_stop = start_and_stop_getter(key)
shape = shape_getter(key)
return old_data_getter(slice(data_start, data_stop, 1)).reshape(shape)
elif isinstance(key, slice):
start = key.start if key.start is not None else 0
stop = key.stop if key.stop is not None else len(earray_variable_hdf5_shape)
step = key.step if key.step is not None else 1
pos_keys = [len(earray_variable_hdf5_shape) + k if k < 0 else k
for k in range(start, stop, step)]
starts_and_stops = [start_and_stop_getter(k) for k in pos_keys]
shapes = [shape_getter(k) for k in pos_keys]
return [old_data_getter(slice(dstrt, dstp, 1)).reshape(shp)
for ((dstrt, dstp), shp) in zip(starts_and_stops, shapes)]
print("Testing single key %s" % str(getter(earray_variable_hdf5_data, -1).shape))
print("Testing slice %i" % len(getter(earray_variable_hdf5_data, slice(-20, -1, 1))))
class _my_EArray_subclass(tables.EArray):
pass
earray_variable_hdf5_file.root.data.__class__ = _my_EArray_subclass
_my_EArray_subclass.__getitem__ = getter
_my_EArray_subclass.__len__ = earray_variable_hdf5_shape.__len__
print(earray_variable_hdf5_data[-2].shape)
print(len(earray_variable_hdf5_data))
print(len(earray_variable_hdf5_data[-20:-1]))
In [22]:
all_points = sum([shp[0] * shp[1] for shp in earray_variable_hdf5_shape])
print("Total number of datapoints %i" % all_points)
print("kB for raw storage %f" % ((all_points * 4.) / float(1E3)))
!ls -lh my_earray_variable*.hdf5
Naive calculation of this data shows that for uncompressed storage of 29960 int32 values, this is raw storage of (4 bytes per int32 x 29960 int32 points) = 119.840 kB. But looking at the actual storage, we use less than 50% of that even with the extra data needed for start, stop and shape. This 50% gain in storage also equates to effectively ~50% faster reads from disk - awesome! It's like buying new hardware without spending any money.
The only downside is that decompressing on the fly will use more CPU than reading the data from disk raw - but in general this overhead is much less than the gain you get from reading data from disk 50% faster. In general we have much more CPU available than excess disk read throughput, though this starts changing if you have an SSD.
There is one more area we can optimize - memory usage. By reading more data into memory at a time, we can improve processing times by hitting the disk less often, and reading longer continuous reads. For spinning disk, this is the best way to read a file - since it is spinning in circles you are just reading the bits in order rather than having to spin the disk back and forth.
HDF5/pytables support some of this in memory processing, but the rest can be done using getter override tricks like we have seen above. First, let's see how we can do an in-memory hdf5 file, using H5D_core
.
In [23]:
sample_data, sample_clusters = make_random_cluster_points(10000)
hdf5_path = "my_inmemory_data.hdf5"
hdf5_file = tables.openFile(hdf5_path, mode='w')
data_storage = hdf5_file.createArray(hdf5_file.root, 'data', sample_data)
clusters_storage = hdf5_file.createArray(hdf5_file.root, 'clusters', sample_clusters)
hdf5_file.close()
hdf5_path = "my_inmemory_data.hdf5"
read_hdf5_file = tables.openFile(hdf5_path, mode='r', driver='H5FD_CORE')
hdf5_data = read_hdf5_file.root.data[:]
hdf5_clusters = read_hdf5_file.root.clusters[:]
read_hdf5_file.close()
plot_clusters(hdf5_data, hdf5_clusters, "in memory PyTables array")
Now, we would also like to be able to trade off memory usage for access speed for something in between H5FFD_CORE
and slices [:] from the dataset. For sequential access (like iterating minibatches for KMeans, neural networks, etc.) this can be achieved using some special getter tricks. One example of this can be seen in my library dagbldr. I will post a function from that library here for completeness, but in general the latest version (and other tricks) will be found in my library.
In [24]:
def add_memory_swapper(earray, mem_size):
class _cEArray(tables.EArray):
pass
# Filthy hack to override getter which is a cextension...
earray.__class__ = _cEArray
earray._in_mem_size = int(float(mem_size))
assert earray._in_mem_size >= 1E6 # anything smaller than 1MB is pretty worthless
earray._in_mem_slice = np.empty([1] * len(earray.shape)).astype("float32")
earray._in_mem_limits = [np.inf, -np.inf]
old_getter = earray.__getitem__
def _check_in_mem(earray, start, stop):
lower = earray._in_mem_limits[0]
upper = earray._in_mem_limits[1]
if start < lower or stop > upper:
return False
else:
return True
def _load_in_mem(earray, start, stop):
# start and stop are slice indices desired - we calculate different
# sizes to put in memory
n_bytes_per_entry = earray._in_mem_slice.dtype.itemsize
n_entries = earray._in_mem_size / float(n_bytes_per_entry)
n_samples = earray.shape[0]
n_other = earray.shape[1:]
n_samples_that_fit = int(n_entries / np.prod(n_other))
assert n_samples_that_fit > 0
# handle - index case later
assert start >= 0
assert stop >= 0
assert stop >= start
slice_size = stop - start
if slice_size > n_samples_that_fit:
err_str = "Slice from [%i:%i] (size %i) too large! " % (start, stop, slice_size)
err_str += "Max slice size %i" % n_samples_that_fit
raise ValueError(err_str)
slice_limit = [start, stop]
earray._in_mem_limits = slice_limit
if earray._in_mem_slice.shape[0] == 1:
# allocate memory
print("Allocating %i bytes of memory for EArray swap buffer" % earray._in_mem_size)
earray._in_mem_slice = np.empty((n_samples_that_fit,) + n_other, dtype=earray.dtype)
# handle edge case when last chunk is smaller than what slice will
# return
limit = min([slice_limit[1] - slice_limit[0], n_samples - slice_limit[0]])
earray._in_mem_slice[:limit] = old_getter(
slice(slice_limit[0], slice_limit[1], 1))
def getter(self, key):
if isinstance(key, numbers.Integral) or isinstance(key, np.integer):
start, stop, step = self._processRange(key, key, 1)
if key < 0:
key = start
if _check_in_mem(self, key, key):
lower = self._in_mem_limits[0]
else:
# slice into memory...
_load_in_mem(self, key, key)
lower = self._in_mem_limits[0]
return self._in_mem_slice[key - lower]
elif isinstance(key, slice):
start, stop, step = self._processRange(key.start, key.stop, key.step)
if _check_in_mem(self, start, stop):
lower = self._in_mem_limits[0]
else:
# slice into memory...
_load_in_mem(self, start, stop)
lower = self._in_mem_limits[0]
return self._in_mem_slice[start - lower:stop - lower:step]
# This line is critical...
_cEArray.__getitem__ = getter
return earray
After defining this function, we should be able to add a memory swapper onto an EArray and transparently control the amount of CPU memory we use to speed up processing - giving us a tradeoff between memory usage and prcessing speed which is one of the most common trade-offs made in programming. Depending on your hardware, anywhere from "only a slice at a time" as in regular HDF5, to "all in memory" as by having a large value in add_memory_swapper
or using the H5FD_CORE
driver when opening the file is available to trade through.
In [25]:
hdf5_path = "my_memory_extendable_compressed_data.hdf5"
hdf5_file = tables.openFile(hdf5_path, mode='w')
filters = tables.Filters(complevel=5, complib='blosc')
data_storage = hdf5_file.createEArray(hdf5_file.root, 'data',
tables.Atom.from_dtype(sample_data.dtype),
shape=(0, sample_data.shape[-1]),
filters=filters,
expectedrows=len(sample_data))
clusters_storage = hdf5_file.createEArray(hdf5_file.root, 'clusters',
tables.Atom.from_dtype(sample_clusters.dtype),
shape=(0,),
filters=filters,
expectedrows=len(sample_clusters))
for n, (d, c) in enumerate(zip(sample_data, sample_clusters)):
data_storage.append(sample_data[n][None])
clusters_storage.append(sample_clusters[n][None])
hdf5_file.close()
hdf5_path = "my_memory_extendable_compressed_data.hdf5"
memory_extendable_hdf5_file = tables.openFile(hdf5_path, mode='r')
memory_extendable_hdf5_data = add_memory_swapper(memory_extendable_hdf5_file.root.data, 10E6)
memory_extendable_hdf5_clusters = memory_extendable_hdf5_file.root.clusters
plot_clusters(memory_extendable_hdf5_file.root.data[10:100],
memory_extendable_hdf5_file.root.clusters[10:100], "EArray subset")
After reading the data, we can look at the memory limits used internally by the memory swapper. After taking different slices we see the _in_mem_limits
move - this means the swapper is reading new data from disk into memory. However, taking a slice which is already in memory will not invoke another copy from disk.
In [26]:
print("Current memory limits %s" % str(memory_extendable_hdf5_data._in_mem_limits))
memory_extendable_hdf5_data[-100:]
print("Moved memory limits %s" % str(memory_extendable_hdf5_data._in_mem_limits))
memory_extendable_hdf5_data[-5:]
print("Unchanged memory limits %s" % str(memory_extendable_hdf5_data._in_mem_limits))
memory_extendable_hdf5_file.close()
We can see that using HDF5 with pytables makes almost any operation possible, even on low memory hardware. By playing some advanced tricks with __getitem__ it is possible to create containers which compress lots of data into small storage, helping us process from disk faster giving more effective throughput for a system. In addition, we can also use these same tricks to trade memory for speed for uniform 2D arrays, with easy extension to arbitrary nd-array processing.
kk