HDF5 files with Pandas

TODO

  • ...

In [ ]:
# import python packages here...

In [ ]:
import numpy as np
import pandas as pd

In [ ]:
FILE = "test.h5"

In [ ]:
!rm test.h5

Read/write HDF files using HDFStore objects API

HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using the PyTables library.

Make data to save/load


In [ ]:
s = pd.Series([3.14, 2.72, np.nan], index=['pi', 'e', 'nan'])
s

In [ ]:
df = pd.DataFrame(np.array([[3, 1, 4],[2, 7, np.nan]]).T,
                  index=pd.date_range('1/1/2000', periods=3),
                  columns=['A', 'B'])
df

Write to HDF5 file


In [ ]:
store = pd.HDFStore(FILE)

Objects can be written to the file just like adding key-value pairs to a dict:


In [ ]:
store['df'] = df      # the equivalent of: store.put('df', df)
store['series'] = s   # the equivalent of: store.put('series', s)

Closing a Store:


In [ ]:
store.close()

In [ ]:
del df
del s
del store

Read from HDF5 file


In [ ]:
with pd.HDFStore(FILE) as store:
    print(store.keys())
    df = store['df']      # the equivalent of: store.get('df')
    s = store['series']   # the equivalent of: store.get('series')

In [ ]:
df

In [ ]:
s

Read/Write HDF5 files using the to_hdf()/read_hdf() top-level API

HDFStore supports an top-level API using read_hdf for reading and to_hdf for writing.

Make data to save/load


In [ ]:
!rm test.h5

In [ ]:
s = pd.Series([3.14, 2.72, np.nan], index=['pi', 'e', 'nan'])
s

In [ ]:
df = pd.DataFrame(np.array([[3, 1, 4],[2, 7, np.nan]]).T,
                  index=pd.date_range('1/1/2000', periods=3),
                  columns=['A', 'B'])
df

Write a DataFrame in a HDF5 file


In [ ]:
s.to_hdf(FILE, key='series')
df.to_hdf(FILE, key='df')

Useful paremeters:

format : 'fixed(f)|table(t)', default is 'fixed'
    fixed(f) : Fixed format
               Fast writing/reading. Not-appendable, nor searchable
    table(t) : Table format
               Write as a PyTables Table structure which may perform
               worse but allow more flexible operations like searching
               / selecting subsets of the data
append : boolean, default False
    For Table formats, append the input data to the existing
data_columns :  list of columns, or True, default None
    List of columns to create as indexed data columns for on-disk
    queries, or True to use all columns. By default only the axes
    of the object are indexed. See `here
    <http://pandas.pydata.org/pandas-docs/stable/io.html#query-via-data-columns>`__.
    Applicable only to format='table'.
complevel : int, 0-9, default None
    Specifies a compression level for data.
    A value of 0 disables compression.
complib : {'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib'
    Specifies the compression library to be used.
    As of v0.20.2 these additional compressors for Blosc are supported
    (default if no compressor specified: 'blosc:blosclz'):
    {'blosc:blosclz', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy',
    'blosc:zlib', 'blosc:zstd'}.
    Specifying a compression library which is not available issues
    a ValueError.
fletcher32 : bool, default False
    If applying compression use the fletcher32 checksum
dropna : boolean, default False.
    If true, ALL nan rows will not be written to store.

In [ ]:
del df
del s
del store

Read a DataFrame from a HDF5 file


In [ ]:
s = pd.read_hdf(FILE, key='series')  # the `key` param can be omitted if the HDF file contains a single pandas object
s

In [ ]:
df = pd.read_hdf(FILE, key='df')  # the `key` param can be omitted if the HDF file contains a single pandas object
df

In [ ]:
!rm test.h5

Read/Write a compressed HDF5 file


In [ ]:
a = np.random.randint(10, size=(1000, 1000))
df = pd.DataFrame(a)
del a

In [ ]:
df.to_hdf(FILE, key='df')

In [ ]:
!ls -lh test.h5

In [ ]:
df.to_hdf(FILE,
          key='df',
          complevel=9,     # 0-9, default None, Specifies a compression level for data. 0 = disables compression
          complib='zlib') # 'zlib', 'lzo', 'bzip2', 'blosc', 'blosc:blosclz', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd'. default 'zlib'

In [ ]:
!ls -lh test.h5

In [ ]:
df = pd.read_hdf(FILE, key='df')  # the `key` param can be omitted if the HDF file contains a single pandas object
df.memory_usage().sum()

In [ ]:
!rm test.h5