TODO
In [ ]:
# import python packages here...
In [ ]:
import numpy as np
import pandas as pd
In [ ]:
FILE = "test.h5"
In [ ]:
!rm test.h5
HDFStore
is a dict-like object which reads and writes pandas using the high performance HDF5 format using the PyTables library.
In [ ]:
s = pd.Series([3.14, 2.72, np.nan], index=['pi', 'e', 'nan'])
s
In [ ]:
df = pd.DataFrame(np.array([[3, 1, 4],[2, 7, np.nan]]).T,
index=pd.date_range('1/1/2000', periods=3),
columns=['A', 'B'])
df
In [ ]:
store = pd.HDFStore(FILE)
Objects can be written to the file just like adding key-value pairs to a dict:
In [ ]:
store['df'] = df # the equivalent of: store.put('df', df)
store['series'] = s # the equivalent of: store.put('series', s)
Closing a Store:
In [ ]:
store.close()
In [ ]:
del df
del s
del store
In [ ]:
with pd.HDFStore(FILE) as store:
print(store.keys())
df = store['df'] # the equivalent of: store.get('df')
s = store['series'] # the equivalent of: store.get('series')
In [ ]:
df
In [ ]:
s
HDFStore
supports an top-level API using read_hdf
for reading and to_hdf
for writing.
Documentation: https://pandas.pydata.org/pandas-docs/stable/io.html#id2
In [ ]:
!rm test.h5
In [ ]:
s = pd.Series([3.14, 2.72, np.nan], index=['pi', 'e', 'nan'])
s
In [ ]:
df = pd.DataFrame(np.array([[3, 1, 4],[2, 7, np.nan]]).T,
index=pd.date_range('1/1/2000', periods=3),
columns=['A', 'B'])
df
In [ ]:
s.to_hdf(FILE, key='series')
df.to_hdf(FILE, key='df')
Useful paremeters:
format : 'fixed(f)|table(t)', default is 'fixed'
fixed(f) : Fixed format
Fast writing/reading. Not-appendable, nor searchable
table(t) : Table format
Write as a PyTables Table structure which may perform
worse but allow more flexible operations like searching
/ selecting subsets of the data
append : boolean, default False
For Table formats, append the input data to the existing
data_columns : list of columns, or True, default None
List of columns to create as indexed data columns for on-disk
queries, or True to use all columns. By default only the axes
of the object are indexed. See `here
<http://pandas.pydata.org/pandas-docs/stable/io.html#query-via-data-columns>`__.
Applicable only to format='table'.
complevel : int, 0-9, default None
Specifies a compression level for data.
A value of 0 disables compression.
complib : {'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib'
Specifies the compression library to be used.
As of v0.20.2 these additional compressors for Blosc are supported
(default if no compressor specified: 'blosc:blosclz'):
{'blosc:blosclz', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy',
'blosc:zlib', 'blosc:zstd'}.
Specifying a compression library which is not available issues
a ValueError.
fletcher32 : bool, default False
If applying compression use the fletcher32 checksum
dropna : boolean, default False.
If true, ALL nan rows will not be written to store.
In [ ]:
del df
del s
del store
In [ ]:
s = pd.read_hdf(FILE, key='series') # the `key` param can be omitted if the HDF file contains a single pandas object
s
In [ ]:
df = pd.read_hdf(FILE, key='df') # the `key` param can be omitted if the HDF file contains a single pandas object
df
In [ ]:
!rm test.h5
In [ ]:
a = np.random.randint(10, size=(1000, 1000))
df = pd.DataFrame(a)
del a
In [ ]:
df.to_hdf(FILE, key='df')
In [ ]:
!ls -lh test.h5
In [ ]:
df.to_hdf(FILE,
key='df',
complevel=9, # 0-9, default None, Specifies a compression level for data. 0 = disables compression
complib='zlib') # 'zlib', 'lzo', 'bzip2', 'blosc', 'blosc:blosclz', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd'. default 'zlib'
In [ ]:
!ls -lh test.h5
In [ ]:
df = pd.read_hdf(FILE, key='df') # the `key` param can be omitted if the HDF file contains a single pandas object
df.memory_usage().sum()
In [ ]:
!rm test.h5