PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. This package is an alternative to h5py which was discussed before.
Features it has to offer:
structure youre data in hierarchical fashion (like in folders/directories) and add user specific data for each group/node
stores, tables or NumPy arrays, or a table of NumPy arrays
optimized for I/O speed
apply out of memory mathermatical operations on the data
file based
mainly ment for reading, be carefull when writing
well integrated within the Python ecosystem (NumPy, pandas, ...)
is not a replacement for a relational SQL database
Home page GitHub dev pages Documentation Tutorial
A graphical interface to PyTables data sets exists at ViTables (which also supports HDF5 files).
Group GPS data by date and store each day as a table or something with Pytables
In [9]:
from datetime import datetime
import tables as tbl
import pandas as pd
import numpy as np
Defining the structure of our pytables file
In [10]:
class TableDefs:
class data(tbl.IsDescription):
Time = tbl.Int64Col()
Latitude_N = tbl.Float32Col()
Longitude_E = tbl.Float32Col()
orientation = tbl.Float16Col()
memo = tbl.StringCol(16)
class statistics(tbl.IsDescription):
memo = tbl.StringCol(16)
Creating, opening and closing an HDF5 PyTables file
In [11]:
def get_db(fpath):
# if the database does not exist, create one
try:
#h5f = open(src).close()
h5f = tbl.openFile(fpath, mode = "r+", title = "gps_db")
print 'h5file opened:', fpath
except IOError:
# if the file does not exists, create it with the correct columns
# note that the mode w will delete any existing one
h5f = tbl.openFile(fpath, mode = "w", title = "gps_db")
# layout global structure
descr = 'GPS Data'
data_group = h5f.createGroup("/", 'data', descr)
# besides the data as a numpy array, we will also save it as a table
descr = 'GPS data imported from CSV file'
table = h5f.createTable(data_group, 'csv', TableDefs.data, descr)
table.flush()
# subgroup: data sorted per month
descr = 'Months'
h5f.createGroup("/data", 'months', descr)
# A table that could hold statistical data per month
descr = 'Monthly Statistics on GPS Data'
table = h5f.createTable(data_group, 'statistics', TableDefs.statistics, descr)
table.flush()
# If you would close the empty file now the created groups are lost.
print 'empty h5file created:', fpath
return h5f
How to build in some concurrency
In [12]:
def get_node(h5f, node, descr=''):
"""
open in write mode with some concurrency build in
"""
try:
group_node = h5f.getNode('/'+node)
except tbl.exceptions.NoSuchNodeError:
group_node = h5f.createGroup("/", node, descr)
except Exception as exception:
h5f.close()
raise exception
return group_node
In [13]:
def get_table(group, table_name, descr=''):
try:
table = getattr(group, table_name)
except tbl.exceptions.NoSuchNodeError:
table = h5f.createTable(group, table_name, TableDefs.data, descr)
table.flush()
except Exception as exception:
h5f.close()
raise exception
return table
In [14]:
h5f = get_db('gps_db')
data_node = get_node(h5f, 'data')
Lets save the GPS data as found from the CSV file. Convert datetime object to unix time stamp. There are two ways to do that, and one of them is wrong:
date = '20131211 18:29:30'
dt = datetime.strptime(date, "%Y%m%d %H:%M:%S")
print dt.strftime('%s') # wrong result
import calendar
print calendar.timegm(dt.utctimetuple())
Use the example from the pandas tutorial with the user defined date parser, but now convert the date to a unix time stamp.
In [15]:
def date_parser(date):
# create a datetime instance and assume constant formatting
# format: 20131211 18:29:30
return datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))
df = pd.read_csv('gps.csv', header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)
In [16]:
#np.genfromtxt('gps.csv', converters={0: lambda d: float(s or 0)})
In [17]:
data = df.values[:,1:]
data = np.ones((10,1000))
In [18]:
h5f.createArray(data_node, 'gpsdata', data, title='gps data')
h5f.close()
Add all the data in the csv table
In [19]:
#table = h5f.root.data.csv
csv_table = get_table(h5f.root.data, 'csv')
In [20]:
tablerow = csv_table.row
for dfrow in df.values:
#tablerow['Time'] = dfrow[0]
tablerow['Latitude_N'] = float(dfrow[1])
tablerow['Longitude_E'] = float(dfrow[2])
tablerow['orientation'] = float(dfrow[3])
tablerow.append()
csv_table.flush()
h5f.close()
Example: store DataFrame in PyTables
benchmark: how fast is storing with the hdfstore, compared to native pytables and all the other examples we have seen
Assume you have a huge CSV file that is too big load into memory. Store it safely in a PyTables table.
In [21]:
date = '20131211 18:29:30'
dt = datetime.strptime(date, "%Y%m%d %H:%M:%S")
print dt.strftime('%s')
import calendar
print calendar.timegm(dt.utctimetuple())