PyTables

PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. This package is an alternative to h5py which was discussed before.

Features it has to offer:

  • structure youre data in hierarchical fashion (like in folders/directories) and add user specific data for each group/node

  • stores, tables or NumPy arrays, or a table of NumPy arrays

  • optimized for I/O speed

  • apply out of memory mathermatical operations on the data

  • file based

  • mainly ment for reading, be carefull when writing

  • well integrated within the Python ecosystem (NumPy, pandas, ...)

  • is not a replacement for a relational SQL database

Home page GitHub dev pages Documentation Tutorial

A graphical interface to PyTables data sets exists at ViTables (which also supports HDF5 files).

Group GPS data by date and store each day as a table or something with Pytables


In [9]:
from datetime import datetime
import tables as tbl
import pandas as pd
import numpy as np

Defining the structure of our pytables file


In [10]:
class TableDefs:
    class data(tbl.IsDescription):    
        Time        = tbl.Int64Col()
        Latitude_N  = tbl.Float32Col()
        Longitude_E = tbl.Float32Col()
        orientation = tbl.Float16Col()
        memo        = tbl.StringCol(16)
    class statistics(tbl.IsDescription):
        memo        = tbl.StringCol(16)

Creating, opening and closing an HDF5 PyTables file


In [11]:
def get_db(fpath):
    # if the database does not exist, create one
    try:
        #h5f = open(src).close()
        h5f = tbl.openFile(fpath, mode = "r+", title = "gps_db")
        print 'h5file opened:', fpath
    except IOError:
        # if the file does not exists, create it with the correct columns
        # note that the mode w will delete any existing one
        h5f = tbl.openFile(fpath, mode = "w", title = "gps_db")
        
        # layout global structure
        descr = 'GPS Data'
        data_group = h5f.createGroup("/", 'data', descr)
        
        # besides the data as a numpy array, we will also save it as a table
        descr = 'GPS data imported from CSV file'
        table = h5f.createTable(data_group, 'csv', TableDefs.data, descr)
        table.flush()
        
        # subgroup: data sorted per month
        descr = 'Months'
        h5f.createGroup("/data", 'months', descr)
        
        # A table that could hold statistical data per month
        descr = 'Monthly Statistics on GPS Data'
        table = h5f.createTable(data_group, 'statistics', TableDefs.statistics, descr)
        table.flush()
        
        # If you would close the empty file now the created groups are lost.
        
        print 'empty h5file created:', fpath
    return h5f

How to build in some concurrency


In [12]:
def get_node(h5f, node, descr=''):
    """
    open in write mode with some concurrency build in
    """
    try:
        group_node = h5f.getNode('/'+node)
    except tbl.exceptions.NoSuchNodeError:
        group_node = h5f.createGroup("/", node, descr)
    except Exception as exception:
        h5f.close()
        raise exception
    return group_node

In [13]:
def get_table(group, table_name, descr=''):
    try:
        table = getattr(group, table_name)
    except tbl.exceptions.NoSuchNodeError:
        table = h5f.createTable(group, table_name, TableDefs.data, descr)
        table.flush()
    except Exception as exception:
        h5f.close()
        raise exception
    return table

In [14]:
h5f = get_db('gps_db')
data_node = get_node(h5f, 'data')


h5file opened: gps_db

Lets save the GPS data as found from the CSV file. Convert datetime object to unix time stamp. There are two ways to do that, and one of them is wrong:

date = '20131211 18:29:30'
dt = datetime.strptime(date, "%Y%m%d %H:%M:%S")
print dt.strftime('%s') # wrong result
import calendar
print calendar.timegm(dt.utctimetuple())

Use the example from the pandas tutorial with the user defined date parser, but now convert the date to a unix time stamp.


In [15]:
def date_parser(date):
    # create a datetime instance and assume constant formatting
    # format: 20131211 18:29:30
    return datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))
df = pd.read_csv('gps.csv', header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)

In [16]:
#np.genfromtxt('gps.csv', converters={0: lambda d: float(s or 0)})

In [17]:
data = df.values[:,1:]
data = np.ones((10,1000))

In [18]:
h5f.createArray(data_node, 'gpsdata', data, title='gps data')
h5f.close()

Add all the data in the csv table


In [19]:
#table = h5f.root.data.csv
csv_table = get_table(h5f.root.data, 'csv')

In [20]:
tablerow = csv_table.row
for dfrow in df.values:
    #tablerow['Time'] = dfrow[0]
    tablerow['Latitude_N'] = float(dfrow[1])
    tablerow['Longitude_E'] = float(dfrow[2])
    tablerow['orientation'] = float(dfrow[3])
    tablerow.append()
csv_table.flush()
h5f.close()

Example: store DataFrame in PyTables

Excercise:

  • benchmark: how fast is storing with the hdfstore, compared to native pytables and all the other examples we have seen

  • Assume you have a huge CSV file that is too big load into memory. Store it safely in a PyTables table.


In [21]:
date = '20131211 18:29:30'
dt = datetime.strptime(date, "%Y%m%d %H:%M:%S")
print dt.strftime('%s')
import calendar
print calendar.timegm(dt.utctimetuple())


1386782970
1386786570