PyTables

PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. This package is an alternative to h5py which was discussed before.

Features it has to offer:

structure youre data in hierarchical fashion (like in folders/directories) and add user specific data for each group/node
stores, tables or NumPy arrays, or a table of NumPy arrays
optimized for I/O speed
apply out of memory mathermatical operations on the data
file based
mainly ment for reading, be carefull when writing
well integrated within the Python ecosystem (NumPy, pandas, ...)
is not a replacement for a relational SQL database

Home page GitHub dev pages Documentation Tutorial

A graphical interface to PyTables data sets exists at ViTables (which also supports HDF5 files).

Group GPS data by date and store each day as a table or something with Pytables



In [1]:

    
from datetime import datetime
import tables as tbl
import pandas as pd
import numpy as np

Defining the structure of our pytables file



In [2]:

    
class TableDefs:
    class data(tbl.IsDescription):    
        Time        = tbl.Int64Col()
        Latitude_N  = tbl.Float32Col()
        Longitude_E = tbl.Float32Col()
        orientation = tbl.Float16Col()
        memo        = tbl.StringCol(16)
    class statistics(tbl.IsDescription):
        memo        = tbl.StringCol(16)

Creating, opening and closing an HDF5 PyTables file



In [3]:

    
def get_db(fpath):
    # if the database does not exist, create one
    try:
        #h5f = open(src).close()
        h5f = tbl.openFile(fpath, mode = "r+", title = "gps_db")
        print 'h5file opened:', fpath
    except IOError:
        # if the file does not exists, create it with the correct columns
        # note that the mode w will delete any existing one
        h5f = tbl.openFile(fpath, mode = "w", title = "gps_db")
        
        # layout global structure
        descr = 'GPS Data'
        data_group = h5f.createGroup("/", 'data', descr)
        
        # besides the data as a numpy array, we will also save it as a table
        descr = 'GPS data imported from CSV file'
        table = h5f.createTable(data_group, 'csv', TableDefs.data, descr)
        table.flush()
        
        # subgroup: data sorted per month
        descr = 'Months'
        h5f.createGroup("/data", 'months', descr)
        
        # A table that could hold statistical data per month
        descr = 'Monthly Statistics on GPS Data'
        table = h5f.createTable(data_group, 'statistics', TableDefs.statistics, descr)
        table.flush()
        
        # If you would close the empty file now the created groups are lost.
        
        print 'empty h5file created:', fpath
    return h5f

How to build in some concurrency



In [4]:

    
def get_node(h5f, node, descr=''):
    """
    open in write mode with some concurrency build in
    """
    try:
        group_node = h5f.getNode('/'+node)
    except tbl.exceptions.NoSuchNodeError:
        group_node = h5f.createGroup("/", node, descr)
    except Exception as exception:
        h5f.close()
        raise exception
    return group_node



In [5]:

    
def get_table(group, table_name, descr=''):
    try:
        table = getattr(group, table_name)
    except tbl.exceptions.NoSuchNodeError:
        table = h5f.createTable(group, table_name, TableDefs.data, descr)
        table.flush()
    except Exception as exception:
        h5f.close()
        raise exception
    return table

Lets save the GPS data as found from the CSV file. Convert datetime object to unix time stamp. There are two ways to do that, and one of them is wrong:

date = '20131211 18:29:30'
dt = datetime.strptime(date, "%Y%m%d %H:%M:%S")
print dt.strftime('%s') # wrong result
import calendar
print calendar.timegm(dt.utctimetuple())

Use the example from the pandas tutorial with the user defined date parser, but now convert the date to a unix time stamp.



In [6]:

    
#Reads information from csv:
def date_parser(date):
    # create a datetime instance and assume constant formatting
    # format: 20131211 18:29:30
    return datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))
df = pd.read_csv('gps.csv', header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)

#Creates a HDF5 PyTable "h5f" and a file HDF5 named "gps_db".
h5f = get_db('gps_db.hdf5')
data_node = get_node(h5f, 'data')

#Creates an array "Data" containing the Latitude, Longitude and Orientation in the first, second and third columns respectively.
#The array must have dtype = float in order to create a table and write the information into the HDF5 file.
data = np.array(df.values[:,1:],dtype='f');

#Writes the Data array created previously into the file h5f using the createArray built-in function.
h5f.createArray(data_node, 'gpsdata', data, title='gps data');

#Closes the file h5f.
h5f.close()

#np.genfromtxt('gps.csv', converters={0: lambda d: float(s or 0)})









    



empty h5file created: gps_db.hdf5

Add all the data in the csv table



In [7]:

    
#Reads information from csv:
df = pd.read_csv('gps.csv', header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)

#Creates a HDF5 PyTable "h5f" and a file HDF5 named "gps_db".
h5f2 = get_db('gps_db2.hdf5')
data_node = get_node(h5f2, 'data')

#Reads the "data" array from the h5f file:
#table = h5f.root.data.csv
csv_table = get_table(h5f2.root.data, 'csv')

#tablerow contains the first row of the csv_table:
tablerow = csv_table.row

#Reads next rows from panda type database df;
for dfrow in df.values:
    #tablerow['Time'] = dfrow[0]
    tablerow['Latitude_N'] = float(dfrow[1])
    tablerow['Longitude_E'] = float(dfrow[2])
    tablerow['orientation'] = float(dfrow[3])
    tablerow.append()
csv_table.flush()

#Closes the file h5f.
h5f2.close()









    



h5file opened: gps_db2.hdf5

Example: store DataFrame in PyTables

Excercise:

benchmark: how fast is storing with the hdfstore, compared to native pytables and all the other examples we have seen
Assume you have a huge CSV file that is too big load into memory. Store it safely in a PyTables table.



In [8]:

    
%load_ext logit
user_name = 'FcoH'



In [9]:

    
#%%logit save_BM_NativePyTableStorage, gps_BM1.hdf5

#Storing CSV file values into PyTables. The CSV file is too big to be loaded into memory (RAM), 
#thus the chosen solution procedure is to read and save the information sequentially. It is 
#assumed that both the CSV file and the new PyTable can be stored in the local hard disk.

#BENCHMARK 1: APPEND ROWS INTO THE PYTABLE (NATIVE FORM).

#Creation of an instance object "h5f" of a HDF5 file named "gps_BM1".
#The root of the object tree is specified in the instance’s root attribute.
h5f = get_db('gps_BM1.hdf5')

#Reads the "csv" table from the h5f object, which is inside the "data" group in the h5f object.
#table = h5f.root.data.csv
csv_table = get_table(h5f.root.data, 'csv')

#tablerow contains the first row of the csv_table having attributes of type "data":
tablerow = csv_table.row

#Reads information manually from the CSV file and stores it into the gps_BM1.hdf5 file:
with open('gps.csv') as f:
    f.readline()              #Reads and discards the names of the CSV file.
    f.readline()              #Reads and discards the units of the CSV file.
    for i in f.readlines():   #Reads the data.
        
        #Reads the ith row and return a list of strings. Then, parses the list 
        #into a 32bit float numpy array (i.e., exclude the date/time column).
        aux=np.array(i.strip().split(","))[1:].astype(np.float32)
        
        #Reads the elements of the array and appends them into the csv_table PyTable:
        tablerow['Latitude_N'] = aux[0]
        tablerow['Longitude_E'] = aux[1]
        tablerow['orientation'] = aux[2]
        tablerow.append()
        
        #After the data is processed, we flush the table’s I/O buffer if we want to
        #write all this data to disk. We achieve that by calling the table.flush() method:
        csv_table.flush()

#Closes the h5f file.
h5f.close()









    



h5file opened: gps_BM1.hdf5



In [10]:

    
#%%logit save_BM_HDFStorage, gps_BM2.hdf5

#BENCHMARK 2: APPEND ROWS INTO THE PYTABLE (HDFStore FORM).

#Creation of a PyTable "store" object and a HDF5 file named "gps_BM2".
store = pd.HDFStore('gps_BM2.hdf5',mode = "w")

#store.append('/data',pd.DataFrame(data={'Latitude_N':[], 'Longitude_E':[], 'orientation':[]}))
#tablerow=store.root.data.table.row

#res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))                  #Empty DataFrame
#row = pd.DataFrame([dict(lib='hello', qty1=4.0, qty2=100.0), ])      #New DataFrame
#res = res.append(row, ignore_index=True)                             #Append DataFrame


#Reads information manually from the CSV file and stores it into the gps_BM2.hdf5 file:
with open('gps.csv') as f:
    f.readline()           #Reads and discards the names of the CSV file.
    f.readline()           #Reads and discards the units of the CSV file.
    indexer=0              #Indexer.
    for i in f.readlines():   #Reads the data.
        
        #Reads the ith row and return a list of strings. Then, parses the list 
        #into a 32bit float numpy array (i.e., exclude the date/time column).
        aux=np.array(i.strip().split(","))[1:].astype(np.float32)
        
        ##Creates a DataFrame with the information and appends the DataFrame to the data group in the store object:
        store.append('/data',pd.DataFrame(data={'Latitude_N':[aux[0]], 'Longitude_E':[aux[1]], 'orientation':[aux[2]]},index=[indexer]))
    
        #After the data is processed, we flush the table’s I/O buffer if we want to
        #write all this data to disk. We achieve that by calling the store.flush() method:
        store.flush()
        
        #Updates indexer:
        indexer+=1

#Closes the h5f file.
store.close()



In [11]:

    
#%%logit save_BM_HDFStorage_Fast, gps_BM3.hdf5

#How does the storage process improve when assuming that information can completely be loaded into memory (RAM) before flushing it?
#BENCHMARK 3: APPEND ROWS INTO THE PYTABLE (HDFStore FORM with complete loaded info).

#Creation of a PyTable "store" object and a HDF5 file named "gps_BM3".
store = pd.HDFStore('gps_BM3.hdf5',mode = "w")

#Function for date parsing:
def date_parser(date):
    # create a datetime instance and assume constant formatting
    # format: 20131211 18:29:30
    return datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))

#Reads information from csv:
df = pd.read_csv('gps.csv', header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)

#Appends information to the data group in the store object:
store.append('/data',df)

#After the data is processed, we flush the table’s I/O buffer if we want to
#write all this data to disk. We achieve that by calling the store.flush() method:
store.flush()

#Closes the h5f file.
store.close()

#The file will have a larger size since the dates are included.









    



save_BM_HDFStorage_Fast :
14-02-24 20:41:02
File size: 5.62045192719 MB
Creation time: 1.02662579961 sec



In [ ]:

    
#Any other method that does not involve the creation of tables (such as the ones studied in the Simple data storage IPython notebook)
#will perform better.