The goal of the ticdat
package is to facilitate solve engines that are modular and robust. For example, the multicommodity netflow.py
engine can read and write from a variety of file types when run from the the command line. It can also be run from a Python script that contains embedded static data, or from a script that reads and writes from a system-of-record data source such as an ERP system.
With regards to the latter, we should note that Python is one of the most popular "glue" languages. The market has recognized that Python scripts are easy to write, manage data with intuitive programming syntax, and can be connected to nearly any data source.
The ticdat
package can easily be used in any Python glue script. One way to do this is to exploit ticdat
's ability to recognize data tables as list-of-lists. The inner lists contain data values in the field order defined by by the PanDatFactory
(i.e. netflow.input_schema
).
For example, suppose the netflow
engine needs to connect to an Oracle database for a daily automated solve. The integration engineer can use the cx_Oracle
package (or something equivalent) to turn system data into a list-of-lists for each input table. These data structures can then be used to create a PanDat
object that can be passed as input data to netflow.solve
. The solution PanDat
object returned by netflow.solve
can then be converted back into a list-of-lists representation of each solution report table. (The list-of-lists strategy is just one approach. It might make sense to convert system-of-record data into pandas.DataFrame
objects, and then use these DataFrame
s to build the PanDat
object.)
We demonstrate this approach without explicit references to cx_Oracle
. By demonstrating that ticdat
is compatible with list-of-list/DataFrame
table representations we thus show that ticdat
is compatible with any data source that can be connected to Python, and also with human readable static data.
In [1]:
commodities = [['Pencils', 0.5], ['Pens', 0.2125]]
# a one column table can just be a simple list
nodes = ['Boston', 'Denver', 'Detroit', 'New York', 'Seattle']
cost = [['Pencils', 'Denver', 'Boston', 10.0],
['Pencils', 'Denver', 'New York', 10.0],
['Pencils', 'Denver', 'Seattle', 7.5],
['Pencils', 'Detroit', 'Boston', 2.5],
['Pencils', 'Detroit', 'New York', 5.0],
['Pencils', 'Detroit', 'Seattle', 15.0],
['Pens', 'Denver', 'Boston', 15.0],
['Pens', 'Denver', 'New York', 17.5],
['Pens', 'Denver', 'Seattle', 7.5],
['Pens', 'Detroit', 'Boston', 5.0],
['Pens', 'Detroit', 'New York', 5.0],
['Pens', 'Detroit', 'Seattle', 20.0]]
inflow = [['Pencils', 'Boston', -200],
['Pencils', 'Denver', 240],
['Pencils', 'Detroit', 200],
['Pencils', 'New York', -200],
['Pencils', 'Seattle', -40],
['Pens', 'Boston', -160],
['Pens', 'Denver', 160],
['Pens', 'Detroit', 240],
['Pens', 'New York', -120],
['Pens', 'Seattle', -120]]
An integration engineer might prefer to copy system-of-records data into pandas.DataFrame
objects. Note that pandas
is itself capable of reading directly from various SQL databases, although it usually needs a supporting package like cx_Oracle
.
In [2]:
from pandas import DataFrame
arcs = DataFrame({"Source": ["Denver", "Denver", "Denver", "Detroit", "Detroit", "Detroit",],
"Destination": ["Boston", "New York", "Seattle", "Boston", "New York",
"Seattle"],
"Capacity": [120, 120, 120, 100, 80, 120]})
# PanDatFactory doesn't require the fields to be in order so long as the field names are supplied
arcs = arcs[["Destination", "Source", "Capacity"]]
arcs
Out[2]:
Next we create a PanDat
input data object from the list-of-lists/DataFrame
representations.
In [3]:
%env PATH = PATH:/Users/petercacioppi/ampl/ampl
from netflow import input_schema, solve, solution_schema
dat = input_schema.PanDat(commodities=commodities, nodes=nodes, cost=cost, arcs=arcs,
inflow=inflow)
We now create a PanDat
solution data object by calling solve
.
In [4]:
sln = solve(dat)
We now create a list-of-lists representation of the solution data object.
In [5]:
sln_lists = {t: list(map(list, getattr(sln, t).itertuples(index=False)))
for t in solution_schema.all_tables}
Here we demonstrate that sln_lists
is a dictionary mapping table name to list-of-lists of solution report data.
In [6]:
import pprint
for sln_table_name, sln_table_data in sln_lists.items():
print "\n\n**\nSolution Table %s\n**"%sln_table_name
pprint.pprint(sln_table_data)
Of course the solution data object itself contains DataFrames
, if that representation is preferred.
In [7]:
sln.flow
Out[7]:
The preceding section demonstrated how we can use ticdat
to build modular engines. We now demonstrate how we can use ticdat
to build engines that check solve
pre-conditions, and are thus robust with respect to data integrity problems.
First, lets violate our (somewhat artificial) rule that the commodity volume must be positive.
In [8]:
dat.commodities.loc[dat.commodities["Name"] == "Pencils", "Volume"] = 0
dat.commodities
Out[8]:
The input_schema
can not only flag this problem, but give us a useful data structure to examine.
In [9]:
data_type_failures = input_schema.find_data_type_failures(dat)
data_type_failures
Out[9]:
In [10]:
data_type_failures['commodities', 'Volume']
Out[10]:
Next, lets add a Cost record for a non-existent commodity and see how input_schema
flags this problem.
In [11]:
dat.cost = dat.cost.append({'Commodity':'Crayons', 'Source': 'Detroit',
'Destination': 'Seattle', 'Cost': 10},
ignore_index=True)
fk_failures = input_schema.find_foreign_key_failures(dat, verbosity="Low")
fk_failures
Out[11]:
In [12]:
fk_failures['cost', 'commodities', ('Commodity', 'Name')]
Out[12]:
In real life, data integrity failures can typically be grouped into a small number of categories. However, the number of failures in each category might be quite large. ticdat
creates data structures for each of these categories that can themselves be examined programmatically. As a result, an analyst can leverage the power of Python and pandas
to detect patterns in the data integrity problems.