Basic orienatation to ticdat, pandas and developing engines for Opalytics

One of the advantages of Python is that it has "batteries included". That is to say, there is a rich set of libraries available for installation. Of course, with such a large collection of libraries to choose from, it's natural to wonder how different libraries relate to each other, and which to choose for a given situation.

This notebook addresses the ticdat and pandas libraries. It is a good starting point if you are a pythonic and pandonic programmer who wishes to develop Opalytics-ready data science engines as quickly as possible.

ticdat was developed to promote modular solve engine development. It facilitates the pattern under which a solve function publishes its input and output data formats.

Specifically, a solve engine creates two TicDatFactory objects. One defines the input schema and the other the output schema. Although you are encouraged to add as many data integrity rules as possible to these objects (particularly the input object), you only need to specify the table and field names, and to organize the fields into primary key fields and data fields.

For example, in the diet example, the dietmodel.py file has the following lines.


In [1]:
from ticdat import TicDatFactory, freeze_me

In [2]:
dataFactory = TicDatFactory (
     categories = [["name"],["minNutrition", "maxNutrition"]],
     foods  = [["name"],["cost"]],
     nutritionQuantities = [["food", "category"], ["qty"]])

Here, the dataFactory object defines an input schema. This schema has three tables (categories, foods, and nutritionQuantities). The categories table is indexed by a single field (name) and has two data fields (minNutrition and maxNutrition). The nutritionQuantities table is indexed by two fields (food and category) and has one data field (qty).

Any code wishing to run the solve function can learn what type of data object to pass as input by examining the dataFactory object. The dietcsvdata.py, dietstaticdata.py and dietxls.py scripts demonstrate this pattern by sourcing data from a sub-directory of csv files, a static data instance, and an xls file, respectively. Were Opalytics to deploy dietmodel, it would perform work roughly analogous to that performed by these three files, except Opalytics would source the input data from the Opalytics Cloud Platform.

Let's examine what a TicDat object created by dataFactory looks like. To do this, we're going to pull in some sample testing data hard coded in the ticdat testing code.


In [3]:
import ticdat.testing.ticdattestutils as tictest
_tmp = tictest.dietData()
dietData = dataFactory.TicDat(categories = _tmp.categories, foods = _tmp.foods, 
                              nutritionQuantities = _tmp.nutritionQuantities)

dietData is a TicDat object. It is an instance of the schema defined by dataFactory. By default, it stores its data in a "dict of dicts" format.


In [4]:
dietData.categories


Out[4]:
{'calories': _td:{'minNutrition': 1800, 'maxNutrition': 2200},
 'fat': _td:{'minNutrition': 0, 'maxNutrition': 65},
 'protein': _td:{'minNutrition': 91, 'maxNutrition': inf},
 'sodium': _td:{'minNutrition': 0, 'maxNutrition': 1779}}

In [5]:
dietData.nutritionQuantities


Out[5]:
{('chicken', 'calories'): _td:{'qty': 420},
 ('chicken', 'fat'): _td:{'qty': 10},
 ('chicken', 'protein'): _td:{'qty': 32},
 ('chicken', 'sodium'): _td:{'qty': 1190},
 ('fries', 'calories'): _td:{'qty': 380},
 ('fries', 'fat'): _td:{'qty': 19},
 ('fries', 'protein'): _td:{'qty': 4},
 ('fries', 'sodium'): _td:{'qty': 270},
 ('hamburger', 'calories'): _td:{'qty': 410},
 ('hamburger', 'fat'): _td:{'qty': 26},
 ('hamburger', 'protein'): _td:{'qty': 24},
 ('hamburger', 'sodium'): _td:{'qty': 730},
 ('hot dog', 'calories'): _td:{'qty': 560},
 ('hot dog', 'fat'): _td:{'qty': 32},
 ('hot dog', 'protein'): _td:{'qty': 20},
 ('hot dog', 'sodium'): _td:{'qty': 1800},
 ('ice cream', 'calories'): _td:{'qty': 330},
 ('ice cream', 'fat'): _td:{'qty': 10},
 ('ice cream', 'protein'): _td:{'qty': 8},
 ('ice cream', 'sodium'): _td:{'qty': 180},
 ('macaroni', 'calories'): _td:{'qty': 320},
 ('macaroni', 'fat'): _td:{'qty': 10},
 ('macaroni', 'protein'): _td:{'qty': 12},
 ('macaroni', 'sodium'): _td:{'qty': 930},
 ('milk', 'calories'): _td:{'qty': 100},
 ('milk', 'fat'): _td:{'qty': 2.5},
 ('milk', 'protein'): _td:{'qty': 8},
 ('milk', 'sodium'): _td:{'qty': 125},
 ('pizza', 'calories'): _td:{'qty': 320},
 ('pizza', 'fat'): _td:{'qty': 12},
 ('pizza', 'protein'): _td:{'qty': 15},
 ('pizza', 'sodium'): _td:{'qty': 820},
 ('salad', 'calories'): _td:{'qty': 320},
 ('salad', 'fat'): _td:{'qty': 12},
 ('salad', 'protein'): _td:{'qty': 31},
 ('salad', 'sodium'): _td:{'qty': 1230}}

However, since you are pandonic, you might prefer to have a copy of this data in pandas format. This is easy to do.


In [6]:
panDiet = dataFactory.copy_to_pandas(dietData)

In [7]:
panDiet.categories


Out[7]:
minNutrition maxNutrition
name
calories 1800 2200.000000
fat 0 65.000000
protein 91 inf
sodium 0 1779.000000

In [8]:
panDiet.nutritionQuantities


Out[8]:
qty
food category
chicken calories 420.0
fat 10.0
protein 32.0
sodium 1190.0
fries calories 380.0
fat 19.0
protein 4.0
sodium 270.0
hamburger calories 410.0
fat 26.0
protein 24.0
sodium 730.0
hot dog calories 560.0
fat 32.0
protein 20.0
sodium 1800.0
ice cream calories 330.0
fat 10.0
protein 8.0
sodium 180.0
macaroni calories 320.0
fat 10.0
protein 12.0
sodium 930.0
milk calories 100.0
fat 2.5
protein 8.0
sodium 125.0
pizza calories 320.0
fat 12.0
protein 15.0
sodium 820.0
salad calories 320.0
fat 12.0
protein 31.0
sodium 1230.0

Note that these aren't "raw" DataFrame objects. Intead, ticdat has inferred sensible indexes for you from the primary key field designations in dataFactory. The nutritionQuantities table has a MultiIndex and the foods and categories table each have a simple index.

By default, copy_to_pandas will drop the columns that are used to populate the index, unless doing so would result in a DataFrame with no columns at all. However, if you wish for no columns to be dropped under any circumstances, you can use the optional drop_pk_columns argument. This is illustrated below.


In [9]:
panDietNoDrop = dataFactory.copy_to_pandas(dietData, drop_pk_columns=False)
panDietNoDrop.categories


Out[9]:
name minNutrition maxNutrition
name
calories calories 1800 2200.000000
fat fat 0 65.000000
protein protein 91 inf
sodium sodium 0 1779.000000

Let's review.

  • dataFactory describes the input schema
  • The solve function doesn't know where its input data is coming from. It only knows that is will conform to the schema defined by dataFactory. (All of my examples include at least one assert statement double checking this assumption).
  • By default, the input tables will be in the default "dict of dicts" format. However, its easy to create a copy of the data which creates a DataFrame for each table.

This summarizes how a solve function can specify its input data and reformat this data as needed. Let's now examine how solve will return data.

The following code specifies a return schema.


In [10]:
solutionFactory = TicDatFactory(
        parameters = [[],["totalCost"]],
        buyFood = [["food"],["qty"]],
        consumeNutrition = [["category"],["qty"]])

This schema has three tables (parameters, buyFood, consumeNutrition). The parameters table has no primary key fields at all, and just a single data field. (It is assumed that this table will have at most one record). The buyFood table is indexed by the food field, and has a single data field indicating how much of that food is to be consumed. consumeNutrition is similar, except it defines the quantity consumed for each nutrition type.

(As an aside, only the buyFood table is really needed. The total cost and the quantities of nutrition consumed for each nutrition type can be inferred from the consumption of food and the input data. However, it often makes good sense for the solve routine to compute mathematically redundant tables purely for reporting purposes).

How can the solve code return an object of this type? The easiest way is to create an empty TicDat object, and populate it row by row. This is particularly easy for this schema because all the tables have but one data field. (We're going to skip populating the parameters table because "no primary key" tables are a little different).


In [11]:
soln = solutionFactory.TicDat()
soln.buyFood["hamburger"] = 0.6045138888888888
soln.buyFood["ice cream"] = 2.591319444444
soln.buyFood["milk"] = 6.9701388888
soln.consumeNutrition["calories"]= 1800.0
soln.consumeNutrition["fat"]=59.0559
soln.consumeNutrition["protein"]=91.
soln.consumeNutrition["sodium"]=1779.

ticdat overrides __setitem__ for single data field tables so as to create the following.


In [12]:
soln.buyFood


Out[12]:
{'hamburger': _td:{'qty': 0.6045138888888888},
 'ice cream': _td:{'qty': 2.591319444444},
 'milk': _td:{'qty': 6.9701388888}}

In [13]:
soln.consumeNutrition


Out[13]:
{'calories': _td:{'qty': 1800.0},
 'fat': _td:{'qty': 59.0559},
 'protein': _td:{'qty': 91.0},
 'sodium': _td:{'qty': 1779.0}}

Here are a couple of other, equivalent ways to populate these seven records.


In [14]:
soln = solutionFactory.TicDat()
soln.buyFood["hamburger"]["qty"] = 0.6045138888888888
soln.buyFood["ice cream"]["qty"] = 2.591319444444
soln.buyFood["milk"]["qty"] = 6.9701388888
soln.consumeNutrition["calories"]["qty"] = 1800.0
soln.consumeNutrition["fat"]["qty"] = 59.0559
soln.consumeNutrition["protein"]["qty"] = 91.
soln.consumeNutrition["sodium"]["qty"] = 1779.

In [15]:
soln = solutionFactory.TicDat()
soln.buyFood["hamburger"] = {"qty" : 0.6045138888888888}
soln.buyFood["ice cream"] = {"qty" : 2.591319444444}
soln.buyFood["milk"] = {"qty" : 6.9701388888}
soln.consumeNutrition["calories"] = {"qty" : 1800.0}
soln.consumeNutrition["fat"] = {"qty" : 59.0559}
soln.consumeNutrition["protein"] = {"qty" : 91.}
soln.consumeNutrition["sodium"] = {"qty" : 1779.}

But wait! You're pandonic! Fair enough. Here are a few ways to initialize a TicDat object with Series and DataFrame objects.

First, lets make two DataFrames for the two output tables.


In [16]:
from pandas import Series, DataFrame
buyDf = DataFrame({"food":['hamburger', 'ice cream', 'milk'], 
                   "qty":[0.6045138888888888, 2.591319444444, 6.9701388888]}).set_index("food")
consumeDf = DataFrame({"category" : ["calories", "fat", "protein", "sodium"], 
                       "qty": [1800.0, 59.0559, 91., 1779.]}).set_index("category")

As you can see, these DataFrames are consistent with the format expected by solutionFactory.


In [17]:
buyDf


Out[17]:
qty
food
hamburger 0.604514
ice cream 2.591319
milk 6.970139

In [18]:
consumeDf


Out[18]:
qty
category
calories 1800.0000
fat 59.0559
protein 91.0000
sodium 1779.0000

As a result, they can be used to create a solutionFactory.TicDat object. Just pass the DataFrame objects as the correct named arguments when creating the TicDat.


In [19]:
soln = solutionFactory.TicDat(buyFood = buyDf, consumeNutrition = consumeDf)
soln.buyFood


Out[19]:
{'hamburger': _td:{'qty': 0.6045138888888888},
 'ice cream': _td:{'qty': 2.5913194444440002},
 'milk': _td:{'qty': 6.9701388888000002}}

But wait! There's even more. Because the data tables here have but a single data field, they can accept properly formatted Series objects as well.


In [20]:
buyS = buyDf.qty
consumeS = consumeDf.qty
assert isinstance(buyS, Series) and isinstance(consumeS, Series)
soln = solutionFactory.TicDat(buyFood = buyS, consumeNutrition = consumeS)
soln.consumeNutrition


Out[20]:
{'calories': _td:{'qty': 1800.0},
 'fat': _td:{'qty': 59.055900000000001},
 'protein': _td:{'qty': 91.0},
 'sodium': _td:{'qty': 1779.0}}

Thanks for reading!