Basic orienatation to `ticdat`, `pandas` and developing engines for Opalytics

One of the advantages of Python is that it has "batteries included". That is to say, there is a rich set of libraries available for installation. Of course, with such a large collection of libraries to choose from, it's natural to wonder how different libraries relate to each other, and which to choose for a given situation.

This notebook addresses the ticdat and pandas libraries. It is a good starting point if you are a pythonic and pandonic programmer who wishes to develop Opalytics-ready data science engines as quickly as possible.

ticdat was developed to promote modular solve engine development. It facilitates the pattern under which a solve function publishes its input and output data formats.

Specifically, a solve engine creates two TicDatFactory objects. One defines the input schema and the other the output schema. Although you are encouraged to add as many data integrity rules as possible to these objects (particularly the input object), you only need to specify the table and field names, and to organize the fields into primary key fields and data fields.

For example, in the diet example, the dietmodel.py file has the following lines.



In [1]:

    
from ticdat import TicDatFactory, freeze_me



In [2]:

    
dataFactory = TicDatFactory (
     categories = [["name"],["minNutrition", "maxNutrition"]],
     foods  = [["name"],["cost"]],
     nutritionQuantities = [["food", "category"], ["qty"]])

Here, the dataFactory object defines an input schema. This schema has three tables (categories, foods, and nutritionQuantities). The categories table is indexed by a single field (name) and has two data fields (minNutrition and maxNutrition). The nutritionQuantities table is indexed by two fields (food and category) and has one data field (qty).

Any code wishing to run the solve function can learn what type of data object to pass as input by examining the dataFactory object. The dietcsvdata.py, dietstaticdata.py and dietxls.py scripts demonstrate this pattern by sourcing data from a sub-directory of csv files, a static data instance, and an xls file, respectively. Were Opalytics to deploy dietmodel, it would perform work roughly analogous to that performed by these three files, except Opalytics would source the input data from the Opalytics Cloud Platform.

Let's examine what a TicDat object created by dataFactory looks like. To do this, we're going to pull in some sample testing data hard coded in the ticdat testing code.



In [3]:

    
import ticdat.testing.ticdattestutils as tictest
_tmp = tictest.dietData()
dietData = dataFactory.TicDat(categories = _tmp.categories, foods = _tmp.foods, 
                              nutritionQuantities = _tmp.nutritionQuantities)

dietData is a TicDat object. It is an instance of the schema defined by dataFactory. By default, it stores its data in a "dict of dicts" format.



In [4]:

    
dietData.categories









    Out[4]:





{'calories': _td:{'minNutrition': 1800, 'maxNutrition': 2200},
 'fat': _td:{'minNutrition': 0, 'maxNutrition': 65},
 'protein': _td:{'minNutrition': 91, 'maxNutrition': inf},
 'sodium': _td:{'minNutrition': 0, 'maxNutrition': 1779}}



In [5]:

    
dietData.nutritionQuantities









    Out[5]:





{('chicken', 'calories'): _td:{'qty': 420},
 ('chicken', 'fat'): _td:{'qty': 10},
 ('chicken', 'protein'): _td:{'qty': 32},
 ('chicken', 'sodium'): _td:{'qty': 1190},
 ('fries', 'calories'): _td:{'qty': 380},
 ('fries', 'fat'): _td:{'qty': 19},
 ('fries', 'protein'): _td:{'qty': 4},
 ('fries', 'sodium'): _td:{'qty': 270},
 ('hamburger', 'calories'): _td:{'qty': 410},
 ('hamburger', 'fat'): _td:{'qty': 26},
 ('hamburger', 'protein'): _td:{'qty': 24},
 ('hamburger', 'sodium'): _td:{'qty': 730},
 ('hot dog', 'calories'): _td:{'qty': 560},
 ('hot dog', 'fat'): _td:{'qty': 32},
 ('hot dog', 'protein'): _td:{'qty': 20},
 ('hot dog', 'sodium'): _td:{'qty': 1800},
 ('ice cream', 'calories'): _td:{'qty': 330},
 ('ice cream', 'fat'): _td:{'qty': 10},
 ('ice cream', 'protein'): _td:{'qty': 8},
 ('ice cream', 'sodium'): _td:{'qty': 180},
 ('macaroni', 'calories'): _td:{'qty': 320},
 ('macaroni', 'fat'): _td:{'qty': 10},
 ('macaroni', 'protein'): _td:{'qty': 12},
 ('macaroni', 'sodium'): _td:{'qty': 930},
 ('milk', 'calories'): _td:{'qty': 100},
 ('milk', 'fat'): _td:{'qty': 2.5},
 ('milk', 'protein'): _td:{'qty': 8},
 ('milk', 'sodium'): _td:{'qty': 125},
 ('pizza', 'calories'): _td:{'qty': 320},
 ('pizza', 'fat'): _td:{'qty': 12},
 ('pizza', 'protein'): _td:{'qty': 15},
 ('pizza', 'sodium'): _td:{'qty': 820},
 ('salad', 'calories'): _td:{'qty': 320},
 ('salad', 'fat'): _td:{'qty': 12},
 ('salad', 'protein'): _td:{'qty': 31},
 ('salad', 'sodium'): _td:{'qty': 1230}}

However, since you are pandonic, you might prefer to have a copy of this data in pandas format. This is easy to do.



In [6]:

    
panDiet = dataFactory.copy_to_pandas(dietData)



In [7]:

    
panDiet.categories









    Out[7]:






  
    
      
      minNutrition
      maxNutrition
    
    
      name
      
      
    
  
  
    
      calories
      1800
      2200.000000
    
    
      fat
      0
      65.000000
    
    
      protein
      91
      inf
    
    
      sodium
      0
      1779.000000



In [8]:

    
panDiet.nutritionQuantities









    Out[8]:






  
    
      
      
      qty
    
    
      food
      category
      
    
  
  
    
      chicken
      calories
      420.0
    
    
      fat
      10.0
    
    
      protein
      32.0
    
    
      sodium
      1190.0
    
    
      fries
      calories
      380.0
    
    
      fat
      19.0
    
    
      protein
      4.0
    
    
      sodium
      270.0
    
    
      hamburger
      calories
      410.0
    
    
      fat
      26.0
    
    
      protein
      24.0
    
    
      sodium
      730.0
    
    
      hot dog
      calories
      560.0
    
    
      fat
      32.0
    
    
      protein
      20.0
    
    
      sodium
      1800.0
    
    
      ice cream
      calories
      330.0
    
    
      fat
      10.0
    
    
      protein
      8.0
    
    
      sodium
      180.0
    
    
      macaroni
      calories
      320.0
    
    
      fat
      10.0
    
    
      protein
      12.0
    
    
      sodium
      930.0
    
    
      milk
      calories
      100.0
    
    
      fat
      2.5
    
    
      protein
      8.0
    
    
      sodium
      125.0
    
    
      pizza
      calories
      320.0
    
    
      fat
      12.0
    
    
      protein
      15.0
    
    
      sodium
      820.0
    
    
      salad
      calories
      320.0
    
    
      fat
      12.0
    
    
      protein
      31.0
    
    
      sodium
      1230.0

Note that these aren't "raw" DataFrame objects. Intead, ticdat has inferred sensible indexes for you from the primary key field designations in dataFactory. The nutritionQuantities table has a MultiIndex and the foods and categories table each have a simple index.

By default, copy_to_pandas will drop the columns that are used to populate the index, unless doing so would result in a DataFrame with no columns at all. However, if you wish for no columns to be dropped under any circumstances, you can use the optional drop_pk_columns argument. This is illustrated below.



In [9]:

    
panDietNoDrop = dataFactory.copy_to_pandas(dietData, drop_pk_columns=False)
panDietNoDrop.categories









    Out[9]:






  
    
      
      name
      minNutrition
      maxNutrition
    
    
      name
      
      
      
    
  
  
    
      calories
      calories
      1800
      2200.000000
    
    
      fat
      fat
      0
      65.000000
    
    
      protein
      protein
      91
      inf
    
    
      sodium
      sodium
      0
      1779.000000

Let's review.

dataFactory describes the input schema
The solve function doesn't know where its input data is coming from. It only knows that is will conform to the schema defined by dataFactory. (All of my examples include at least one assert statement double checking this assumption).
By default, the input tables will be in the default "dict of dicts" format. However, its easy to create a copy of the data which creates a DataFrame for each table.

This summarizes how a solve function can specify its input data and reformat this data as needed. Let's now examine how solve will return data.

The following code specifies a return schema.



In [10]:

    
solutionFactory = TicDatFactory(
        parameters = [[],["totalCost"]],
        buyFood = [["food"],["qty"]],
        consumeNutrition = [["category"],["qty"]])

This schema has three tables (parameters, buyFood, consumeNutrition). The parameters table has no primary key fields at all, and just a single data field. (It is assumed that this table will have at most one record). The buyFood table is indexed by the food field, and has a single data field indicating how much of that food is to be consumed. consumeNutrition is similar, except it defines the quantity consumed for each nutrition type.

(As an aside, only the buyFood table is really needed. The total cost and the quantities of nutrition consumed for each nutrition type can be inferred from the consumption of food and the input data. However, it often makes good sense for the solve routine to compute mathematically redundant tables purely for reporting purposes).

How can the solve code return an object of this type? The easiest way is to create an empty TicDat object, and populate it row by row. This is particularly easy for this schema because all the tables have but one data field. (We're going to skip populating the parameters table because "no primary key" tables are a little different).



In [11]:

    
soln = solutionFactory.TicDat()
soln.buyFood["hamburger"] = 0.6045138888888888
soln.buyFood["ice cream"] = 2.591319444444
soln.buyFood["milk"] = 6.9701388888
soln.consumeNutrition["calories"]= 1800.0
soln.consumeNutrition["fat"]=59.0559
soln.consumeNutrition["protein"]=91.
soln.consumeNutrition["sodium"]=1779.

ticdat overrides __setitem__ for single data field tables so as to create the following.



In [12]:

    
soln.buyFood









    Out[12]:





{'hamburger': _td:{'qty': 0.6045138888888888},
 'ice cream': _td:{'qty': 2.591319444444},
 'milk': _td:{'qty': 6.9701388888}}



In [13]:

    
soln.consumeNutrition









    Out[13]:





{'calories': _td:{'qty': 1800.0},
 'fat': _td:{'qty': 59.0559},
 'protein': _td:{'qty': 91.0},
 'sodium': _td:{'qty': 1779.0}}

Here are a couple of other, equivalent ways to populate these seven records.



In [14]:

    
soln = solutionFactory.TicDat()
soln.buyFood["hamburger"]["qty"] = 0.6045138888888888
soln.buyFood["ice cream"]["qty"] = 2.591319444444
soln.buyFood["milk"]["qty"] = 6.9701388888
soln.consumeNutrition["calories"]["qty"] = 1800.0
soln.consumeNutrition["fat"]["qty"] = 59.0559
soln.consumeNutrition["protein"]["qty"] = 91.
soln.consumeNutrition["sodium"]["qty"] = 1779.



In [15]:

    
soln = solutionFactory.TicDat()
soln.buyFood["hamburger"] = {"qty" : 0.6045138888888888}
soln.buyFood["ice cream"] = {"qty" : 2.591319444444}
soln.buyFood["milk"] = {"qty" : 6.9701388888}
soln.consumeNutrition["calories"] = {"qty" : 1800.0}
soln.consumeNutrition["fat"] = {"qty" : 59.0559}
soln.consumeNutrition["protein"] = {"qty" : 91.}
soln.consumeNutrition["sodium"] = {"qty" : 1779.}

But wait! You're pandonic! Fair enough. Here are a few ways to initialize a TicDat object with Series and DataFrame objects.

First, lets make two DataFrames for the two output tables.



In [16]:

    
from pandas import Series, DataFrame
buyDf = DataFrame({"food":['hamburger', 'ice cream', 'milk'], 
                   "qty":[0.6045138888888888, 2.591319444444, 6.9701388888]}).set_index("food")
consumeDf = DataFrame({"category" : ["calories", "fat", "protein", "sodium"], 
                       "qty": [1800.0, 59.0559, 91., 1779.]}).set_index("category")

As you can see, these DataFrames are consistent with the format expected by solutionFactory.



In [17]:

    
buyDf



In [18]:

    
consumeDf

As a result, they can be used to create a solutionFactory.TicDat object. Just pass the DataFrame objects as the correct named arguments when creating the TicDat.



In [19]:

    
soln = solutionFactory.TicDat(buyFood = buyDf, consumeNutrition = consumeDf)
soln.buyFood









    Out[19]:





{'hamburger': _td:{'qty': 0.6045138888888888},
 'ice cream': _td:{'qty': 2.5913194444440002},
 'milk': _td:{'qty': 6.9701388888000002}}

But wait! There's even more. Because the data tables here have but a single data field, they can accept properly formatted Series objects as well.



In [20]:

    
buyS = buyDf.qty
consumeS = consumeDf.qty
assert isinstance(buyS, Series) and isinstance(consumeS, Series)
soln = solutionFactory.TicDat(buyFood = buyS, consumeNutrition = consumeS)
soln.consumeNutrition









    Out[20]:





{'calories': _td:{'qty': 1800.0},
 'fat': _td:{'qty': 59.055900000000001},
 'protein': _td:{'qty': 91.0},
 'sodium': _td:{'qty': 1779.0}}

Thanks for reading!

	minNutrition	maxNutrition
name
calories	1800	2200.000000
fat	0	65.000000
protein	91	inf
sodium	0	1779.000000

		qty
food	category
chicken	calories	420.0
	fat	10.0
	protein	32.0
	sodium	1190.0
fries	calories	380.0
	fat	19.0
	protein	4.0
	sodium	270.0
hamburger	calories	410.0
	fat	26.0
	protein	24.0
	sodium	730.0
hot dog	calories	560.0
	fat	32.0
	protein	20.0
	sodium	1800.0
ice cream	calories	330.0
	fat	10.0
	protein	8.0
	sodium	180.0
macaroni	calories	320.0
	fat	10.0
	protein	12.0
	sodium	930.0
milk	calories	100.0
	fat	2.5
	protein	8.0
	sodium	125.0
pizza	calories	320.0
	fat	12.0
	protein	15.0
	sodium	820.0
salad	calories	320.0
	fat	12.0
	protein	31.0
	sodium	1230.0

	qty
category
calories	1800.0000
fat	59.0559
protein	91.0000
sodium	1779.0000

Basic orienatation to ticdat, pandas and developing engines for Opalytics

Basic orienatation to `ticdat`, `pandas` and developing engines for Opalytics