One of the advantages of Python is that it has "batteries included". That is to say, there is a rich set of libraries available for installation. Of course, with such a large collection of libraries to choose from, it's natural to wonder how different libraries relate to each other, and which to choose for a given situation.
This notebook addresses the ticdat
and pandas
libraries. It is a good starting point if you are a pythonic and pandonic programmer who wishes to develop Opalytics-ready data science engines as quickly as possible.
ticdat
was developed to promote modular solve engine development. It facilitates the pattern under which a solve
function publishes its input and output data formats.
Specifically, a solve engine creates two TicDatFactory
objects. One defines the input schema and the other the output schema. Although you are encouraged to add as many data integrity rules as possible to these objects (particularly the input object), you only need to specify the table and field names, and to organize the fields into primary key fields and data fields.
For example, in the diet
example, the dietmodel.py
file has the following lines.
In [1]:
from ticdat import TicDatFactory, freeze_me
In [2]:
dataFactory = TicDatFactory (
categories = [["name"],["minNutrition", "maxNutrition"]],
foods = [["name"],["cost"]],
nutritionQuantities = [["food", "category"], ["qty"]])
Here, the dataFactory
object defines an input schema. This schema has three tables (categories, foods, and nutritionQuantities). The categories table is indexed by a single field (name) and has two data fields (minNutrition and maxNutrition). The nutritionQuantities table is indexed by two fields (food and category) and has one data field (qty).
Any code wishing to run the solve
function can learn what type of data object to pass as input by examining the dataFactory
object. The dietcsvdata.py
, dietstaticdata.py
and dietxls.py
scripts demonstrate this pattern by sourcing data from a sub-directory of csv files, a static data instance, and an xls file, respectively. Were Opalytics to deploy dietmodel
, it would perform work roughly analogous to that performed by these three files, except Opalytics would source the input data from the Opalytics Cloud Platform.
Let's examine what a TicDat
object created by dataFactory
looks like. To do this, we're going to pull in some sample testing data hard coded in the ticdat
testing code.
In [3]:
import ticdat.testing.ticdattestutils as tictest
_tmp = tictest.dietData()
dietData = dataFactory.TicDat(categories = _tmp.categories, foods = _tmp.foods,
nutritionQuantities = _tmp.nutritionQuantities)
dietData
is a TicDat
object. It is an instance of the schema defined by dataFactory
. By default, it stores its data in a "dict of dicts" format.
In [4]:
dietData.categories
Out[4]:
In [5]:
dietData.nutritionQuantities
Out[5]:
However, since you are pandonic, you might prefer to have a copy of this data in pandas
format. This is easy to do.
In [6]:
panDiet = dataFactory.copy_to_pandas(dietData)
In [7]:
panDiet.categories
Out[7]:
In [8]:
panDiet.nutritionQuantities
Out[8]:
Note that these aren't "raw" DataFrame
objects. Intead, ticdat
has inferred sensible indexes for you from the primary key field designations in dataFactory
. The nutritionQuantities table has a MultiIndex
and the foods and categories table each have a simple index.
By default, copy_to_pandas
will drop the columns that are used to populate the index, unless doing so would result in a DataFrame
with no columns at all. However, if you wish for no columns to be dropped under any circumstances, you can use the optional drop_pk_columns
argument. This is illustrated below.
In [9]:
panDietNoDrop = dataFactory.copy_to_pandas(dietData, drop_pk_columns=False)
panDietNoDrop.categories
Out[9]:
Let's review.
dataFactory
describes the input schemasolve
function doesn't know where its input data is coming from. It only knows that is will conform to the schema defined by dataFactory
. (All of my examples include at least one assert
statement double checking this assumption).DataFrame
for each table. This summarizes how a solve
function can specify its input data and reformat this data as needed. Let's now examine how solve
will return data.
The following code specifies a return schema.
In [10]:
solutionFactory = TicDatFactory(
parameters = [[],["totalCost"]],
buyFood = [["food"],["qty"]],
consumeNutrition = [["category"],["qty"]])
This schema has three tables (parameters, buyFood, consumeNutrition). The parameters table has no primary key fields at all, and just a single data field. (It is assumed that this table will have at most one record). The buyFood table is indexed by the food field, and has a single data field indicating how much of that food is to be consumed. consumeNutrition is similar, except it defines the quantity consumed for each nutrition type.
(As an aside, only the buyFood table is really needed. The total cost and the quantities of nutrition consumed for each nutrition type can be inferred from the consumption of food and the input data. However, it often makes good sense for the solve
routine to compute mathematically redundant tables purely for reporting purposes).
How can the solve
code return an object of this type? The easiest way is to create an empty TicDat
object, and populate it row by row. This is particularly easy for this schema because all the tables have but one data field. (We're going to skip populating the parameters table because "no primary key" tables are a little different).
In [11]:
soln = solutionFactory.TicDat()
soln.buyFood["hamburger"] = 0.6045138888888888
soln.buyFood["ice cream"] = 2.591319444444
soln.buyFood["milk"] = 6.9701388888
soln.consumeNutrition["calories"]= 1800.0
soln.consumeNutrition["fat"]=59.0559
soln.consumeNutrition["protein"]=91.
soln.consumeNutrition["sodium"]=1779.
ticdat
overrides __setitem__
for single data field tables so as to create the following.
In [12]:
soln.buyFood
Out[12]:
In [13]:
soln.consumeNutrition
Out[13]:
Here are a couple of other, equivalent ways to populate these seven records.
In [14]:
soln = solutionFactory.TicDat()
soln.buyFood["hamburger"]["qty"] = 0.6045138888888888
soln.buyFood["ice cream"]["qty"] = 2.591319444444
soln.buyFood["milk"]["qty"] = 6.9701388888
soln.consumeNutrition["calories"]["qty"] = 1800.0
soln.consumeNutrition["fat"]["qty"] = 59.0559
soln.consumeNutrition["protein"]["qty"] = 91.
soln.consumeNutrition["sodium"]["qty"] = 1779.
In [15]:
soln = solutionFactory.TicDat()
soln.buyFood["hamburger"] = {"qty" : 0.6045138888888888}
soln.buyFood["ice cream"] = {"qty" : 2.591319444444}
soln.buyFood["milk"] = {"qty" : 6.9701388888}
soln.consumeNutrition["calories"] = {"qty" : 1800.0}
soln.consumeNutrition["fat"] = {"qty" : 59.0559}
soln.consumeNutrition["protein"] = {"qty" : 91.}
soln.consumeNutrition["sodium"] = {"qty" : 1779.}
But wait! You're pandonic! Fair enough. Here are a few ways to initialize a TicDat
object with Series
and DataFrame
objects.
First, lets make two DataFrames
for the two output tables.
In [16]:
from pandas import Series, DataFrame
buyDf = DataFrame({"food":['hamburger', 'ice cream', 'milk'],
"qty":[0.6045138888888888, 2.591319444444, 6.9701388888]}).set_index("food")
consumeDf = DataFrame({"category" : ["calories", "fat", "protein", "sodium"],
"qty": [1800.0, 59.0559, 91., 1779.]}).set_index("category")
As you can see, these DataFrames
are consistent with the format expected by solutionFactory
.
In [17]:
buyDf
Out[17]:
In [18]:
consumeDf
Out[18]:
As a result, they can be used to create a solutionFactory.TicDat
object. Just pass the DataFrame
objects as the correct named arguments when creating the TicDat
.
In [19]:
soln = solutionFactory.TicDat(buyFood = buyDf, consumeNutrition = consumeDf)
soln.buyFood
Out[19]:
But wait! There's even more. Because the data tables here have but a single data field, they can accept properly formatted Series
objects as well.
In [20]:
buyS = buyDf.qty
consumeS = consumeDf.qty
assert isinstance(buyS, Series) and isinstance(consumeS, Series)
soln = solutionFactory.TicDat(buyFood = buyS, consumeNutrition = consumeS)
soln.consumeNutrition
Out[20]:
Thanks for reading!