Using ticdat to wrangle data from the diet problem

This example is based on the classic diet problem. We will demonstrate how the ticdat data library can be used for wrangling problematic input data.

Please note that this notebook references a dietmodel.py demonstration file. This file in turn is based on Gurobis own dietmodel.py example. The latter uses the Python dict to store the input and output data, whereas the former uses a TicDat object.

First import a TicDatFactory. This will allow us to generate a schema for the diet problem.


In [1]:
from ticdat import TicDatFactory

Next, we create a TicDatFactory object that defines the tables, primary key fields, and data fields for the diet problem.


In [2]:
dataFactory = TicDatFactory (
     categories = [["name"],["minNutrition", "maxNutrition"]],
     foods  = [["name"],["cost"]],
     nutritionQuantities = [["food", "category"], ["qty"]])

Add the foreign key relationships - the nutritionQuantities table is a child to foods and categories.


In [3]:
dataFactory.add_foreign_key("nutritionQuantities", "foods", ["food", "name"])

In [4]:
dataFactory.add_foreign_key("nutritionQuantities", "categories", ["category", "name"])

The default default of zero is fine everywhere, except for maxNutrition


In [5]:
dataFactory.set_default_value("categories", "maxNutrition", float("inf"))

By default, none of the data fields are typed. We want to add types to our data fields, to facilitate data wrangling. That is to say, to recognize data entries of the wrong type. (TicDat objects will accept data of any type regardless of field type specifications).

First, we simply call set_data_type for every data field with the default arguments. This specifies the most common data type - a non-negative, non-infinite number that has no integrality restrictions.


In [6]:
for table, fields in dataFactory.data_fields.items():
    for field in fields:
        dataFactory.set_data_type(table, field)

The maxNutrition field for the categories table is different. Here, we want to accept positive infinity as valid. (The Atkins diet having taught us we can basically eat as much protein and fat as we want, so long as we don't eat too many calories).


In [7]:
dataFactory.set_data_type("categories", "maxNutrition", max=float("inf"), 
                                                        inclusive_max=True)

We've specified our tables, fields, foreign keys, and data types. We're ready to wrangle!

The moreFood.xls file has categories, foods and nutritionQuantities sheets that we think can populate a sound diet model. First, we inspect this file to insure there are no duplicated records.


In [8]:
dups = dataFactory.xls.find_duplicates("moreFood.xls")

In [9]:
dups


Out[9]:
{'nutritionQuantities': {(u'pizza', u'protein'): 2, (u'salad', u'protein'): 2}}

So we found two duplicate rows. The amount of protein in pizza and the amount of protein in salad has been specified twice. There is, in general, no magic solution for dealing with this sort of problem. (They call it "data wrangling" because you should expect to exert some effort). In this instance, it's easy enough to verify that the protein quantity is the same for both duplicates, and thus there is no problem in simply ignoring the duplication.

Next, we read the all three tables into a TicDat data object.


In [10]:
dat = dataFactory.xls.create_tic_dat("moreFood.xls")

Some basic Python lets us get a row count for each table. Since 4*37==148, it looks like the nutritionQuantities table is fully populated, but there is an easy ticdat way to be sure.


In [11]:
{t:len(getattr(dat, t)) for t in dataFactory.all_tables}


Out[11]:
{'categories': 4, 'foods': 37, 'nutritionQuantities': 148}

To be fully confident that nutritionQuantities is populated correctly, we will check for foreign key reference failures. Typically, databases will simply reject data records that fail to match the required foreign keys. ticdat is a relational data library, not a database, so it accepts such records and then helps you find any failed matches.


In [12]:
foreign_key_failed = dataFactory.find_foreign_key_failures(dat)

In [13]:
foreign_key_failed


Out[13]:
{ForeignKey(native_table='nutritionQuantities', foreign_table='foods', mapping=FKMapping(native_field='food', foreign_field='name'), cardinality='many-to-one'): ForeignKeyFailures(native_values=(u'Baked Cods',), native_pks=((u'Baked Cods', u'protein'), (u'Baked Cods', u'sodium'), (u'Baked Cods', u'calories'), (u'Baked Cods', u'fat')))}

So there are foreign key failures. Good thing we checked! The foreign_key_failed dictionary doesn't render very nicely here in iPython. That's because its using a namedtuple both for keys and values. We can make it prettier by filtering out the self documentation.


In [14]:
{tuple(k[:2]):v.native_values for k,v in foreign_key_failed.items()}


Out[14]:
{('nutritionQuantities', 'foods'): (u'Baked Cods',)}

In [15]:
{tuple(k[:2]):v.native_pks for k,v in foreign_key_failed.items()}


Out[15]:
{('nutritionQuantities', 'foods'): ((u'Baked Cods', u'protein'),
  (u'Baked Cods', u'sodium'),
  (u'Baked Cods', u'calories'),
  (u'Baked Cods', u'fat'))}

I think I see what's happening here. When the nutritionQuantities table was trying to match into the foods table the "Baked Cods" string wasn't matching with any of the food names. "Baked Cods" is the only string that failed to match, but it failed 4 times (once for every nutrition category). Lets poke around a bit to see if we can figure out what "Baked Cods" is supposed to match.


In [16]:
[k for k in dat.foods if "cod" in k.lower()]


Out[16]:
[u'2 Cod Meal',
 u'Baked Cod & Shrimp Platter',
 u'Cod & Shrimp Platter',
 u'Baked Cod']

Aha! I see what's going on. "Baked Cods" is just misspelled everywhere. It's supposed to be "Baked Cod". (Did I mention already that it's a good thing we checked? Without fixing it, the MIP I run below will throw an exception, or, if using an unfrozen TicDat object, simply assign "Baked Cod" the default value of zero for all nutrition categories). Luckily, this is just the sort of wrangling Python is good at. Lets replace dat with a copy that fixes this misspelling.


In [17]:
dat = dataFactory.TicDat(categories = dat.categories, 
                         foods = dat.foods,
                         nutritionQuantities = { 
                                ("Baked Cod" if f == "Baked Cods" else f,c):v
                                for (f,c),v in dat.nutritionQuantities.items()})

Lets just double check that this fixes up all the foreign key problems. "Only The Paranoid Survive", amirite?


In [18]:
assert not dataFactory.find_foreign_key_failures(dat)

Might as well double check the row counts.


In [19]:
{t:len(getattr(dat, t)) for t in dataFactory.all_tables}


Out[19]:
{'categories': 4, 'foods': 37, 'nutritionQuantities': 148}

Duplicates, foreign keys, what's left to check? Oh, that's right, the actual data.


In [20]:
data_failed = dataFactory.find_data_type_failures(dat)

In [21]:
data_failed


Out[21]:
{TableField(table='categories', field='maxNutrition'): ValuesPks(bad_values=(u'GRB.INFINITY',), pks=(u'protein',)),
 TableField(table='nutritionQuantities', field='qty'): ValuesPks(bad_values=('',), pks=((u'sunnyside egg', u'sodium'),))}

So there are data field problems as well. Similar to before, the data_failed dictionary doesn't render very nicely in iPython because of the (self documenting but cosmetically irritating) use of namedtuple. Easy enough to pretty it up.


In [22]:
{tuple(k):v.bad_values for k,v in data_failed.items()}


Out[22]:
{('categories', 'maxNutrition'): (u'GRB.INFINITY',),
 ('nutritionQuantities', 'qty'): ('',)}

In [23]:
{tuple(k):len(v.pks) for k,v in data_failed.items()}


Out[23]:
{('categories', 'maxNutrition'): 1, ('nutritionQuantities', 'qty'): 1}

There are two types of data field problems. The first is the 'GRB.INFINITY' string in the maxNutrition field of the categories table. (This isn't a weird ticdat.xls bug. ticdat can recognize Excel INFINITY. The QA engineer who built the moreFood.xls file put a string based on a gurobipy constant in there hoping to trip me up.)

The second data problem is the blank string in the qty field of the nutritionQuantities table. Both of these problems are both showing up for only one record apiece, so I feel pretty confident they are just typos of some sort. The sensible thing to do here is replace the 'GRB.INFINITY' string with positive infinity, and also to replace the blank string with the number zero. That is equivalent to replacing bad data values with the appropriate default values. (Remember, the default default is zero for all data fields, and we overrode this for the maxNutrition field earlier).

Lucky for us, there is a ticdat function just for this purpose.


In [24]:
dataFactory.replace_data_type_failures(dat)


Out[24]:
td:('nutritionQuantities', 'foods', 'categories')

In [25]:
assert not dataFactory.find_data_type_failures(dat)

Might as well inspect the replacement values populated by find_data_type_failures.


In [26]:
{dat.categories[f]['maxNutrition'] for f in data_failed['categories', 'maxNutrition'].pks}


Out[26]:
{inf}

In [27]:
{dat.nutritionQuantities[f,c]['qty'] for f,c in 
                                     data_failed['nutritionQuantities', 'qty'].pks}


Out[27]:
{0}

Now that the data is all cleaned up, we can solve it with a MIP. We happen to have the dietmodel.py file handy right here. As a well modularized solve engine, this file publishes both its input and output schema along with a solve function. However, we know our dat object is consistent with the dietmodel input schema, so we can go straight to solve.


In [28]:
from dietmodel import solve

Now, I'm going to be a little fancy here, and freeze my dat object before passing it to solve. I don't want solve to even try to edit the input data. All solve should do is read from dat. After I freeze dat, any attempt to write to dat (deliberate or accidental) will throw an exception. If solve runs without an exception throw I'll know there weren't any such attempts to edit the input data.


In [29]:
soln = solve(dataFactory.freeze_me(dat))


Optimize a model with 4 rows, 41 columns and 151 nonzeros
Presolve removed 0 rows and 3 columns
Presolve time: 0.00s
Presolved: 4 rows, 38 columns, 148 nonzeros

Iteration    Objective       Primal Inf.    Dual Inf.      Time
       0    0.0000000e+00   3.179688e+00   0.000000e+00      0s
       4    6.8838775e+00   0.000000e+00   0.000000e+00      0s

Solved in 4 iterations and 0.00 seconds
Optimal objective  6.883877480e+00

Good, no exception throw, even when running on frozen input. We've got some nice diagnostic output from gurobipy, which is also reassuring. Let's see what solution was created.


In [30]:
soln.buyFood


Out[30]:
{u'bacon': _td:{'qty': 1.0210393585062434},
 u'boiled egg': _td:{'qty': 2.757684125701712},
 u'hot dog': _td:{'qty': 2.1451544549453345}}

Ok, bacon, eggs, and hot dogs. It's an Atkins paradise. Say no more, optimizer, this is a diet I can live with!

Before we conclude, I'd like to emphasize a point. The somewhat casual way we ignored duplicates and made our best guess replacements to the various data integrity problems we found in moreFood.xls was intended as a demonstration of the data wrangling power of Python and ticdat. It wasn't intended as professional guidance for how to run a data science based consultancy.

In real life, data integrity problems like this generally represent some sort of breakdown in communication between different professionals. Perhaps someone described the nature of the input data incorrectly to the MIP practitioner. Or perhaps the MIP practitioner made some incorrect assumptions based on what was, in hindsight, a somewhat ambiguous conversation.

Regardless, it's critically important to address these sorts of problems as early as possible in the development cycle. The last thing you want to do is to go live with an custom built analytics component automatically pulling data from a system-of-record, only to discover that the entire component has been built around a faulty assumption. It's axiomatic to good software development practices that the farther upstream a bug is caught, the less harm it causes. We believe ticdat can play an important role in catching these sorts of bugs at the earliest possible stage.