This example is based on the classic diet problem. We will demonstrate how the ticdat
data library can be used for wrangling problematic input data.
Please note that this notebook references a dietmodel.py demonstration file. This file in turn is based on Gurobis own dietmodel.py example. The latter uses the Python dict
to store the input and output data, whereas the former uses a TicDat
object.
First import a TicDatFactory
. This will allow us to generate a schema for the diet problem.
In [1]:
from ticdat import TicDatFactory
Next, we create a TicDatFactory
object that defines the tables, primary key fields, and data fields for the diet problem.
In [2]:
dataFactory = TicDatFactory (
categories = [["name"],["minNutrition", "maxNutrition"]],
foods = [["name"],["cost"]],
nutritionQuantities = [["food", "category"], ["qty"]])
Add the foreign key relationships - the nutritionQuantities
table is a child to foods
and categories
.
In [3]:
dataFactory.add_foreign_key("nutritionQuantities", "foods", ["food", "name"])
In [4]:
dataFactory.add_foreign_key("nutritionQuantities", "categories", ["category", "name"])
The default default of zero is fine everywhere, except for maxNutrition
In [5]:
dataFactory.set_default_value("categories", "maxNutrition", float("inf"))
By default, none of the data fields are typed. We want to add types to our data fields, to facilitate data wrangling. That is to say, to recognize data entries of the wrong type. (TicDat
objects will accept data of any type regardless of field type specifications).
First, we simply call set_data_type
for every data field with the default arguments. This specifies the most common data type - a non-negative, non-infinite number that has no integrality restrictions.
In [6]:
for table, fields in dataFactory.data_fields.items():
for field in fields:
dataFactory.set_data_type(table, field)
The maxNutrition
field for the categories
table is different. Here, we want to accept positive infinity as valid. (The Atkins diet having taught us we can basically eat as much protein and fat as we want, so long as we don't eat too many calories).
In [7]:
dataFactory.set_data_type("categories", "maxNutrition", max=float("inf"),
inclusive_max=True)
We've specified our tables, fields, foreign keys, and data types. We're ready to wrangle!
The moreFood.xls
file has categories
, foods
and nutritionQuantities
sheets that we think can populate a sound diet model. First, we inspect this file to insure there are no duplicated records.
In [8]:
dups = dataFactory.xls.find_duplicates("moreFood.xls")
In [9]:
dups
Out[9]:
So we found two duplicate rows. The amount of protein in pizza and the amount of protein in salad has been specified twice. There is, in general, no magic solution for dealing with this sort of problem. (They call it "data wrangling" because you should expect to exert some effort). In this instance, it's easy enough to verify that the protein quantity is the same for both duplicates, and thus there is no problem in simply ignoring the duplication.
Next, we read the all three tables into a TicDat
data object.
In [10]:
dat = dataFactory.xls.create_tic_dat("moreFood.xls")
Some basic Python lets us get a row count for each table. Since 4*37==148
, it looks like the nutritionQuantities
table is fully populated, but there is an easy ticdat
way to be sure.
In [11]:
{t:len(getattr(dat, t)) for t in dataFactory.all_tables}
Out[11]:
To be fully confident that nutritionQuantities
is populated correctly, we will check for foreign key reference failures. Typically, databases will simply reject data records that fail to match the required foreign keys. ticdat
is a relational data library, not a database, so it accepts such records and then helps you find any failed matches.
In [12]:
foreign_key_failed = dataFactory.find_foreign_key_failures(dat)
In [13]:
foreign_key_failed
Out[13]:
So there are foreign key failures. Good thing we checked! The foreign_key_failed
dictionary doesn't render very nicely here in iPython. That's because its using a namedtuple
both for keys and values. We can make it prettier by filtering out the self documentation.
In [14]:
{tuple(k[:2]):v.native_values for k,v in foreign_key_failed.items()}
Out[14]:
In [15]:
{tuple(k[:2]):v.native_pks for k,v in foreign_key_failed.items()}
Out[15]:
I think I see what's happening here. When the nutritionQuantities
table was trying to match into the foods
table the "Baked Cods" string wasn't matching with any of the food names. "Baked Cods" is the only string that failed to match, but it failed 4 times (once for every nutrition category). Lets poke around a bit to see if we can figure out what "Baked Cods" is supposed to match.
In [16]:
[k for k in dat.foods if "cod" in k.lower()]
Out[16]:
Aha! I see what's going on. "Baked Cods" is just misspelled everywhere. It's supposed to be "Baked Cod". (Did I mention already that it's a good thing we checked? Without fixing it, the MIP I run below will throw an exception, or, if using an unfrozen TicDat
object, simply assign "Baked Cod" the default value of zero for all nutrition categories). Luckily, this is just the sort of wrangling Python is good at. Lets replace dat
with a copy that fixes this misspelling.
In [17]:
dat = dataFactory.TicDat(categories = dat.categories,
foods = dat.foods,
nutritionQuantities = {
("Baked Cod" if f == "Baked Cods" else f,c):v
for (f,c),v in dat.nutritionQuantities.items()})
Lets just double check that this fixes up all the foreign key problems. "Only The Paranoid Survive", amirite?
In [18]:
assert not dataFactory.find_foreign_key_failures(dat)
Might as well double check the row counts.
In [19]:
{t:len(getattr(dat, t)) for t in dataFactory.all_tables}
Out[19]:
Duplicates, foreign keys, what's left to check? Oh, that's right, the actual data.
In [20]:
data_failed = dataFactory.find_data_type_failures(dat)
In [21]:
data_failed
Out[21]:
So there are data field problems as well. Similar to before, the data_failed
dictionary doesn't render very nicely in iPython because of the (self documenting but cosmetically irritating) use of namedtuple
. Easy enough to pretty it up.
In [22]:
{tuple(k):v.bad_values for k,v in data_failed.items()}
Out[22]:
In [23]:
{tuple(k):len(v.pks) for k,v in data_failed.items()}
Out[23]:
There are two types of data field problems. The first is the 'GRB.INFINITY'
string in the maxNutrition
field of the categories
table. (This isn't a weird ticdat.xls
bug. ticdat
can recognize Excel INFINITY. The QA engineer who built the moreFood.xls
file put a string based on a gurobipy
constant in there hoping to trip me up.)
The second data problem is the blank string in the qty
field of the nutritionQuantities
table. Both of these problems are both showing up for only one record apiece, so I feel pretty confident they are just typos of some sort. The sensible thing to do here is replace the 'GRB.INFINITY'
string with positive infinity, and also to replace the blank string with the number zero. That is equivalent to replacing bad data values with the appropriate default values. (Remember, the default default is zero for all data fields, and we overrode this for the maxNutrition
field earlier).
Lucky for us, there is a ticdat
function just for this purpose.
In [24]:
dataFactory.replace_data_type_failures(dat)
Out[24]:
In [25]:
assert not dataFactory.find_data_type_failures(dat)
Might as well inspect the replacement values populated by find_data_type_failures
.
In [26]:
{dat.categories[f]['maxNutrition'] for f in data_failed['categories', 'maxNutrition'].pks}
Out[26]:
In [27]:
{dat.nutritionQuantities[f,c]['qty'] for f,c in
data_failed['nutritionQuantities', 'qty'].pks}
Out[27]:
Now that the data is all cleaned up, we can solve it with a MIP. We happen to have the dietmodel.py
file handy right here. As a well modularized solve engine, this file publishes both its input and output schema along with a solve
function. However, we know our dat
object is consistent with the dietmodel
input schema, so we can go straight to solve
.
In [28]:
from dietmodel import solve
Now, I'm going to be a little fancy here, and freeze my dat
object before passing it to solve
. I don't want solve
to even try to edit the input data. All solve
should do is read from dat
. After I freeze dat
, any attempt to write to dat
(deliberate or accidental) will throw an exception. If solve
runs without an exception throw I'll know there weren't any such attempts to edit the input data.
In [29]:
soln = solve(dataFactory.freeze_me(dat))
Good, no exception throw, even when running on frozen input. We've got some nice diagnostic output from gurobipy
, which is also reassuring. Let's see what solution was created.
In [30]:
soln.buyFood
Out[30]:
Ok, bacon, eggs, and hot dogs. It's an Atkins paradise. Say no more, optimizer, this is a diet I can live with!
Before we conclude, I'd like to emphasize a point. The somewhat casual way we ignored duplicates and made our best guess replacements to the various data integrity problems we found in moreFood.xls
was intended as a demonstration of the data wrangling power of Python and ticdat
. It wasn't intended as professional guidance for how to run a data science based consultancy.
In real life, data integrity problems like this generally represent some sort of breakdown in communication between different professionals. Perhaps someone described the nature of the input data incorrectly to the MIP practitioner. Or perhaps the MIP practitioner made some incorrect assumptions based on what was, in hindsight, a somewhat ambiguous conversation.
Regardless, it's critically important to address these sorts of problems as early as possible in the development cycle. The last thing you want to do is to go live with an custom built analytics component automatically pulling data from a system-of-record, only to discover that the entire component has been built around a faulty assumption. It's axiomatic to good software development practices that the farther upstream a bug is caught, the less harm it causes. We believe ticdat
can play an important role in catching these sorts of bugs at the earliest possible stage.