pandas and ticdat

pandas is arguably the most successful data library in history, not just for Python but across all languages. That said, in the context of prescriptive analytics in general, and Mixed-Integer-Programming in particular, there are a few reasons to use pandas.DataFrame with caution. For one, DataFrame might present a level of complexity that is seen as intimidating to novice Python developers. For another, there are common MIP idioms (such as empty slicing returning an empty set) that aren't supported by DataFrame. In fact, such "iterate over indicies and capture slices of other tables" coding strategies are not likely to be thought of as pandonic (i.e. idiomatic pandas code). Last (and also probably not least) a DataFrame might not be the most computationally efficient data structure if your primary method of accessing data is via primary-key based slices.

In sum, as Jeff says, "iterating over values is not idiomatic at all and completely non performant". This is not a ringing endorsement for a data science community that is used to expressing constraints precisely by iterating over values - either as part of an inner summation or in order to generate a collection of similar constraints.

Regardless, ticdat wants to stand in solidarity with pandas. We all share the same goal - to unify data science with a common high level programming language. To that end, ticdat supports pandas in 2 ways - either by creating TicDat objects from properly formatted DataFrame objects, or by creating a DataFrame based copy of a TicDat object.

This notebook focuses on the latter technique. Here, we dive into the nuances of the copy_to_pandas routine.

To begin, lets use the ticdat testing section to get some data tables. (As always, the testing section is not appropriate for production code, but is fine for demonstration code like this).


In [1]:
import ticdat.testing.testutils as tdu
from ticdat import TicDatFactory
tdf = TicDatFactory(**tdu.netflowSchema())
dat = tdf.copy_tic_dat(tdu.netflowData())

Now let's look at a couple of different types of tables, and see what how copy_to_pandas handles different types of data. Here is the "cost" table both in dict-of-dicts format and the as a DataFrame.


In [2]:
dat.cost


Out[2]:
{('Pencils', 'Denver', 'Boston'): _td:{'cost': 40},
 ('Pencils', 'Denver', 'New York'): _td:{'cost': 40},
 ('Pencils', 'Denver', 'Seattle'): _td:{'cost': 30},
 ('Pencils', 'Detroit', 'Boston'): _td:{'cost': 10},
 ('Pencils', 'Detroit', 'New York'): _td:{'cost': 20},
 ('Pencils', 'Detroit', 'Seattle'): _td:{'cost': 60},
 ('Pens', 'Denver', 'Boston'): _td:{'cost': 60},
 ('Pens', 'Denver', 'New York'): _td:{'cost': 70},
 ('Pens', 'Denver', 'Seattle'): _td:{'cost': 30},
 ('Pens', 'Detroit', 'Boston'): _td:{'cost': 20},
 ('Pens', 'Detroit', 'New York'): _td:{'cost': 20},
 ('Pens', 'Detroit', 'Seattle'): _td:{'cost': 80}}

In [3]:
df_cost = tdf.copy_to_pandas(dat).cost
df_cost


Out[3]:
cost
commodity source destination
Pencils Denver Boston 40
New York 40
Seattle 30
Detroit Boston 10
New York 20
Seattle 60
Pens Denver Boston 60
New York 70
Seattle 30
Detroit Boston 20
New York 20
Seattle 80

It's important to emphasize whats happening. The only column in the DataFrame is "cost". This is because the cost table only has a single data field (also named "cost"). The "commodity","source","destination" columns are all primary key columns. To avoid redundant data in the DataFrame, copy_to_pandas includes this information only as part of the index of the DataFrame.


In [4]:
df_cost.index


Out[4]:
MultiIndex(levels=[[u'Pencils', u'Pens'], [u'Denver', u'Detroit'], [u'Boston', u'New York', u'Seattle']],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=[u'commodity', u'source', u'destination'])

This is also a good time to point out that asking "is this row present" is completely different for the "dict-of-dicts" representation than for the DataFrame representation. Choices are good - use whatever version will make your code more readable. Personally, I'd be a little worried that someone might forget to include .index when looking for row presence in pandas, but I also think an experienced pandas developer would be pretty unlikely to make that mistake.


In [5]:
('Pens', 'Denver', 'Seattle') in df_cost


Out[5]:
False

In [6]:
('Pens', 'Denver', 'Seattle') in df_cost.index


Out[6]:
True

In [7]:
('Pens', 'Denver', 'Seattle') in dat.cost


Out[7]:
True

Now, lets look at the nodes table.


In [8]:
dat.nodes


Out[8]:
{'Boston': {}, 'Denver': {}, 'Detroit': {}, 'New York': {}, 'Seattle': {}}

In [9]:
tdf.copy_to_pandas(dat).nodes


Out[9]:
name
name
Boston Boston
Denver Denver
Detroit Detroit
New York New York
Seattle Seattle

Wait, why is the "name" information duplicated here? Shouldn't the primary key just be part of the index? The answer, by default, is it depends. If the table has no data fields, then by default it will duplicate the primary key data in both the index of the DataFrame and in the column(s) of the DataFrame. In other words, by default, copy_to_pandas will duplicate data only to avoid having "no column" DataFrames. But, you can override this behavior however you want by specifying non-None booleans as the drop_pk_columns argument.


In [10]:
tdf.copy_to_pandas(dat, drop_pk_columns=True).nodes


Out[10]:
name
Boston
Denver
Detroit
New York
Seattle

This is a DataFrame with an .index but no column. Seems like a strange thing to have, but you can create it just fine if you want.

Here I go back and create a version of the "cost" table that doesn't drop the primary key fields from the columns of the DataFrame.


In [11]:
tdf.copy_to_pandas(dat, drop_pk_columns=False).cost


Out[11]:
commodity source destination cost
commodity source destination
Pencils Denver Boston Pencils Denver Boston 40
New York Pencils Denver New York 40
Seattle Pencils Denver Seattle 30
Detroit Boston Pencils Detroit Boston 10
New York Pencils Detroit New York 20
Seattle Pencils Detroit Seattle 60
Pens Denver Boston Pens Denver Boston 60
New York Pens Denver New York 70
Seattle Pens Denver Seattle 30
Detroit Boston Pens Detroit Boston 20
New York Pens Detroit New York 20
Seattle Pens Detroit Seattle 80

This sort of seems like it was created by the Department of Redundancy Department. However, you might find it handy for performing .join, as I did when creating a truly pandonic implementation of a multi commodity network flow model here