pandas and ticdat

pandas is arguably the most successful data library in history, not just for Python but across all languages. That said, in the context of prescriptive analytics in general, and Mixed-Integer-Programming in particular, there are a few reasons to use pandas.DataFrame with caution. For one, DataFrame might present a level of complexity that is seen as intimidating to novice Python developers. For another, there are common MIP idioms (such as empty slicing returning an empty set) that aren't supported by DataFrame. In fact, such "iterate over indicies and capture slices of other tables" coding strategies are not likely to be thought of as pandonic (i.e. idiomatic pandas code). Last (and also probably not least) a DataFrame might not be the most computationally efficient data structure if your primary method of accessing data is via primary-key based slices.

In sum, as Jeff says, "iterating over values is not idiomatic at all and completely non performant". This is not a ringing endorsement for a data science community that is used to expressing constraints precisely by iterating over values - either as part of an inner summation or in order to generate a collection of similar constraints.

Regardless, ticdat wants to stand in solidarity with pandas. We all share the same goal - to unify data science with a common high level programming language. To that end, ticdat supports pandas in 2 ways - either by creating TicDat objects from properly formatted DataFrame objects, or by creating a DataFrame based copy of a TicDat object.

This notebook focuses on the latter technique. Here, we dive into the nuances of the copy_to_pandas routine.

To begin, lets use the ticdat testing section to get some data tables. (As always, the testing section is not appropriate for production code, but is fine for demonstration code like this).



In [1]:

    
import ticdat.testing.testutils as tdu
from ticdat import TicDatFactory
tdf = TicDatFactory(**tdu.netflowSchema())
dat = tdf.copy_tic_dat(tdu.netflowData())

Now let's look at a couple of different types of tables, and see what how copy_to_pandas handles different types of data. Here is the "cost" table both in dict-of-dicts format and the as a DataFrame.



In [2]:

    
dat.cost









    Out[2]:





{('Pencils', 'Denver', 'Boston'): _td:{'cost': 40},
 ('Pencils', 'Denver', 'New York'): _td:{'cost': 40},
 ('Pencils', 'Denver', 'Seattle'): _td:{'cost': 30},
 ('Pencils', 'Detroit', 'Boston'): _td:{'cost': 10},
 ('Pencils', 'Detroit', 'New York'): _td:{'cost': 20},
 ('Pencils', 'Detroit', 'Seattle'): _td:{'cost': 60},
 ('Pens', 'Denver', 'Boston'): _td:{'cost': 60},
 ('Pens', 'Denver', 'New York'): _td:{'cost': 70},
 ('Pens', 'Denver', 'Seattle'): _td:{'cost': 30},
 ('Pens', 'Detroit', 'Boston'): _td:{'cost': 20},
 ('Pens', 'Detroit', 'New York'): _td:{'cost': 20},
 ('Pens', 'Detroit', 'Seattle'): _td:{'cost': 80}}



In [3]:

    
df_cost = tdf.copy_to_pandas(dat).cost
df_cost









    Out[3]:






  
    
      
      
      
      cost
    
    
      commodity
      source
      destination
      
    
  
  
    
      Pencils
      Denver
      Boston
      40
    
    
      New York
      40
    
    
      Seattle
      30
    
    
      Detroit
      Boston
      10
    
    
      New York
      20
    
    
      Seattle
      60
    
    
      Pens
      Denver
      Boston
      60
    
    
      New York
      70
    
    
      Seattle
      30
    
    
      Detroit
      Boston
      20
    
    
      New York
      20
    
    
      Seattle
      80

It's important to emphasize whats happening. The only column in the DataFrame is "cost". This is because the cost table only has a single data field (also named "cost"). The "commodity","source","destination" columns are all primary key columns. To avoid redundant data in the DataFrame, copy_to_pandas includes this information only as part of the index of the DataFrame.



In [4]:

    
df_cost.index









    Out[4]:





MultiIndex(levels=[[u'Pencils', u'Pens'], [u'Denver', u'Detroit'], [u'Boston', u'New York', u'Seattle']],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=[u'commodity', u'source', u'destination'])

This is also a good time to point out that asking "is this row present" is completely different for the "dict-of-dicts" representation than for the DataFrame representation. Choices are good - use whatever version will make your code more readable. Personally, I'd be a little worried that someone might forget to include .index when looking for row presence in pandas, but I also think an experienced pandas developer would be pretty unlikely to make that mistake.



In [5]:

    
('Pens', 'Denver', 'Seattle') in df_cost









    Out[5]:





False



In [6]:

    
('Pens', 'Denver', 'Seattle') in df_cost.index









    Out[6]:





True



In [7]:

    
('Pens', 'Denver', 'Seattle') in dat.cost









    Out[7]:





True

Now, lets look at the nodes table.



In [8]:

    
dat.nodes









    Out[8]:





{'Boston': {}, 'Denver': {}, 'Detroit': {}, 'New York': {}, 'Seattle': {}}



In [9]:

    
tdf.copy_to_pandas(dat).nodes









    Out[9]:






  
    
      
      name
    
    
      name
      
    
  
  
    
      Boston
      Boston
    
    
      Denver
      Denver
    
    
      Detroit
      Detroit
    
    
      New York
      New York
    
    
      Seattle
      Seattle

Wait, why is the "name" information duplicated here? Shouldn't the primary key just be part of the index? The answer, by default, is it depends. If the table has no data fields, then by default it will duplicate the primary key data in both the index of the DataFrame and in the column(s) of the DataFrame. In other words, by default, copy_to_pandas will duplicate data only to avoid having "no column" DataFrames. But, you can override this behavior however you want by specifying non-None booleans as the drop_pk_columns argument.



In [10]:

    
tdf.copy_to_pandas(dat, drop_pk_columns=True).nodes









    Out[10]:






  
    
      
    
    
      name
    
  
  
    
      Boston
    
    
      Denver
    
    
      Detroit
    
    
      New York
    
    
      Seattle

This is a DataFrame with an .index but no column. Seems like a strange thing to have, but you can create it just fine if you want.

Here I go back and create a version of the "cost" table that doesn't drop the primary key fields from the columns of the DataFrame.



In [11]:

    
tdf.copy_to_pandas(dat, drop_pk_columns=False).cost









    Out[11]:






  
    
      
      
      
      commodity
      source
      destination
      cost
    
    
      commodity
      source
      destination
      
      
      
      
    
  
  
    
      Pencils
      Denver
      Boston
      Pencils
      Denver
      Boston
      40
    
    
      New York
      Pencils
      Denver
      New York
      40
    
    
      Seattle
      Pencils
      Denver
      Seattle
      30
    
    
      Detroit
      Boston
      Pencils
      Detroit
      Boston
      10
    
    
      New York
      Pencils
      Detroit
      New York
      20
    
    
      Seattle
      Pencils
      Detroit
      Seattle
      60
    
    
      Pens
      Denver
      Boston
      Pens
      Denver
      Boston
      60
    
    
      New York
      Pens
      Denver
      New York
      70
    
    
      Seattle
      Pens
      Denver
      Seattle
      30
    
    
      Detroit
      Boston
      Pens
      Detroit
      Boston
      20
    
    
      New York
      Pens
      Detroit
      New York
      20
    
    
      Seattle
      Pens
      Detroit
      Seattle
      80

This sort of seems like it was created by the Department of Redundancy Department. However, you might find it handy for performing .join, as I did when creating a truly pandonic implementation of a multi commodity network flow model here

			cost
commodity	source	destination
Pencils	Denver	Boston	40
		New York	40
		Seattle	30
	Detroit	Boston	10
		New York	20
		Seattle	60
Pens	Denver	Boston	60
		New York	70
		Seattle	30
	Detroit	Boston	20
		New York	20
		Seattle	80

	name
name
Boston	Boston
Denver	Denver
Detroit	Detroit
New York	New York
Seattle	Seattle