Slicing with pandas, gurobipy.tuplelist, and O(n) slicing

Run a little python script that sets up the performance comparisons.


In [1]:
run prep_for_different_slicings.py

The slicing will be over small, medium, and large tables.


In [2]:
[len(getattr(td, "childTable")) for td in (smallTd, medTd, bigTd)]


Out[2]:
[1200, 31800, 270000]

We will run three series of four tests each.

Each series tests

  1. slicing with .sloc and pandas
  2. slicing with gurobipy.tuplelist
  3. slicing with ticdat.Slicer (with the gurobipy enhancement disabled)
  4. O(n) slicing

First, we see that with a small table (1,200) rows, the pandas slicing is only somewhat faster than the O(n) slicing, while Slicer slicing is quite a bit faster and tuplelist faster still.


In [3]:
%timeit checkChildDfLen(smallChildDf, *smallChk)


1000 loops, best of 3: 1.59 ms per loop

In [4]:
%timeit checkTupleListLen(smallSmartTupleList, *smallChk)


The slowest run took 5.71 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.26 µs per loop

In [5]:
%timeit checkSlicerLen(smallSlicer, *smallChk)


10000 loops, best of 3: 23.3 µs per loop

In [6]:
%timeit checkTupleListLen(smallDumbTupleList, *smallChk)


100 loops, best of 3: 5.71 ms per loop

Next we see that with a table of 31,800 rows, pandas slicing is now ~100 faster than O(n) slicing (but tuplelist and Slicer are still the fastest by far).


In [7]:
%timeit checkChildDfLen(medChildDf, *medChk)


1000 loops, best of 3: 1.87 ms per loop

In [8]:
%timeit checkTupleListLen(medSmartTupleList, *medChk)


The slowest run took 4.05 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.17 µs per loop

In [9]:
%timeit checkSlicerLen(medSlicer, *medChk)


10000 loops, best of 3: 25 µs per loop

In [10]:
%timeit checkTupleListLen(medDumbTupleList, *medChk)


10 loops, best of 3: 164 ms per loop

Finally, we see that with a table of 270,000 rows, pandas slicing is ~1000X faster than O(n) slicing. Here, tuplelist is blindingly fast - nearly as much an improvement shows over pandas as pandas shows over O(n). Slicer again comes in a respectably close second.


In [11]:
%timeit checkChildDfLen(bigChildDf, *bigChk)


100 loops, best of 3: 4.06 ms per loop

In [12]:
%timeit checkTupleListLen(bigSmartTupleList, *bigChk)


100000 loops, best of 3: 5.27 µs per loop

In [13]:
%timeit checkSlicerLen(bigSlicer, *bigChk)


10000 loops, best of 3: 69 µs per loop

In [14]:
%timeit checkTupleListLen(bigDumbTupleList, *bigChk)


1 loops, best of 3: 1.42 s per loop

Bottom line? pandas isn't really designed with "iterating over indicies and slicing" in mind, so it isn't the absolutely fastest way to write this sort of code. However, pandas also doesn't implement naive O(n) slicing.

For most instances, the .sloc approach to slicing will be fast enough. In general, so long as you use the optimal big-O subroutines, the time to solve a MIP or LP model will be larger than the time to formulate the model. However, in those instances where the slicing is the bottleneck operation, gurobipy.tuplelist or ticdat.Slicer can be used, or the model building code can be refactored to be more pandonic.

Addendum

There was a request to check sum as well as len. Here the results vindicate pandas, in as much as all three "smart" strategies are roughly equivalent.


In [15]:
%timeit checkChildDfSum(bigChildDf, *bigChk)


100 loops, best of 3: 4.22 ms per loop

In [16]:
%timeit checkTupleListSum(bigSmartTupleList, bigTd, *bigChk)


100 loops, best of 3: 2.9 ms per loop

In [17]:
%timeit checkSlicerSum(bigSlicer, bigTd, *bigChk)


100 loops, best of 3: 2.94 ms per loop

In [18]:
%timeit checkTupleListSum(bigDumbTupleList, bigTd, *bigChk)


1 loops, best of 3: 1.42 s per loop