Run a little python script that sets up the performance comparisons.
In [1]:
run prep_for_different_slicings.py
The slicing will be over small, medium, and large tables.
In [2]:
[len(getattr(td, "childTable")) for td in (smallTd, medTd, bigTd)]
Out[2]:
We will run three series of four tests each.
Each series tests
.sloc and pandasgurobipy.tuplelistticdat.Slicer (with the gurobipy enhancement disabled)First, we see that with a small table (1,200) rows, the pandas slicing is only somewhat faster than the O(n) slicing, while Slicer slicing is quite a bit faster and tuplelist faster still.
In [3]:
%timeit checkChildDfLen(smallChildDf, *smallChk)
In [4]:
%timeit checkTupleListLen(smallSmartTupleList, *smallChk)
In [5]:
%timeit checkSlicerLen(smallSlicer, *smallChk)
In [6]:
%timeit checkTupleListLen(smallDumbTupleList, *smallChk)
Next we see that with a table of 31,800 rows, pandas slicing is now ~100 faster than O(n) slicing (but tuplelist and Slicer are still the fastest by far).
In [7]:
%timeit checkChildDfLen(medChildDf, *medChk)
In [8]:
%timeit checkTupleListLen(medSmartTupleList, *medChk)
In [9]:
%timeit checkSlicerLen(medSlicer, *medChk)
In [10]:
%timeit checkTupleListLen(medDumbTupleList, *medChk)
Finally, we see that with a table of 270,000 rows, pandas slicing is ~1000X faster than O(n) slicing. Here, tuplelist is blindingly fast - nearly as much an improvement shows over pandas as pandas shows over O(n). Slicer again comes in a respectably close second.
In [11]:
%timeit checkChildDfLen(bigChildDf, *bigChk)
In [12]:
%timeit checkTupleListLen(bigSmartTupleList, *bigChk)
In [13]:
%timeit checkSlicerLen(bigSlicer, *bigChk)
In [14]:
%timeit checkTupleListLen(bigDumbTupleList, *bigChk)
Bottom line? pandas isn't really designed with "iterating over indicies and slicing" in mind, so it isn't the absolutely fastest way to write this sort of code. However, pandas also doesn't implement naive O(n) slicing.
For most instances, the .sloc approach to slicing will be fast enough. In general, so long as you use the optimal big-O subroutines, the time to solve a MIP or LP model will be larger than the time to formulate the model. However, in those instances where the slicing is the bottleneck operation, gurobipy.tuplelist or ticdat.Slicer can be used, or the model building code can be refactored to be more pandonic.
There was a request to check sum as well as len. Here the results vindicate pandas, in as much as all three "smart" strategies are roughly equivalent.
In [15]:
%timeit checkChildDfSum(bigChildDf, *bigChk)
In [16]:
%timeit checkTupleListSum(bigSmartTupleList, bigTd, *bigChk)
In [17]:
%timeit checkSlicerSum(bigSlicer, bigTd, *bigChk)
In [18]:
%timeit checkTupleListSum(bigDumbTupleList, bigTd, *bigChk)