Run a little python script that sets up the performance comparisons.
In [1]:
run prep_for_different_slicings.py
The slicing will be over small, medium, and large tables.
In [2]:
[len(getattr(td, "childTable")) for td in (smallTd, medTd, bigTd)]
Out[2]:
We will run three series of four tests each.
Each series tests
.sloc
and pandas
gurobipy.tuplelist
ticdat.Slicer
(with the gurobipy
enhancement disabled)First, we see that with a small table (1,200) rows, the pandas
slicing is only somewhat faster than the O(n) slicing, while Slicer
slicing is quite a bit faster and tuplelist
faster still.
In [3]:
%timeit checkChildDfLen(smallChildDf, *smallChk)
In [4]:
%timeit checkTupleListLen(smallSmartTupleList, *smallChk)
In [5]:
%timeit checkSlicerLen(smallSlicer, *smallChk)
In [6]:
%timeit checkTupleListLen(smallDumbTupleList, *smallChk)
Next we see that with a table of 31,800 rows, pandas
slicing is now ~100 faster than O(n) slicing (but tuplelist
and Slicer
are still the fastest by far).
In [7]:
%timeit checkChildDfLen(medChildDf, *medChk)
In [8]:
%timeit checkTupleListLen(medSmartTupleList, *medChk)
In [9]:
%timeit checkSlicerLen(medSlicer, *medChk)
In [10]:
%timeit checkTupleListLen(medDumbTupleList, *medChk)
Finally, we see that with a table of 270,000 rows, pandas
slicing is ~1000X faster than O(n) slicing. Here, tuplelist
is blindingly fast - nearly as much an improvement shows over pandas
as pandas
shows over O(n). Slicer
again comes in a respectably close second.
In [11]:
%timeit checkChildDfLen(bigChildDf, *bigChk)
In [12]:
%timeit checkTupleListLen(bigSmartTupleList, *bigChk)
In [13]:
%timeit checkSlicerLen(bigSlicer, *bigChk)
In [14]:
%timeit checkTupleListLen(bigDumbTupleList, *bigChk)
Bottom line? pandas
isn't really designed with "iterating over indicies and slicing" in mind, so it isn't the absolutely fastest way to write this sort of code. However, pandas
also doesn't implement naive O(n) slicing.
For most instances, the .sloc
approach to slicing will be fast enough. In general, so long as you use the optimal big-O subroutines, the time to solve a MIP or LP model will be larger than the time to formulate the model. However, in those instances where the slicing is the bottleneck operation, gurobipy.tuplelist
or ticdat.Slicer
can be used, or the model building code can be refactored to be more pandonic.
There was a request to check sum
as well as len
. Here the results vindicate pandas
, in as much as all three "smart" strategies are roughly equivalent.
In [15]:
%timeit checkChildDfSum(bigChildDf, *bigChk)
In [16]:
%timeit checkTupleListSum(bigSmartTupleList, bigTd, *bigChk)
In [17]:
%timeit checkSlicerSum(bigSlicer, bigTd, *bigChk)
In [18]:
%timeit checkTupleListSum(bigDumbTupleList, bigTd, *bigChk)