One important issue that was evident to us (NFQ Solutions) during the last two months is that the Pandas Dataframe is not a very good container for data. This notebook is a short explanation on why we may have to reduce our now intensive use of the Pandas Dataframe (it is now almost everywhere) and to explore some other solutions. None of this is a criticism about Pandas. It is a game changer, and I am a strong advocate for its use. It's just that we hit one of its limitations. Some simple operations just have too much overhead.

To be more precise, let's start by importing Pandas



In [1]:

    
import pandas as pd
import numpy as np

df = pd.DataFrame({'a': np.arange(1E6), 'b': np.arange(1E6)})

We have just created a relatively large dataframe with some dummy data, enough to prove my initial point. Let's see how much time it takes to add the two columns and to insert the result into the third one.



In [4]:

    
%timeit -o df.c = df.a + df.b









    



3.1 ms ± 60 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)






    Out[4]:





<TimeitResult : 3.1 ms ± 60 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)>

Is that fast or slow? Well, let's try to make the very same computation in a slightly different manner



In [3]:

    
a = df.a.values
b = df.b.values



In [4]:

    
%%timeit
c = a + b









    



2.06 ms ± 26.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If we compare how fast it is to a simple sum of two numpy arrays, it is pretty fast. But we are adding two relatively large arrays. Let's try the exact same thing with smaller arrays.



In [5]:

    
df = pd.DataFrame({'a': np.arange(100), 'b': np.arange(100)})



In [6]:

    
%%timeit
df.c = df.a + df.b









    



141 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)



In [7]:

    
a = df.a.values
b = df.b.values



In [8]:

    
%%timeit
c = a + b









    



896 ns ± 2.79 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Now things have changed quite a lot. Just adding two arrays takes two orders of magnitude less than adding from the Pandas Dataframe. But this comparison is not fare at all. Those 145µs are not spent waiting. Pandas does lots of things with the value of the Series resulting from the sum before it inserts it to the dataframe. If we profile the execution of that simple sum, we'll see that almost a fifth of the time is spent on a function called _sanitize_array.



In [9]:

    
from IPython.display import Image
Image(filename='_static/snakeviz_add.png')









    Out[9]:

The most important characteristic of Pandas is that it always does what it is supposed to do with data regardless of how dirty, heterogeneous, sparse (you name it) your data is. And it does an amazing job with that. But the price we have to pay are those two orders of magnitude in time.

That is exactly what impacted the performance of our last project. The Dataframe is a very convenient container because it always does something that makes sense, therefore you have to code very little. For instance, take the join method of a dataframe. It does just what it has to do, and it is definitely not trivial. Unfortunately, that overhead is too much for us.

We are in the typical situation where abstractions are not for free. The higher the level, the slower the computation. This is a kind of a second law of Thermodynamics applied to numerical computing. And there are abstractions that are tremendously useful to us. A Dataframe is not a dictionary of arrays. It can be indexed by row and by column, and it can operate as a whole, and on any imaginable portion of it. It can sort, group, joing, merge... You name it. But if you want to compute the payment schedule of all the securities of an entire bank, you may need thousands of processors to have it done in less than six hours.

This is where I started thinking. There must be something in between. Something that is fast, but it's not just a dictionary of numpy arrays. And I started designing gtable



In [9]:

    
from gtable import Table

tb = Table({'a': np.arange(1E6), 'b': np.arange(1E6)})
tb









    Out[9]:





<Table[ a[1000000] <float64>, b[1000000] <float64> ] object at 0x7f6caafd5240>



In [14]:

    
%%timeit
tb.c = tb.a + tb.b









    



3.76 ms ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can see that for large arrays, the computation time shadows the overhead. Let's see how well it does with smaller arrays



In [8]:

    
tb = Table({'a': np.arange(100), 'b': np.arange(100)})



In [9]:

    
%%timeit
tb.c = tb.a + tb.b









    



10.9 µs ± 67.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

We have improved by a factor of 7, which is crucial if that's the difference between running in one or seven servers. We can still improve the computation by a little bit more if we fallback into some kind of I know what I am doing mode, and we want to reuse memory to avoid allocations:



In [16]:

    
%%timeit
tb['a'] = tb['a'] + tb['b']









    



2.15 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Now the performance of arithmetic operations with gtable is closer to operate with plain arrays to the overhead-driven performance of Pandas. You can seriously break the table if you really don't know what you are doing. But for obvious reasons, having this kind of performance tricks is key to us.

Of course, these speedups come at a cost: features. Gtable is in its infancy. There are literally two afternoons of work on it, and the whole module fits within a single file with less than 300 lines of code. It is pure python, and I have not started to seriously tune its performance. But the idea of having something inbetween a Dataframe and a dictionary of arrays with support for sparse information is appealing to say the list.

Let me demo a little the capabilities of this tiny module of mine. Assume that we start with a small table with two columns



In [17]:

    
tb = Table({'a': pd.date_range('2000-01-01', freq='M', periods=10),
            'b': np.random.randn(10)})
tb









    Out[17]:





<Table[ a[10] <datetime64[ns]>, b[10] <float64> ] object at 0x7f62cbc20908>



In [18]:

    
tb.add_column('schedule', np.array(['first ']))
tb









    Out[18]:





<Table[ a[10] <datetime64[ns]>, b[10] <float64>, schedule[1] <<U6> ] object at 0x7f62cbc20908>

I have been able to concatenate a full column in the horizontal direction with a single value, and it's part of the information that the printed value of the table gives. Storing the data and the indexes separately is a nice and efficient way of dealing with sparse data. We can visualize the table by converting it to a pandas Dataframe



In [19]:

    
tb









    Out[19]:





<Table[ a[10] <datetime64[ns]>, b[10] <float64>, schedule[1] <<U6> ] object at 0x7f62cbc20908>

Gtable is not designed as a general tool for data analysis, but as an efficient data container. We can also concatenate data in the vertical direction efficiently, also keeping a single copy of data when necessary



In [20]:

    
tb1 = tb.copy()
tb1.schedule.values[0] = 'second'
tb.stitch(tb1)
tb









    Out[20]:





<Table[ a[20] <datetime64[ns]>, b[20] <float64>, schedule[2] <<U6> ] object at 0x7f62cbc20908>

If you care a little about how it is done. The internal storage is just a list of arrays and a bitmap index. The bitmap index is interesting because some computations, like sorting or filtering, only involve the index. The storage of the table is stored within the _data, the _keys and _index attributes



In [21]:

    
tb.data









    Out[21]:





[array(['2000-01-31T00:00:00.000000000', '2000-02-29T00:00:00.000000000',
        '2000-03-31T00:00:00.000000000', '2000-04-30T00:00:00.000000000',
        '2000-05-31T00:00:00.000000000', '2000-06-30T00:00:00.000000000',
        '2000-07-31T00:00:00.000000000', '2000-08-31T00:00:00.000000000',
        '2000-09-30T00:00:00.000000000', '2000-10-31T00:00:00.000000000',
        '2000-01-31T00:00:00.000000000', '2000-02-29T00:00:00.000000000',
        '2000-03-31T00:00:00.000000000', '2000-04-30T00:00:00.000000000',
        '2000-05-31T00:00:00.000000000', '2000-06-30T00:00:00.000000000',
        '2000-07-31T00:00:00.000000000', '2000-08-31T00:00:00.000000000',
        '2000-09-30T00:00:00.000000000', '2000-10-31T00:00:00.000000000'], dtype='datetime64[ns]'),
 array([ 0.80556054,  2.82697133,  0.23874968,  1.32208886,  0.45994963,
         0.90057559,  0.7262292 , -0.8382196 ,  0.70222322, -1.25277213,
         0.80556054,  2.82697133,  0.23874968,  1.32208886,  0.45994963,
         0.90057559,  0.7262292 , -0.8382196 ,  0.70222322, -1.25277213]),
 array(['first ', 'second'],
       dtype='<U6')]



In [22]:

    
tb.keys









    Out[22]:





['a', 'b', 'schedule']



In [23]:

    
tb.index









    Out[23]:





array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

We can take some advantage of knowing the internal representation of the data to insert data into the table in an efficient way. Every attribute of the table corresponds to a column, and each column stores the data as a numpy array in values and a piece of the index in index.



In [24]:

    
b_col = tb.b
b_col









    Out[24]:





<Column[ float64 ] object at 0x7f6292d33eb8>

This means that it is relatively simple make efficient computations with a whole column, to add yet another colum



In [25]:

    
tb.sum_b = b_col.values.cumsum()



In [26]:

    
tb.sum_b.values









    Out[26]:





array([  0.80556054,   3.63253188,   3.87128156,   5.19337042,
         5.65332005,   6.55389564,   7.28012484,   6.44190524,
         7.14412845,   5.89135633,   6.69691687,   9.5238882 ,
         9.76263788,  11.08472674,  11.54467638,  12.44525197,
        13.17148117,  12.33326156,  13.03548478,  11.78271265])



In [27]:

    
tb['sum_b']









    Out[27]:





array([  0.80556054,   3.63253188,   3.87128156,   5.19337042,
         5.65332005,   6.55389564,   7.28012484,   6.44190524,
         7.14412845,   5.89135633,   6.69691687,   9.5238882 ,
         9.76263788,  11.08472674,  11.54467638,  12.44525197,
        13.17148117,  12.33326156,  13.03548478,  11.78271265])

We'll see where it will go from here



In [21]:

    
%load_ext snakeviz



In [22]:

    
%%snakeviz
tb.a > tb.b









    



 
*** Profile stats marshalled to file '/tmp/tmp0j_glnmm'.



In [17]:

    
col = tb.a >= tb.b



In [18]:

    
col.values









    Out[18]:





array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])



In [ ]: