DataFrame: Functional Chains for TTrees in Python.


The DataFrame class brings the feature called functional chains with caching to trees. This is achieved in identifying different functions are creating lists of transformations. Usability is a key. Functional chains are a lot simpler way of creating histograms because the user doesn't need to create loops. DataFrame will do it for you.

Preparation

We include ROOT, DataFrame and PyTreeReader class. DataFrame uses PyTreeReader for filling histograms and filtering results. All of the computing is mostly done by using PyTreeReader inside the DataFrame Class. Clearly this will be done in a better way now that the usage of PyTreeReader in ROOT is still unknown. PyTreeReader can be found from https://github.com/dpiparo/pytreereader


In [1]:
import ROOT
from PyTreeReader import PyTreeReader
from functional import DataFrame
from ROOT import TFile


Welcome to JupyROOT 6.07/07

This is to get a tree from test data called cernstaff.root


In [2]:
testFile = TFile('cernstaff.root')
testTree = testFile.Get('T')

Here we create the DataFrame object


In [3]:
dataFrame = DataFrame(testTree)


Creating the PyTreeReader

As you can see, it also creates a PyTreeReader. This is why PyTreeReader is mandatory for the class

Traditional read without cache


In [4]:
%%time
dataFrame.filter(lambda e : e.Children() > 4).head(5)


Category Flag Age Service Children Grade Step Hrweek Cost Division Nation
300 13 49 24 5 9 9 40 10039 L F
201 15 58 26 5 11 9 40 14390 P G
560 14 47 19 5 5 13 20 4624 E F
202 15 47 19 6 11 4 40 13574 P D
102 7 36 5 5 10 1 40 11026 E F
CPU times: user 115 ms, sys: 17 ms, total: 132 ms
Wall time: 126 ms

Same but now first caching it and then rerunning the same


In [8]:
dataFrame.resetcache()


Out[8]:
<functional.DataFrame.DataFrame at 0x7f228005df90>

In [9]:
%%time
dataFrame.filter(lambda e : e.Children() > 4).cache().head(5)


Category Flag Age Service Children Grade Step Hrweek Cost Division Nation
300 13 49 24 5 9 9 40 10039 L F
201 15 58 26 5 11 9 40 14390 P G
560 14 47 19 5 5 13 20 4624 E F
202 15 47 19 6 11 4 40 13574 P D
102 7 36 5 5 10 1 40 11026 E F
NOT cached FILTER
new cache = True
CPU times: user 18.1 ms, sys: 10.8 ms, total: 28.9 ms
Wall time: 24 ms

Now rerunning it and using the cached results to print


In [11]:
%%time
dataFrame.filter(lambda e : e.Children() > 4).cache().filter(lambda e : e.Age() < 47).head(5)


Category Flag Age Service Children Grade Step Hrweek Cost Division Nation
102 7 36 5 5 10 1 40 11026 E F
Cached FILTER
new cache = False
CPU times: user 8.79 ms, sys: 6.5 ms, total: 15.3 ms
Wall time: 13.4 ms

There is some caching with the files in the Swan service, but the point is that first and second run differ alot with their speed

Lets reset the cache by calling a function from the class


In [12]:
dataFrame.resetcache()


Out[12]:
<functional.DataFrame.DataFrame at 0x7f228005df90>

Now we can demonstrate different histograms and drawing them


In [14]:
%%time
dataFrame.filter(lambda e : e.Age() > 45).cache().histo('Age:Cost').Draw('COLZ')
ROOT.gPad.Draw()


Cached FILTER
new cache = True
CPU times: user 28.9 ms, sys: 17.6 ms, total: 46.6 ms
Wall time: 40.3 ms
Warning in <TFile::Append>: Replacing existing TH1: h (Potential memory leak).

Rerun the same analysis, compare the time


In [15]:
%%time
dataFrame.filter(lambda e : e.Age() > 45).cache().histo('Age:Cost').Draw('COLZ')
ROOT.gPad.Draw()


Cached FILTER
new cache = False
CPU times: user 19.9 ms, sys: 7.36 ms, total: 27.3 ms
Wall time: 24.8 ms
Warning in <TFile::Append>: Replacing existing TH1: h (Potential memory leak).

Lets add one more filter after the cache and see how it differs...


In [16]:
%%time
dataFrame.filter(lambda e : e.Age() > 45).cache().filter(lambda e: e.Cost() > 8500).histo('Age:Cost').Draw('COLZ')
ROOT.gPad.Draw()


Cached FILTER
new cache = False
CPU times: user 20.5 ms, sys: 7.13 ms, total: 27.6 ms
Wall time: 23.4 ms
Warning in <TFile::Append>: Replacing existing TH1: h (Potential memory leak).

What can be done more?

This is the first implementation of the class and functional chains.

Usability can be improved with adding more and more transformations and actions to

A lot can be achieved with seizable performance improvements with using the PyTreeReader.

However, there are some minor flaws in the class:

  • Reading more complex trees might need a different approach
  • If PyTreeReader is changed not to use brackets to handle the entries, this program crashes
  • Map() and FlatMap() functions have a skeleton ready, but it has to be figured out how and where new tree should be read to the PyTreeReader
  • TEntryList usage can be optimized more
  • This uses the RDD idea, so it uses the functions one by one, if there are 3 filters in a row it could run all these at the same time -> this way it doesnt have to go through the loop seperately for each of them.
  • Transformations after cache() are not working properly if there is more than 1 of them.
  • This Class is in Python and it should be converted to C++ when its possible
  • It has some glitches here when reading values the first time but when its done second time it works

Remember that this is a prototype, it will need optimizing and improvements


In [ ]: