DataFrame: Functional Chains for TTrees in Python.

The DataFrame class brings the feature called functional chains with caching to trees. This is achieved in identifying different functions are creating lists of transformations. Usability is a key. Functional chains are a lot simpler way of creating histograms because the user doesn't need to create loops. DataFrame will do it for you.

Preparation

We include ROOT, DataFrame and PyTreeReader class. DataFrame uses PyTreeReader for filling histograms and filtering results. All of the computing is mostly done by using PyTreeReader inside the DataFrame Class. Clearly this will be done in a better way now that the usage of PyTreeReader in ROOT is still unknown. PyTreeReader can be found from https://github.com/dpiparo/pytreereader



In [1]:

    
import ROOT
from PyTreeReader import PyTreeReader
from functional import DataFrame
from ROOT import TFile









    














    



Welcome to JupyROOT 6.07/07

This is to get a tree from test data called cernstaff.root



In [2]:

    
testFile = TFile('cernstaff.root')
testTree = testFile.Get('T')

Here we create the DataFrame object



In [3]:

    
dataFrame = DataFrame(testTree)









    



Creating the PyTreeReader

As you can see, it also creates a PyTreeReader. This is why PyTreeReader is mandatory for the class

Traditional read without cache



In [4]:

    
%%time
dataFrame.filter(lambda e : e.Children() > 4).head(5)









    






Category
Flag
Age
Service
Children
Grade
Step
Hrweek
Cost
Division
Nation




300
13
49
24
5
9
9
40
10039
L
F


201
15
58
26
5
11
9
40
14390
P
G


560
14
47
19
5
5
13
20
4624
E
F


202
15
47
19
6
11
4
40
13574
P
D


102
7
36
5
5
10
1
40
11026
E
F










    



CPU times: user 115 ms, sys: 17 ms, total: 132 ms
Wall time: 126 ms

Same but now first caching it and then rerunning the same



In [8]:

    
dataFrame.resetcache()









    Out[8]:





<functional.DataFrame.DataFrame at 0x7f228005df90>



In [9]:

    
%%time
dataFrame.filter(lambda e : e.Children() > 4).cache().head(5)









    






Category
Flag
Age
Service
Children
Grade
Step
Hrweek
Cost
Division
Nation




300
13
49
24
5
9
9
40
10039
L
F


201
15
58
26
5
11
9
40
14390
P
G


560
14
47
19
5
5
13
20
4624
E
F


202
15
47
19
6
11
4
40
13574
P
D


102
7
36
5
5
10
1
40
11026
E
F










    



NOT cached FILTER
new cache = True
CPU times: user 18.1 ms, sys: 10.8 ms, total: 28.9 ms
Wall time: 24 ms

Now rerunning it and using the cached results to print



In [11]:

    
%%time
dataFrame.filter(lambda e : e.Children() > 4).cache().filter(lambda e : e.Age() < 47).head(5)









    






Category
Flag
Age
Service
Children
Grade
Step
Hrweek
Cost
Division
Nation




102
7
36
5
5
10
1
40
11026
E
F










    



Cached FILTER
new cache = False
CPU times: user 8.79 ms, sys: 6.5 ms, total: 15.3 ms
Wall time: 13.4 ms

There is some caching with the files in the Swan service, but the point is that first and second run differ alot with their speed

Lets reset the cache by calling a function from the class



In [12]:

    
dataFrame.resetcache()









    Out[12]:





<functional.DataFrame.DataFrame at 0x7f228005df90>

Now we can demonstrate different histograms and drawing them



In [14]:

    
%%time
dataFrame.filter(lambda e : e.Age() > 45).cache().histo('Age:Cost').Draw('COLZ')
ROOT.gPad.Draw()









    



Cached FILTER
new cache = True
CPU times: user 28.9 ms, sys: 17.6 ms, total: 46.6 ms
Wall time: 40.3 ms






    



Warning in <TFile::Append>: Replacing existing TH1: h (Potential memory leak).

Rerun the same analysis, compare the time



In [15]:

    
%%time
dataFrame.filter(lambda e : e.Age() > 45).cache().histo('Age:Cost').Draw('COLZ')
ROOT.gPad.Draw()









    



Cached FILTER
new cache = False
CPU times: user 19.9 ms, sys: 7.36 ms, total: 27.3 ms
Wall time: 24.8 ms






    



Warning in <TFile::Append>: Replacing existing TH1: h (Potential memory leak).

Lets add one more filter after the cache and see how it differs...



In [16]:

    
%%time
dataFrame.filter(lambda e : e.Age() > 45).cache().filter(lambda e: e.Cost() > 8500).histo('Age:Cost').Draw('COLZ')
ROOT.gPad.Draw()









    



Cached FILTER
new cache = False
CPU times: user 20.5 ms, sys: 7.13 ms, total: 27.6 ms
Wall time: 23.4 ms






    



Warning in <TFile::Append>: Replacing existing TH1: h (Potential memory leak).

What can be done more?

This is the first implementation of the class and functional chains.

Usability can be improved with adding more and more transformations and actions to

A lot can be achieved with seizable performance improvements with using the PyTreeReader.

However, there are some minor flaws in the class:

Reading more complex trees might need a different approach
If PyTreeReader is changed not to use brackets to handle the entries, this program crashes
Map() and FlatMap() functions have a skeleton ready, but it has to be figured out how and where new tree should be read to the PyTreeReader
TEntryList usage can be optimized more
This uses the RDD idea, so it uses the functions one by one, if there are 3 filters in a row it could run all these at the same time -> this way it doesnt have to go through the loop seperately for each of them.
Transformations after cache() are not working properly if there is more than 1 of them.
This Class is in Python and it should be converted to C++ when its possible
It has some glitches here when reading values the first time but when its done second time it works

Remember that this is a prototype, it will need optimizing and improvements



In [ ]:

Category	Flag	Age	Service	Children	Grade	Step	Hrweek	Cost	Division	Nation
300	13	49	24	5	9	9	40	10039	L	F
201	15	58	26	5	11	9	40	14390	P	G
560	14	47	19	5	5	13	20	4624	E	F
202	15	47	19	6	11	4	40	13574	P	D
102	7	36	5	5	10	1	40	11026	E	F