Introduction to Deep Learning with Keras

and HEP Data Exploration in Python

Michela Paganini - Yale University

High Energy Phenomenology, Experiment and Cosmology Seminar Series

This is the first of two notebooks for today's tutorial. You will find the second one here.

What is Deep Learning?

Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and more abstract representations computed in terms of less abstract ones.

I. Goodfellow, Y. Bengio, and A. Courville, "Deep Learning", MIT Press, 2016.
http://www.deeplearningbook.org/

Representation Learning

Use machine learning to discover not only the mapping from representation to output but also the representation itself.

I. Goodfellow, Y. Bengio, and A. Courville, "Deep Learning", MIT Press, 2016.
http://www.deeplearningbook.org/

e.g.:

  • What’s 1753 + 131?
  • What’s MMXVII - MCMXCI?

Our goal:

Learn a function that maps a set of raw electrical signals from the detector all the way to particle identification

From D. Whiteson: Deep learning tackles this by breaking down the task into a series of nested and simpler mappings, which can be described by each different layer of the model. Deeper nets give us the flexibility to sequentially learn more and more abstract features from iterative non-linear transformations of the input data. Higher layers of representation amplify aspects of the input that are important for the given task and suppress irrelevant variations.

Learning Recipe

In parametric, supervised learning:

  • Objective: in the form of a differentiable loss function, to calculate how far your current prediction is from the target
  • Model: parametrized by learnable parameters
  • Training strategy: how to update the parameters
  • Data: enough to learn from & test your performance

Practical Example

From CMS Software Tutorial, developed by Christian Sander and Alexander Schmidt, and available on .

</a> The samples come from a $t\bar{t}$ analysis and they are described in a very clear and accessible way in the documentation. This was chosen to provide you with a full demo of my typical workflow, going from a ROOT file to a fully trained Keras model.

Disclaimer: my applications won't make 100% physical sense -- please focus on the tools!

Before diving into the Deep Learning world, I want to spend a few minutes discussing some data handling techniques I use whenever I get started prototyping my applications.

NumPy


In [1]:
import numpy as np

In [2]:
np.array([[0,1,2], [0,0,0], [1,2,-1]])  #+1 /2 **2 .ravel() etc.


Out[2]:
array([[ 0,  1,  2],
       [ 0,  0,  0],
       [ 1,  2, -1]])

For a nice numpy intro, check out the CERN tutorial Loose your Loops with NumPy (and tons of online material).

Idea: What if those columns represented various branches and every line represented an event/physics object?

From ROOT to numpy: root_numpy

Very easy to turn your .root files into Machine Learning compliant inputs using numpy and `root_numpy`.


In [3]:
from numpy.lib.recfunctions import stack_arrays
from root_numpy import root2array, root2rec
import glob


Welcome to JupyROOT 6.08/02

Using one single function from root_numpy, you can open your .root file and turn it into an ndarray, a Python object equivalent to an n-dimensional matrix. All you need to do is to pass it the name of the file you'd like to open. Other keyword arguments are specified below.

Let's take a look at the MC signal sample from our CMS open dataset:


In [4]:
ttbar = root2array('files/ttbar.root')

In [5]:
# -- display your newly created object
ttbar


Out[5]:
array([ (3, array([ 14.53632641, -22.59884262,  47.94461823], dtype=float32), array([ 81.21743011, -57.12860489, -11.35455418], dtype=float32), array([  69.889328  ,  214.73963928,  103.36489105], dtype=float32), array([ 109.21035004,  223.63397217,  114.65055084], dtype=float32), array([ 2.04609799,  2.28738594, -1.        ], dtype=float32), array([ True,  True,  True], dtype=bool), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), -30.1446991 , -47.48961639, -24.28571701, -59.33085632,  220.39312744,  20.73790359,  83.93949127,  66.30121613,   7.07375669,  10.14721012,  -10.8865633 ,  55.89609146, -12.89020824,  119.76635742, -47.3605957 ,   13.39617157,   34.92552948,  15,   7.22068501, -44.57110214,   42.75763321,  6, False,  0.28620061),
       (6, array([ 28.44177246, -56.91559601,  33.16614532, -56.35861969,
        44.68463516,  -7.48202801], dtype=float32), array([-77.40626526,  27.35497093,  53.65788269,  26.63551712,
       -14.14289093, -34.91702271], dtype=float32), array([-249.8009491 ,  -83.38486481,  -35.81163406,   32.79572296,
         17.63115501,  -16.43756294], dtype=float32), array([ 263.24673462,  105.52606964,   72.99871826,   71.04415131,
         51.12100983,   40.03748703], dtype=float32), array([-1., -1., -1., -1., -1., -1.], dtype=float32), array([ True,  True,  True,  True,  True,  True], dtype=bool), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32),  -3.05776286,  17.73644257,   0.        ,   0.        ,    0.        ,   0.        ,   0.        ,   0.        ,   0.        ,   0.        ,    0.        ,   0.        ,   0.        ,    0.        ,   0.        ,    0.        ,    0.        ,   0,   0.        ,   0.        ,    0.        ,  3, False,  0.27002999),
       (4, array([-16.00138855, -41.80726624,  29.1941433 ,  38.47055435], dtype=float32), array([-59.66587067,  21.09321976, -28.67432213,   7.80847073], dtype=float32), array([-49.96570206, -29.62834167,  38.02173615, -14.96393204], dtype=float32), array([ 79.99531555,  56.04853821,  56.24638748,  42.51010513], dtype=float32), array([-1.        , -1.        ,  1.70723951, -1.        ], dtype=float32), array([ True,  True,  True,  True], dtype=bool), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), -13.46125984,   5.44226217, -12.39086437,  22.39970207,   85.69178009, -18.29978943, -57.43149948, -47.72479248,  59.78998184,  13.00851154,  -26.30075264,  28.60722542, -30.50980568,   35.85360718, -11.64190197,  -32.56383896, -263.4397583 , -13,  -6.770576  ,  36.39264679, -135.82580566,  6, False,  0.29352903),
       ...,
       (1, array([ 9.45587254], dtype=float32), array([ 56.32987595], dtype=float32), array([ 58.52116013], dtype=float32), array([ 83.46845245], dtype=float32), array([-1.], dtype=float32), array([ True], dtype=bool), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 2, array([ 123.97548676,  -41.90577698], dtype=float32), array([-20.59002304, -41.98944473], dtype=float32), array([ 25.7717514 ,  -1.04125142], dtype=float32), array([ 128.28894043,   59.33204651], dtype=float32), array([ 1, -1], dtype=int32), array([  1.47950327,  11.97175503], dtype=float32), 1, array([ 122.57876587], dtype=float32), array([-21.20843315], dtype=float32), array([ 25.54430962], dtype=float32), array([ 126.99552155], dtype=float32), array([ 3.75910282], dtype=float32), -73.85587311,  -4.27852106,   0.        ,   0.        ,    0.        ,   0.        ,   0.        ,   0.        ,   0.        ,   0.        ,    0.        ,   0.        ,   0.        ,    0.        ,   0.        ,    0.        ,    0.        ,   0,   0.        ,   0.        ,    0.        ,  4, False,  0.27138647),
       (3, array([ 34.38261795, -40.7835083 ,  38.38739395], dtype=float32), array([-36.50298691,  -6.54035282,  -3.93633842], dtype=float32), array([-39.89178467, -50.6091423 , -69.5358429 ], dtype=float32), array([ 65.95163727,  65.82172394,  79.7787323 ], dtype=float32), array([-1., -1., -1.], dtype=float32), array([ True,  True,  True], dtype=bool), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32),  14.25508308,  50.85000992, -57.41909027,  -8.61177063,  -68.42054749, -25.27488518,  12.40928173,  20.68931198,  53.04928207,  -6.79531813, -100.73949432,  26.85243416, -53.96256638,  -23.28963089,  38.08544922,    4.27271748,  -61.93172455,  15, -31.4263649 ,  42.42398071, -140.16157532, 11, False,  0.22419348),
       (3, array([  1.28875291,  54.24991226, -54.32022858], dtype=float32), array([-84.01944733, -25.61555099,  -8.0340786 ], dtype=float32), array([ 21.79267883,  59.1702919 ,  48.22095108], dtype=float32), array([ 87.86754608,  84.76728058,  73.52315521], dtype=float32), array([-1.        , -1.        ,  3.45734024], dtype=float32), array([ True,  True,  True], dtype=bool), 0, array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=float32), array([], dtype=int32), array([], dtype=float32), 1, array([ 10.35683537], dtype=float32), array([ 152.02658081], dtype=float32), array([-128.6199646], dtype=float32), array([ 199.4052124], dtype=float32), array([1], dtype=int32), array([ 3.53812528], dtype=float32), 1, array([ 12.24383163], dtype=float32), array([ 155.64831543], dtype=float32), array([-131.72909546], dtype=float32), array([ 204.27644348], dtype=float32), array([ 3.8334322], dtype=float32), -39.6864357 ,  -7.80921078, -64.75705719, -12.6721344 ,   52.86598587,  30.71007347, -14.85277843,  10.30566978,   1.36320496, -91.56234741,   25.71873665,  44.26504898, -22.72837067,   51.07799149,  10.75428677,  151.41537476, -129.31864929, -11, -15.81485939,  -3.38039279,   -6.43942213,  8, False,  0.29903701)],
      dtype=[('NJet', '<i4'), ('Jet_Px', 'O'), ('Jet_Py', 'O'), ('Jet_Pz', 'O'), ('Jet_E', 'O'), ('Jet_btag', 'O'), ('Jet_ID', 'O'), ('NMuon', '<i4'), ('Muon_Px', 'O'), ('Muon_Py', 'O'), ('Muon_Pz', 'O'), ('Muon_E', 'O'), ('Muon_Charge', 'O'), ('Muon_Iso', 'O'), ('NElectron', '<i4'), ('Electron_Px', 'O'), ('Electron_Py', 'O'), ('Electron_Pz', 'O'), ('Electron_E', 'O'), ('Electron_Charge', 'O'), ('Electron_Iso', 'O'), ('NPhoton', '<i4'), ('Photon_Px', 'O'), ('Photon_Py', 'O'), ('Photon_Pz', 'O'), ('Photon_E', 'O'), ('Photon_Iso', 'O'), ('MET_px', '<f4'), ('MET_py', '<f4'), ('MChadronicBottom_px', '<f4'), ('MChadronicBottom_py', '<f4'), ('MChadronicBottom_pz', '<f4'), ('MCleptonicBottom_px', '<f4'), ('MCleptonicBottom_py', '<f4'), ('MCleptonicBottom_pz', '<f4'), ('MChadronicWDecayQuark_px', '<f4'), ('MChadronicWDecayQuark_py', '<f4'), ('MChadronicWDecayQuark_pz', '<f4'), ('MChadronicWDecayQuarkBar_px', '<f4'), ('MChadronicWDecayQuarkBar_py', '<f4'), ('MChadronicWDecayQuarkBar_pz', '<f4'), ('MClepton_px', '<f4'), ('MClepton_py', '<f4'), ('MClepton_pz', '<f4'), ('MCleptonPDGid', '<i4'), ('MCneutrino_px', '<f4'), ('MCneutrino_py', '<f4'), ('MCneutrino_pz', '<f4'), ('NPrimaryVertices', '<i4'), ('triggerIsoMu24', '?'), ('EventWeight', '<f4')])

In [6]:
# -- what data type is it?
type(ttbar)


Out[6]:
numpy.ndarray

In [7]:
# -- how many events are present?
ttbar.shape


Out[7]:
(36941,)

In [8]:
# -- what are the names of the branches?
ttbar.dtype.names


Out[8]:
('NJet',
 'Jet_Px',
 'Jet_Py',
 'Jet_Pz',
 'Jet_E',
 'Jet_btag',
 'Jet_ID',
 'NMuon',
 'Muon_Px',
 'Muon_Py',
 'Muon_Pz',
 'Muon_E',
 'Muon_Charge',
 'Muon_Iso',
 'NElectron',
 'Electron_Px',
 'Electron_Py',
 'Electron_Pz',
 'Electron_E',
 'Electron_Charge',
 'Electron_Iso',
 'NPhoton',
 'Photon_Px',
 'Photon_Py',
 'Photon_Pz',
 'Photon_E',
 'Photon_Iso',
 'MET_px',
 'MET_py',
 'MChadronicBottom_px',
 'MChadronicBottom_py',
 'MChadronicBottom_pz',
 'MCleptonicBottom_px',
 'MCleptonicBottom_py',
 'MCleptonicBottom_pz',
 'MChadronicWDecayQuark_px',
 'MChadronicWDecayQuark_py',
 'MChadronicWDecayQuark_pz',
 'MChadronicWDecayQuarkBar_px',
 'MChadronicWDecayQuarkBar_py',
 'MChadronicWDecayQuarkBar_pz',
 'MClepton_px',
 'MClepton_py',
 'MClepton_pz',
 'MCleptonPDGid',
 'MCneutrino_px',
 'MCneutrino_py',
 'MCneutrino_pz',
 'NPrimaryVertices',
 'triggerIsoMu24',
 'EventWeight')

Pandas Dataframes


In [9]:
import pandas as pd

One way of manipulating your data (slicing, filtering, removing variables, creating new features, taking operations of branches) in a simple, visually appealing way is to use `pandas` dataframes, a beatiful and efficient Python data structure library. Recommended for exploratory data analysis, probably not for high performance applications.


In [10]:
# -- how to turn an ndarray into a pandas dataframe
df = pd.DataFrame(ttbar)

In [11]:
# -- better way of displaying your data
df.head() # print the first few entries


Out[11]:
NJet Jet_Px Jet_Py Jet_Pz Jet_E Jet_btag Jet_ID NMuon Muon_Px Muon_Py ... MClepton_px MClepton_py MClepton_pz MCleptonPDGid MCneutrino_px MCneutrino_py MCneutrino_pz NPrimaryVertices triggerIsoMu24 EventWeight
0 3 [14.5363, -22.5988, 47.9446] [81.2174, -57.1286, -11.3546] [69.8893, 214.74, 103.365] [109.21, 223.634, 114.651] [2.0461, 2.28739, -1.0] [True, True, True] 0 [] [] ... -47.360596 13.396172 34.925529 15 7.220685 -44.571102 42.757633 6 False 0.286201
1 6 [28.4418, -56.9156, 33.1661, -56.3586, 44.6846... [-77.4063, 27.355, 53.6579, 26.6355, -14.1429,... [-249.801, -83.3849, -35.8116, 32.7957, 17.631... [263.247, 105.526, 72.9987, 71.0442, 51.121, 4... [-1.0, -1.0, -1.0, -1.0, -1.0, -1.0] [True, True, True, True, True, True] 0 [] [] ... 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 3 False 0.270030
2 4 [-16.0014, -41.8073, 29.1941, 38.4706] [-59.6659, 21.0932, -28.6743, 7.80847] [-49.9657, -29.6283, 38.0217, -14.9639] [79.9953, 56.0485, 56.2464, 42.5101] [-1.0, -1.0, 1.70724, -1.0] [True, True, True, True] 0 [] [] ... -11.641902 -32.563839 -263.439758 -13 -6.770576 36.392647 -135.825806 6 False 0.293529
3 2 [-40.8037, 37.5208] [34.4222, -35.3076] [-47.5731, -67.0822] [72.2569, 85.3261] [3.63089, 3.02456] [True, True] 1 [55.77] [12.2147] ... 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 10 False 0.192474
4 3 [-67.6013, 70.1426, 27.3418] [83.3236, -71.9012, 50.1827] [78.5999, -8.07146, 22.9956] [135.263, 101.793, 62.2964] [-1.0, -1.0, 3.74993] [True, True, True] 0 [] [] ... -49.375671 -30.740866 57.628201 11 2.897522 -13.882012 93.436157 16 False 0.036131

5 rows × 51 columns


In [12]:
# -- ... or the last few
df.tail()


Out[12]:
NJet Jet_Px Jet_Py Jet_Pz Jet_E Jet_btag Jet_ID NMuon Muon_Px Muon_Py ... MClepton_px MClepton_py MClepton_pz MCleptonPDGid MCneutrino_px MCneutrino_py MCneutrino_pz NPrimaryVertices triggerIsoMu24 EventWeight
36936 4 [130.752, -88.6566, 52.0555, -38.6537] [-25.4033, 73.8251, -74.2651, -30.6417] [-120.235, -287.902, -124.879, -128.15] [180.098, 310.715, 154.604, 137.425] [-1.0, 2.17192, -1.0, -1.0] [True, True, True, True] 1 [-44.8889] [42.0714] ... -42.990280 40.279636 -78.633194 -13 13.631536 -7.189492 9.089453 12 True 0.019638
36937 7 [103.327, -76.9673, -1.77329, -36.4286, 40.127... [28.3643, 19.0497, -52.6751, -36.7993, -30.191... [147.432, 428.396, 70.9709, 98.6322, 31.8465, ... [183.936, 435.736, 88.666, 111.896, 59.8123, 1... [1.61034, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0] [True, True, True, True, True, True, True] 0 [] [] ... 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 5 False 0.293529
36938 1 [9.45587] [56.3299] [58.5212] [83.4685] [-1.0] [True] 0 [] [] ... 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 4 False 0.271386
36939 3 [34.3826, -40.7835, 38.3874] [-36.503, -6.54035, -3.93634] [-39.8918, -50.6091, -69.5358] [65.9516, 65.8217, 79.7787] [-1.0, -1.0, -1.0] [True, True, True] 0 [] [] ... 38.085449 4.272717 -61.931725 15 -31.426365 42.423981 -140.161575 11 False 0.224193
36940 3 [1.28875, 54.2499, -54.3202] [-84.0194, -25.6156, -8.03408] [21.7927, 59.1703, 48.221] [87.8675, 84.7673, 73.5232] [-1.0, -1.0, 3.45734] [True, True, True] 0 [] [] ... 10.754287 151.415375 -129.318649 -11 -15.814859 -3.380393 -6.439422 8 False 0.299037

5 rows × 51 columns


In [13]:
# -- check the shape: it should be [nb_events, nb_variables]
df.shape


Out[13]:
(36941, 51)

In [14]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36941 entries, 0 to 36940
Data columns (total 51 columns):
NJet                           36941 non-null int32
Jet_Px                         36941 non-null object
Jet_Py                         36941 non-null object
Jet_Pz                         36941 non-null object
Jet_E                          36941 non-null object
Jet_btag                       36941 non-null object
Jet_ID                         36941 non-null object
NMuon                          36941 non-null int32
Muon_Px                        36941 non-null object
Muon_Py                        36941 non-null object
Muon_Pz                        36941 non-null object
Muon_E                         36941 non-null object
Muon_Charge                    36941 non-null object
Muon_Iso                       36941 non-null object
NElectron                      36941 non-null int32
Electron_Px                    36941 non-null object
Electron_Py                    36941 non-null object
Electron_Pz                    36941 non-null object
Electron_E                     36941 non-null object
Electron_Charge                36941 non-null object
Electron_Iso                   36941 non-null object
NPhoton                        36941 non-null int32
Photon_Px                      36941 non-null object
Photon_Py                      36941 non-null object
Photon_Pz                      36941 non-null object
Photon_E                       36941 non-null object
Photon_Iso                     36941 non-null object
MET_px                         36941 non-null float32
MET_py                         36941 non-null float32
MChadronicBottom_px            36941 non-null float32
MChadronicBottom_py            36941 non-null float32
MChadronicBottom_pz            36941 non-null float32
MCleptonicBottom_px            36941 non-null float32
MCleptonicBottom_py            36941 non-null float32
MCleptonicBottom_pz            36941 non-null float32
MChadronicWDecayQuark_px       36941 non-null float32
MChadronicWDecayQuark_py       36941 non-null float32
MChadronicWDecayQuark_pz       36941 non-null float32
MChadronicWDecayQuarkBar_px    36941 non-null float32
MChadronicWDecayQuarkBar_py    36941 non-null float32
MChadronicWDecayQuarkBar_pz    36941 non-null float32
MClepton_px                    36941 non-null float32
MClepton_py                    36941 non-null float32
MClepton_pz                    36941 non-null float32
MCleptonPDGid                  36941 non-null int32
MCneutrino_px                  36941 non-null float32
MCneutrino_py                  36941 non-null float32
MCneutrino_pz                  36941 non-null float32
NPrimaryVertices               36941 non-null int32
triggerIsoMu24                 36941 non-null bool
EventWeight                    36941 non-null float32
dtypes: bool(1), float32(21), int32(6), object(23)
memory usage: 10.3+ MB

In [15]:
df.keys() #df.columns


Out[15]:
Index([u'NJet', u'Jet_Px', u'Jet_Py', u'Jet_Pz', u'Jet_E', u'Jet_btag',
       u'Jet_ID', u'NMuon', u'Muon_Px', u'Muon_Py', u'Muon_Pz', u'Muon_E',
       u'Muon_Charge', u'Muon_Iso', u'NElectron', u'Electron_Px',
       u'Electron_Py', u'Electron_Pz', u'Electron_E', u'Electron_Charge',
       u'Electron_Iso', u'NPhoton', u'Photon_Px', u'Photon_Py', u'Photon_Pz',
       u'Photon_E', u'Photon_Iso', u'MET_px', u'MET_py',
       u'MChadronicBottom_px', u'MChadronicBottom_py', u'MChadronicBottom_pz',
       u'MCleptonicBottom_px', u'MCleptonicBottom_py', u'MCleptonicBottom_pz',
       u'MChadronicWDecayQuark_px', u'MChadronicWDecayQuark_py',
       u'MChadronicWDecayQuark_pz', u'MChadronicWDecayQuarkBar_px',
       u'MChadronicWDecayQuarkBar_py', u'MChadronicWDecayQuarkBar_pz',
       u'MClepton_px', u'MClepton_py', u'MClepton_pz', u'MCleptonPDGid',
       u'MCneutrino_px', u'MCneutrino_py', u'MCneutrino_pz',
       u'NPrimaryVertices', u'triggerIsoMu24', u'EventWeight'],
      dtype='object')

To summarize, if you want to go directly from .root files to pandas dataframes, you can do so in 3 lines of Python code. I like to use this function below in all my application whenever I load in data from a ROOT file. Feel free to copy it and use it!


In [16]:
def root2pandas(files_path, tree_name, **kwargs):
    '''
    Args:
    -----
        files_path: a string like './data/*.root', for example
        tree_name: a string like 'Collection_Tree' corresponding to the name of the folder inside the root 
                   file that we want to open
        kwargs: arguments taken by root2array, such as branches to consider, start, stop, step, etc
    Returns:
    --------    
        output_panda: a pandas dataframe like allbkg_df in which all the info from the root file will be stored
    
    Note:
    -----
        if you are working with .root files that contain different branches, you might have to mask your data
        in that case, return pd.DataFrame(ss.data)
    '''
    # -- create list of .root files to process
    files = glob.glob(files_path)
    
    # -- process ntuples into rec arrays
    ss = stack_arrays([root2array(fpath, tree_name, **kwargs).view(np.recarray) for fpath in files])

    try:
        return pd.DataFrame(ss)
    except Exception:
        return pd.DataFrame(ss.data)

In [17]:
# -- usage of root2pandas
singletop = root2pandas('./files/single_top.root', 'events')

We just turned a HEP-specific ROOT file into a standard data format that can be used by any ML expert. You can now save your data out to widely accepted data formats such as HDF5 or even CSV, share it with your collaborators from the ML community without them having to learn how to use the ROOT library or other CERN-specific analysis tools.


In [28]:
# -- save a pandas df to hdf5 (better to first convert it back to ndarray, to be fair)
import deepdish.io as io
io.save('ttbar.h5', df)


/Users/mp744/venvs/keras2tf1/lib/python2.7/site-packages/ipykernel_launcher.py:3: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->['Jet_Px', 'Jet_Py', 'Jet_Pz', 'Jet_E', 'Jet_btag', 'Jet_ID', 'Muon_Px', 'Muon_Py', 'Muon_Pz', 'Muon_E', 'Muon_Charge', 'Muon_Iso', 'Electron_Px', 'Electron_Py', 'Electron_Pz', 'Electron_E', 'Electron_Charge', 'Electron_Iso', 'Photon_Px', 'Photon_Py', 'Photon_Pz', 'Photon_E', 'Photon_Iso']]

  This is separate from the ipykernel package so we can avoid doing imports until

In [29]:
singletop.to_hdf('try.h5', 'branches')


/Users/mp744/venvs/keras2tf1/lib/python2.7/site-packages/pandas/core/generic.py:1138: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->['Jet_Px', 'Jet_Py', 'Jet_Pz', 'Jet_E', 'Jet_btag', 'Jet_ID', 'Muon_Px', 'Muon_Py', 'Muon_Pz', 'Muon_E', 'Muon_Charge', 'Muon_Iso', 'Electron_Px', 'Electron_Py', 'Electron_Pz', 'Electron_E', 'Electron_Charge', 'Electron_Iso', 'Photon_Px', 'Photon_Py', 'Photon_Pz', 'Photon_E', 'Photon_Iso']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

In [30]:
# -- let's load it back in to make sure it actually worked!
new_df = io.load('ttbar.h5')
new_df.head()


Out[30]:
NJet Jet_Px Jet_Py Jet_Pz Jet_E Jet_btag Jet_ID NMuon Muon_Px Muon_Py ... MClepton_px MClepton_py MClepton_pz MCleptonPDGid MCneutrino_px MCneutrino_py MCneutrino_pz NPrimaryVertices triggerIsoMu24 EventWeight
0 3 [14.5363, -22.5988, 47.9446] [81.2174, -57.1286, -11.3546] [69.8893, 214.74, 103.365] [109.21, 223.634, 114.651] [2.0461, 2.28739, -1.0] [True, True, True] 0 [] [] ... -47.360596 13.396172 34.925529 15 7.220685 -44.571102 42.757633 6 False 0.286201
1 6 [28.4418, -56.9156, 33.1661, -56.3586, 44.6846... [-77.4063, 27.355, 53.6579, 26.6355, -14.1429,... [-249.801, -83.3849, -35.8116, 32.7957, 17.631... [263.247, 105.526, 72.9987, 71.0442, 51.121, 4... [-1.0, -1.0, -1.0, -1.0, -1.0, -1.0] [True, True, True, True, True, True] 0 [] [] ... 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 3 False 0.270030
2 4 [-16.0014, -41.8073, 29.1941, 38.4706] [-59.6659, 21.0932, -28.6743, 7.80847] [-49.9657, -29.6283, 38.0217, -14.9639] [79.9953, 56.0485, 56.2464, 42.5101] [-1.0, -1.0, 1.70724, -1.0] [True, True, True, True] 0 [] [] ... -11.641902 -32.563839 -263.439758 -13 -6.770576 36.392647 -135.825806 6 False 0.293529
3 2 [-40.8037, 37.5208] [34.4222, -35.3076] [-47.5731, -67.0822] [72.2569, 85.3261] [3.63089, 3.02456] [True, True] 1 [55.77] [12.2147] ... 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 10 False 0.192474
4 3 [-67.6013, 70.1426, 27.3418] [83.3236, -71.9012, 50.1827] [78.5999, -8.07146, 22.9956] [135.263, 101.793, 62.2964] [-1.0, -1.0, 3.74993] [True, True, True] 0 [] [] ... -49.375671 -30.740866 57.628201 11 2.897522 -13.882012 93.436157 16 False 0.036131

5 rows × 51 columns


In [31]:
# -- check the shape again -- nice check to run every time you create a df
new_df.shape


Out[31]:
(36941, 51)

Now, let's create a new dataframe that contains only jet-related branches by slicing our pre-existing ttbar dataframe


In [19]:
# slice the dataframe
jet_df = df[[key for key in df.keys() if key.startswith('Jet')]]
jet_df.head()


Out[19]:
Jet_Px Jet_Py Jet_Pz Jet_E Jet_btag Jet_ID
0 [14.5363, -22.5988, 47.9446] [81.2174, -57.1286, -11.3546] [69.8893, 214.74, 103.365] [109.21, 223.634, 114.651] [2.0461, 2.28739, -1.0] [True, True, True]
1 [28.4418, -56.9156, 33.1661, -56.3586, 44.6846... [-77.4063, 27.355, 53.6579, 26.6355, -14.1429,... [-249.801, -83.3849, -35.8116, 32.7957, 17.631... [263.247, 105.526, 72.9987, 71.0442, 51.121, 4... [-1.0, -1.0, -1.0, -1.0, -1.0, -1.0] [True, True, True, True, True, True]
2 [-16.0014, -41.8073, 29.1941, 38.4706] [-59.6659, 21.0932, -28.6743, 7.80847] [-49.9657, -29.6283, 38.0217, -14.9639] [79.9953, 56.0485, 56.2464, 42.5101] [-1.0, -1.0, 1.70724, -1.0] [True, True, True, True]
3 [-40.8037, 37.5208] [34.4222, -35.3076] [-47.5731, -67.0822] [72.2569, 85.3261] [3.63089, 3.02456] [True, True]
4 [-67.6013, 70.1426, 27.3418] [83.3236, -71.9012, 50.1827] [78.5999, -8.07146, 22.9956] [135.263, 101.793, 62.2964] [-1.0, -1.0, 3.74993] [True, True, True]

This would be useful if you wanted to classify your events only by using the properties of jets in each event.

What if your application involved classifying jets, instead of events? In this case, you might want to turn your dataset from event-flat to jet-flat, i.e. a dataframe in which every row represents a jet and every column is a property of this jet. This is extremely easy to do using pandas and numpy:


In [20]:
def flatten(column):
    '''
    Args:
    -----
        column: a column of a pandas df whose entries are lists (or regular entries -- in which case nothing is done)
                e.g.: my_df['some_variable'] 

    Returns:
    --------    
        flattened out version of the column. 

        For example, it will turn:
        [1791, 2719, 1891]
        [1717, 1, 0, 171, 9181, 537, 12]
        [82, 11]
        ...
        into:
        1791, 2719, 1891, 1717, 1, 0, 171, 9181, 537, 12, 82, 11, ...
    '''
    try:
        return np.array([v for e in column for v in e])
    except (TypeError, ValueError):
        return column

In [21]:
# -- ok, let's try it out!
df_flat = pd.DataFrame({k: flatten(c) for k, c in jet_df.iteritems()})
df_flat.head()


Out[21]:
Jet_E Jet_ID Jet_Px Jet_Py Jet_Pz Jet_btag
0 109.210350 True 14.536326 81.217430 69.889328 2.046098
1 223.633972 True -22.598843 -57.128605 214.739639 2.287386
2 114.650551 True 47.944618 -11.354554 103.364891 -1.000000
3 263.246735 True 28.441772 -77.406265 -249.800949 -1.000000
4 105.526070 True -56.915596 27.354971 -83.384865 -1.000000

Plot your data using Matplotlib

Using pandas in conjunction with matplotlib, you can also inspect your variables super quickly. Check out the following cells for a quick example.


In [22]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [23]:
# iterate through the columns
for key in df_flat.keys():
    # plotting settings
    matplotlib.rcParams.update({'font.size': 16})
    fig = plt.figure(figsize=(5, 5), dpi=100)
    bins = np.linspace(min(df_flat[key]), max(df_flat[key]), 30)
    # plot!
    _ = plt.hist(df_flat[key], bins=bins, histtype='step', label=r'$t\overline{t}$')
    # decorate
    plt.xlabel(key)
    plt.ylabel('Number of Jets')
    plt.legend()
    plt.plot()


It's really easy and intuitive to add new columns to a dataframe. You can also define them as functions of other columns. This is great if you need to build your own hand-crafted variables.


In [24]:
df['Jet_P'] = (df['Jet_Px']**2 + df['Jet_Py']**2 + df['Jet_Pz']**2)**(0.5)

In [25]:
# -- again, you can easily slice dataframes by specifying the names of the branches you would like to select 
df[['Jet_P', 'Jet_Px', 'Jet_Py', 'Jet_Pz', 'Jet_E']].head()


Out[25]:
Jet_P Jet_Px Jet_Py Jet_Pz Jet_E
0 [108.13, 223.355, 114.507] [14.5363, -22.5988, 47.9446] [81.2174, -57.1286, -11.3546] [69.8893, 214.74, 103.365] [109.21, 223.634, 114.651]
1 [263.061, 104.598, 72.5371, 70.4365, 50.0759, ... [28.4418, -56.9156, 33.1661, -56.3586, 44.6846... [-77.4063, 27.355, 53.6579, 26.6355, -14.1429,... [-249.801, -83.3849, -35.8116, 32.7957, 17.631... [263.247, 105.526, 72.9987, 71.0442, 51.121, 4...
2 [79.4521, 55.4131, 55.8585, 42.0104] [-16.0014, -41.8073, 29.1941, 38.4706] [-59.6659, 21.0932, -28.6743, 7.80847] [-49.9657, -29.6283, 38.0217, -14.9639] [79.9953, 56.0485, 56.2464, 42.5101]
3 [71.5054, 84.584] [-40.8037, 37.5208] [34.4222, -35.3076] [-47.5731, -67.0822] [72.2569, 85.3261]
4 [133.006, 100.772, 61.6009] [-67.6013, 70.1426, 27.3418] [83.3236, -71.9012, 50.1827] [78.5999, -8.07146, 22.9956] [135.263, 101.793, 62.2964]

In [26]:
# -- you can also build four vectors and store them in a new column in 1 line of code 
from rootpy.vector import LorentzVector
df['Jet_4V'] = [map(lambda args: LorentzVector(*args), zip(px, py, pz, e)) for 
                (_, (px, py, pz, e)) in df[['Jet_Px', 'Jet_Py', 'Jet_Pz', 'Jet_E']].iterrows()]

In [27]:
# -- look at the 4-vectors of the jets in the first 5 events
[_ for _ in df['Jet_4V'].head()]


Out[27]:
[[LorentzVector(px=14.536326, py=81.217430, pz=69.889328, E=109.210350),
  LorentzVector(px=-22.598843, py=-57.128605, pz=214.739639, E=223.633972),
  LorentzVector(px=47.944618, py=-11.354554, pz=103.364891, E=114.650551)],
 [LorentzVector(px=28.441772, py=-77.406265, pz=-249.800949, E=263.246735),
  LorentzVector(px=-56.915596, py=27.354971, pz=-83.384865, E=105.526070),
  LorentzVector(px=33.166145, py=53.657883, pz=-35.811634, E=72.998718),
  LorentzVector(px=-56.358620, py=26.635517, pz=32.795723, E=71.044151),
  LorentzVector(px=44.684635, py=-14.142891, pz=17.631155, E=51.121010),
  LorentzVector(px=-7.482028, py=-34.917023, pz=-16.437563, E=40.037487)],
 [LorentzVector(px=-16.001389, py=-59.665871, pz=-49.965702, E=79.995316),
  LorentzVector(px=-41.807266, py=21.093220, pz=-29.628342, E=56.048538),
  LorentzVector(px=29.194143, py=-28.674322, pz=38.021736, E=56.246387),
  LorentzVector(px=38.470554, py=7.808471, pz=-14.963932, E=42.510105)],
 [LorentzVector(px=-40.803673, py=34.422157, pz=-47.573055, E=72.256897),
  LorentzVector(px=37.520752, py=-35.307610, pz=-67.082169, E=85.326126)],
 [LorentzVector(px=-67.601349, py=83.323586, pz=78.599915, E=135.262543),
  LorentzVector(px=70.142593, py=-71.901230, pz=-8.071459, E=101.792641),
  LorentzVector(px=27.341812, py=50.182686, pz=22.995554, E=62.296364)]]

In [28]:
# -- calculate the mass (or any other property) of all the jets in the first event
[jet.M() for jet in df['Jet_4V'][0]]


Out[28]:
[15.323401924031147, 11.164929207443834, 5.730252080241989]

There is obviously lots you can do with your data once you turn it into a standard Python object and move away from ROOT-specific classes. You can now take advantage of state-of-the-art Data Science and Machine Learning libraries to transform your data, while still recovering the functionalities you're used to.

You can now move on to the second notebook.


In [ ]: