This is the first of two notebooks for today's tutorial. You will find the second one here.
Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and more abstract representations computed in terms of less abstract ones.
Use machine learning to discover not only the mapping from representation to output but also the representation itself.
e.g.:
Our goal:
Learn a function that maps a set of raw electrical signals from the detector all the way to particle identification
From D. Whiteson:
In parametric, supervised learning:
From CMS Software Tutorial, developed by Christian Sander and Alexander Schmidt, and available on
ROOT file to a fully trained Keras model.
Disclaimer: my applications won't make 100% physical sense -- please focus on the tools!
Before diving into the Deep Learning world, I want to spend a few minutes discussing some data handling techniques I use whenever I get started prototyping my applications.
In [1]:
import numpy as np
In [2]:
np.array([[0,1,2], [0,0,0], [1,2,-1]]) #+1 /2 **2 .ravel() etc.
Out[2]:
For a nice numpy intro, check out the CERN tutorial Loose your Loops with NumPy (and tons of online material).
Idea: What if those columns represented various branches and every line represented an event/physics object?
Very easy to turn your .root files into Machine Learning compliant inputs using numpy and `root_numpy`.
In [3]:
from numpy.lib.recfunctions import stack_arrays
from root_numpy import root2array, root2rec
import glob
Using one single function from root_numpy, you can open your .root file and turn it into an ndarray, a Python object equivalent to an n-dimensional matrix. All you need to do is to pass it the name of the file you'd like to open. Other keyword arguments are specified below.
Let's take a look at the MC signal sample from our CMS open dataset:
In [4]:
ttbar = root2array('files/ttbar.root')
In [5]:
# -- display your newly created object
ttbar
Out[5]:
In [6]:
# -- what data type is it?
type(ttbar)
Out[6]:
In [7]:
# -- how many events are present?
ttbar.shape
Out[7]:
In [8]:
# -- what are the names of the branches?
ttbar.dtype.names
Out[8]:
In [9]:
import pandas as pd
One way of manipulating your data (slicing, filtering, removing variables, creating new features, taking operations of branches) in a simple, visually appealing way is to use `pandas` dataframes, a beatiful and efficient Python data structure library. Recommended for exploratory data analysis, probably not for high performance applications.
In [10]:
# -- how to turn an ndarray into a pandas dataframe
df = pd.DataFrame(ttbar)
In [11]:
# -- better way of displaying your data
df.head() # print the first few entries
Out[11]:
In [12]:
# -- ... or the last few
df.tail()
Out[12]:
In [13]:
# -- check the shape: it should be [nb_events, nb_variables]
df.shape
Out[13]:
In [14]:
df.info()
In [15]:
df.keys() #df.columns
Out[15]:
To summarize, if you want to go directly from .root files to pandas dataframes, you can do so in 3 lines of Python code. I like to use this function below in all my application whenever I load in data from a ROOT file. Feel free to copy it and use it!
In [16]:
def root2pandas(files_path, tree_name, **kwargs):
'''
Args:
-----
files_path: a string like './data/*.root', for example
tree_name: a string like 'Collection_Tree' corresponding to the name of the folder inside the root
file that we want to open
kwargs: arguments taken by root2array, such as branches to consider, start, stop, step, etc
Returns:
--------
output_panda: a pandas dataframe like allbkg_df in which all the info from the root file will be stored
Note:
-----
if you are working with .root files that contain different branches, you might have to mask your data
in that case, return pd.DataFrame(ss.data)
'''
# -- create list of .root files to process
files = glob.glob(files_path)
# -- process ntuples into rec arrays
ss = stack_arrays([root2array(fpath, tree_name, **kwargs).view(np.recarray) for fpath in files])
try:
return pd.DataFrame(ss)
except Exception:
return pd.DataFrame(ss.data)
In [17]:
# -- usage of root2pandas
singletop = root2pandas('./files/single_top.root', 'events')
We just turned a HEP-specific ROOT file into a standard data format that can be used by any ML expert. You can now save your data out to widely accepted data formats such as HDF5 or even CSV, share it with your collaborators from the ML community without them having to learn how to use the ROOT library or other CERN-specific analysis tools.
In [28]:
# -- save a pandas df to hdf5 (better to first convert it back to ndarray, to be fair)
import deepdish.io as io
io.save('ttbar.h5', df)
In [29]:
singletop.to_hdf('try.h5', 'branches')
In [30]:
# -- let's load it back in to make sure it actually worked!
new_df = io.load('ttbar.h5')
new_df.head()
Out[30]:
In [31]:
# -- check the shape again -- nice check to run every time you create a df
new_df.shape
Out[31]:
Now, let's create a new dataframe that contains only jet-related branches by slicing our pre-existing ttbar dataframe
In [19]:
# slice the dataframe
jet_df = df[[key for key in df.keys() if key.startswith('Jet')]]
jet_df.head()
Out[19]:
This would be useful if you wanted to classify your events only by using the properties of jets in each event.
What if your application involved classifying jets, instead of events? In this case, you might want to turn your dataset from event-flat to jet-flat, i.e. a dataframe in which every row represents a jet and every column is a property of this jet. This is extremely easy to do using pandas and numpy:
In [20]:
def flatten(column):
'''
Args:
-----
column: a column of a pandas df whose entries are lists (or regular entries -- in which case nothing is done)
e.g.: my_df['some_variable']
Returns:
--------
flattened out version of the column.
For example, it will turn:
[1791, 2719, 1891]
[1717, 1, 0, 171, 9181, 537, 12]
[82, 11]
...
into:
1791, 2719, 1891, 1717, 1, 0, 171, 9181, 537, 12, 82, 11, ...
'''
try:
return np.array([v for e in column for v in e])
except (TypeError, ValueError):
return column
In [21]:
# -- ok, let's try it out!
df_flat = pd.DataFrame({k: flatten(c) for k, c in jet_df.iteritems()})
df_flat.head()
Out[21]:
Using pandas in conjunction with matplotlib, you can also inspect your variables super quickly. Check out the following cells for a quick example.
In [22]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
In [23]:
# iterate through the columns
for key in df_flat.keys():
# plotting settings
matplotlib.rcParams.update({'font.size': 16})
fig = plt.figure(figsize=(5, 5), dpi=100)
bins = np.linspace(min(df_flat[key]), max(df_flat[key]), 30)
# plot!
_ = plt.hist(df_flat[key], bins=bins, histtype='step', label=r'$t\overline{t}$')
# decorate
plt.xlabel(key)
plt.ylabel('Number of Jets')
plt.legend()
plt.plot()
It's really easy and intuitive to add new columns to a dataframe. You can also define them as functions of other columns. This is great if you need to build your own hand-crafted variables.
In [24]:
df['Jet_P'] = (df['Jet_Px']**2 + df['Jet_Py']**2 + df['Jet_Pz']**2)**(0.5)
In [25]:
# -- again, you can easily slice dataframes by specifying the names of the branches you would like to select
df[['Jet_P', 'Jet_Px', 'Jet_Py', 'Jet_Pz', 'Jet_E']].head()
Out[25]:
In [26]:
# -- you can also build four vectors and store them in a new column in 1 line of code
from rootpy.vector import LorentzVector
df['Jet_4V'] = [map(lambda args: LorentzVector(*args), zip(px, py, pz, e)) for
(_, (px, py, pz, e)) in df[['Jet_Px', 'Jet_Py', 'Jet_Pz', 'Jet_E']].iterrows()]
In [27]:
# -- look at the 4-vectors of the jets in the first 5 events
[_ for _ in df['Jet_4V'].head()]
Out[27]:
In [28]:
# -- calculate the mass (or any other property) of all the jets in the first event
[jet.M() for jet in df['Jet_4V'][0]]
Out[28]:
There is obviously lots you can do with your data once you turn it into a standard Python object and move away from ROOT-specific classes. You can now take advantage of state-of-the-art Data Science and Machine Learning libraries to transform your data, while still recovering the functionalities you're used to.
You can now move on to the second notebook.
In [ ]: