This tutorial is meant to teach a beginning HEP under-graduate or graduate student who may be unfamiliar with python or data science in python to train Deep Learning models using Keras. Usually, tutorials attempt to familiarize you with a software package by leading you through the steps of a few tasks. Like most tutorials, there are many sections where you can simply follow the instructions and execute example code like a robot (usualy via copy/paste into a terminal, but in this case using Jupyter notebooks). But this tutorial also aims to teach you key concepts in scientific computing, Machine Learning, and Deep Learning in python through exercises that require you to slow down, think critically, and apply what you read. The exercises were derived from labs for a Deep Learning in HEP course for undergrads (taught at University of Texas at Arlington by Amir Farbin).
The tutorial is divided into three sections:
A. Basics
C. DeepLearning
You are very likely to find that the first sections of this tutorial are very basic. If you have some familiarity with data science in python, we suggest you skip what you know and only follow sections B.1, B.2, and C.
Please quickly skim the beginning sections to find the appropriate starting point for you. If there isn't sufficient time for you to finish the exercises during the tutorial session please read through the explanations, execute the cells containing the examples, and think about how you would solve the exercises. You can go back and try the exercises at home.
Data Science in python usually starts with loading hdf5 files into numpy tensors for manipulation in an interactive python session. While you can run the session in a terminal, Jupyter provides a nice web-based alternative environment for data analysis. As you can see (since you should be running this in Jupyter Notebbok), it allows you to combine text, code, and results all in one interactive document in your browser. There are many excellent primers already out there for python, numpy, h5py, and jupyter. You are encouraged to study them on your own as needed.
If you are seeing this page, you have successfully connected to a python server via ssh tunnel and navigated to this notebook. Jupyter notebooks consist of cells that can hold text or code (usually python). This text that you are reading, was written into a text cell as simple text "coding" language known as mark-down. When this cell is run (either automatically at start of the notebook or manually by pressing shift-enter), the mark-down text is interpreted into nice looking text. Running a code cell will execute the code in that cell and give you the results. If you make a mistake, you can usually simply change the cell and re-run. But be aware that since you ran the mistaken cell already, whatever code was properly executed before your mistake/error, was already executed and has therefore changed your current python environment accordingly. In some cases this situation will be problematic, and you will need to rerun the notebook from the start by pressing the "reload" botton (next to the "stop" button) above.
You are encouraged to add cells to this notebook (using the "+" button on the tool bar) and play around a bit. If you don't want to mess up this notebook, you can work in a copy of this notebook by selecting Make Copy from the File menu.
Here we are assuming you have some basic level of python knowledge, such as the syntax. There are many great python tutorials available. For an introductory level interactive tutorial you can try this one: http://www.learnpython.org/
We will lead you through exercises that show you common fundamental problems you might face when doing a deep learning problem. Lets start with generating some fake random data. You can get a random number between 0 and 1 using the python random module as follow: (remember to execute this in Jupyter notebooks click in the cell and hit shift-enter)
In [ ]:
import random
x=random.random()
print "The Value of x is", x
Using the random method (shown above), write a function GenerateData(N, mymin, mymax), that returns a python list containing N random numbers between a specified minimum and maximum value. Note that you may want to quickly work out on paper how to turn numbers between 0 and 1 to other values.
In [ ]:
# Skeleton
def GenerateData(N,min,max):
out = []
### BEGIN SOLUTION
# Fill in your solution here
### END SOLUTION
return out
Data=GenerateData(1000,-10,10)
print "Data Type:", type(Data)
print "Data Length:", len(Data)
if len(Data)>0:
print "Type of Data Contents:", type(Data[0])
print "Data Minimum:", min(Data)
print "Data Maximum:", max(Data)
In [ ]:
# Skeleton
def mean(Data):
m=0
### BEGIN SOLUTION
# Fill in your solution here
### END SOLUTION
return m
print "Mean of Data:", mean(Data)
Write a function the applies a booling function (that returns true/false) to every element in data, and returns a list of indices of elements where the result was true. Use this function to find the indices of positive entries. (This might be something you want to do if you are applying some selection criteria to your dataset and only want to keep events/entries/examples that pass the criteria.)
In [ ]:
def where(mylist,myfunc):
out= []
### BEGIN SOLUTION
# Fill in your solution here
### END SOLUTION
return out
The inrange(mymin,mymax) function below returns a function that tests if it's input is between the specified values. Use this function, in conjunction to your solution to Exercise 1.3, to demonstrate that your data is "flat". Hint: pick several sub-ranges and show that the number of data point divided by the size of the range is roughly constant.
In [ ]:
def inrange(mymin,mymax):
def testrange(x):
return x<mymax and x>=mymin
return testrange
# Examples:
F1=inrange(0,10)
F2=inrange(10,20)
print F1(0), F1(1), F1(10), F1(15), F1(20)
print F2(0), F2(1), F2(10), F2(15), F2(20)
print "Number of Entries passing F1:", len(where(Data,F1))
print "Number of Entries passing F2:", len(where(Data,F2))
In [ ]:
### BEGIN SOLUTION
# Fill in your solution here
### END SOLUTION
Write a new function called GenerateDataFromFunction(N,mymin,mymax,myfunc), that instead of generating a flat distribution, generates a distribution with a functional form coded in myfunc. Note that for this exercise myfunc should always be > 0.
For this exercise, let us make myfunc a Gaussian distribution, which is given below. Generate 1000 numbers that follow this Gaussian distribution. Confirm that the mean of the generated data is close to mean you specified when building the Gaussian.
Hint: A simple, but slow, solution to generate data with a given distribution is to a draw random number, let's call it test_x within the specified range (mymin, mymax) and another number p between the minimum and maximum values of the function myfunc (which you will have to determine). If p<=function(test_x), then place test_x on the output. If not, repeat the process, drawing two new numbers. Repeat until you have the specified number of generated numbers, N, in the output. This method is often called "the accept and reject sampling method". For this problem, it's OK to determine the min and max by numerically sampling the function.
In [ ]:
def GenerateDataFromFunction(N,mymin,mymax,myfunc):
out = []
### BEGIN SOLUTION
# Fill in your solution here
### END SOLUTION
return out
In [ ]:
import math
def gaussian(mean, sigma):
def f(x):
return (1/math.sqrt(2*math.pi*sigma**2))*math.exp(-( (x-mean)**2)/(2*(sigma**2) ))
return f
# Example Instantiation
g1=gaussian(0,1)
g2=gaussian(10,3)
### BEGIN SOLUTION
# Fill in your solution here
### END SOLUTION
Numpy is the tensor manipulation package most commonly used in python-based scientific computing. Numpy tensor interface is also adopted by all packages that provide tensors (e.g. h5py, theano, TensorFlow, ...).
Let start with some basic reshape manipulations. Consider a classification task. We can imagine the training data X consisting of N examples each with M inputs, so the shape of X is (M,N). The output of the Neural Network for the training sample encodes the true class of each of the N examples in X, in a "one-hot" matrix of shape (N,C), where C is the number of classes and each row corresponds to the true class for the corresponding example in X. So for a given row Y[i], all elements are 0 except for the column corresponding to the true class.
For example consider a classification task of separating between 4 classes. We'll call them A, B, C, and D.
In [ ]:
import numpy as np
Y=np.array( [ [0, 1, 0, 0], # Class B
[1, 0, 0, 0], # Class A
[0, 0, 0, 1], # Class C
[0, 0, 1, 0] # Class D
])
print "Shape of Y:", Y.shape
Lets imagine that we want to change to 2 classes instead by combining classes A with B and C with D. Use np.reshape and np.sum to create a new vector Y1. Hint: change the shape of Y into (8,2), sum along the correct axes, and change shape to (4,2). LH: solution given
In [ ]:
print "Transpose:", np.transpose(Y)
print "Reshape 8,2:", np.transpose(Y).reshape((8,2))
print "Sum:", np.sum(np.transpose(Y).reshape((8,2)),axis=1)
Y1= np.sum(np.transpose(Y)
.reshape((8,2)),axis=1).reshape(4,2)
print "Answer: ",Y1
In [ ]:
X=np.random.normal(4,10,1000)
print np.mean(X)
print np.min(X)
print np.max(X)
print np.var(X)
In [ ]:
import math
X1=(X-np.mean(X))/math.sqrt(np.var(X)) # Replace X with your answer
print np.mean(X1)
print np.var(X1)
In [ ]:
X0=np.random.random(1000)
def CheckFlatness(D,steps=10):
maxD=np.max(D)
minD=np.min(D)
i=minD
stepsize=(maxD-minD)/steps
while i<maxD:
print i,i+stepsize,":",np.shape(np.where((D<=(i+stepsize)) & (D>i) ))
i+=stepsize
CheckFlatness(X0)
CheckFlatness(X)
HDF5 is a "data model, library, and file format for storing and managing data." It is also the most common storage format in data science. h5py provides a python API for HDF5. In most cases, you do not need to know very much about HDF5 or h5py, just how to read/write tensors into/from files, which you can easily pick up from the h5py Quick Start. We won't be using HDF5 for this tutorial. This section is here as reference for when you do encounter an HDF5 file.
For the remainder of this tutorial, we will attempt to follow the first paper on Deep Learning in High Energy physics P. Baldi, et al. This paper demonstrates that Deep Neural Networks can learn from raw data the features that are typically used for searches for exotics particles. The authors publically provide the two benchmark scenarios considered in the paper. We will focus on the SUSY benchmark.
The data is distributed as a comma separated values (CSV) file. If you are running on lxplus, you can find a local copy to use as input. Otherwise, download the ~ GB compressed file from UCI's ML Archive, use gunzip
to decompress it, and change the path to the file below accordingly.
In [ ]:
filename="/data/tutorial/amir/SUSY.csv"
# print out the first 5 lines using unix head command (note in jupyter ! => shell command)
!head -5 "/data/tutorial/amir/SUSY.csv"
Each row represents a LHC collision event. Each column contains some observable from that event. The variable names are:
In [ ]:
VarNames=["signal", "l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi", "MET", "MET_phi", "MET_rel", "axial_MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos_theta_r1"]
Some of these variables represent the "raw" kinematics of the observed final state particles, while others are "features" that are derived from these raw quantities:
In [ ]:
RawNames=["l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi"]
FeatureNames=[ "MET", "MET_phi", "MET_rel", "axial_MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos_theta_r1"]
We will use pandas to read in the file, and matplotlib to make plots. Pandas provides "data structures and data analysis tools for the Python Programming Language". Many machine learning tasks can be accomplished with numpy tensors and h5py files. In this case, pandas just makes it very easy to read a CSV file.
The following ensures pandas is installed and sets everything up:
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Now we can read the data into a pandas dataframe. It's a ~GB file, so be patient.
In [ ]:
df = pd.read_csv(filename, dtype='float64', names=VarNames)
Another nice feature of pandas is that you can see the data in Jupyter by just evaluating the dataframe:
In [ ]:
df
The first column stores the "truth" label of whether an event was signal or background. Pandas makes it easy to create dataframes that store only the signal or background events:
In [ ]:
df_sig=df[df.signal==1]
df_bkg=df[df.signal==0]
The following example plots the signal and background distributions of every variable. Note that we use VarNames[1:] to skip the first variable, which was the true label.
We will use matplotlib for plotting. There are lots of tutorials and primers out there that you can find searching the web. A good tutorial can be found in the Scipy Lectures. Look through these on your own time, it is not necessary for doing these exercise. The code below is all you need to know for making histograms with matplotlib.
In [ ]:
for var in VarNames[1:]:
print var
plt.figure()
plt.hist(df_sig[var],bins=100,histtype="step", color="red",label="background",stacked=True)
plt.hist(df_bkg[var],bins=100,histtype="step", color="blue", label="signal",stacked=True)
plt.legend(loc='upper right')
plt.show()
Scikit-learn is a rich python library for data science, including machine learning. As an example, we can easily build a Fisher Discriminant (aka Linear Discriminant Analysis, or LDA). The LDA Documentation does as great job explaining this classifier. Here's how we instanciate the classifier:
In [ ]:
import sklearn.discriminant_analysis as DA
Fisher=DA.LinearDiscriminantAnalysis()
Lets separate the data into inputs (X) vs outputs (Y) and training vs testing samples:
In [ ]:
N_Train=4000000
Train_Sample=df[:N_Train]
Test_Sample=df[N_Train:]
X_Train=Train_Sample[VarNames[1:]]
y_Train=Train_Sample["signal"]
X_Test=Test_Sample[VarNames[1:]]
y_Test=Test_Sample["signal"]
Test_sig=Test_Sample[Test_Sample.signal==1]
Test_bkg=Test_Sample[Test_Sample.signal==0]
We can train the classifier as follow:
In [ ]:
Fisher.fit(X_Train,y_Train)
We can plot the output, comparing signal and background:
In [ ]:
plt.figure()
plt.hist(Fisher.decision_function(Test_sig[VarNames[1:]]),bins=100,histtype="step", color="blue", label="signal",stacked=True)
plt.hist(Fisher.decision_function(Test_bkg[VarNames[1:]]),bins=100,histtype="step", color="red", label="background",stacked=True)
plt.legend(loc='upper right')
plt.show()
And we can make a ROC curve and evaluate the AUC:
In [ ]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_Test, Fisher.decision_function(X_Test))
roc_auc = auc(fpr, tpr)
plt.plot(fpr,tpr,color='darkorange',label='ROC curve (area = %0.2f)' % roc_auc)
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
In [ ]:
X_Train_Raw=Train_Sample[RawNames]
X_Test_Raw=Test_Sample[RawNames]
X_Train_Features=Train_Sample[FeatureNames]
X_Test_Features=Test_Sample[FeatureNames]
In [ ]:
def TrainFisher(X_Train,X_Test,y_Train):
Fisher=DA.LinearDiscriminantAnalysis()
Fisher.fit(X_Train,y_Train)
fpr, tpr, _ = roc_curve(y_Test, Fisher.decision_function(X_Test))
roc_auc = auc(fpr, tpr)
plt.plot(fpr,tpr,color='darkorange',label='ROC curve (area = %0.2f)' % roc_auc)
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
return Fisher
RawFisher=TrainFisher(X_Train_Raw,X_Test_Raw,y_Train)
FeatureFisher=TrainFisher(X_Train_Features,X_Test_Features,y_Train)
Select 3 different classifiers from the techniques listed here. Note that you can use the multi-layer peceptron to build a deep network, though training may be prohibitively slow, so avoid this technique. Perform the comparison in exercise 1 for each classifier. Compare your conclusions for your selected techniques to the paper.
In [ ]:
def PlotSignificance(N_S,N_B, N_S_min=1):
plt.figure()
eff_sig,bins_sig,p_sig=plt.hist(Fisher.decision_function(Test_sig[VarNames[1:]]),bins=100,histtype="step", color="blue", label="signal",cumulative=-1,stacked=True,normed=True)
eff_bkg,bins_bkg,p_bkg=plt.hist(Fisher.decision_function(Test_bkg[VarNames[1:]]),bins=100,histtype="step", color="red", label="background",cumulative=-1,stacked=True,normed=True)
plt.legend(loc='upper right')
plt.show()
good_bins = np.where(eff_sig*N_S>=N_S_min)
print len(good_bins[0])
if len(good_bins[0])<1:
print "Insufficient Signal."
return 0,0,0
significance=(N_S*eff_sig)/np.sqrt((N_B*eff_bkg)+(N_S*eff_sig))
plt.figure()
plt.plot(bins_sig[:-1],significance)
max_sign=np.max(significance[good_bins])
max_signI=np.argmax(significance[good_bins])
plt.show()
print "Max significance at ", bins_sig[max_signI], " of", max_sign
return bins_sig[max_signI],max_sign, max_signI
PlotSignificance(1000000,1e11)
Answer the following questions:
Read the Baldi, et al. paper and attempt to reproduce the results, as closely as possible, using scikit-learn. Try using the multi-layer peceptron to build a deep network. Or if you are capable, try it using Keras Scikit-learn interface.
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
filename="/data/tutorial/amir/SUSY.csv"
VarNames=["signal", "l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi", "MET", "MET_phi", "MET_rel", "axial_MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos_theta_r1"]
RawNames=["l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi"]
FeatureNames=[ "MET", "MET_phi", "MET_rel", "axial_MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos_theta_r1"]
df = pd.read_csv(filename, dtype='float64', names=VarNames)
Now lets define training and test samples. Note that DNNs take very long to train, so for testing purposes we will use only about 10% of the 5 million events in the training/validation sample. Once you get everything working, you can go back and make the final version of your plots with the full sample.
Also note that Keras had trouble with the Pandas tensors, so after doing all of the nice manipulation that Pandas enables, we convert the Tensor to a regular numpy tensor.
In [ ]:
N_Max=550000
N_Train=500000
Train_Sample=df[:N_Train]
Test_Sample=df[N_Train:N_Max]
X_Train=np.array(Train_Sample[VarNames[1:]])
y_Train=np.array(Train_Sample["signal"])
X_Test=np.array(Test_Sample[VarNames[1:]])
y_Test=np.array(Test_Sample["signal"])
Training Deep Learning models can take a very long time. If you have access to a GPU, training with the GPU will be about 2 orders of magnitude faster that training with just the CPU. Unforunately, there are no GPUs on lxplus. But, if you are running this notebook on a system with NVidia GPU(s) properly setup, you can tell Keras to use a specific GPU:
Now we will build a simple model. Note that this is a very small model, so things run fast. You should attempt more ambitious models.
In [ ]:
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(12, input_dim=X_Train.shape[1], init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
The model has to be compiled. At this time we set the loss function and the optimizer too:
In [ ]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Now we train. We are running only 10 epochs in this example. Models may need hundreds of epochs before they stop improving.
In [ ]:
history=model.fit(X_Train, y_Train, validation_data=(X_Test,y_Test), nb_epoch=10, batch_size=2048)
The model history keeps track of the loss and accuracy for each epoch. Note that the training above was setup to run on the validation sample at the end of each epoch:
In [ ]:
print history.history
You can plot the loss versus epoch:
In [ ]:
loss_history=history.history["loss"]
plt.plot(range(len(loss_history)),loss_history)
You will need to create several models and make sure they are properly trained. Write a function that takes this history and plots the values versus epoch. For every model that you train in the remainder of this lab, assess:
* Has you model's performance plateaued? If not train for more epochs.
* Compare the performance on training versus test sample. Are you over training?
In [ ]:
## Your Solution Here
We can evaluate how the trained model does on the test sample as follows:
In [ ]:
scores = model.evaluate(X_Test, y_Test)
print scores
And we can make ROC curves as before:
In [ ]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_Test, model.predict(X_Test))
roc_auc = auc(fpr, tpr)
plt.plot(fpr,tpr,color='darkorange',label='ROC curve (area = %0.2f)' % roc_auc)
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
In [ ]:
## Your solution here
In [ ]:
## Your solution here
Write a function that evaluates the performance (AUC) as a function of a given input variable. You will need to bin the test data in the variable (i.e. make sub-samples for events which have the particular variable in a range), evaluate the performance in each bin, and plot the results.
Apply your function to each input variable.
In [ ]:
## Your solution here