An Introduction to Deep Learning with Keras

This tutorial is meant to teach a beginning HEP under-graduate or graduate student who may be unfamiliar with python or data science in python to train Deep Learning models using Keras. Usually, tutorials attempt to familiarize you with a software package by leading you through the steps of a few tasks. Like most tutorials, there are many sections where you can simply follow the instructions and execute example code like a robot (usualy via copy/paste into a terminal, but in this case using Jupyter notebooks). But this tutorial also aims to teach you key concepts in scientific computing, Machine Learning, and Deep Learning in python through exercises that require you to slow down, think critically, and apply what you read. The exercises were derived from labs for a Deep Learning in HEP course for undergrads (taught at University of Texas at Arlington by Amir Farbin).

The tutorial is divided into three sections:

A. Basics

  1. Jupyter
  2. Python
  3. Numpy
  4. HDF5

B. MachineLearning

  1. Dataset
  2. Pandas
  3. Scikit-learn

C. DeepLearning

  1. Keras

You are very likely to find that the first sections of this tutorial are very basic. If you have some familiarity with data science in python, we suggest you skip what you know and only follow sections B.1, B.2, and C.

Please quickly skim the beginning sections to find the appropriate starting point for you. If there isn't sufficient time for you to finish the exercises during the tutorial session please read through the explanations, execute the cells containing the examples, and think about how you would solve the exercises. You can go back and try the exercises at home.

A. Basics

Data Science in python usually starts with loading hdf5 files into numpy tensors for manipulation in an interactive python session. While you can run the session in a terminal, Jupyter provides a nice web-based alternative environment for data analysis. As you can see (since you should be running this in Jupyter Notebbok), it allows you to combine text, code, and results all in one interactive document in your browser. There are many excellent primers already out there for python, numpy, h5py, and jupyter. You are encouraged to study them on your own as needed.

0. Jupyter

If you are seeing this page, you have successfully connected to a python server via ssh tunnel and navigated to this notebook. Jupyter notebooks consist of cells that can hold text or code (usually python). This text that you are reading, was written into a text cell as simple text "coding" language known as mark-down. When this cell is run (either automatically at start of the notebook or manually by pressing shift-enter), the mark-down text is interpreted into nice looking text. Running a code cell will execute the code in that cell and give you the results. If you make a mistake, you can usually simply change the cell and re-run. But be aware that since you ran the mistaken cell already, whatever code was properly executed before your mistake/error, was already executed and has therefore changed your current python environment accordingly. In some cases this situation will be problematic, and you will need to rerun the notebook from the start by pressing the "reload" botton (next to the "stop" button) above.

You are encouraged to add cells to this notebook (using the "+" button on the tool bar) and play around a bit. If you don't want to mess up this notebook, you can work in a copy of this notebook by selecting Make Copy from the File menu.

1. Python

Here we are assuming you have some basic level of python knowledge, such as the syntax. There are many great python tutorials available. For an introductory level interactive tutorial you can try this one: http://www.learnpython.org/

We will lead you through exercises that show you common fundamental problems you might face when doing a deep learning problem. Lets start with generating some fake random data. You can get a random number between 0 and 1 using the python random module as follow: (remember to execute this in Jupyter notebooks click in the cell and hit shift-enter)


In [ ]:
import random
x=random.random()
print "The Value of x is", x

Exercise A.1.1

Using the random method (shown above), write a function GenerateData(N, mymin, mymax), that returns a python list containing N random numbers between a specified minimum and maximum value. Note that you may want to quickly work out on paper how to turn numbers between 0 and 1 to other values.


In [ ]:
# Skeleton
def GenerateData(N,min,max):
    out = []
    ### BEGIN SOLUTION

    # Fill in your solution here 
    
    ### END SOLUTION
    return out

Data=GenerateData(1000,-10,10)
print "Data Type:", type(Data)
print "Data Length:", len(Data)
if len(Data)>0: 
    print "Type of Data Contents:", type(Data[0])
    print "Data Minimum:", min(Data)
    print "Data Maximum:", max(Data)

Exercise A.1.2

Write a function that computes the mean of values in a list.


In [ ]:
# Skeleton
def mean(Data):
    m=0
    ### BEGIN SOLUTION

    # Fill in your solution here        
    
    ### END SOLUTION   
    return m

print "Mean of Data:", mean(Data)

Exercise A.1.3

Write a function the applies a booling function (that returns true/false) to every element in data, and returns a list of indices of elements where the result was true. Use this function to find the indices of positive entries. (This might be something you want to do if you are applying some selection criteria to your dataset and only want to keep events/entries/examples that pass the criteria.)


In [ ]:
def where(mylist,myfunc):
    out= []
    ### BEGIN SOLUTION

    # Fill in your solution here        
    
    ### END SOLUTION    
    return out

Exercise A.1.4

The inrange(mymin,mymax) function below returns a function that tests if it's input is between the specified values. Use this function, in conjunction to your solution to Exercise 1.3, to demonstrate that your data is "flat". Hint: pick several sub-ranges and show that the number of data point divided by the size of the range is roughly constant.


In [ ]:
def inrange(mymin,mymax):
    def testrange(x):
        return x<mymax and x>=mymin
    return testrange

# Examples:
F1=inrange(0,10)
F2=inrange(10,20)

print F1(0), F1(1), F1(10), F1(15), F1(20)
print F2(0), F2(1), F2(10), F2(15), F2(20)

print "Number of Entries passing F1:", len(where(Data,F1))
print "Number of Entries passing F2:", len(where(Data,F2))

Exercise A.1.5

Repeat Exercise 1.4 using the built in python functions sum and map instead of your solution to 1.3.


In [ ]:
### BEGIN SOLUTION

# Fill in your solution here        
    
### END SOLUTION

Exercise A.1.6

Write a new function called GenerateDataFromFunction(N,mymin,mymax,myfunc), that instead of generating a flat distribution, generates a distribution with a functional form coded in myfunc. Note that for this exercise myfunc should always be > 0.

For this exercise, let us make myfunc a Gaussian distribution, which is given below. Generate 1000 numbers that follow this Gaussian distribution. Confirm that the mean of the generated data is close to mean you specified when building the Gaussian.

Hint: A simple, but slow, solution to generate data with a given distribution is to a draw random number, let's call it test_x within the specified range (mymin, mymax) and another number p between the minimum and maximum values of the function myfunc (which you will have to determine). If p<=function(test_x), then place test_x on the output. If not, repeat the process, drawing two new numbers. Repeat until you have the specified number of generated numbers, N, in the output. This method is often called "the accept and reject sampling method". For this problem, it's OK to determine the min and max by numerically sampling the function.


In [ ]:
def GenerateDataFromFunction(N,mymin,mymax,myfunc):
    out = []
    ### BEGIN SOLUTION

    # Fill in your solution here        
    
    ### END SOLUTION   
    return out

In [ ]:
import math

def gaussian(mean, sigma):
    def f(x):
        return (1/math.sqrt(2*math.pi*sigma**2))*math.exp(-( (x-mean)**2)/(2*(sigma**2)   ))
    return f

# Example Instantiation
g1=gaussian(0,1)
g2=gaussian(10,3)

### BEGIN SOLUTION

# Fill in your solution here        
    
### END SOLUTION

2. Numpy

Numpy is the tensor manipulation package most commonly used in python-based scientific computing. Numpy tensor interface is also adopted by all packages that provide tensors (e.g. h5py, theano, TensorFlow, ...).

Exercise A.2.1

Let start with some basic reshape manipulations. Consider a classification task. We can imagine the training data X consisting of N examples each with M inputs, so the shape of X is (M,N). The output of the Neural Network for the training sample encodes the true class of each of the N examples in X, in a "one-hot" matrix of shape (N,C), where C is the number of classes and each row corresponds to the true class for the corresponding example in X. So for a given row Y[i], all elements are 0 except for the column corresponding to the true class.

For example consider a classification task of separating between 4 classes. We'll call them A, B, C, and D.


In [ ]:
import numpy as np

Y=np.array( [ [0, 1, 0, 0], # Class B
              [1, 0, 0, 0], # Class A
              [0, 0, 0, 1], # Class C
              [0, 0, 1, 0]  # Class D
            ])

print "Shape of Y:", Y.shape

Lets imagine that we want to change to 2 classes instead by combining classes A with B and C with D. Use np.reshape and np.sum to create a new vector Y1. Hint: change the shape of Y into (8,2), sum along the correct axes, and change shape to (4,2). LH: solution given


In [ ]:
print "Transpose:", np.transpose(Y)
print "Reshape 8,2:", np.transpose(Y).reshape((8,2))
print "Sum:", np.sum(np.transpose(Y).reshape((8,2)),axis=1)

Y1= np.sum(np.transpose(Y)
           .reshape((8,2)),axis=1).reshape(4,2)
print "Answer: ",Y1

Exercise A.2.2

Oftentimes we find that neutral networks work best when their input is mostly between 0,1. Below, we create a random dataset that is normal distributed (mean of 4, sigma of 10). Shift the data so that the mean is 0.5 and 68% of the data lies between 0 and 1. LH: solution given.


In [ ]:
X=np.random.normal(4,10,1000)
print np.mean(X)
print np.min(X)
print np.max(X)
print np.var(X)

In [ ]:
import math
X1=(X-np.mean(X))/math.sqrt(np.var(X)) # Replace X with your answer

print np.mean(X1)
print np.var(X1)

Exercise A.2.3

Using np.random.random and np.random.normal to generate two datasets. Then use np.where to repeat exercise 1.4 showing that one creates a flat distribution and the other does not by binning the data by hand.


In [ ]:
X0=np.random.random(1000)

def CheckFlatness(D,steps=10):
    maxD=np.max(D)
    minD=np.min(D)
    i=minD
    stepsize=(maxD-minD)/steps
    while i<maxD:
        print i,i+stepsize,":",np.shape(np.where((D<=(i+stepsize)) & (D>i) ))
        i+=stepsize
        
CheckFlatness(X0)
CheckFlatness(X)

3. h5py

HDF5 is a "data model, library, and file format for storing and managing data." It is also the most common storage format in data science. h5py provides a python API for HDF5. In most cases, you do not need to know very much about HDF5 or h5py, just how to read/write tensors into/from files, which you can easily pick up from the h5py Quick Start. We won't be using HDF5 for this tutorial. This section is here as reference for when you do encounter an HDF5 file.

B. Machine Learning

For the remainder of this tutorial, we will attempt to follow the first paper on Deep Learning in High Energy physics P. Baldi, et al. This paper demonstrates that Deep Neural Networks can learn from raw data the features that are typically used for searches for exotics particles. The authors publically provide the two benchmark scenarios considered in the paper. We will focus on the SUSY benchmark.

1. The Dataset

The data is distributed as a comma separated values (CSV) file. If you are running on lxplus, you can find a local copy to use as input. Otherwise, download the ~ GB compressed file from UCI's ML Archive, use gunzip to decompress it, and change the path to the file below accordingly.


In [ ]:
filename="/data/tutorial/amir/SUSY.csv"
# print out the first 5 lines using unix head command (note in jupyter ! => shell command)
!head -5  "/data/tutorial/amir/SUSY.csv"

Each row represents a LHC collision event. Each column contains some observable from that event. The variable names are:


In [ ]:
VarNames=["signal", "l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi", "MET", "MET_phi", "MET_rel", "axial_MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos_theta_r1"]

Some of these variables represent the "raw" kinematics of the observed final state particles, while others are "features" that are derived from these raw quantities:


In [ ]:
RawNames=["l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi"]
FeatureNames=[ "MET", "MET_phi", "MET_rel", "axial_MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos_theta_r1"]

2. Pandas

We will use pandas to read in the file, and matplotlib to make plots. Pandas provides "data structures and data analysis tools for the Python Programming Language". Many machine learning tasks can be accomplished with numpy tensors and h5py files. In this case, pandas just makes it very easy to read a CSV file.

The following ensures pandas is installed and sets everything up:


In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Now we can read the data into a pandas dataframe. It's a ~GB file, so be patient.


In [ ]:
df = pd.read_csv(filename, dtype='float64', names=VarNames)

Another nice feature of pandas is that you can see the data in Jupyter by just evaluating the dataframe:


In [ ]:
df

The first column stores the "truth" label of whether an event was signal or background. Pandas makes it easy to create dataframes that store only the signal or background events:


In [ ]:
df_sig=df[df.signal==1]
df_bkg=df[df.signal==0]

The following example plots the signal and background distributions of every variable. Note that we use VarNames[1:] to skip the first variable, which was the true label.

We will use matplotlib for plotting. There are lots of tutorials and primers out there that you can find searching the web. A good tutorial can be found in the Scipy Lectures. Look through these on your own time, it is not necessary for doing these exercise. The code below is all you need to know for making histograms with matplotlib.


In [ ]:
for var in VarNames[1:]:
    print var
    plt.figure()
    plt.hist(df_sig[var],bins=100,histtype="step", color="red",label="background",stacked=True)
    plt.hist(df_bkg[var],bins=100,histtype="step", color="blue", label="signal",stacked=True)
    plt.legend(loc='upper right')
    plt.show()

3. Scikit-learn

Scikit-learn is a rich python library for data science, including machine learning. As an example, we can easily build a Fisher Discriminant (aka Linear Discriminant Analysis, or LDA). The LDA Documentation does as great job explaining this classifier. Here's how we instanciate the classifier:


In [ ]:
import sklearn.discriminant_analysis as DA
Fisher=DA.LinearDiscriminantAnalysis()

Lets separate the data into inputs (X) vs outputs (Y) and training vs testing samples:


In [ ]:
N_Train=4000000

Train_Sample=df[:N_Train]
Test_Sample=df[N_Train:]

X_Train=Train_Sample[VarNames[1:]]
y_Train=Train_Sample["signal"]

X_Test=Test_Sample[VarNames[1:]]
y_Test=Test_Sample["signal"]

Test_sig=Test_Sample[Test_Sample.signal==1]
Test_bkg=Test_Sample[Test_Sample.signal==0]

We can train the classifier as follow:


In [ ]:
Fisher.fit(X_Train,y_Train)

We can plot the output, comparing signal and background:


In [ ]:
plt.figure()
plt.hist(Fisher.decision_function(Test_sig[VarNames[1:]]),bins=100,histtype="step", color="blue", label="signal",stacked=True)
plt.hist(Fisher.decision_function(Test_bkg[VarNames[1:]]),bins=100,histtype="step", color="red", label="background",stacked=True)
plt.legend(loc='upper right')
plt.show()

And we can make a ROC curve and evaluate the AUC:


In [ ]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_Test, Fisher.decision_function(X_Test))

roc_auc = auc(fpr, tpr)

plt.plot(fpr,tpr,color='darkorange',label='ROC curve (area = %0.2f)' % roc_auc)
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.show()

Exercise B.3.1

Train the Fisher performance using the raw, features, and raw+features as input. Compare the performance one a single plot. Add cells to this notebook as needed. Or start new notebooks.


In [ ]:
X_Train_Raw=Train_Sample[RawNames]
X_Test_Raw=Test_Sample[RawNames]

X_Train_Features=Train_Sample[FeatureNames]
X_Test_Features=Test_Sample[FeatureNames]

In [ ]:
def TrainFisher(X_Train,X_Test,y_Train):
    Fisher=DA.LinearDiscriminantAnalysis()
    Fisher.fit(X_Train,y_Train)

    fpr, tpr, _ = roc_curve(y_Test, Fisher.decision_function(X_Test))
    roc_auc = auc(fpr, tpr)

    plt.plot(fpr,tpr,color='darkorange',label='ROC curve (area = %0.2f)' % roc_auc)
    plt.legend(loc="lower right")
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.show()
    
    return Fisher

RawFisher=TrainFisher(X_Train_Raw,X_Test_Raw,y_Train)
FeatureFisher=TrainFisher(X_Train_Features,X_Test_Features,y_Train)

Exercise B.3.2

Select 3 different classifiers from the techniques listed here. Note that you can use the multi-layer peceptron to build a deep network, though training may be prohibitively slow, so avoid this technique. Perform the comparison in exercise 1 for each classifier. Compare your conclusions for your selected techniques to the paper.

Exercise B.3.3

The following function calculates the significance of the observation of the signal given the number of expected Signal and Background events, using the simple formula $\sigma_S= \frac{N_S}{\sqrt{N_S+N_B}}$. Read through the code carefully.


In [ ]:
def PlotSignificance(N_S,N_B, N_S_min=1):
    plt.figure()
    eff_sig,bins_sig,p_sig=plt.hist(Fisher.decision_function(Test_sig[VarNames[1:]]),bins=100,histtype="step", color="blue", label="signal",cumulative=-1,stacked=True,normed=True)
    eff_bkg,bins_bkg,p_bkg=plt.hist(Fisher.decision_function(Test_bkg[VarNames[1:]]),bins=100,histtype="step", color="red", label="background",cumulative=-1,stacked=True,normed=True)
    plt.legend(loc='upper right')
    plt.show()
    
    good_bins = np.where(eff_sig*N_S>=N_S_min)

    print len(good_bins[0])
    if len(good_bins[0])<1:
        print "Insufficient Signal."
        return 0,0,0
    
    significance=(N_S*eff_sig)/np.sqrt((N_B*eff_bkg)+(N_S*eff_sig))

    plt.figure()
    plt.plot(bins_sig[:-1],significance)
    
    max_sign=np.max(significance[good_bins])
    max_signI=np.argmax(significance[good_bins])
    
    plt.show()
    print "Max significance at ", bins_sig[max_signI], " of", max_sign
    return bins_sig[max_signI],max_sign, max_signI
    
PlotSignificance(1000000,1e11)

Answer the following questions:

  • What are we computing when making a normalized cummulative plot?
  • Assume that the experiment produces 1 signal event for every $10^{11}$ background events. For each of your classifiers, how many signal events need to be produced to be able to make a $5\sigma$ discovery claim?

Exercise B.3.4

Read the Baldi, et al. paper and attempt to reproduce the results, as closely as possible, using scikit-learn. Try using the multi-layer peceptron to build a deep network. Or if you are capable, try it using Keras Scikit-learn interface.

C. Deep Learning

This section is meant to get you started in using Keras to design Deep Neural Networks. The goal here is to simply repeat section B with Deep Learning.

If you are starting here and have not run the cells above that load the data, you will need to run the following cell:


In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

filename="/data/tutorial/amir/SUSY.csv"
VarNames=["signal", "l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi", "MET", "MET_phi", "MET_rel", "axial_MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos_theta_r1"]
RawNames=["l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi"]
FeatureNames=[ "MET", "MET_phi", "MET_rel", "axial_MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos_theta_r1"]

df = pd.read_csv(filename, dtype='float64', names=VarNames)

Now lets define training and test samples. Note that DNNs take very long to train, so for testing purposes we will use only about 10% of the 5 million events in the training/validation sample. Once you get everything working, you can go back and make the final version of your plots with the full sample.

Also note that Keras had trouble with the Pandas tensors, so after doing all of the nice manipulation that Pandas enables, we convert the Tensor to a regular numpy tensor.


In [ ]:
N_Max=550000
N_Train=500000

Train_Sample=df[:N_Train]
Test_Sample=df[N_Train:N_Max]

X_Train=np.array(Train_Sample[VarNames[1:]])
y_Train=np.array(Train_Sample["signal"])

X_Test=np.array(Test_Sample[VarNames[1:]])
y_Test=np.array(Test_Sample["signal"])

1. Keras

Training Deep Learning models can take a very long time. If you have access to a GPU, training with the GPU will be about 2 orders of magnitude faster that training with just the CPU. Unforunately, there are no GPUs on lxplus. But, if you are running this notebook on a system with NVidia GPU(s) properly setup, you can tell Keras to use a specific GPU:

Now we will build a simple model. Note that this is a very small model, so things run fast. You should attempt more ambitious models.


In [ ]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(12, input_dim=X_Train.shape[1], init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))

The model has to be compiled. At this time we set the loss function and the optimizer too:


In [ ]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Now we train. We are running only 10 epochs in this example. Models may need hundreds of epochs before they stop improving.


In [ ]:
history=model.fit(X_Train, y_Train, validation_data=(X_Test,y_Test), nb_epoch=10, batch_size=2048)

The model history keeps track of the loss and accuracy for each epoch. Note that the training above was setup to run on the validation sample at the end of each epoch:


In [ ]:
print history.history

You can plot the loss versus epoch:


In [ ]:
loss_history=history.history["loss"]
plt.plot(range(len(loss_history)),loss_history)

Exercise C.1.1

You will need to create several models and make sure they are properly trained. Write a function that takes this history and plots the values versus epoch. For every model that you train in the remainder of this lab, assess:

* Has you model's performance plateaued? If not train for more epochs. 

* Compare the performance on training versus test sample. Are you over training?

In [ ]:
## Your Solution Here

We can evaluate how the trained model does on the test sample as follows:


In [ ]:
scores = model.evaluate(X_Test, y_Test)
print scores

And we can make ROC curves as before:


In [ ]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_Test, model.predict(X_Test))
                        
roc_auc = auc(fpr, tpr)

plt.plot(fpr,tpr,color='darkorange',label='ROC curve (area = %0.2f)' % roc_auc)
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.show()

Exercise C.1.2

Following section B, make a comparison of the performance between models trained with raw, features, and raw+features data.


In [ ]:
## Your solution here

Exercise C.1.3

Again, following section B, design and implement at least 3 different DNN models. Train them and compare performance. You may try different architectures, loss functions, and optimizers to see if there is an effect.


In [ ]:
## Your solution here

Exercise C.1.4

Write a function that evaluates the performance (AUC) as a function of a given input variable. You will need to bin the test data in the variable (i.e. make sub-samples for events which have the particular variable in a range), evaluate the performance in each bin, and plot the results.

Apply your function to each input variable.


In [ ]:
## Your solution here