Neural networks

Neural networks inside hep_ml are very simple, but flexible. They are using theano library.

hep_ml.nnet also provides tools to optimize any continuos expression as decision function (see below).

Downloading dataset

downloading dataset from UCI and splitting it into train and test


In [1]:
!cd toy_datasets; wget -O ../data/MiniBooNE_PID.txt -nc MiniBooNE_PID.txt https://archive.ics.uci.edu/ml/machine-learning-databases/00199/MiniBooNE_PID.txt


/bin/sh: line 0: cd: toy_datasets: No such file or directory
Файл «../data/MiniBooNE_PID.txt» уже существует; не загружается.

In [2]:
import numpy, pandas
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score

data = pandas.read_csv('../data/MiniBooNE_PID.txt', sep='\s*', skiprows=[0], header=None, engine='python')
labels = pandas.read_csv('../data/MiniBooNE_PID.txt', sep=' ', nrows=1, header=None)
labels = [1] * labels[1].values[0] + [0] * labels[2].values[0]
data.columns = ['feature_{}'.format(key) for key in data.columns]

train_data, test_data, train_labels, test_labels = train_test_split(data, labels, train_size=0.5, random_state=42)

Example of training network

Training multilayer perceptron with one hidden layer with 5 neurons.


In [3]:
from hep_ml.nnet import MLPClassifier
from sklearn.metrics import roc_auc_score

clf = MLPClassifier(layers=[5], epochs=500)
clf.fit(train_data, train_labels)


Out[3]:
MLPClassifier(epochs=500, layers=[5], loss='log_loss', random_state=None,
       scaler='standard', trainer='irprop-', trainer_parameters=None)

In [4]:
proba = clf.predict_proba(test_data)
print 'Test quality:', roc_auc_score(test_labels, proba[:, 1])


Test quality: 0.971250859023

In [5]:
proba = clf.predict_proba(train_data)
print 'Train quality:', roc_auc_score(train_labels, proba[:, 1])


Train quality: 0.971777980506

Creating own neural network

To create own neural network, one should provide activation function and define parameters of network.

You are not limited here to any kind of structure in this function, hep_ml.nnet will consider this as a black box for optimization.

Simplest way is to override prepare method of AbstractNeuralNetworkClassifier.


In [6]:
from hep_ml.nnet import AbstractNeuralNetworkClassifier
from theano import tensor as T

class SimpleNeuralNetwork(AbstractNeuralNetworkClassifier):
    def prepare(self):
        # getting number of layers in input, hidden, output layers
        # note that we support only one hidden layer here
        n1, n2, n3 = self.layers_
        
        # creating parameters of neural network
        W1 = self._create_matrix_parameter('W1', n1, n2)
        W2 = self._create_matrix_parameter('W2', n2, n3)
        
        # defining activation function
        def activation(input):
            first = T.nnet.sigmoid(T.dot(input, W1))
            return T.dot(first, W2)

        return activation

In [7]:
clf = SimpleNeuralNetwork(layers=[5], epochs=500)
clf.fit(train_data, train_labels)
print 'Test quality:', roc_auc_score(test_labels, clf.predict_proba(test_data)[:, 1])


Test quality: 0.967173363583

Using specific neural network

this NN has one hidden layer, but it is quite strange


In [8]:
from hep_ml.nnet import PairwiseNeuralNetwork
clf = PairwiseNeuralNetwork(layers=[5], epochs=500)
clf.fit(train_data, train_labels)
print 'Test quality:', roc_auc_score(test_labels, clf.predict_proba(test_data)[:, 1])


Test quality: 0.972384121561
/Users/axelr/venvs/rep_env/lib/python2.7/site-packages/theano/scan_module/scan_perform_ext.py:133: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *

Creating very specific expression estimators

We can use hep_ml.nnet to optimize any expressions as black-box for simplicity, let's assume we have only three variables: $\text{var}_1, \text{var}_2, \text{var}_3.$

And for some physical intuition we are sure that this is good expression to discriminate signal and background: $$\text{output} = c_1 \text{var}_1 + c_2 \log[ \exp(\text{var}_2 + \text{var}_3) + \exp(c_3)] + c_4 \dfrac{\text{var}_3}{\text{var}_2} + c_5 $$

Note: I have written some random expression here, in practice it appears from physical intuition (or looking at the data).


In [9]:
class CustomNeuralNetwork(AbstractNeuralNetworkClassifier):
    def prepare(self):
        # getting number of layers in input, hidden, output layers
        # note that we support only one hidden layer here
        n1, n2, n3 = self.layers_        
        # checking that we have three variables in input + constant
        assert n1 == 3 + 1 
        # creating parameters
        c1, c2, c3, c4, c5 = self._create_scalar_parameters('c1', 'c2', 'c3', 'c4', 'c5')
        
        # defining activation function
        def activation(input):
            v1, v2, v3 = input[:, 0], input[:, 1], input[:, 2]
            return c1 * v1 + c2 * T.log(T.exp(v2 + v3) + T.exp(c3)) + c4 * v3 / v2 + c5
        
        return activation

Writing custom pretransformer

very simple scikit-learn transformer which will transform each feature uniform to range [0, 1]


In [10]:
from sklearn.base import BaseEstimator, TransformerMixin
from rep.utils import Flattener

class Uniformer(BaseEstimator, TransformerMixin):
    # leaving only 3 features and flattening each variable
    def fit(self, X, y=None):
        self.transformers = []
        X = numpy.array(X, dtype=float)
        for column in range(X.shape[1]):
            self.transformers.append(Flattener(X[:, column]))
        return self
        
    def transform(self, X):
        X = numpy.array(X, dtype=float)
        assert X.shape[1] == len(self.transformers)
        for column, trans in enumerate(self.transformers):
            X[:, column] = trans(X[:, column])
        return X

In [11]:
# selecting three features to train: 
train_features = train_data.columns[:3]

clf = CustomNeuralNetwork(layers=[5], epochs=1000, scaler=Uniformer())
clf.fit(train_data[train_features], train_labels)

print 'Test quality:', roc_auc_score(test_labels, clf.predict_proba(test_data[train_features])[:, 1])


Test quality: 0.914575996678

Ensembling neural neworks

let's run AdaBoost algorithm over neural network.


In [12]:
from sklearn.ensemble import AdaBoostClassifier

base_nnet = base_estimator=MLPClassifier(layers=[5], scaler=Uniformer())
clf = AdaBoostClassifier(base_estimator=base_nnet, n_estimators=10)
clf.fit(train_data, train_labels)


Out[12]:
AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=MLPClassifier(epochs=100, layers=[5], loss='log_loss', random_state=None,
       scaler=Uniformer(), trainer='irprop-', trainer_parameters=None),
          learning_rate=1.0, n_estimators=10, random_state=None)

In [13]:
print 'Test quality:', roc_auc_score(test_labels, clf.predict_proba(test_data)[:, 1])


Test quality: 0.977154302238