Using sklearn's Iris Dataset with neon

Tony Reina
28 JUNE 2017

Here's an example of how we can load one of the standard sklearn datasets into a neon model. We'll be using the iris dataset, a classification model which tries to predict the type of iris flower species (Setosa, Versicolour, and Virginica) based on 4 continuous parameters: Sepal Length, Sepal Width, Petal Length and Petal Width. It is based on Ronald Fisher's 1936 paper describing Linear Discriminant Analysis. The dataset is now considered one of the gold standards at monitoring the performance of a new classification method.

In this notebook, we'll walk through loading the data from sklearn into neon's ArrayIterator class and then passing that to a simple multi-layer perceptron model. We should get a misclassification rate of 2% to 8%.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Load the iris dataset from sklearn


In [1]:
from sklearn import datasets

In [2]:
iris = datasets.load_iris()
X = iris.data  
Y = iris.target

nClasses = len(iris.target_names)  # Setosa, Versicolour, and Virginica iris species

Use sklearn to split the data into training and testing sets


In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33) # 66% training, 33% testing

Make sure that the features are scaled to mean of 0 and standard deviation of 1

This is standard pre-processing for multi-layered perceptron inputs.


In [4]:
from sklearn.preprocessing import StandardScaler

scl = StandardScaler()

X_train = scl.fit_transform(X_train)
X_test = scl.transform(X_test)

Generate a backend for neon to use

This sets up either our GPU or CPU connection to neon. If we don't start with this, then ArrayIterator won't execute.

We're asking neon to use the cpu, but can change that to a gpu if it is available. Batch size refers to how many data points are taken at a time. Here's a primer on Gradient Descent.

Technical note: Your batch size must always be much less than the number of points in your data. So if you have 50 points, then set your batch size to something much less than 50. I'd suggest setting the batch size to no more than 10% of the number of data points. You can always just set your batch size to 1. In that case, you are no longer performing mini-batch gradient descent, but are performing the standard stochastic gradient descent.


In [5]:
from neon.data import ArrayIterator
from neon.backends import gen_backend

be = gen_backend(backend='cpu', batch_size=X_train.shape[0]//10)  # Change to 'gpu' if you have gpu support

Let's pass the data to neon

We pass our data (both features and labels) into neon's ArrayIterator class. By default, ArrayIterator one-hot encodes the labels (which saves us a step). Once we get our ArrayIterators, then we can pass them directly into neon models.


In [6]:
training_data = ArrayIterator(X=X_train, y=y_train, nclass=nClasses, make_onehot=True)
testing_data = ArrayIterator(X=X_test, y=y_test, nclass=nClasses, make_onehot=True)

In [7]:
print ('I am using this backend: {}'.format(be))


I am using this backend: <neon.backends.nervanacpu.NervanaCPU object at 0x7f0b6ff35410>

Import the neon libraries we need for this MLP


In [8]:
from neon.initializers import GlorotUniform, Gaussian 
from neon.layers import GeneralizedCost, Affine, Dropout
from neon.models import Model 
from neon.optimizers import GradientDescentMomentum, Adam
from neon.transforms import Softmax, CrossEntropyMulti, Rectlin, Tanh
from neon.callbacks.callbacks import Callbacks, EarlyStopCallback
from neon.transforms import Misclassification

Initialize the weights and bias variables

We could use numbers from the Gaussian distribution ($\mu=0, \sigma=0.3$) to initialize the weights and bias terms for our regression model. However, we can also use other initializations like GlorotUniform.


In [9]:
init = GlorotUniform()    #Gaussian(loc=0, scale=0.3)

Define a multi-layered perceptron (MLP) model

We just use a simple Python list to add our different layers to the model. The nice thing is that we've already put our data into a neon ArrayIterator. That means the model will automatically know how to handle the input layer.

In this model, the input layer feeds into a 5-neuron rectified linear unit affine layer. That feeds into an 8 neuron hyperbolic tangent layer (with 50% dropout). Finally, that outputs to a softmax of the nClasses. We'll predict based on the argmax of the softmax layer.

I've just thrown together a model haphazardly. There is no reason the model has to be like this. In fact, I would suggest playing with adding different layers, different # of neurons, and different activation functions to see if you can get a better model. What's nice about neon is that we can easily alter the model architecture without much change to our code.


In [10]:
layers = [ 
          Affine(nout=5, init=init, bias=init, activation=Rectlin()), # Affine layer with 5 neurons (ReLU activation)
          Affine(nout=8, init=init, bias=init, activation=Tanh()), # Affine layer with 8 neurons (Tanh activation)
          Dropout(0.5),  # Dropout layer
          Affine(nout=nClasses, init=init, bias=init, activation=Softmax()) # Affine layer with softmax
         ]

In [11]:
mlp = Model(layers=layers)

Cost function

How "close" is the model's prediction is to the true value? For the case of multi-class prediction we typically use Cross Entropy.


In [12]:
cost = GeneralizedCost(costfunc=CrossEntropyMulti())

Gradient descent

All of our models will use gradient descent. We will iteratively update the model weights and biases in order to minimize the cost of the model.

There are many optimizing algorithms we can use for gradient descent. Here we'll use Adam.


In [13]:
#optimizer = GradientDescentMomentum(0.1, momentum_coef=0.2) 

optimizer = Adam(learning_rate=0.1, beta_1=0.9, beta_2=0.999)

Callbacks

Callbacks allow us to run custom code at certain points during the training. For example, in the code below we want to find out how well the model is performing against the testing data after every 2 callbacks of training. If the cross entropy error goes up, then we stop the training early. Otherwise, we might be overfitting the model to the training set.

I've added a patience parameter to the early stopping. If the model's performance has not improved after a certain number of callbacks, then we will stop training early.


In [14]:
# define stopping function
# it takes as input a tuple (State,val[t])
# which describes the cumulative validation state (generated by this function)
# and the validation error at time t
# and returns as output a tuple (State', Bool),
# which represents the new state and whether to stop

def stop_func(s, v):
    
    patience = 4  # If model performance has not improved in this many callbacks, then early stop.
    
    if s is None:
        return ([v], False)
    
    if (all(v < i for i in s)):  # Check to see if this value is smaller than any in the history
        history = [v]  # New value is smaller so let's reset the history
        print('Model improved performance: {}'.format(v))
    else:
        history = s + [v]   # New value is not smaller, so let's add to current history
        print('Model has not improved in {} callbacks.'.format(len(history)-1))
            
    if len(history) > patience:  # If our history is greater than the patience, then early terminate.
        stop = True
        print('Stopping training early.')
    else:
        stop = False   # Otherwise, keep training.
    
    return (history, stop)   

# The model trains on the training set, but every 2 epochs we calculate
# its performance against the testing set. If the performance increases, then
# we want to stop early because we are overfitting our model.
callbacks = Callbacks(mlp, eval_set=testing_data, eval_freq=2)  # Run the callback every 2 epochs
callbacks.add_callback(EarlyStopCallback(stop_func)) # Add our early stopping function call

Run the model

This starts gradient descent. The number of epochs is how many times we want to perform gradient descent on our entire training dataset. So 100 epochs means that we repeat gradient descent on our data 100 times in a row.


In [15]:
mlp.fit(training_data, optimizer=optimizer, num_epochs=100, cost=cost, callbacks=callbacks)


Epoch 0   [Train |████████████████████|   10/10   batches, 0.63 cost, 0.04s]
Epoch 1   [Train |████████████████████|   10/10   batches, 0.33 cost, 0.03s] [CrossEntropyMulti Loss 0.31, 0.01s]
Epoch 2   [Train |████████████████████|   10/10   batches, 0.30 cost, 0.03s]
Epoch 3   [Train |████████████████████|   10/10   batches, 0.23 cost, 0.04s] [CrossEntropyMulti Loss 0.13, 0.00s]
Model improved performance: 0.134086459875
Epoch 4   [Train |████████████████████|   10/10   batches, 0.24 cost, 0.03s]
Epoch 5   [Train |████████████████████|   10/10   batches, 0.18 cost, 0.03s] [CrossEntropyMulti Loss 0.15, 0.00s]
Model has not improved in 1 callbacks.
Epoch 6   [Train |████████████████████|   10/10   batches, 0.14 cost, 0.03s]
Epoch 7   [Train |████████████████████|   10/10   batches, 0.23 cost, 0.03s] [CrossEntropyMulti Loss 0.11, 0.00s]
Model improved performance: 0.108739070594
Epoch 8   [Train |████████████████████|   10/10   batches, 0.34 cost, 0.03s]
Epoch 9   [Train |████████████████████|   10/10   batches, 0.22 cost, 0.03s] [CrossEntropyMulti Loss 0.17, 0.00s]
Model has not improved in 1 callbacks.
Epoch 10  [Train |████████████████████|   10/10   batches, 0.24 cost, 0.03s]
Epoch 11  [Train |████████████████████|   10/10   batches, 0.20 cost, 0.03s] [CrossEntropyMulti Loss 0.15, 0.00s]
Model has not improved in 2 callbacks.
Epoch 12  [Train |████████████████████|   10/10   batches, 0.13 cost, 0.03s]
Epoch 13  [Train |████████████████████|   10/10   batches, 0.23 cost, 0.04s] [CrossEntropyMulti Loss 0.15, 0.00s]
Model has not improved in 3 callbacks.
Epoch 14  [Train |████████████████████|   10/10   batches, 0.08 cost, 0.03s]
Epoch 15  [Train |████████████████████|   10/10   batches, 0.08 cost, 0.03s] [CrossEntropyMulti Loss 0.14, 0.00s]
Model has not improved in 4 callbacks.
Stopping training early.

Run the model on the testing data

Let's run the model on the testing data and get the predictions. We can then compare those predictions with the true values to see how well our model has performed.


In [16]:
results = mlp.get_outputs(testing_data) 
prediction = results.argmax(1) 

error_pct = 100 * mlp.eval(testing_data, metric=Misclassification())[0]
print ('The model misclassified {:.2f}% of the test data.'.format(error_pct))


The model misclassified 6.00% of the test data.

Save the model

Let's save the model and the parameters.


In [17]:
mlp.save_params('iris_model.prm')

Here's the text description of the model.

You could use this to draw a graph of the network.


In [18]:
mlp.get_description()['model']


Out[18]:
{'config': {'layers': [{'config': {'init': {'config': {},
      'type': 'neon.initializers.initializer.GlorotUniform'},
     'name': 'Linear_0',
     'nout': 5},
    'type': 'neon.layers.layer.Linear'},
   {'config': {'init': {'config': {},
      'type': 'neon.initializers.initializer.GlorotUniform'},
     'name': 'Linear_0_bias'},
    'type': 'neon.layers.layer.Bias'},
   {'config': {'name': 'Linear_0_Rectlin',
     'transform': {'config': {'name': 'Rectlin_0'},
      'type': 'neon.transforms.activation.Rectlin'}},
    'type': 'neon.layers.layer.Activation'},
   {'config': {'init': {'config': {},
      'type': 'neon.initializers.initializer.GlorotUniform'},
     'name': 'Linear_1',
     'nout': 8},
    'type': 'neon.layers.layer.Linear'},
   {'config': {'init': {'config': {},
      'type': 'neon.initializers.initializer.GlorotUniform'},
     'name': 'Linear_1_bias'},
    'type': 'neon.layers.layer.Bias'},
   {'config': {'name': 'Linear_1_Tanh',
     'transform': {'config': {'name': 'Tanh_0'},
      'type': 'neon.transforms.activation.Tanh'}},
    'type': 'neon.layers.layer.Activation'},
   {'config': {'name': 'Dropout_0'}, 'type': 'neon.layers.layer.Dropout'},
   {'config': {'init': {'config': {},
      'type': 'neon.initializers.initializer.GlorotUniform'},
     'name': 'Linear_2',
     'nout': 3},
    'type': 'neon.layers.layer.Linear'},
   {'config': {'init': {'config': {},
      'type': 'neon.initializers.initializer.GlorotUniform'},
     'name': 'Linear_2_bias'},
    'type': 'neon.layers.layer.Bias'},
   {'config': {'name': 'Linear_2_Softmax',
     'transform': {'config': {'name': 'Softmax_0'},
      'type': 'neon.transforms.activation.Softmax'}},
    'type': 'neon.layers.layer.Activation'}],
  'name': 'Sequential_0'},
 'container': True,
 'type': 'neon.layers.container.Sequential'}

In [ ]: