Previously, I compared several different methods for analyzing the Abalone dataset from the UCI Machine Learning Repository: multiple regression (http://ericstrong.org/predicting-abalone-rings-part-1/), principal component analysis (http://ericstrong.org/predicting-abalone-rings-part-2/), and neural networks (http://ericstrong.org/predicting-abalone-rings-part-3-multilayer-perceptron/). In Part 3, the neural network was modeled using scikit-learn. However, in this Jupyter notebook, I'll be exploring a new Python neural network library, Keras.

The dataset used in this post was obtained from the UCI Machine Learning Repository (UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science), at the following link:

http://archive.ics.uci.edu/ml/datasets/Abalone

The data file from the above link was renamed to “abalone.csv”. Otherwise, no other changes were made.

First, I'll preprocess the data, as before. As a reminder, this will load the Abalone dataset into a pandas DataFrame, transform the "sex" variable into Male, Female, and Infant using one-hot encoding, and split the data into testing and training datasets.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
import keras
%matplotlib inline

# Load the data from the CSV file
abalone_df = pd.read_csv('abalone.csv',names=['Sex','Length','Diameter','Height',
                                             'Whole Weight','Shucked Weight',
                                             'Viscera Weight','Shell Weight',
                                             'Rings'])

# Transform sex into a dummy variable using one-hot encoding
abalone_df['Male'] = (abalone_df['Sex']=='M').astype(int)
abalone_df['Female'] = (abalone_df['Sex']=='F').astype(int)
abalone_df['Infant'] = (abalone_df['Sex']=='I').astype(int)
abalone_df = abalone_df[abalone_df['Height']>0]

# Split the data into training and testing
# Don't make the mistake I did and try a pandas DataFrame here; it must be a
# numpy array
train, test = train_test_split(abalone_df, train_size=0.7)
x_train = train.drop(['Rings','Sex'], axis=1).values
y_train = pd.DataFrame(train['Rings']).values
x_test = test.drop(['Rings','Sex'], axis=1).values
y_test = pd.DataFrame(test['Rings']).values


Using TensorFlow backend.

I read over the documentation for constructing a neural network in Keras, and I found the examples for a "Sequential" model to be straightforward and clear. The idea is that each "layer" in the neural network is modeled sequentially, as if they were stacked vertically in a line, which is sufficient for most simple neural network architectures.

Note that initially I tried to give the Sequential model pandas DataFrames as inputs, which it did not like. It's important to use DataFrame.values to supply a numpy array to the Sequential model, not a DataFrame itself.

In the code below, I'll construct a neural network with the first hidden layer having 20 nodes, and the second hidden layer having 5 nodes. This neural network is intended for regression, not classification (which I will be exploring in a later post). If you remember from Part 3 of the prior Abalone dataset investigation, a 20/5 hidden layer configuration was the neural network architecture that achieved the best results. This time, I'll use a "tanh" activation function, which achieved fairly similar results to the "logistic" activation function. You can review why these parameters are important in my previous post (http://ericstrong.org/predicting-abalone-rings-part-3-multilayer-perceptron/).

I thought the code below was fairly readable on its own. Essentially, the Sequential model is passed an array of layers ("Dense" layer being the most common), each with an optional activation function. It's important that the input_dim of the first layer be equal to the number of columns in your input dataset, which is 10 in this case.


In [2]:
abalone_model = Sequential([
    Dense(20, input_dim=10, activation='tanh'),
    Dense(5, activation='tanh'),
    Dense(1),
])

Next, Keras requires that the model be "compiled", which essentially finalizes the model to be used for training. When compiling the model, you should choose an optimizer, which refers to how the model will select updated parameters after each training epoch. For example, a gradient descent optimizer would look at the "slope" of the error space and choose the direction which minimizes the error. Here, I've chosen 'sgd', or stochastic gradient descent.

The "loss" parameter tells the neural network how the results of each epoch should be scored. Mean squared error is a standard choice, especially for regression, since its use in Ordinary Least Squares.

To compare the results from this neural network to the one previously constructed, our output metric will be the mean absolute error.


In [3]:
abalone_model.compile(optimizer='rmsprop',loss='mse',
                      metrics=['mean_absolute_error'])

Finally, the neural network will actually be run using the "fit" and "evaluate" methods. nb_epoch specifies the number of epochs that should be used to train the neural network. In this case, I chose 200 epochs, after some trial-and-error investigation. 10 epochs was definitely too few, and model performance stagnated after about 20 epochs, but I wanted to plot an extended graph of the performance over time. There are many other methods for choosing stopping criteria, but for this first neural network in Keras, I wanted a quick, simple scoring.

Note that you should pass "verbose=0" to the fit method, so that Keras doesn't print the results from 200 epochs in the console.


In [4]:
tb = keras.callbacks.TensorBoard(log_dir='./logs', histogram_freq=0, 
                                 write_graph=True, write_images=False)
results = abalone_model.fit(x_train, y_train, nb_epoch=200, verbose=0, 
                            callbacks=[tb])
score = abalone_model.evaluate(x_test, y_test)
# The second entry in the array is the MAE
print("\nTesting MAE: {}".format(score[1]))


  32/1253 [..............................] - ETA: 0s
Testing MAE: 1.4873103530332743

The final MAE of the neural network on the testing data is 1.487. For comparison, the best result achieved in the previous post was 1.42. However, we have not yet begun optimizing the parameters of this neural network. That will be investigated in a following post.

Notice that a "TensorBoard" callback was passed as an argument to the model. We will investigate these results in the following post.