Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano.
It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. ref: https://keras.io/
The Otto Group is one of the world’s biggest e-commerce companies, A consistent analysis of the performance of products is crucial. However, due to diverse global infrastructure, many identical products get classified differently. For this competition, we have provided a dataset with 93 features for more than 200,000 products. The objective is to build a predictive model which is able to distinguish between our main product categories. Each row corresponds to a single product. There are a total of 93 numerical features, which represent counts of different events. All features have been obfuscated and will not be defined any further.
https://www.kaggle.com/c/otto-group-product-classification-challenge/data
This algorithm has nothing to do with the canonical linear regression, but it is an algorithm that allows us to solve problems of classification (supervised learning).
In fact, to estimate the dependent variable, now we make use of the so-called logistic function or sigmoid.
It is precisely because of this feature we call this algorithm logistic regression.
In [1]:
from kaggle_data import load_data, preprocess_data, preprocess_labels
import numpy as np
import matplotlib.pyplot as plt
In [2]:
X_train, labels = load_data('../data/kaggle_ottogroup/train.csv', train=True)
X_train, scaler = preprocess_data(X_train)
Y_train, encoder = preprocess_labels(labels)
X_test, ids = load_data('../data/kaggle_ottogroup/test.csv', train=False)
X_test, _ = preprocess_data(X_test, scaler)
nb_classes = Y_train.shape[1]
print(nb_classes, 'classes')
dims = X_train.shape[1]
print(dims, 'dims')
In [3]:
np.unique(labels)
Out[3]:
In [4]:
Y_train # one-hot encoding
Out[4]:
In [5]:
import theano as th
import theano.tensor as T
In [6]:
#Based on example from DeepLearning.net
rng = np.random
N = 400
feats = 93
training_steps = 10
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = th.shared(rng.randn(feats), name="w")
b = th.shared(0., name="b")
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum() # The cost to minimize
gw, gb = T.grad(cost, [w, b]) # Compute the gradient of the cost
# Compile
train = th.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)),
allow_input_downcast=True)
predict = th.function(inputs=[x], outputs=prediction, allow_input_downcast=True)
#Transform for class1
y_class1 = []
for i in Y_train:
y_class1.append(i[0])
y_class1 = np.array(y_class1)
# Train
for i in range(training_steps):
print('Epoch %s' % (i+1,))
pred, err = train(X_train, y_class1)
print("target values for Data:")
print(y_class1)
print("prediction on training set:")
print(predict(X_train))
In [7]:
import tensorflow as tf
In [8]:
# Parameters
learning_rate = 0.01
training_epochs = 25
display_step = 1
In [9]:
# tf Graph Input
x = tf.placeholder("float", [None, dims])
y = tf.placeholder("float", [None, nb_classes])
In [10]:
x
Out[10]:
In [11]:
# Construct (linear) model
with tf.name_scope("model") as scope:
# Set model weights
W = tf.Variable(tf.zeros([dims, nb_classes]))
b = tf.Variable(tf.zeros([nb_classes]))
activation = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax
# Add summary ops to collect data
w_h = tf.summary.histogram("weights_histogram", W)
b_h = tf.summary.histogram("biases_histograms", b)
tf.summary.scalar('mean_weights', tf.reduce_mean(W))
tf.summary.scalar('mean_bias', tf.reduce_mean(b))
# Minimize error using cross entropy
# Note: More name scopes will clean up graph representation
with tf.name_scope("cost_function") as scope:
cross_entropy = y*tf.log(activation)
cost = tf.reduce_mean(-tf.reduce_sum(cross_entropy,reduction_indices=1))
# Create a summary to monitor the cost function
tf.summary.scalar("cost_function", cost)
tf.summary.histogram("cost_histogram", cost)
with tf.name_scope("train") as scope:
# Set the Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
In [12]:
with tf.name_scope('Accuracy') as scope:
correct_prediction = tf.equal(tf.argmax(activation, 1), tf.argmax(y, 1))
# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
# Create a summary to monitor the cost function
tf.summary.scalar("accuracy", accuracy)
In [13]:
LOGDIR = "/tmp/logistic_logs"
import os, shutil
if os.path.isdir(LOGDIR):
shutil.rmtree(LOGDIR)
os.mkdir(LOGDIR)
# Plug TensorBoard Visualisation
writer = tf.summary.FileWriter(LOGDIR, graph=tf.get_default_graph())
In [14]:
for var in tf.get_collection(tf.GraphKeys.SUMMARIES):
print(var.name)
summary_op = tf.summary.merge_all()
print('Summary Op: ' + summary_op)
In [15]:
# Launch the graph
with tf.Session() as session:
# Initializing the variables
session.run(tf.global_variables_initializer())
cost_epochs = []
# Training cycle
for epoch in range(training_epochs):
_, summary, c = session.run(fetches=[optimizer, summary_op, cost],
feed_dict={x: X_train, y: Y_train})
cost_epochs.append(c)
writer.add_summary(summary=summary, global_step=epoch)
print("accuracy epoch {}:{}".format(epoch, accuracy.eval({x: X_train, y: Y_train})))
print("Training phase finished")
#plotting
plt.plot(range(len(cost_epochs)), cost_epochs, 'o', label='Logistic Regression Training phase')
plt.ylabel('cost')
plt.xlabel('epoch')
plt.legend()
plt.show()
prediction = tf.argmax(activation, 1)
print(prediction.eval({x: X_test}))
In [16]:
%%bash
python -m tensorflow.tensorboard --logdir=/tmp/logistic_logs
In [17]:
from keras.models import Sequential
from keras.layers import Dense, Activation
In [18]:
dims = X_train.shape[1]
print(dims, 'dims')
print("Building model...")
nb_classes = Y_train.shape[1]
print(nb_classes, 'classes')
model = Sequential()
model.add(Dense(nb_classes, input_shape=(dims,), activation='sigmoid'))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.fit(X_train, Y_train)
Out[18]:
Simplicity is pretty impressive right? :)
Theano:
shape = (channels, rows, cols)
Tensorflow:
shape = (rows, cols, channels)
image_data_format
: channels_last | channels_first
In [19]:
!cat ~/.keras/keras.json
Now lets understand:
The core data structure of Keras is a model, a way to organize layers. The main type of model is the Sequential model, a linear stack of layers.
What we did here is stacking a Fully Connected (Dense) layer of trainable weights from the input to the output and an Activation layer on top of the weights layer.
from keras.layers.core import Dense
Dense(units, activation=None, use_bias=True,
kernel_initializer='glorot_uniform', bias_initializer='zeros',
kernel_regularizer=None, bias_regularizer=None,
activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
units
: int > 0.
init
: name of initialization function for the weights of the layer (see initializations), or alternatively, Theano function to use for weights initialization. This parameter is only relevant if you don't pass a weights argument.
activation
: name of activation function to use (see activations), or alternatively, elementwise Theano function. If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).
weights
: list of Numpy arrays to set as initial weights. The list should have 2 elements, of shape (input_dim, output_dim) and (output_dim,) for weights and biases respectively.
kernel_regularizer
: instance of WeightRegularizer (eg. L1 or L2 regularization), applied to the main weights matrix.
bias_regularizer
: instance of WeightRegularizer, applied to the bias.
activity_regularizer
: instance of ActivityRegularizer, applied to the network output.
kernel_constraint
: instance of the constraints module (eg. maxnorm, nonneg), applied to the main weights matrix.
bias_constraint
: instance of the constraints module, applied to the bias.
use_bias
: whether to include a bias (i.e. make the layer affine rather than linear).
keras.layers.core.Flatten()
keras.layers.core.Reshape(target_shape)
keras.layers.core.Permute(dims)
model = Sequential()
model.add(Permute((2, 1), input_shape=(10, 64)))
# now: model.output_shape == (None, 64, 10)
# note: `None` is the batch dimension
keras.layers.core.Lambda(function, output_shape=None, arguments=None)
keras.layers.core.ActivityRegularization(l1=0.0, l2=0.0)
Credits: Yam Peleg (@Yampeleg)
from keras.layers.core import Activation
Activation(activation)
Supported Activations : [https://keras.io/activations/]
Advanced Activations: [https://keras.io/layers/advanced-activations/]
If you need to, you can further configure your optimizer. A core principle of Keras is to make things reasonably simple, while allowing the user to be fully in control when they need to (the ultimate control being the easy extensibility of the source code). Here we used SGD (stochastic gradient descent) as an optimization algorithm for our trainable weights.
What we did here is nice, however in the real world it is not useable because of overfitting. Lets try and solve it with cross validation.
In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.
A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.
To avoid overfitting, we will first split out data to training set and test set and test out model on the test set. Next: we will use two of keras's callbacks EarlyStopping and ModelCheckpoint
Let's see first the model we implemented
In [20]:
model.summary()
In [21]:
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping, ModelCheckpoint
In [22]:
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.15, random_state=42)
fBestModel = 'best_model.h5'
early_stop = EarlyStopping(monitor='val_loss', patience=2, verbose=1)
best_model = ModelCheckpoint(fBestModel, verbose=0, save_best_only=True)
model.fit(X_train, Y_train, validation_data = (X_val, Y_val), epochs=50,
batch_size=128, verbose=True, callbacks=[best_model, early_stop])
Out[22]:
Q: How hard can it be to build a Multi-Layer Fully-Connected Network with keras?
A: It is basically the same, just add more layers!
In [23]:
model = Sequential()
model.add(Dense(100, input_shape=(dims,)))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.summary()
In [24]:
model.fit(X_train, Y_train, validation_data = (X_val, Y_val), epochs=20,
batch_size=128, verbose=True)
Out[24]:
Take couple of minutes and try to play with the number of layers and the number of parameters in the layers to get the best results.
In [25]:
model = Sequential()
model.add(Dense(100, input_shape=(dims,)))
# ...
# ...
# Play with it! add as much layers as you want! try and get better results.
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.summary()
In [26]:
model.fit(X_train, Y_train, validation_data = (X_val, Y_val), epochs=20,
batch_size=128, verbose=True)
Out[26]:
Building a question answering system, an image classification model, a Neural Turing Machine, a word2vec embedder or any other model is just as fast. The ideas behind deep learning are simple, so why should their implementation be painful?
Much has been studied about the depth of neural nets. Is has been proven mathematically[1] and empirically that convolutional neural network benifit from depth!
[1] - On the Expressive Power of Deep Learning: A Tensor Analysis - Cohen, et al 2015
One much quoted theorem about neural network states that:
Universal approximation theorem states[1] that a feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.
[1] - Approximation Capabilities of Multilayer Feedforward Networks - Kurt Hornik 1991