This notebook will explore the dataset from Kaggle's competition on the digit recognition and will use a number of Neural Networks to demonstrate their usage and utility.
In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
from subprocess import check_output
print(check_output(["ls", "."]).decode("utf8"))
# Any results you write to the current directory are saved as output.
In [3]:
myTrainDf = pd.read_csv('./train.csv')
myTestDf = pd.read_csv('./test.csv')
In [5]:
# Display some digits to understand how the data is structured
def formatDigit (aLine):
return np.array(aLine).reshape((28,28))
NUM_EXAMPLES = 3
for i in (myTrainDf.shape[0]*np.random.rand(NUM_EXAMPLES)).astype(int):
myImage = formatDigit(myTrainDf.iloc[i][1:])
plt.matshow(myImage, cmap='gray')
plt.colorbar()
plt.title('Example for digit ' + str(myTrainDf.iloc[i][0]))
plt.show()
Now that we see how the data is organized, let's use a MLP, with an architecture as the one taught in "Machine Learning" from Stanford in coursera.org to recognize the digits.
The MLP will have 3 layers.
Note that the sizes of the input and output layers are given by the X and Y datasets.
In [6]:
from sklearn.neural_network import MLPClassifier
myX = myTrainDf[myTrainDf.columns[1:]]
myY = myTrainDf[myTrainDf.columns[0]]
# Use 'adam' solver for large datasets, alpha is the regularization term.
# Display the optimization by showing the cost.
myClf = MLPClassifier(hidden_layer_sizes=25, activation='logistic', solver='adam',
alpha=1e-5, verbose=True)
myClf.fit(myX, myY)
Out[6]:
In [7]:
# Get the training error
myPredY = myClf.predict(myX)
In [8]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
# Generic function to assess performance in two datasets
def showPerformance (aY, aYpred):
# Ensure np.array
print ('*** Performance Statistics ***')
print ('Accuracy: ', accuracy_score(aY, aYpred))
print ('Precision: ', precision_score(aY, aYpred, average='micro'))
print ('Recall: ', recall_score(aY, aYpred, average='micro'))
print ('F1: ', f1_score(aY, aYpred, average='micro'))
showPerformance(myY, myPredY)
The results are quite good, as expected, now let's make a prediction for the test set.
In [17]:
myYtestPred = myClf.predict(myTestDf)
myOutDf = pd.DataFrame(index=myTestDf.index+1, data=myYtestPred)
myOutDf.reset_index().to_csv('submission.csv', header=['ImageId', 'Label'],index=False)
In [49]:
REG_ARRAY = [100, 1, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 0]
def splitDataset (aDf, aFrac):
aTrainDf = aDf.sample(frac=aFrac)
aXvalDf = aDf.iloc[[x for x in aDf.index if x not in aTrainDf.index]]
return aTrainDf, aXvalDf
mySampleTrainDf, mySampleXvalDf = splitDataset(myTrainDf, 0.8)
myAccuracyDf = pd.DataFrame(index=REG_ARRAY, columns=['Accuracy'])
for myAlpha in REG_ARRAY:
print ('Training with regularization param ', str(myAlpha))
myClf = MLPClassifier(hidden_layer_sizes=25, activation='logistic', solver='adam', alpha=myAlpha, verbose=False)
myClf.fit(mySampleTrainDf[mySampleTrainDf.columns[1:]], mySampleTrainDf['label'])
myYpred = myClf.predict(mySampleXvalDf[mySampleXvalDf.columns[1:]])
myAccuracyDf.loc[myAlpha, 'Accuracy'] = accuracy_score(mySampleXvalDf['label'], myYpred)
In [50]:
myAccuracyDf
Out[50]:
From here one can tell that the default regularization parameter (around 1e-5)
In [55]:
REG_ARRAY = [100, 1, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 0]
myAccuracyDf = pd.DataFrame(index=REG_ARRAY, columns=['Accuracy'])
for myAlpha in REG_ARRAY:
print ('Training with regularization param ', str(myAlpha))
myClf = MLPClassifier(hidden_layer_sizes=[400, 400, 100, 25], activation='logistic', solver='adam', alpha=myAlpha, verbose=False)
myClf.fit(mySampleTrainDf[mySampleTrainDf.columns[1:]], mySampleTrainDf['label'])
myYpred = myClf.predict(mySampleXvalDf[mySampleXvalDf.columns[1:]])
myAccuracyDf.loc[myAlpha, 'Accuracy'] = accuracy_score(mySampleXvalDf['label'], myYpred)
In [59]:
myAccuracyDf
Out[59]:
Let's produce a new output file with no regularization and a complex MLP of 784x400x400x100x25x10
In [61]:
myClf = MLPClassifier(hidden_layer_sizes=[400, 400, 100, 25], activation='logistic', solver='adam', alpha=0, verbose=True)
myClf.fit(myTrainDf[myTrainDf.columns[1:]], myTrainDf['label'])
myYtestPred = myClf.predict(myTestDf)
myOutDf = pd.DataFrame(index=myTestDf.index+1, data=myYtestPred)
myOutDf.reset_index().to_csv('submission2.csv', header=['ImageId', 'Label'],index=False)
In [ ]: