CSAL4243: Introduction to Machine Learning

Muhammad Mudassir Khan (mudasssir.khan@ucp.edu.pk)

Assignment 2:

Digits Recognition using Logistic Regression & Neural Networks

In this assignment you are going to use Logistic Regression and Neural Networks. You are going to use digits dataset from digits recognition competition on kaggle. First task is to train a logistic regression model from scikit learn on the training dataset and then predict the labels of the given test dataset and submit it to kaggle. Then you are going to play around with the regularization parameter of logistic regression and see if it has any effect on your results. Later you are going to use neural networks from scikit learn and train it on the same dataset and use the trained model to predict the labels of the test dataset and submit the results to kaggle. You will need to report the results of neural networks as well. Lastly you will create some handwritten digits using a drawing software like MS paint or even write it on a paper and take a picture of it and see how good your trained model works on it.

Note:

The given images are grey scale and has digits written in white, make sure your generated digits are of the same format.

Overview



Digit Recognizer Dataset

The dataset you are going to use in this assignment is called Digit Recognizer, available at kaggle. To download the dataset go to dataset data tab. Download 'train.csv', 'test.csv' and 'sample_submission.csv.gz' files. 'train.csv' is going to be used for training the model. 'test.csv' is used to test the model i.e. generalization. 'sample_submission.csv.gz' contain sample submission file that you need to generate to be submitted to kaggle.

Note:

Thare are some tutorials available at the dataset tutorial section which you can use as a starting point. Specially the A beginner’s approach to classification which uses scikit learn's SVM classifier. You can replace it with logistic regression and neural network. You can download the notebook by clicking fork notebook first and then download button.


Tasks

  1. Use scikit learn logistic regression to train on digit recognizer dataset from kaggle competition. Submit your best result to the competition and report result.
  2. Use different values of regularization parameter (parameter C which is inverse of regularization parameter i.e. C = $\frac{1}{\lambda}$) in logistic regression and report the effect.
  3. Use scikit learn neural network to train on digit recognizer dataset and subimit your best result.
  4. Hand draw digits using any drawing software with black background and white font and test it on the trained model above and report results.

Note:

  1. If your system takes too much time on training then reduce training data. Around 5000 examples are enough to get a good classifier.
  2. It is a good idea to convert images to binary values i.e. 0's and 1's.
  3. Since dataset include images of $28\times 28$ dimensions, you should use opencv libaray for image resize if needed in task 4. You can download it as anaconda package.

Image resize using opencv


In [ ]:
import cv2
img = cv2.imread('test.png',0)
resized_image = cv2.resize(img, (28, 28), interpolation = cv2.INTER_AREA)

convert grey scale image to binary

covert every non zero value to one.


In [ ]:
test_images[test_images>0]=1
train_images[train_images>0]=1

Credits

Raschka, Sebastian. Python machine learning. Birmingham, UK: Packt Publishing, 2015. Print.

Andrew Ng, Machine Learning, Coursera

Scikit Learn Linear Regression