ML Projects - Main

For this week's project, we'll be using the dataset Digits dataset from Scikit-Learn. This dataset contains the pixel value of a 8x8 image of a number, from 0 to 9.

Purpose

  1. We will use this dataset to train some of the algorithms that we mentioned in the last couple of weeks, i.e. Naive Bayes Classifier, Decision Trees, and k-Nearest Neighbor.
  2. Each person will be assigned one main algorithm. S/he will write her/his own machine learning code, and use the Digits dataset to train it.
  3. Each person will present his/her results on the next session (Feb. 17, 2016)

Instructions

The code that you write can be uploaded to the Github Repository under the folder 'profiles/your-name'.

Steps

  1. You first need to have access to the VandyAstroML group. Otherwise, you won't be able to push your code to the repository. If you don't have access, let me (Victor) know.
  2. Make a copy of the repositry onto your local machine.
  3. Create a folder under 'profiles/your-name'.
  4. Write the code and save it in that directory.
  5. Update your local version before pushing your changes. (Using git pull)
  6. Add, commit, and push your changes.

Git Commands

Before you begin, make sure git is working correctly (here):


In [ ]:
$ git config --global user.name "John Doe"
$ git config --global user.email "johndoe@example.com"

To clone the local repository


In [ ]:
$ git clone https://github.com/VandyAstroML/Vandy_AstroML.git

To create your folder under 'profiles'


In [ ]:
$ cd profiles
$ mkdir your_name_folder
$ cd your_name_folder

To add a file 'README.md' to the repository (assuming you already created the file and it is saved in /profiles/your_name_folder/


In [ ]:
$ git pull
$ git add README.md
$ git commit -am "Some useful message here..."
$ git push origin master

You should now a see a message telling you that the upload was successful!!

Congrats, you just pushed your changes to the repository!

Using the datasets

Now we are ready to use the dataset 'digits' from scikit-learn. For further details, you can see: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits


In [1]:
%matplotlib inline
import matplotlib
import numpy as num
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits


/Users/victor2/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Loading the dataset and showing the number of features, sample size, etc


In [12]:
# Loading dataset
digits = load_digits()

You can read the description of the dataset by using the 'DESCR' key:


In [13]:
print digits['DESCR']


Optical Recognition of Handwritten Digits Data Set
===================================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

You can also see the structure and data that is included in the dataset


In [14]:
# Displaying different keys/attributes 
# of the dataset
print 'Keys:', digits.keys()

# Loading data
# This includes the pixel value for each of the samples
digits_data = digits['data']
print 'Data for 1st element:', digits_data[0]

# Targets
# This is what actual number for each sample, i.e. the 'truth'
digits_targetnames = digits['target_names']
print 'Target names:', digits_targetnames

digits_target = digits['target']
print 'Targets:', digits_target


Keys: ['images', 'data', 'target_names', 'DESCR', 'target']
Data for 1st element: [  0.   0.   5.  13.   9.   1.   0.   0.   0.   0.  13.  15.  10.  15.   5.
   0.   0.   3.  15.   2.   0.  11.   8.   0.   0.   4.  12.   0.   0.   8.
   8.   0.   0.   5.   8.   0.   0.   9.   8.   0.   0.   4.  11.   0.   1.
  12.   7.   0.   0.   2.  14.   5.  10.  12.   0.   0.   0.   0.   6.  13.
  10.   0.   0.   0.]
Target names: [0 1 2 3 4 5 6 7 8 9]
Targets: [0 1 2 ..., 8 9 8]

This means that you you have 1797 samples, and each of the them are characterized by 64 different features (pixel values).

We can also visualize some of the data, using the 'images' keys:


In [15]:
# Choosing a colormap
color_map_used = plt.get_cmap('autumn')

In [16]:
# Visualizing some of the targets
fig, axes = plt.subplots(2,5, sharex=True, sharey=True, figsize=(20,12))
axes_f = axes.flatten()
for ii in range(len(axes_f)):
    axes_f[ii].imshow(digits['images'][ii], cmap = color_map_used)
    axes_f[ii].text(1, -1, 'Target: {0}'.format(digits_target[ii]), fontsize=30)
plt.show()


The algorithm will be able to use the pixel values to determine that the first number is '0' and the other then is '4'.

Let's see some examples of the number 2:


In [17]:
IDX2 = num.where( digits_target == 2)[0]
print 'There are {0} samples of the number 2 in the dataset'.format(IDX2.size)

fig, axes = plt.subplots(2,5, sharex=True, sharey=True, figsize=(20,12))
axes_f = axes.flatten()
for ii in range(len(axes_f)):
    axes_f[ii].imshow(digits['images'][IDX2][ii], cmap = color_map_used)
    axes_f[ii].text(1, -1, 'Target: {0}'.format(digits_target[IDX2][ii]), fontsize=30)
plt.show()


There are 177 samples of the number 2 in the dataset

In [18]:
print 'And now the number 4\n'
IDX4 = num.where( digits_target == 4)[0]
fig, axes = plt.subplots(2,5, sharex=True, sharey=True, figsize=(20,12))
axes_f = axes.flatten()
for ii in range(len(axes_f)):
    axes_f[ii].imshow(digits['images'][IDX4][ii], cmap = color_map_used)
    axes_f[ii].text(1, -1, 'Target: {0}'.format(digits_target[IDX4][ii]), fontsize=30)
plt.show()


And now the number 4

You can see how different each input by subtracting one target from another. In here, I'm subtracting two images that represent the number '4':


In [19]:
# Difference between two samples of the number 4
plt.imshow(digits['images'][IDX4][1] - digits['images'][IDX4][8], cmap=color_map_used)
plt.show()


This figure shows how different two samples can be from each other.