For this week's project, we'll be using the dataset Digits dataset from Scikit-Learn. This dataset contains the pixel value of a 8x8 image of a number, from 0 to 9.
The code that you write can be uploaded to the Github Repository under the folder 'profiles/your-name'.
Before you begin, make sure git is working correctly (here):
In [ ]:
$ git config --global user.name "John Doe"
$ git config --global user.email "johndoe@example.com"
To clone the local repository
In [ ]:
$ git clone https://github.com/VandyAstroML/Vandy_AstroML.git
To create your folder under 'profiles'
In [ ]:
$ cd profiles
$ mkdir your_name_folder
$ cd your_name_folder
To add a file 'README.md' to the repository (assuming you already created the file and it is saved in /profiles/your_name_folder/
In [ ]:
$ git pull
$ git add README.md
$ git commit -am "Some useful message here..."
$ git push origin master
You should now a see a message telling you that the upload was successful!!
Congrats, you just pushed your changes to the repository!
Now we are ready to use the dataset 'digits' from scikit-learn. For further details, you can see: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits
In [1]:
%matplotlib inline
import matplotlib
import numpy as num
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
Loading the dataset and showing the number of features, sample size, etc
In [12]:
# Loading dataset
digits = load_digits()
You can read the description of the dataset by using the 'DESCR' key:
In [13]:
print digits['DESCR']
You can also see the structure and data that is included in the dataset
In [14]:
# Displaying different keys/attributes
# of the dataset
print 'Keys:', digits.keys()
# Loading data
# This includes the pixel value for each of the samples
digits_data = digits['data']
print 'Data for 1st element:', digits_data[0]
# Targets
# This is what actual number for each sample, i.e. the 'truth'
digits_targetnames = digits['target_names']
print 'Target names:', digits_targetnames
digits_target = digits['target']
print 'Targets:', digits_target
This means that you you have 1797 samples, and each of the them are characterized by 64 different features (pixel values).
We can also visualize some of the data, using the 'images' keys:
In [15]:
# Choosing a colormap
color_map_used = plt.get_cmap('autumn')
In [16]:
# Visualizing some of the targets
fig, axes = plt.subplots(2,5, sharex=True, sharey=True, figsize=(20,12))
axes_f = axes.flatten()
for ii in range(len(axes_f)):
axes_f[ii].imshow(digits['images'][ii], cmap = color_map_used)
axes_f[ii].text(1, -1, 'Target: {0}'.format(digits_target[ii]), fontsize=30)
plt.show()
The algorithm will be able to use the pixel values to determine that the first number is '0' and the other then is '4'.
Let's see some examples of the number 2:
In [17]:
IDX2 = num.where( digits_target == 2)[0]
print 'There are {0} samples of the number 2 in the dataset'.format(IDX2.size)
fig, axes = plt.subplots(2,5, sharex=True, sharey=True, figsize=(20,12))
axes_f = axes.flatten()
for ii in range(len(axes_f)):
axes_f[ii].imshow(digits['images'][IDX2][ii], cmap = color_map_used)
axes_f[ii].text(1, -1, 'Target: {0}'.format(digits_target[IDX2][ii]), fontsize=30)
plt.show()
In [18]:
print 'And now the number 4\n'
IDX4 = num.where( digits_target == 4)[0]
fig, axes = plt.subplots(2,5, sharex=True, sharey=True, figsize=(20,12))
axes_f = axes.flatten()
for ii in range(len(axes_f)):
axes_f[ii].imshow(digits['images'][IDX4][ii], cmap = color_map_used)
axes_f[ii].text(1, -1, 'Target: {0}'.format(digits_target[IDX4][ii]), fontsize=30)
plt.show()
You can see how different each input by subtracting one target from another. In here, I'm subtracting two images that represent the number '4':
In [19]:
# Difference between two samples of the number 4
plt.imshow(digits['images'][IDX4][1] - digits['images'][IDX4][8], cmap=color_map_used)
plt.show()
This figure shows how different two samples can be from each other.