Python library for data mining and data analysis
Built on NumPy, SciPy, and matplotlib
Documentation: http://scikit-learn.org/stable/documentation.html
scikit-learns comes with a few standard datasets (iris and digits)
each dataset is a dictionary-like object holding the data and metadata
.data member stores the data as (n_samples, n_features)
.target member stores the response variables
In [1]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
.data member contains the features used to classify the digit sample images
each pixel is a feature
there are 1797 image samples, each has 64 features (8x8 pixels)
this is the same data that shows up in the .images member, but flattened into 1x64 shape
In [2]:
print(digits.data.shape)
print(digits.data[0])
.target member contains the truth label for each image
There are 1797 target variables
In [3]:
print(digits.target.shape)
print(digits.target)
.target_names member has the name of each target variable (0-9 for each number)
In [4]:
print(digits.target_names.shape)
print(digits.target_names)
.images member contains the actual image samples
there are 1797 images which are 8x8 pixels
this is the same data from the .data member, but reshaped into an 8x8
In [5]:
print(digits.images.shape)
digits.images[0]
Out[5]:
Plot the above array as an image
Create a second image which just shows grayscale
Notice the light pixels are 0 in the array, dark pixels are closer to 15
use pyplot to display the first and last images
convert to grayscale to reduce the color data
In [9]:
from matplotlib import pyplot as plt
plt.subplot(1,4,1)
plt.imshow(digits.images[0])
plt.subplot(1,4,2)
plt.imshow(digits.images[0], cmap=plt.cm.gray_r)
plt.subplot(1,4,3)
plt.imshow(digits.images[1796])
plt.subplot(1,4,4)
plt.imshow(digits.images[1796], cmap=plt.cm.gray_r)
plt.show()
an estimator is fit with training data in order to predict unseen samples
scikit-learn uses a Python object which implements the fit(X,y) and predict(T) methods
support vector classification (sklearn.svm.SVC) is an example classification estimator
SVC has hyperparameters for gamma and C
In [10]:
from sklearn import svm
classifier = svm.SVC(gamma=0.001, C=100.)
the SVC is trainined (fit) with all of the data except the last item
then use the SVC to predict what the last item should be classified as using this model
In [11]:
classifier.fit(digits.data[:-1], digits.target[:-1])
Out[11]:
In [12]:
classifier.predict(digits.data[-1:])
Out[12]:
In [13]:
digits.target[1796]
Out[13]:
The SVC correctly classifies the last image based on what is listed in the .target label
In [ ]: