This notebook will introduce you to scikit learn, the prefered python package for classical machine learning. Scikit learn has all the standard machine learning approaches and uses a common syntax making it easy to try different methods. It is very popular, well-documented with great tutorials etc (I'm using several of them for this course!), and advice on usage cases for the different methods: it will be easy to get help if you ever get stuck. So, if you want to use a popular, well-established (e.g. not deep learning, or something that just came out) machine learning approach sckit learn is where you should start. For more details be sure to check out their web-site: http://scikit-learn.org/stable/index.html
Scikit learn comes pre-installed in the environment you are using for this course. On other machines you may need to install it. See here for instructions: http://scikit-learn.org/stable/install.html. Once you have it installed, you can simply import the library like any other.
Like most large python libraries scikit learn is organized hierarchically, so you will only need to import the parts relevant to you. For example:
So to get some of this started you would:
In [1]:
from sklearn import datasets # for various test data sets
from sklearn import decomposition # for PCA. ICA, nNMF
from sklearn import manifold # MDS, tSNE
# then to use it you might go like
digits = datasets.load_digits(n_class=6)
# While we're at it, lets load a few other things we will need
import numpy as np
import matplotlib.pyplot as plt
As mentioned earlier, scikit learn does a great job of using a common syntax shared across the different approaches. This syntax can be divided into the various parts of applying a machine learning approach:
nnmf = decomposition.NMF(n_components=2) would initialize a 2-component nnMF modellogistic = linear_model.LogisticRegression(C=1e5) is the syntax for logistic regression with inverse resularization strength CPCA, NMF and FastICA all accept the n_components= argumentmodel.fit(data1,...)nnmf.fit(X) where X is a numpy array containing your datalogistic.fit(trainingData, trainingLabels) where the two inputs are arrays containing training data and the corresponding training labels respectivelynnmf or logistic above).output=model.transform(data) for most unsupervised methods predictedLabel=model.predict(data) for most supervised methodsNote: for unsupervised methods you can use the fit_transform to at once perform the model fit and produce the analysis result on that data
In [21]:
# Here is an unsupervised example with PCA
# Lets work with the digits data which contains 8x8 pixel images for handwritten digits
X = digits.data # load data for the digits, each 8x8 image is stored as a 64 dimensional row in the array
y = digits.target # the true number represented by the image
print(X.shape)
plt.imshow(X[0,:].reshape(8,8),cmap=plt.cm.binary)
plt.title('An example of a single image')
plt.show()
# For your exercise you will need to replace these 3 lines
pca = decomposition.PCA(n_components=2)
pca.fit(X)
outCoords=pca.transform(X)
plt.figure()
ax = plt.subplot(111)
x_min, x_max = np.min(outCoords, 0), np.max(outCoords, 0)
outCoordsScaled = (outCoords - x_min) / (x_max - x_min)
for i in range(X.shape[0]):
plt.text(outCoordsScaled[i, 0], outCoordsScaled[i, 1], str(y[i]),
color=plt.cm.tab20(y[i] / 10.),fontdict={'weight': 'bold', 'size': 12})
plt.title('PCA applied to digits: each number represent the true label for a single image')
plt.show()