There are two kinds of machine learning we will talk about today: supervised learning and unsupervised learning.
supervised: classification, regression,...
unsupervised: clustering, dimension reduction,...
Scikit-learn strives to have a uniform interface across all objects. Given a scikit-learn estimator named model
, the following methods are available:
Available in all Estimators
model.fit()
: fit training data. For supervised learning applications,
this accepts two arguments: the data X
and the labels y
(e.g. model.fit(X, y)
).
For unsupervised learning applications, fit
takes only a single argument,
the data X
(e.g. model.fit(X)
).Available in supervised estimators
model.predict()
: given a trained model, predict the label of a new set of data.
This method accepts one argument, the new data X_new
(e.g. model.predict(X_new)
),
and returns the learned label for each object in the array.model.fit_predict()
: fits and predicts for trained one and the same time model.predict_proba()
: For classification problems, some estimators also provide
this method, which returns the probability that a new observation has each categorical label.
In this case, the label with the highest probability is returned by model.predict()
.model.score()
: An indication of how well the model fits the training data. Scores are between 0 and 1, with a larger score indicating a better fit.Data in scikit-learn, with very few exceptions, is assumed to be stored as a
two-dimensional array, of size [n_samples, n_features]
. Many algorithms also accept scipy.sparse
matrices of the same shape.
What if you have categorical features? For example, imagine there is data on the color of each iris:
color in [red, blue, purple]
You might be tempted to assign numbers to these features, i.e. red=1, blue=2, purple=3 but in general this is a bad idea. Estimators tend to operate under the assumption that numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike than 1 and 3, and this is often not the case for categorical features.
A better strategy is to give each category its own dimension.
The enriched iris feature set would hence be in this case:
Note that using many of these categorical features may result in data which is better represented as a sparse matrix, as we'll see with the text classification example below.
When the source data is encoded has a list of dicts where the values are either strings names for categories or numerical values, you can use the DictVectorizer
class to compute the boolean expansion of the categorical features while leaving the numerical features unimpacted:
In [2]:
measurements = [
{'city': 'Dubai', 'temperature': 33.},
{'city': 'London', 'temperature': 12.},
{'city': 'San Francisco', 'temperature': 18.},
]
In [3]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
tf_measurements = vec.fit_transform(measurements)
tf_measurements.toarray()
Out[3]:
In [4]:
vec.get_feature_names()
Out[4]:
In [5]:
#disable some annoying warning
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [6]:
#load the iris datasets
import sklearn.datasets
data = sklearn.datasets.load_iris()
data.data.shape
Out[6]:
In [7]:
from sklearn.cluster import KMeans
iris_pred = KMeans(n_clusters=3, random_state = 102).fit_predict(data.data)
In [8]:
plt.figure(figsize=(12, 12))
colors = sns.color_palette()
plt.subplot(211)
plt.scatter(data.data[:, 0], data.data[:, 1], c=[colors[i] for i in iris_pred], s=40)
plt.title('KMeans-3 clusterer')
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.subplot(212)
plt.scatter(data.data[:, 0], data.data[:, 1], c=[colors[i] for i in data.target],s=40)
plt.title('Ground Truth')
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
Out[8]:
In [9]:
import sklearn.cross_validation
data_train, data_test, target_train, target_test = sklearn.cross_validation.train_test_split(
data.data, data.target, test_size=0.20, random_state = 5)
print(data.data.shape, data_train.shape, data_test.shape)
Now, we use a DecisionTree to learn a model and test our result
In [10]:
from sklearn.tree import DecisionTreeClassifier
instance = DecisionTreeClassifier()
r = instance.fit(data_train, target_train)
target_predict = instance.predict(data_test)
from sklearn.metrics import accuracy_score
print('Prediction accuracy: ',accuracy_score(target_predict, target_test))
pretty good, isn't it?
if we go back to our K-Means example, the clustering doesn't really make sense. However, we are just looking at two out of four dimensions. So, we can't really see the real distances/similarities between items. Dimension reduction techniques reduce the number of dimensions, while preserving the inner structure of the higher dimensions. We take a look at two of them: Multi Dimensional Scaling (MDS) and Principal Component Analysis (PCA).
In [11]:
from sklearn import manifold
#create mds instance
mds = manifold.MDS(n_components=2, random_state=5)
#fit the model and get the embedded coordinates
pos = mds.fit(data.data).embedding_
plt.scatter(pos[:, 0], pos[:, 1], s=20, c=[colors[i] for i in data.target])
#create a legend since we just have one plot and not three fake the legend using patches
import matplotlib.patches as mpatches
patches = [ mpatches.Patch(color=colors[i], label=data.target_names[i]) for i in range(3) ]
plt.legend(handles=patches)
plt.legend()
In [12]:
#compare with e.g. PCA
from sklearn import decomposition
pca = decomposition.PCA(n_components=2)
pca_pos = pca.fit(data.data).transform(data.data)
mds_pos = mds.fit(data.data).embedding_
plt.figure(figsize=[20,7])
plt.subplot(121)
plt.scatter(mds_pos[:, 0], mds_pos[:, 1], s=30, c=[colors[i] for i in data.target])
plt.title('MDS')
plt.subplot(122)
plt.scatter(pca_pos[:, 0], pca_pos[:, 1], s=30, c=[colors[i] for i in data.target])
plt.title('PCA')
Out[12]:
seems like versicolor and virginicia are more similar then setosa
In [13]:
from IPython.html.widgets import interact
colors = sns.color_palette(n_colors=10)
thanks!