scikit-learn is the most used python library for machine learning (but you might want to look into TensorFlow and Theano if you like deep-learning). It includes both supervised and unsupervised learning, clustering, feature reduction (PCA) and several pre-processing and performance metrics.
Despite the great number of algorithms implemented in the library, the structure is similar for all of them, making it easier to use them and plug them together. Advanced usage include pipelines and advanced cross-validations.
Key components of scikit-learn are the following:
Imputer
, which imputes missing values)MinMaxScaler
, which standardises the given dataset)LinearRegression
, which as you might have guessed makes a linear regression)All of this components (actually classes) have consistent characteristics:
fit
method to train the model or learn some parameterstransform
(or fit_transform
) method to transform the datapredict
method to carry out the predictionsscore
method to asses the goodness of the predictionsattributes
, by convention followed by an underscore (e.g. Ridge
's alpha_
parameter)This consistency allows great flexibility: different machine-learning methods and algorithms can be swapped easily, and pipelines can be easily built, since each class can feed into each other's result.
Example: predictor of breast cancer malignity
Datasets in scikit.learn
follow the scheme (n_samples, n_features)
, using numpy.ndarrays
. If you want greater flexibility, it is advisable to use pandas dataframes.
This dataset is available inside scikit-learn
as a test dataset. Playing with it will provide a good introduction to the main concepts of this package.
In [ ]:
# Plotting imports
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
In [ ]:
import numpy as np
import pandas as pd
from sklearn import datasets
In [ ]:
d = datasets.load_breast_cancer()
In [ ]:
print(d['DESCR'])
In [ ]:
d.keys()
In [ ]:
d['data']
In [ ]:
d['target']
In [ ]:
d['target_names']
In [ ]:
df = pd.DataFrame(d['data'],
columns=d['feature_names'])
In [ ]:
target = pd.Series(d['target'],
name='target')
In [ ]:
df.info()
All the samples seem to have all the features, so we will not need to either impute them or drop them.
Despite not being strictly implemented in scikit-learn
, having a broad overview of our input dataset is a mandatory step in any machine learning project. We can use either panda
's plotting capabilities or matplotlib
/seaborn
.
In [ ]:
df.hist(bins=50, figsize=(20,15));
Our features are not on the same scale, despite all following similar distributions. This could be a problem for some algorithms. We will therefore need to perform feature scaling, of which there are two flavours:
min
and a max
value (i.e. through the MinMaxScaler
transformer)Another routine operation is to verify the relationship between features, so that highly correlated ones are either combined or selected. Again, we can have a quick sense using either seaborn
or pandas
.
In [ ]:
sns.pairplot(data=df[df.columns[:5]], size=2);
In [ ]:
from pandas.plotting import scatter_matrix
scatter_matrix(df[df.columns[:5]], figsize=(10, 10));
We can leverage seaborn
's clustermap
method to visualize the correlation matrix of all features.
In [ ]:
mclust = sns.clustermap(df.corr(),
figsize=(10, 10),
cmap='RdBu')
mclust.ax_heatmap.set_yticklabels(mclust.ax_heatmap.get_yticklabels(),
rotation=0);
We can conclude that it might be a good idea to either drop some features or create new ones (features engineering) to improve our predictor. For instance we could highlight those features that have at least a correlation above 0.7 and randomly pick a representative.
In [ ]:
feat_corr = df.corr()
feat_corr.stack().head(5)
In [ ]:
feat_corr = df.corr()
# ignore the diagonal, obviously
np.fill_diagonal(feat_corr.values, np.nan)
feat_corr = feat_corr.stack()
In [ ]:
feat_corr[feat_corr > 0.7].head(5)
In [ ]:
# can you think a smarter way to perform this operation?
high_corr = feat_corr[feat_corr > 0.7]
discarded = set()
saved = set()
for feat1 in {x[0] for x in high_corr.index}:
if feat1 in discarded:
continue
saved.add(feat1)
for feat2 in high_corr.loc[feat1].index:
discarded.add(feat2)
In [ ]:
saved
In [ ]:
discarded
In [ ]:
sns.pairplot(data=df.loc[:, sorted(set(df.columns) - discarded)], size=2);
In [ ]:
df = df.loc[:, sorted(set(df.columns) - discarded)]
To avoid the risk of confirmation bias, and most importanlty overfitting, it is also a good practice to split early our dataset into two:
You should avoid at all costs the temptation to use the test set until the very end of the project, and only to choose which model gives the best trade-offs.
In [ ]:
from sklearn import model_selection
In [ ]:
df_train, df_test = model_selection.train_test_split(df,
test_size=0.2)
df_train.head(5)
In [ ]:
target_train, target_test = target.iloc[df_train.index], target.iloc[df_test.index]
target_train.head(5)
In [ ]:
target_train[target_train == 0].shape[0] / target_train[target_train == 1].shape[0]
In [ ]:
target_test[target_test == 0].shape[0] / target_test[target_test == 1].shape[0]
We might have a problem here: the proportion of benign and malign tumors is uneven between the train and test sets! This might impose a bias in our predictor, which will reflect in a poor performance in our final predictions.
In [ ]:
cv = model_selection.StratifiedShuffleSplit(n_splits=1,
test_size=0.2,
random_state=42)
The StratifiedShuffleSplit
class is a more advanced way to generate a test set, such that the proportion of each target class are roughly the same with the training set.
We are defining a random_state
so that the random number generator seed is the same each time we run this notebook. This ensures that we will always get the same train/test sets.
In [ ]:
cv.split(df, target)
In [ ]:
train_idx, test_idx = next(cv.split(df, target))
train_idx
In [ ]:
df_train, target_train = df.iloc[train_idx], target[train_idx]
df_test, target_test = df.iloc[test_idx], target[test_idx]
df_train.head(5)
In [ ]:
target_train[target_train == 0].shape[0] / target_train[target_train == 1].shape[0]
In [ ]:
target_test[target_test == 0].shape[0] / target_test[target_test == 1].shape[0]
We are now gonna focus exclusively on the train set. If we end up applying any transformation to the data, we'll make sure to apply the same transformation to the test set, using the parameters learned on the train set.
In [ ]:
from sklearn import preprocessing
In [ ]:
std_scaler = preprocessing.StandardScaler()
std_scaler.fit(df_train)
As you can see, whenever we apply a fit
method, the returned value is the same instance
of the used transformer
. To make things easier, we can use the fit_transform
method, which fits the scaler using the provided data and then applies the transormation to the same input data.
In [ ]:
std_scaler.fit_transform(df_train)
In [ ]:
# let's bring it back to a dataframe
df_scaled = pd.DataFrame(std_scaler.fit_transform(df_train),
columns=df_train.columns)
In [ ]:
df_scaled.hist(bins=50, figsize=(20,15));
Based on the cheatsheet above, we are facing a classification problem, meaning that we want a predictor to be able to input our features and output a classification label. This particular dataset is easy in this sense, because we are going to work with a binary classifier. The above cheat-sheet seems to indicate that the best classifier given the limited amount of data points might be a support vector machine with a linear kernel (LinearSVC
).
The above scheme highlights the strength of the different classifiers available, which also is a powerful indication that it's always a good idea to visualize your data. Different structures in the data will dictate different optimal strategies and therefore different algorithms.
In [ ]:
from sklearn import svm
In [ ]:
svm.LinearSVC()
In [ ]:
clf = svm.LinearSVC()
clf = clf.fit(df_scaled, target_train)
This time we are passing both the features and the labels to the fit
method. The predict
method will apply the learned decision boundaries to the input features to produce the predicted labels.
In [ ]:
clf.predict(df_scaled)
In [ ]:
predict_train = pd.Series(clf.predict(df_scaled),
index=target_train.index,
name='prediction')
In [ ]:
combined = target_train.to_frame().join(predict_train.to_frame())
combined.head(10)
In [ ]:
combined[combined['target'] == combined['prediction']].shape[0] / combined.shape[0]
Let's first some more evaluation metrics, such as a ROC and a precision-recall curve
In [ ]:
from sklearn import metrics
In [ ]:
plt.figure(figsize=(6, 3))
plt.subplot(121)
fpr, tpr, thresholds = metrics.roc_curve(combined['target'],
combined['prediction'])
plt.plot(fpr, tpr,
'r-')
plt.plot([0, 1],
[0, 1],
'--',
color='grey')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)
plt.subplot(122)
prec, rec, thresholds = metrics.precision_recall_curve(combined['target'],
combined['prediction'])
plt.plot(rec, prec,
'r-')
plt.xlabel('precision')
plt.ylabel('recall')
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)
plt.tight_layout()
metrics.roc_auc_score(combined['target'],
combined['prediction'])
98% accuracy and 0.98 ROC AUC! Well done...
...but what if we are actually overfitting the input data? This might be especially true when many features are used, leading to the curse of dimensionality. And we are also using the dafault parameters of our classifier, which might also play in inducing overfitting. We might therefore need to use cross validation on the train set, further splitting it into a train and a validation set. Luckily scikit-learn
offers an utility function to to this easily.
In [ ]:
model_selection.cross_val_score(clf, df_scaled, target_train,
cv=10,
scoring=metrics.make_scorer(metrics.roc_auc_score))
So it seems that we are not actually overfitting our data, at least judging from this simple analysis.
But what about picking the best hyperparameters? Wec ould keep it simple and test a range of values for the C
and loss
parameters, using another useful class
of scikit learn
: GridSearchCV
, which will explore all combinations of the parameters, using a scoring function to asses their impact on the predictions.
In [ ]:
param_grid = {'C': np.linspace(0.01, 10),
'loss': ['hinge', 'squared_hinge']}
clf = svm.LinearSVC()
grid_search = model_selection.GridSearchCV(clf,
param_grid=param_grid,
cv=10,
scoring=metrics.make_scorer(metrics.roc_auc_score))
grid_search.fit(df_scaled, target_train)
In [ ]:
grid_search.cv_results_.keys()
In [ ]:
plt.figure(figsize=(7, 3))
for mean_score, params in zip(grid_search.cv_results_["mean_test_score"],
grid_search.cv_results_["params"]):
if params['loss'] == 'hinge':
plt.plot(params['C'],
mean_score,
'bo')
else:
plt.plot(params['C'],
mean_score,
'ro')
plt.xlabel('C')
plt.ylabel('ROC AUC');
It seems that the combination of loss
= hinge
and C
= 2 produces the best results. There are other hyperparameters that we could tune and better metrics we could use, but we'll stop here and test our trained model on the test dataset.
We'll do it in one go using another very useful feature of scikit-learn
: pipelines. Pipelines allow the user to chain several transformers (every piece of the pipeline but the last have to implement the fit_transform
method) and a final estimator. This way we can keep the process both modular and concise. We'll also be able to pass to the final estimator (or any extra transformer we might have) the learned parameters.
In [ ]:
from sklearn import pipeline
In [ ]:
breast_pipeline = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
('classifier', svm.LinearSVC(C=2., loss='hinge'))])
In [ ]:
breast_pipeline = breast_pipeline.fit(df_train, target_train)
breast_pipeline.predict(df_test)
In [ ]:
predict_test = pd.Series(breast_pipeline.predict(df_test),
index=target_test.index,
name='prediction')
combined = target_test.to_frame().join(predict_test.to_frame())
combined.head(10)
In [ ]:
plt.figure(figsize=(6, 3))
plt.subplot(121)
fpr, tpr, thresholds = metrics.roc_curve(combined['target'],
combined['prediction'])
plt.plot(fpr, tpr,
'r-')
plt.plot([0, 1],
[0, 1],
'--',
color='grey')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)
plt.subplot(122)
prec, rec, thresholds = metrics.precision_recall_curve(combined['target'],
combined['prediction'])
plt.plot(rec, prec,
'r-')
plt.xlabel('precision')
plt.ylabel('recall')
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)
plt.tight_layout()
metrics.roc_auc_score(combined['target'],
combined['prediction'])
Our model doesn't work just as good as for the training set, which is a good sign that we are have not produced an overfitted model. Our predictions are still very good!
In [ ]:
iris = datasets.load_iris()
In [ ]:
# rows are observations, columns are features
iris.data
In [ ]:
# true label for each observation
iris.target
In [ ]:
# label names
iris.target_names
In [ ]:
# the simplest preprocessing is to standardize the data
std_scaler = preprocessing.StandardScaler()
iris.data = std_scaler.fit_transform(iris.data)
In [ ]:
from sklearn.cluster import KMeans
In [ ]:
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(iris.data)
In [ ]:
kmeans.labels_
In [ ]:
# once the model has been fitted, we can add a new observation and can try to predict to which cluster they belong to
kmeans.predict([[5.8, 2.7, 4.0, 1.25],])
We are going to plot the dataset (using only two features for semplicity), together with the true labels (glyphs) and the clusters (colors).
In [ ]:
plt.figure(figsize=(15, 7))
plt.subplot(121)
for label, glyph in zip(set(iris.target), ('o', 'D', '^')):
for cluster, color in zip(set(kmeans.labels_), ('b', 'r', 'g')):
plt.plot(iris.data[(iris.target == label) & (kmeans.labels_ == cluster)][:, 0],
iris.data[(iris.target == label) & (kmeans.labels_ == cluster)][:, 1],
marker=glyph,
linestyle='',
color=color,
label='{0} - {1}'.format(iris.target_names[label],
cluster))
plt.xlabel('feature 0')
plt.ylabel('feature 1')
plt.subplot(122)
for label, glyph in zip(set(iris.target), ('o', 'D', '^')):
for cluster, color in zip(set(kmeans.labels_), ('b', 'r', 'g')):
plt.plot(iris.data[(iris.target == label) & (kmeans.labels_ == cluster)][:, 2],
iris.data[(iris.target == label) & (kmeans.labels_ == cluster)][:, 3],
marker=glyph,
linestyle='',
color=color,
label='{0} - {1}'.format(iris.target_names[label],
cluster))
plt.xlabel('feature 2')
plt.ylabel('feature 3')
plt.legend(loc=(1, 0));
The clustering is separating the three categories with a decent discriminative power; we can use some metrics implemented in scikit-learn to be more precise. We are going to use an homogeinity score to measure how "pure" each cluster is.
In [ ]:
metrics.homogeneity_score(iris.target, kmeans.labels_)
Linear model classifier example
In a real world scenario we would need to divide our dataset into a training and test set; for that purpose the sklearn.cross_validation
module should be used.
In [ ]:
from sklearn.linear_model import RidgeClassifier
In [ ]:
ridge = RidgeClassifier(alpha=1.0)
ridge.fit(iris.data, iris.target)
In [ ]:
predictions = ridge.predict(iris.data)
In [ ]:
metrics.f1_score(iris.target, predictions, average=None)
With this alpha
parameter we are probably not overfitting the model, as we are not correctly predicting even the training data!
In [ ]:
plt.figure(figsize=(15, 7))
plt.subplot(121)
plt.plot(iris.data[iris.target == predictions][:, 0],
iris.data[iris.target == predictions][:, 1],
'k.',
label='correct predictions')
plt.plot(iris.data[iris.target != predictions][:, 0],
iris.data[iris.target != predictions][:, 1],
'ro',
label='incorrect predictions')
plt.xlabel('feature 0')
plt.ylabel('feature 1')
plt.subplot(122)
plt.plot(iris.data[iris.target == predictions][:, 2],
iris.data[iris.target == predictions][:, 3],
'k.',
label='correct predictions')
plt.plot(iris.data[iris.target != predictions][:, 2],
iris.data[iris.target != predictions][:, 3],
'ro',
label='incorrect predictions')
plt.xlabel('feature 2')
plt.ylabel('feature 3')
plt.legend(loc=(1, 0));