In [1]:
!pip install --upgrade sklearn
!pip install --upgrade seaborn
In [2]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
In previous weeks we have looked at the steps needed in preparing different types of data for use by machine learning algorithms. This week we will consider two problems: regression and classification. In both cases we need to have a dataset for training a regressor or classifier. The dataset includes one or several numeric or categoric variables as input data (also called independent or explanatory variables) and a response variable that we are trying to predict based on input data.
Depending on the number of explanatory variables we distinguish
In case of multiple dependent variables regression is called multivariate.
In classification problems we distinguish:
In [3]:
from sklearn import datasets
diabetes = datasets.load_diabetes()
# Description at http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
# Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements
# were obtained for each of n = 442 diabetes patients,
# as well as the response of interest, a quantitative measure of disease progression one year after baseline.
X = diabetes.data # independent variables
y = diabetes.target # dependent val
print(X.shape)
print(y.shape)
print(X[0,], y[0,])
All the different models in scikit-learn follow a consistent structure:
These are the basic components with additional methods added when needed. For example, classifiers also have
In [4]:
# General format:
# reg = Regression(initialization_params)
from sklearn import linear_model
bmi = X[:, 2].reshape(-1, 1)
outcome = y
reg = linear_model.LinearRegression()
reg.fit(bmi, outcome)
predicted_outcome = reg.predict(bmi)
# The coefficients
print('Coefficients: \n', reg.coef_)
# for the first data point
print("INPUT DATA", bmi[0])
print("PREDICTED", predicted_outcome[0])
print("ACTUAL", outcome[0])
# Plot outputs
plt.scatter(bmi, outcome, color='black')
plt.plot(bmi, predicted_outcome, 'b-', linewidth=3)
plt.xlabel("BMI (normalized)")
plt.ylabel("Clinical outcome")
plt.show()
Before we consider whether we have a good model, or which model to choose, we must first decide on how we will evaluate our models.
As part of our evaluation having a single number with which to compare models can be very useful. Choosing a metric that is as close a representation of our goal as possible enables many models to be automatically compared. This can be important when choosing model parameters or comparing different types of algorithm.
Even if we have a metric we feel is reasonable it can be worthwhile considering in detail the predictions made by any model. Some questions to ask:
The metric we decide to use will depend on the type of problem we have (regression or classification) and what aspects of the prediction are most important to us. For example, a decision we might have to make is between:
For these two situations in a regression task we might choose mean_squared_error and mean_absolute_error.
There are lists for regression metrics and classification metrics.
We can apply the mean_squared_error metric to the linear regression model on the diabetes dataset:
In [5]:
error = predicted_outcome - outcome
# The mean squared error
print("Mean squared error: %.2f" % np.mean(error ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % reg.score(bmi, outcome))
In [6]:
# use a scikit learn function
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import explained_variance_score
print(mean_squared_error(predicted_outcome, outcome))
In [7]:
import pandas as pd
data = pd.DataFrame(X, columns=['age', 'sex', 'bmi', 'map',
'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu'])
data.info()
In [ ]:
In [8]:
pd.scatter_matrix(data, figsize=(20,20))
plt.show()
In [9]:
mult_reg = linear_model.LinearRegression()
mult_reg.fit(X, outcome)
predicted_outcome = mult_reg.predict(X)
# The coefficients
print('Coefficients: \n', mult_reg.coef_)
# for the first data point
print("INPUT DATA", X[0,])
print("PREDICTED", predicted_outcome[0])
print("ACTUAL", outcome[0])
# Plot outputs
plt.scatter(outcome, predicted_outcome, color='black')
plt.xlabel("Clinical outcome")
plt.ylabel("Predicted clinical outcome")
plt.show()
print(r2_score(outcome, predicted_outcome))
print(explained_variance_score(outcome, predicted_outcome))
# Note that R2 is equal to % explained variance
In [10]:
# exploring dependence between BMI and outcome stratified by sex
all_data = data
all_data['outcome'] = y
g = sns.lmplot(x="bmi", y="outcome", col="sex", data=all_data, size=4)
# Note that sex variable is categorical but it was normalized so it's values are numeric -0.05 and +0.05
plt.show()
Scikit-learn includes a variety of different models. The most commonly used algorithms probably include the following:
We have already seen several examples of simple and multiple linear regression.
Support vector machines attempt to project samples into a higher dimensional space such that they can be divided by a hyperplane. A good explanation can be found in this article.
Nearest neighbor methods identify a number of samples from the training set that are close to the new sample and then return the average or most common value depending on the task.
Decision trees attempt to predict the value of a new sample by learning simple rules from the training samples.
Ensembles are combinations of other models. Combining different models together can improve performance by boosting generalizability. An average or most common value from the models is returned.
Boosting builds one model and then attempts to reduce the errors with the next model. At each stage the bias in the model is reduced. In this way many weak predictors can be combined into one much more powerful predictor.
I often begin with an ensemble or boosting approach as they typically give very good performance without needing to be carefully optimized. Many of the other algorithms are sensitive to their parameters.
In [11]:
# https://en.wikipedia.org/wiki/Support_vector_machine#Regression
from sklearn import svm
reg = svm.SVR()
reg.fit(data, y)
y_hat = reg.predict(data)
plt.plot(y, y_hat, 'k.')
plt.show()
mult_reg.fit(X, outcome)
predicted_outcome = mult_reg.predict(X)
# The coefficients
print('Coefficients: \n', mult_reg.coef_)
print(r2_score(y, y_hat))
print(explained_variance_score(y, y_hat))
# Note that R2 is NOT equal to % explained variance
In [ ]:
In [12]:
reg = linear_model.LinearRegression()
errors = []
for i in range(1000):
y_copy = y.copy()
np.random.shuffle(y_copy)
reg.fit(data, y_copy)
y_predicted = reg.predict(data)
mse = mean_squared_error(y_copy, y_predicted)
errors.append(mse)
ee = pd.DataFrame(errors)
ee[0].hist(cumulative=True, normed=1, bins=100)
plt.ylim(0, 1)
plt.ylabel("Probability")
plt.xlabel("MSE")
plt.show()
reg.fit(data, y)
y_predicted = reg.predict(data)
mse = mean_squared_error(y, y_predicted)
print("MSE =", mse)
In [ ]:
When evaluating different models the approach taken above is not going to work. Particularly for models with high variance, that overfit the training data, we will get very good performance on the training data but perform no better than chance on new data.
For example, DecisionTreeRegressor and KNeighborsRegressor if poorly implemented will simply learn a one-to-one mapping of the data it is trained on.
In [14]:
from sklearn import neighbors
from sklearn import tree
print("KNN Regressor")
reg = neighbors.KNeighborsRegressor(n_neighbors=1)
reg.fit(data, y)
y_predicted = reg.predict(data)
plt.plot(y, y_predicted, 'k.')
plt.show()
print("EVAR", explained_variance_score(y, y_predicted))
################
print("\n\nDecision Tree Regressor")
reg = tree.DecisionTreeRegressor(criterion='mse')
reg.fit(data, y)
y_predicted = reg.predict(data)
plt.plot(y, y_predicted, 'k.')
plt.show()
print("EVAR", explained_variance_score(y, y_predicted))
To understand how our model truly performs we need to evaluate the performance on previously unseen samples. The general approach is to divide a dataset into training, validation and test datasets. Each model is trained on the training dataset. Multiple models can then be compared by evaluating the model against the validation dataset. There is still the potential of choosing a model that performs well on the validation dataset by chance so a final check is made against a test dataset.
This unfortunately means that part of our, often expensively gathered, data can't be used to train our model. Although it is important to leave out a test dataset an alternative approach can be used for the validation dataset. Rather than just building one model we can build multiple models, each time leaving out a different validation dataset. Our validation score is then the average across each of the models. This is known as cross-validation.
Scikit-learn provides classes to support cross-validation but a simple solution can also be implemented directly. Below we will separate out a test dataset to evaluate the nearest neighbor model.
http://scikit-learn.org/stable/modules/cross_validation.html
In [15]:
try:
# scikit-learn 0.18
from sklearn.model_selection import train_test_split
except:
from sklearn.cross_validation import train_test_split
def split_demo(clf):
errors = []
evars = []
for i in range(100):
# SPLIT INTO TRAIN & TEST
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3)
# TRAINING BASED ON TRAINING
clf.fit(X_train, y_train)
# EVALUATE PERF. BASED ON TESTING:
y_predicted = clf.predict(X_val)
mse = mean_squared_error(y_val, y_predicted)
errors.append(mse)
evars.append(round(explained_variance_score(y_val, y_predicted), 2))
return evars
errors_lr = split_demo(linear_model.LinearRegression())
pd.DataFrame(errors_lr).hist()
plt.show()
errors_knnr = split_demo(neighbors.KNeighborsRegressor(n_neighbors=5))
pd.DataFrame(errors_knnr).hist()
plt.show()
In [16]:
# Various ways to slice an array
a = np.arange(20) + 5
print(a)
print(a[[0,2,4]])
print(a[np.random.random(20) > 0.5])
mask = a < 10
print(mask)
print(a[mask])
In [ ]:
Many of the models require several different parameters to be specified. Their performance is typically heavily influenced by these parameters and choosing the best values is vital in developing the best model.
Some models have alternative implementations that handle parameter selection in an efficient way.
An example is LassoCV.
In [17]:
clf = linear_model.LassoCV(cv=5)
clf.fit(X, y)
print(clf.coef_.shape)
print('Alpha chosen was ', clf.alpha_)
plt.plot(y, clf.predict(X), 'k.')
print("Variance explained", explained_variance_score(
y, clf.predict(X)))
There is an expanded example in the documentation.
There are also general classes to handle parameter selection for situations when dedicated classes are not available. As we will often have parameters in preprocessing steps these general classes will be used much more often.
In [18]:
from sklearn.model_selection import GridSearchCV
split = np.random.random(y.shape) > 0.3
X_train_test = X[split]
y_train_test = y[split]
X_validation = X[np.logical_not(split)]
y_validation = y[np.logical_not(split)]
print("TRAIN & TEST", X_train_test.shape)
print("VALIDATION", X_validation.shape)
knn = neighbors.KNeighborsRegressor(n_neighbors=1)
parameters = [{'n_neighbors':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]}]
g = GridSearchCV(knn, parameters, cv=5)
# Use Train_Test dataset. Grid Search will use cross-validation to split it into Train and Test subsets
g.fit(X_train_test, y_train_test)
best = g.best_params_
cv = g.cv_results_
print('GridSearch found optimal number of neighbors:', best['n_neighbors'])
print('Mean CV test scores are:', cv['mean_test_score'])
knn = neighbors.KNeighborsRegressor(n_neighbors = best['n_neighbors'])
knn.fit(X_train_test, y_train_test)
# Use VALIDATION dataset
print('The MSE for the model is:', mean_squared_error(y_validation, knn.predict(X_validation)))
In [19]:
import sklearn.datasets
digits = sklearn.datasets.load_digits()
# X - how digits are handwritten
X = digits['data']
# y - what these digits actually are
y = digits['target']
print(X.shape)
print(y.shape)
print("Total 1797 examples")
print("MSE: example for digit 0 compared to another example for digit 0", mean_squared_error(X[0], X[10]))
print("MSE: example for digit 0 compared to examle for digit 4", mean_squared_error(X[0], X[400]))
print(y[10])
plt.imshow(X[20].reshape((8,8)), cmap=plt.cm.gray)
plt.show()
In [ ]:
In [ ]: