Model Evaluation

Pipeline and Feature Unions

It is always a good decision to make your code as readable as possible. Not only so that others can pick it up and use it easily, but so that you don't end up hating your past self when you take a look at something you wrote several months prior. Pipelines are a great tool for integrating and reusing a series of data transformations and fits into a workflow. In this exercise you'll build some pipelines and feature unions using the Concrete Compressive Data Set.

1 - Head over to the Machine Learning Repository, download the Concrete Compressive Data Set, put it into a dataframe, and split into training and test sets. Be sure to familiarize yourself with the data before proceeding.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import resample
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.feature_selection import f_regression, SelectKBest, SelectFromModel
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve, GridSearchCV,\
                                    learning_curve, StratifiedKFold, LeaveOneOut, KFold
from sklearn.metrics import mean_squared_error, confusion_matrix, precision_score, accuracy_score, recall_score,\
                            f1_score, precision_recall_curve, roc_curve, roc_auc_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression, LassoCV, LogisticRegression, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

In [17]:
concrete = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls')
concrete.columns = ['cement', 'blast_furnace_slag', 'fly_ash', 'water', 'superplasticizer',
                    'coarse_aggregate', 'fine_aggregate', 'age', 'concrete_compressive_strength']

In [3]:
concrete.head()


Out[3]:
cement blast_furnace_slag fly_ash water superplasticizer coarse_aggregate fine_aggregate age concrete_compressive_strength
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.986111
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.887366
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.269535
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.052780
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.296075

In [4]:
concrete.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
cement                           1030 non-null float64
blast_furnace_slag               1030 non-null float64
fly_ash                          1030 non-null float64
water                            1030 non-null float64
superplasticizer                 1030 non-null float64
coarse_aggregate                 1030 non-null float64
fine_aggregate                   1030 non-null float64
age                              1030 non-null int64
concrete_compressive_strength    1030 non-null float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB

In [5]:
concrete.describe().T


Out[5]:
count mean std min 25% 50% 75% max
cement 1030.0 281.165631 104.507142 102.000000 192.375000 272.900000 350.000000 540.000000
blast_furnace_slag 1030.0 73.895485 86.279104 0.000000 0.000000 22.000000 142.950000 359.400000
fly_ash 1030.0 54.187136 63.996469 0.000000 0.000000 0.000000 118.270000 200.100000
water 1030.0 181.566359 21.355567 121.750000 164.900000 185.000000 192.000000 247.000000
superplasticizer 1030.0 6.203112 5.973492 0.000000 0.000000 6.350000 10.160000 32.200000
coarse_aggregate 1030.0 972.918592 77.753818 801.000000 932.000000 968.000000 1029.400000 1145.000000
fine_aggregate 1030.0 773.578883 80.175427 594.000000 730.950000 779.510000 824.000000 992.600000
age 1030.0 45.662136 63.169912 1.000000 7.000000 28.000000 56.000000 365.000000
concrete_compressive_strength 1030.0 35.817836 16.705679 2.331808 23.707115 34.442774 46.136287 82.599225

In [18]:
Xtrain, Xtest, ytrain, ytest = train_test_split(concrete.drop('concrete_compressive_strength', axis=1),
                                                concrete.concrete_compressive_strength, 
                                                test_size=0.2, random_state=42)

2 - Build a pipeline for polynomial fitting, fit polynomials of degree 1 to the number of features, and plot your training and testing errors for each. Comment on your results


In [19]:
# scaling+pol+linear regression
pipe = Pipeline([
    ('scl', StandardScaler()),
    ('poly', PolynomialFeatures()),
    ('lr', LinearRegression())
])

In [22]:
scores = []
# looping through degrees
for n in range(1, Xtrain.shape[1]+1):
    # set pipeline params
    pipe.set_params(poly__degree=n)
    # fit and score
    pipe.fit(Xtrain, ytrain)
    scores.append([n, pipe.score(Xtrain, ytrain), pipe.score(Xtest, ytest),
                   mean_squared_error(pipe.predict(Xtrain), ytrain), mean_squared_error(pipe.predict(Xtest), ytest)])

scores = pd.DataFrame(scores)
scores.columns = ['degree', 'train score', 'test score', 'train mse', 'test mse']

In [23]:
scores


Out[23]:
degree train score test score train mse test mse
0 1 0.610459 6.275416e-01 110.661771 9.597548e+01
1 2 0.813116 7.842685e-01 53.090518 5.558993e+01
2 3 0.931086 8.437174e-01 19.577340 4.027107e+01
3 4 0.983277 -2.352121e+01 4.750605 6.318652e+03
4 5 0.993065 -8.443128e+16 1.970058 2.175634e+19
5 6 0.997491 -4.868762e+10 0.712772 1.254588e+13
6 7 0.997542 -2.671983e+10 0.698360 6.885195e+12
7 8 0.997542 -2.194118e+10 0.698360 5.653827e+12

In [19]:
# plot on two separate axes cause the scales are very different
fig, ax = plt.subplots(2, figsize=(10, 8))
ax[0].plot(scores.degree, scores['train score'], label='train score')
ax[1].plot(scores.degree, scores['test score'], label='test score');



In [25]:
# single plot for n=1,2,3
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(scores.iloc[:3, 0], scores.iloc[:3, 1], label='train score')
ax.plot(scores.iloc[:3, 0], scores.iloc[:3, 2], label='test score');


Increasing the number of features increases the training score but we get very bad generalization (with a really strange peak for the fifth degree polynomial). The best score is for the third degree (84%).

3 - Build a pipeline that will perform feature selection on the dataset using the F-Statistic, producing the same plots as in part (2). Comment on your results.


In [26]:
# scaling, select k best and linear regression
pipe = Pipeline([
    ('scl', StandardScaler()),
    ('best', SelectKBest(f_regression)),
    ('lr', LinearRegression())
])

In [28]:
scores = []
# looping over number of features
for n in range(1, Xtrain.shape[1]+1):
    # set pipe params
    pipe.set_params(best__k=n)
    # fit and score
    pipe.fit(Xtrain, ytrain)
    scores.append([n, pipe.score(Xtrain, ytrain), pipe.score(Xtest, ytest),
                   mean_squared_error(pipe.predict(Xtrain), ytrain), mean_squared_error(pipe.predict(Xtest), ytest)])

scores = pd.DataFrame(scores)
scores.columns = ['n_features', 'train score', 'test score', 'train mse', 'test mse']

In [29]:
scores


Out[29]:
n_features train score test score train mse test mse
0 1 0.246699 0.251831 213.999800 192.788897
1 2 0.346373 0.371407 185.684137 161.976448
2 3 0.469642 0.533248 150.665682 120.273091
3 4 0.485188 0.556152 146.249264 114.371259
4 5 0.495800 0.550043 143.234484 115.945336
5 6 0.555148 0.634291 126.374789 94.236333
6 7 0.594330 0.598502 115.243705 103.458495
7 8 0.610459 0.627542 110.661771 95.975484

In [30]:
# train and test on a single plot
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(scores.n_features, scores['train score'], label='train score')
ax.plot(scores.n_features, scores['test score'], label='test score')
ax.legend();


After adding the sixth variable we have little improvement in the results.

4 - Build a pipeline that standardizes your data, performs feature selection via regularization, and then fits a model of your choice. Produce the same plots as above and comment on your results.


In [37]:
# scaling, select from model using lassoCV and lassoCV
pipe = Pipeline([
    ('scl', StandardScaler()),
    ('sfmod', SelectFromModel(LassoCV())),
    # ('lasso', LassoCV())
    ('forest', RandomForestRegressor()) # used in solution
])

In [38]:
scores = []
# looping through threshold values
for c in np.arange(0.1, 2.1, 0.1):
    pipe.set_params(sfmod__threshold=str(c) + '*mean')
    pipe.fit(Xtrain, ytrain)
    scores.append([c, pipe.score(Xtrain, ytrain), pipe.score(Xtest, ytest),
                   mean_squared_error(pipe.predict(Xtrain), ytrain), mean_squared_error(pipe.predict(Xtest), ytest)])
#     sel = pipe.named_steps['sfmod']
#     print(Xtrain.columns[sel.transform(np.arange(len(Xtrain.columns)).reshape(1, -1))])

scores = pd.DataFrame(scores)
scores.columns = ['threshold', 'train score', 'test score', 'train mse', 'test mse']

In [39]:
scores


Out[39]:
threshold train score test score train mse test mse
0 0.1 0.982412 0.866411 4.996542 34.423295
1 0.2 0.980343 0.858554 5.584130 36.447952
2 0.3 0.981444 0.851837 5.271501 38.178921
3 0.4 0.979916 0.847017 5.705384 39.420932
4 0.5 0.979708 0.846753 5.764719 39.488738
5 0.6 0.965539 0.822644 9.789734 45.701326
6 0.7 0.967667 0.849198 9.185124 38.858850
7 0.8 0.964817 0.835547 9.994747 42.376311
8 0.9 0.969069 0.833785 8.786909 42.830404
9 1.0 0.963735 0.765613 10.302199 60.397106
10 1.1 0.963730 0.764703 10.303707 60.631655
11 1.2 0.963066 0.747343 10.492248 65.105029
12 1.3 0.961866 0.754774 10.833086 63.190129
13 1.4 0.625286 0.300748 106.449734 180.183890
14 1.5 0.625075 0.319639 106.509672 175.316227
15 1.6 0.623982 0.329514 106.820310 172.771632
16 1.7 0.623647 0.286774 106.915466 183.784758
17 1.8 0.564522 0.249740 123.711833 193.327918
18 1.9 0.569354 0.252948 122.339139 192.501278
19 2.0 0.563951 0.266557 123.873916 188.994321

In [40]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(scores.threshold, scores['train score'], label='train score')
ax.plot(scores.threshold, scores['test score'], label='test score')
ax.legend();



In [41]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(scores.threshold, scores['train mse'], label='train mse')
ax.plot(scores.threshold, scores['test mse'], label='test mse')
ax.legend();


Test score is better than training using LassoCV as an estimator, using random forest we get more overfitting but also better overall performances than all the previous methods. The best choice for the threshold seems to be around 0.5.

5 - Create two pipelines for feature selection of a technique of your choice, scaling the data before hand. Then, join these two pipelines with a FeatureUnion, and fit a polynomial model, also in a pipeline. Comment on your results.


In [42]:
# joining select from model using lasso and select k best using the best parameters from before
pipe = Pipeline([
    ('scl', StandardScaler()),
    ('featsel', FeatureUnion([
        ('sfmod', SelectFromModel(LassoCV(), threshold='0.5*mean')),
        ('best', SelectKBest(k=6))
    ])),
    ('poly', PolynomialFeatures(degree=3)),
    ('lr', LinearRegression())
])

In [43]:
pipe.fit(Xtrain, ytrain)
print(pipe.score(Xtrain, ytrain))
print(pipe.score(Xtest, ytest))


0.894859371926
0.817670756555

In [45]:
print(mean_squared_error(pipe.predict(Xtrain), ytrain))
print(mean_squared_error(pipe.predict(Xtest), ytest))


29.868636857
46.9827978423

In [44]:
pipe.named_steps['poly'].n_input_features_


Out[44]:
11

By joining the pipelines I get the features from both the feature selections (which is quite strange as a model) but I get a very good test score of 81% (which is still worse than the 84% obtained only with poly features).

Evaluation Metrics

It is very important that you have more than one tool in your toolbox for evaluating the usefulness of your model, as in different context, different metrics are preferred. For example, suppose a new medical test is developed for detecting cancer that has a 0.25 probability of incorrectly labeling a patient of having cancer when they in fact do not, but a 0.001 probability of labeling a cancer patient as cancer free. With this sort of test, you can be sure that those who do have cancer will almost certainly be classified correctly, but a positive does not necessarily mean that the patient has cancer, meaning additional tests are in order. These metrics have different names and depending on the situation, you may be interested in minimizing different quantities, which is the topic we will explore in this exercise.

1 - Head over the Machine Learning Repository and download the Breast Cancer Diagnostic Dataset, put it in a dataframe, and split into testing and training sets.


In [46]:
breast = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',
                     header=None)

breast.columns = ['id','clump_thickness','uniformity_cell_size','uniformity_cell_shape','marginal_adhesion',
                'single_epithelial_cell_size','bare_nuclei','bland_chromatin','normal_nucleoli','mitoses','class']

In [4]:
breast.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
id                             699 non-null int64
clump_thickness                699 non-null int64
uniformity_cell_size           699 non-null int64
uniformity_cell_shape          699 non-null int64
marginal_adhesion              699 non-null int64
single_epithelial_cell_size    699 non-null int64
bare_nuclei                    699 non-null object
bland_chromatin                699 non-null int64
normal_nucleoli                699 non-null int64
mitoses                        699 non-null int64
class                          699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.1+ KB

Bare nuclei is an object: this is due to the presence of ? for missing values, which I'm going to replace with 1s.


In [5]:
breast.bare_nuclei.value_counts()


Out[5]:
1     402
10    132
5      30
2      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: bare_nuclei, dtype: int64

In [47]:
breast.bare_nuclei = breast.bare_nuclei.apply(lambda x: x.replace('?', '1'))
breast.bare_nuclei = pd.to_numeric(breast.bare_nuclei)

Also, I'm encoding the class column with 0s and 1s:


In [48]:
# 2 for benign, 4 for malignant
le = LabelEncoder()
le.fit([2, 4])
breast['class'] = le.transform(breast['class'])

In [21]:
breast.describe().T


Out[21]:
count mean std min 25% 50% 75% max
id 699.0 1.071704e+06 617095.729819 61634.0 870688.5 1171710.0 1238298.0 13454352.0
clump_thickness 699.0 4.417740e+00 2.815741 1.0 2.0 4.0 6.0 10.0
uniformity_cell_size 699.0 3.134478e+00 3.051459 1.0 1.0 1.0 5.0 10.0
uniformity_cell_shape 699.0 3.207439e+00 2.971913 1.0 1.0 1.0 5.0 10.0
marginal_adhesion 699.0 2.806867e+00 2.855379 1.0 1.0 1.0 4.0 10.0
single_epithelial_cell_size 699.0 3.216023e+00 2.214300 1.0 2.0 2.0 4.0 10.0
bare_nuclei 699.0 3.486409e+00 3.621929 1.0 1.0 1.0 5.0 10.0
bland_chromatin 699.0 3.437768e+00 2.438364 1.0 2.0 3.0 5.0 10.0
normal_nucleoli 699.0 2.866953e+00 3.053634 1.0 1.0 1.0 4.0 10.0
mitoses 699.0 1.589413e+00 1.715078 1.0 1.0 1.0 1.0 10.0
class 699.0 3.447783e-01 0.475636 0.0 0.0 0.0 1.0 1.0

In [49]:
Xtrain, Xtest, ytrain, ytest = train_test_split(breast.drop('class', axis=1), breast['class'], test_size=0.2, random_state=0)

2 - Using a classification algorithm of your choice, fit a model to the data and predict the results, making your results as accurate as possible.


In [50]:
# scale and KNN
pipe = Pipeline([
    ('scl', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

In [51]:
scores = []
# looping through number of neighbors
for k in range(1, 20):
    pipe.set_params(knn__n_neighbors=k)
    temp = cross_val_score(estimator=pipe, X=Xtrain, y=ytrain, cv=10)
    scores.append([k, np.mean(temp), np.std(temp)])

scores = pd.DataFrame(scores)
scores.columns = ['k', 'kfold mean', 'kfold std']

In [52]:
scores.sort_values(by='kfold mean', ascending=False)


Out[52]:
k kfold mean kfold std
6 7 0.964564 0.032230
16 17 0.962746 0.037112
14 15 0.962746 0.037112
7 8 0.962715 0.031764
18 19 0.960960 0.036562
12 13 0.960960 0.036562
15 16 0.960928 0.037468
4 5 0.960928 0.030001
5 6 0.960928 0.031083
11 12 0.960897 0.031133
9 10 0.960897 0.031133
10 11 0.960897 0.030052
13 14 0.959142 0.036834
17 18 0.959142 0.036834
8 9 0.955570 0.034248
3 4 0.955506 0.029510
2 3 0.955474 0.023740
0 1 0.946449 0.022355
1 2 0.932159 0.030242

Seven is the number of neighbors with the best CV score:


In [53]:
pipe.set_params(knn__n_neighbors=7)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)


Out[53]:
0.97857142857142854

The accuracy is almost 98%, which is quite good!

3 - Using your model in part (2), compute the following quantities, without using skelarn.metrics.

- True Positives
- True Negatives
- False Positives
- False Negatives

In [27]:
ypred = pipe.predict(Xtest)

In [28]:
TP = sum((ypred[ytest==1]==ytest.values[ytest.values==1]))
TN = sum((ypred[ytest==0]==ytest.values[ytest.values==0]))
FP = sum((ypred[ytest==0]!=ytest.values[ytest.values==0]))
FN = sum((ypred[ytest==1]!=ytest.values[ytest.values==1]))

In [29]:
print('True Positives: {}\nTrue Negatives: {}\nFalse Positives: {}\nFalse Negatives: {}'.format(TP, TN, FP, FN))


True Positives: 54
True Negatives: 83
False Positives: 2
False Negatives: 1

4 - Using your results in part (3), compute the following quantities.

- Sensitivity, recall, hit rate, or true positive rate (TPR)
- Specificity or true negative rate (TNR)
- Miss rate or false negative rate (FNR)
- Fall-out or false positive rate (FPR)
- Precision
- F1
- Accuracy

In [30]:
TPR = TP / (FN+TP)
FPR = FP / (FP+TN)
TNR = 1 - FPR
FNR = 1 - TPR
PRE = TP / (TP+FP)
F1 = 2*PRE*TPR / (PRE+TPR)
ACC = (TP+TN) / len(ypred)

In [31]:
print('TPR: {}\nTNR: {}\nFNR: {}\nFPR: {}\nPRE: {}\nF1: {}\nACC: {}'.format(TPR, TNR, FNR, FPR, PRE, F1, ACC))


TPR: 0.9818181818181818
TNR: 0.9764705882352941
FNR: 0.018181818181818188
FPR: 0.023529411764705882
PRE: 0.9642857142857143
F1: 0.972972972972973
ACC: 0.9785714285714285

5 - Check your results in part (4) using sklearn.


In [32]:
print(confusion_matrix(y_true=ytest, y_pred=ypred))
print('TPR', recall_score(ytest, ypred))
print('PRE', precision_score(y_true=ytest, y_pred=ypred))
print('F1', f1_score(y_true=ytest, y_pred=ypred))
print('ACC', accuracy_score(y_true=ytest, y_pred=ypred))


[[83  2]
 [ 1 54]]
TPR 0.981818181818
PRE 0.964285714286
F1 0.972972972973
ACC 0.978571428571

6 - Plot the precision and recall curve for your fit.


In [34]:
ypredprob = pipe.predict_proba(Xtest)

In [44]:
# get precision and recall values
precision, recall, _ = precision_recall_curve(ytest, ypredprob[:, 1])
# plot them and fill the curve
plt.figure(figsize=(16, 8))
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve');


7 - Plot the ROC curve for your fit.


In [49]:
# getting FPR and TPR for the ROC curve and the area under it
fpr, tpr, _ = roc_curve(ytest, ypredprob[:, 1])
roc_auc = roc_auc_score(ytest, ypredprob[:, 1])
# plot and fill the curve
plt.figure(figsize=(16, 8))
plt.step(fpr, tpr, color='b', alpha=0.2, where='post')
plt.fill_between(fpr, tpr, step='post', alpha=0.2, color='b')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('ROC curve, area = {:.5f}'.format(roc_auc));


Learning and Evaluation Curves, Hyperparameter Tuning, and Bootstrapping

A problem that you will see crop up time and time again in Data Science is overfitting. Much like how people can sometimes see structure where there is none, machine learning algorithms suffer from the same. If you have overfit your model to your data, it has learned a "pattern" in the noise rather than the signal you were looking for, and thus will not generalize well to data it has not seen.

Consider the LASSO Regression model which we have used previously. Like all parametric models, fitting a LASSO Regression model can be reduced to the problem of finding a set of $\hat{\theta_i}$ which minimize the cost function given the data. But notice that unlike a standard linear regression model, LASSO Regression has an additional regularization parameter. The result of this is that the $\hat{\theta_i}$ are dependent not only on our data, but also on this additional hyperparameter.

So now we have three different problems to juggle while we are fitting our models: overfitting, underfitting, and hyperparameter tuning. A common technique to deal with all three is cross-validation which we will explore in this exercise.

Also, you may find yourself in a situation where you do not have enough data to build a good model. Again, with some simple assumptions, we can "pull ourselves up by our bootstraps", and get by reasonably well with the data we have using a method called bootstrapping.

In this exercise, you'll explore these topics using the wine dataset.

1 - Head over to the Machine Learning Repository, download the Wine Dataset, and put it in a dataframe, being sure to label the columns properly.


In [54]:
wine = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
# wine = pd.read_csv('wine.data', header=None)
wine.columns = ['class', 'alcohol', 'malic_acid', 'ash', 'alcalinity_ash', 'magnesium',
                'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity',
                'hue', 'OD280_OD315', 'proline']

In [4]:
wine.head()


Out[4]:
class alcohol malic_acid ash alcalinity_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue OD280_OD315 proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

In [5]:
wine.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
class                   178 non-null int64
alcohol                 178 non-null float64
malic_acid              178 non-null float64
ash                     178 non-null float64
alcalinity_ash          178 non-null float64
magnesium               178 non-null int64
total_phenols           178 non-null float64
flavanoids              178 non-null float64
nonflavanoid_phenols    178 non-null float64
proanthocyanins         178 non-null float64
color_intensity         178 non-null float64
hue                     178 non-null float64
OD280_OD315             178 non-null float64
proline                 178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB

In [6]:
wine.describe().T


Out[6]:
count mean std min 25% 50% 75% max
class 178.0 1.938202 0.775035 1.00 1.0000 2.000 3.0000 3.00
alcohol 178.0 13.000618 0.811827 11.03 12.3625 13.050 13.6775 14.83
malic_acid 178.0 2.336348 1.117146 0.74 1.6025 1.865 3.0825 5.80
ash 178.0 2.366517 0.274344 1.36 2.2100 2.360 2.5575 3.23
alcalinity_ash 178.0 19.494944 3.339564 10.60 17.2000 19.500 21.5000 30.00
magnesium 178.0 99.741573 14.282484 70.00 88.0000 98.000 107.0000 162.00
total_phenols 178.0 2.295112 0.625851 0.98 1.7425 2.355 2.8000 3.88
flavanoids 178.0 2.029270 0.998859 0.34 1.2050 2.135 2.8750 5.08
nonflavanoid_phenols 178.0 0.361854 0.124453 0.13 0.2700 0.340 0.4375 0.66
proanthocyanins 178.0 1.590899 0.572359 0.41 1.2500 1.555 1.9500 3.58
color_intensity 178.0 5.058090 2.318286 1.28 3.2200 4.690 6.2000 13.00
hue 178.0 0.957449 0.228572 0.48 0.7825 0.965 1.1200 1.71
OD280_OD315 178.0 2.611685 0.709990 1.27 1.9375 2.780 3.1700 4.00
proline 178.0 746.893258 314.907474 278.00 500.5000 673.500 985.0000 1680.00

In [47]:
# Xtrain, Xtest, ytrain, ytest = train_test_split(wine.drop('class', axis=1),
#                                                 wine['class'],
#                                                 test_size=0.2,
#                                                 random_state=5,
#                                                 stratify=wine['class'])

2 - Separate your data into train and test sets, of portions ranging from test_size = 0.99 to test_size = 0.01, fitting a logistic regression model to each and computing the training and test errors. Plot the errors of the training and test sets on the same plot. Do this without using sklearn.model_select.learning_curve. Comment on your results.


In [55]:
# scaling and logistic regression
pipe = Pipeline([
        ('scl', StandardScaler()),
        ('lr', LogisticRegression())
    ])

scores = []
# looping through test sizes from 0.98 to 0.02 (0.99 gave error because I didn't have all the classes in the split)
# for tsize in np.arange(0.98, 0.01, -0.01):
for tsize in np.arange(0.01, 0.99, 0.01):
    # splitting
    Xtr, Xts, ytr, yts = train_test_split(wine.drop('class', axis=1),
                                            wine['class'],
                                            test_size=tsize,
                                            random_state=1,
                                            # stratify=wine['class'] # not used in solutions
                                         )
    # fit and score
    pipe.fit(Xtr, ytr)
    scores.append([tsize, pipe.score(Xtr, ytr), pipe.score(Xts, yts)])

scores = np.array(scores)

In [56]:
# plot train and test scores by increasing train size (1-test size)
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(1 - scores[:, 0], scores[:, 1], label='Train')
ax.plot(1 - scores[:, 0], scores[:, 2], label='Test')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])
ax.set_xlabel('Train size (%)')
ax.set_ylabel('Accuracy')
ax.set_title('Learning Curves')
ax.legend();


As expected the test accuracy goes up when the train size is increased; strangely we always get 100% accuracy on the training set.

In the solution without scaling the results are worse, let's try it out:


In [57]:
lr = LogisticRegression()

scores = []
for tsize in np.arange(0.01, 0.99, 0.01):
    # splitting
    Xtr, Xts, ytr, yts = train_test_split(wine.drop('class', axis=1),
                                            wine['class'],
                                            test_size=tsize,
                                            random_state=1,
                                         )
    # fit and score
    lr.fit(Xtr, ytr)
    scores.append([tsize, lr.score(Xtr, ytr), lr.score(Xts, yts)])

scores = np.array(scores)

In [58]:
# plot train and test scores by increasing train size (1-test size)
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(1 - scores[:, 0], scores[:, 1], label='Train')
ax.plot(1 - scores[:, 0], scores[:, 2], label='Test')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])
ax.set_xlabel('Train size (%)')
ax.set_ylabel('Accuracy')
ax.set_title('Learning Curves')
ax.legend();


Still a bit different...

3 - Repeat part (2) but this time using sklearn.model_select.learning_curve. Comment on your results.


In [77]:
train_sizes, train_scores, test_scores = learning_curve(pipe,
                                                        wine.drop('class', axis=1),
                                                        wine['class'],
                                                        train_sizes=np.arange(0.35, 0.99, 0.01),
                                                        cv=10,
                                                        random_state=5)

In [130]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(train_sizes, train_scores.mean(axis=1), label='Train')
ax.plot(train_sizes, test_scores.mean(axis=1), label='Test')
ax.set_ylim([0.4, 1.05])
ax.set_xlabel('Train size')
ax.set_ylabel('Accuracy')
ax.set_title('Learning Curves')
ax.legend();


Using learning curves the accuracy on the training set is always perfect, while the test accuracy increases without getting to 100%: this may be because learning curve uses cross validation and so we are less dependent on the splits.

4 - Use K-Fold cross validation to tune the regularization strength parameter of the logistic regression. Do this for values of k ranging from 2 to 5, make relevant plots, and comment on your results. Do this without using sklearn.model_select.evaluation_curve.


In [146]:
# using grid search which uses k-fold cv by default
param_range = [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]
param_grid = [{'lr__C': param_range}]
gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid)

scores = {}
best_scores = {}
# looping through number of folds
for k in [2, 3, 4, 5]:
    # set params
    gs.set_params(cv=k)
    # fit
    gs.fit(wine.drop('class', axis=1), wine['class'])
    # saving best scores and mean scores for all params
    best_scores[k] = [gs.best_params_['lr__C'], gs.best_score_]
    scores[k] = np.array([param_range, gs.cv_results_['mean_train_score'], gs.cv_results_['mean_test_score']]).T

In [155]:
best_scores


Out[155]:
{2: [1000, 0.9831460674157303],
 3: [30, 0.97752808988764039],
 4: [0.1, 0.9719101123595506],
 5: [1, 0.9831460674157303]}

In [148]:
# creating 4 plots, one for each number of fold
fig, ax = plt.subplots(2, 2, figsize=(16, 8), sharex='col', sharey='row', subplot_kw=dict(xscale='log'))
i, j = 0, 0

for k in scores:
    # plotting train and test scores and the best score
    ax[i, j].plot(scores[k][:, 0], scores[k][:, 1], label='Train')
    ax[i, j].plot(scores[k][:, 0], scores[k][:, 2], label='Test')
    ax[i, j].plot(best_scores[k][0], best_scores[k][1], 'ro', label='Best Score')
    ax[i, j].set_title('{}-fold CV'.format(k))
    if j == 0:
        ax[i, j].set_ylabel('Accuracy')
    if i == 1:
        ax[i, j].set_xlabel('C value')
    ax[i, j].legend()
    j += 1
    if j == 2:
        j = 0
        i += 1


Training scores are very similar in each of the plots, there are some differences between the test accuracy curves where the best value for C seems to go towards 1.


In [63]:
# solution:
X = wine.drop('class', axis=1)
y = wine['class']

k_vals = np.arange(2, 6, 1)
c_vals = np.arange(0.5, 1.5, 0.05)
err = []
for k in k_vals:
    for c in c_vals:
        lm = LogisticRegression(C=c)
        temp_err = cross_val_score(lm, X=X, y=y, cv=k)
        
        err_dict = {}
        err_dict['k'] = k
        err_dict['c'] = c
        err_dict['acc'] = temp_err.mean()
        
        err.append(err_dict)
    
err = pd.DataFrame(err)

In [64]:
for k in err.k.unique():
    plt.plot(c_vals, err.acc[err.k == k], label='k = {}'.format(k))
plt.title('Accuracy vs Regularization Strength')
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.legend();


5 - Tune the regularization strength parameter as in part (4), but this time using sklearn.model_select.evaluation_curve. Comment on your results.


In [142]:
scores = {}
best_scores = {}

for k in [2, 3, 4, 5]:
    # validation cruve uses cv and wants params similar to grid search
    train_scores, test_scores = validation_curve(estimator=pipe,
                                                X=wine.drop('class', axis=1),
                                                y=wine['class'],
                                                param_name='lr__C',
                                                param_range=param_range,
                                                cv=k)
    # saving mean scores
    mean_train_scores = train_scores.mean(axis=1)
    mean_test_scores = test_scores.mean(axis=1)
    # and best scores
    best_scores[k] = [param_range[mean_test_scores.argmax()], mean_test_scores[mean_test_scores.argmax()]]
    scores[k] = np.array([param_range, mean_train_scores, mean_test_scores]).T

In [154]:
best_scores


Out[154]:
{2: [1000, 0.9831460674157303],
 3: [30, 0.97752808988764039],
 4: [0.1, 0.9719101123595506],
 5: [1, 0.9831460674157303]}

In [143]:
fig, ax = plt.subplots(2, 2, figsize=(16, 8), sharex='col', sharey='row', subplot_kw=dict(xscale='log'))
i, j = 0, 0

for k in scores:
    ax[i, j].plot(scores[k][:, 0], scores[k][:, 1], label='Train')
    ax[i, j].plot(scores[k][:, 0], scores[k][:, 2], label='Test')
    ax[i, j].plot(best_scores[k][0], best_scores[k][1], 'ro', label='Best Score')
    ax[i, j].set_title('{}-fold CV'.format(k))
    if j == 0:
        ax[i, j].set_ylabel('Accuracy')
    if i == 1:
        ax[i, j].set_xlabel('C value')
    ax[i, j].legend()
    j += 1
    if j == 2:
        j = 0
        i += 1


The plots are the same as above.

6 - Fit another classification algorithm to the data, tuning the parameter using LOOCV. Comment on your results.


In [153]:
pipe = Pipeline([
    ('scl', StandardScaler()),
    ('svm', SVC())
])
# using grid search with LeaveOneOut as cv
param_range = [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]
param_grid = [{'svm__C': param_range}]

# in solution: cross_val_score(pipe, X, y, cv=LeaveOneOut()) used looping through params values
gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid,
                  cv=LeaveOneOut())

gs.fit(wine.drop('class', axis=1), wine['class'])
gs.best_params_['svm__C'], gs.best_score_


Out[153]:
(0.3, 0.9887640449438202)

Using LOOCV I get 0.3 as the best choice for C and an accuracy of almost 99%, which is slightly higher than before.

7 - Suppose that the wine data the we received was incomplete, containing, say, only 20% of the full set, but due to a fast approaching deadline, we need to still compute some statistics and fit a model. Use bootstrapping to compute the mean and variance of the features, and fit the classification model you used in part (4), comparing and commenting on your results with those from the full dataset.


In [6]:
# resampling 5 times the data
wine_strapped = resample(wine, n_samples=len(wine)*5, random_state=10)

In [67]:
# solution: leaves 80% of the data out and bootstraps the remaining 20%:

data_sub_test, data_sub = train_test_split(wine, test_size=0.2, random_state=0)
data_boot = data_sub.sample(n=150, replace=True, random_state=0)

In [69]:
# still from solution:
means = {'original': wine.drop('class', axis=1).mean(),
        '20%': data_sub.drop('class', axis=1).mean(),
        'bootstrap%': data_boot.drop('class', axis=1).mean()}

var = {'original': wine.drop('class', axis=1).var(),
        '20%': data_sub.drop('class', axis=1).var(),
        'bootstrap%': data_boot.drop('class', axis=1).var()}

means = pd.DataFrame(means)
var = pd.DataFrame(var)

In [70]:
means


Out[70]:
20% bootstrap% original
alcohol 13.062778 13.120800 13.000618
malic_acid 2.193333 2.079067 2.336348
ash 2.365000 2.335200 2.366517
alcalinity_ash 19.258333 18.923333 19.494944
magnesium 98.472222 99.346667 99.741573
total_phenols 2.438889 2.473600 2.295112
flavanoids 2.345278 2.393733 2.029270
nonflavanoid_phenols 0.355278 0.335400 0.361854
proanthocyanins 1.531111 1.590000 1.590899
color_intensity 5.108889 5.156467 5.058090
hue 0.973611 0.980067 0.957449
OD280_OD315 2.711944 2.740733 2.611685
proline 764.888889 794.380000 746.893258

In [72]:
var


Out[72]:
20% bootstrap% original
alcohol 0.703249 0.638096 0.659062
malic_acid 1.273366 1.259058 1.248015
ash 0.087626 0.077578 0.075265
alcalinity_ash 8.585357 8.545962 11.152686
magnesium 211.570635 211.744787 203.989335
total_phenols 0.446822 0.454517 0.391690
flavanoids 1.079951 1.063656 0.997719
nonflavanoid_phenols 0.012568 0.013073 0.015489
proanthocyanins 0.243930 0.275603 0.327595
color_intensity 4.348679 3.965832 5.374449
hue 0.050915 0.046402 0.052245
OD280_OD315 0.432508 0.444665 0.504086
proline 129304.844444 114052.492215 99166.717355

In [79]:
# last from solution:

pipe = Pipeline([
        ('scl', StandardScaler()),
        ('lr', LogisticRegression())
    ])

pipe.set_params(lr__C=1)
pipe.fit(wine.drop('class', axis=1), wine['class'])
print(cross_val_score(pipe, wine.drop('class', axis=1), wine['class'], cv=5).mean())

pipe.fit(data_boot.drop('class', axis=1), data_boot['class'])
print(cross_val_score(pipe, data_sub_test.drop('class', axis=1), data_sub_test['class'], cv=5).mean())


0.983483483483
0.951231527094

In [7]:
wine.describe().T


Out[7]:
count mean std min 25% 50% 75% max
class 178.0 1.938202 0.775035 1.00 1.0000 2.000 3.0000 3.00
alcohol 178.0 13.000618 0.811827 11.03 12.3625 13.050 13.6775 14.83
malic_acid 178.0 2.336348 1.117146 0.74 1.6025 1.865 3.0825 5.80
ash 178.0 2.366517 0.274344 1.36 2.2100 2.360 2.5575 3.23
alcalinity_ash 178.0 19.494944 3.339564 10.60 17.2000 19.500 21.5000 30.00
magnesium 178.0 99.741573 14.282484 70.00 88.0000 98.000 107.0000 162.00
total_phenols 178.0 2.295112 0.625851 0.98 1.7425 2.355 2.8000 3.88
flavanoids 178.0 2.029270 0.998859 0.34 1.2050 2.135 2.8750 5.08
nonflavanoid_phenols 178.0 0.361854 0.124453 0.13 0.2700 0.340 0.4375 0.66
proanthocyanins 178.0 1.590899 0.572359 0.41 1.2500 1.555 1.9500 3.58
color_intensity 178.0 5.058090 2.318286 1.28 3.2200 4.690 6.2000 13.00
hue 178.0 0.957449 0.228572 0.48 0.7825 0.965 1.1200 1.71
OD280_OD315 178.0 2.611685 0.709990 1.27 1.9375 2.780 3.1700 4.00
proline 178.0 746.893258 314.907474 278.00 500.5000 673.500 985.0000 1680.00

In [8]:
wine_strapped.describe().T


Out[8]:
count mean std min 25% 50% 75% max
class 890.0 1.926966 0.779557 1.00 1.0000 2.00 3.00 3.00
alcohol 890.0 13.012371 0.817597 11.03 12.3700 13.05 13.71 14.83
malic_acid 890.0 2.305820 1.080373 0.74 1.5900 1.83 3.10 5.80
ash 890.0 2.363079 0.275447 1.36 2.2100 2.36 2.56 3.23
alcalinity_ash 890.0 19.367978 3.310979 10.60 17.0000 19.00 21.50 30.00
magnesium 890.0 99.713483 14.150085 70.00 88.0000 98.00 108.00 162.00
total_phenols 890.0 2.331596 0.634766 0.98 1.7825 2.42 2.85 3.88
flavanoids 890.0 2.064978 1.014961 0.34 1.1000 2.17 2.92 5.08
nonflavanoid_phenols 890.0 0.356202 0.126038 0.13 0.2600 0.32 0.43 0.66
proanthocyanins 890.0 1.616697 0.577095 0.41 1.2500 1.56 1.97 3.58
color_intensity 890.0 5.180742 2.372478 1.28 3.2550 4.92 6.38 13.00
hue 890.0 0.961290 0.230392 0.48 0.7800 0.96 1.12 1.71
OD280_OD315 890.0 2.612213 0.697673 1.27 1.9600 2.78 3.17 4.00
proline 890.0 760.420225 319.671029 278.00 510.0000 675.00 1020.00 1680.00

In [9]:
pipe = Pipeline([
        ('scl', StandardScaler()),
        ('lr', LogisticRegression())
    ])
# using grid search to fit the best logistic regression
param_range = [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]
param_grid = [{'lr__C': param_range}]
gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid)

scores = {}
best_scores = {}
# looping through number of folds
for k in [2, 3, 4, 5]:
    gs.set_params(cv=k)
    gs.fit(wine_strapped.drop('class', axis=1), wine_strapped['class'])
    best_scores[k] = [gs.best_params_['lr__C'], gs.best_score_]
    scores[k] = np.array([param_range, gs.cv_results_['mean_train_score'], gs.cv_results_['mean_test_score']]).T

In [10]:
best_scores


Out[10]:
{2: [1, 1.0], 3: [1, 1.0], 4: [1, 1.0], 5: [1, 1.0]}

In all the cases I get 100% accuracy!

Model Selection

All that has been within the previous sections is part of the much broader topic known as model selection. Model selection is not only about choosing the right machine learning algorithm to use, but also about tuning parameters while keeping overfitting and scalability issues in mind. In this exercise, we'll build models for a couple different datasets, using all of the concepts you've worked with previously.

Classification

1 - Head over to the Machine Learning Repository, download the Mushroom Data Set, and put it into a dataframe. Be sure to familiarize yourself with the data before proceeding. Break the data into training and testing sets. Be sure to keep this testing set the same for the duration of this exercise as we will be using to test various algorithms!


In [80]:
shrooms = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', header=None)
shrooms.columns = ['poisonous', 'cap_shape', 'cap_surface', 'cap_color', 'bruises', 'odor', 'gill_attachment', 'gill_spacing',
                   'gill_size', 'gill_color', 'stalk_shape', 'stalk_root', 'stalk_surface_above_ring',
                   'stalk_surface_below_ring', 'stalk_color_above_ring', 'stalk_color_below_ring', 'veil_type',
                   'veil_color', 'ring_number', 'ring_type', 'spore_print_color', 'population', 'habitat']

In [49]:
shrooms.head()


Out[49]:
poisonous cap_shape cap_surface cap_color bruises odor gill_attachment gill_spacing gill_size gill_color ... stalk_surface_below_ring stalk_color_above_ring stalk_color_below_ring veil_type veil_color ring_number ring_type spore_print_color population habitat
0 p x s n t p f c n k ... s w w p w o p k s u
1 e x s y t a f c b k ... s w w p w o p n n g
2 e b s w t l f c b n ... s w w p w o p n n m
3 p x y w t p f c n n ... s w w p w o p k s u
4 e x s g f n f w b k ... s w w p w o e n a g

5 rows × 23 columns


In [50]:
shrooms.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
poisonous                   8124 non-null object
cap_shape                   8124 non-null object
cap_surface                 8124 non-null object
cap_color                   8124 non-null object
bruises                     8124 non-null object
odor                        8124 non-null object
gill_attachment             8124 non-null object
gill_spacing                8124 non-null object
gill_size                   8124 non-null object
gill_color                  8124 non-null object
stalk_shape                 8124 non-null object
stalk_root                  8124 non-null object
stalk_surface_above_ring    8124 non-null object
stalk_surface_below_ring    8124 non-null object
stalk_color_above_ring      8124 non-null object
stalk_color_below_ring      8124 non-null object
veil_type                   8124 non-null object
veil_color                  8124 non-null object
ring_number                 8124 non-null object
ring_type                   8124 non-null object
spore_print_color           8124 non-null object
population                  8124 non-null object
habitat                     8124 non-null object
dtypes: object(23)
memory usage: 1.4+ MB

In [51]:
shrooms.describe().T


Out[51]:
count unique top freq
poisonous 8124 2 e 4208
cap_shape 8124 6 x 3656
cap_surface 8124 4 y 3244
cap_color 8124 10 n 2284
bruises 8124 2 f 4748
odor 8124 9 n 3528
gill_attachment 8124 2 f 7914
gill_spacing 8124 2 c 6812
gill_size 8124 2 b 5612
gill_color 8124 12 b 1728
stalk_shape 8124 2 t 4608
stalk_root 8124 5 b 3776
stalk_surface_above_ring 8124 4 s 5176
stalk_surface_below_ring 8124 4 s 4936
stalk_color_above_ring 8124 9 w 4464
stalk_color_below_ring 8124 9 w 4384
veil_type 8124 1 p 8124
veil_color 8124 4 w 7924
ring_number 8124 3 o 7488
ring_type 8124 5 p 3968
spore_print_color 8124 9 w 2388
population 8124 6 v 4040
habitat 8124 7 d 3148

We have 2480 missing values for stalk_root (denoted by ?), which I'm going to keep as they are since this is a category.

I'm dropping veil_type instead because it has only 1 class.


In [81]:
shrooms.drop('veil_type', axis=1, inplace=True)

In [82]:
Xtrain, Xtest, ytrain, ytest = train_test_split(shrooms.drop('poisonous', axis=1),
                                                shrooms.poisonous,
                                                test_size=0.2,
                                                random_state=42)

Also, I'm relabeling the poisonous mushrooms as 1 and not poisonous ones as 0.


In [83]:
ytrain = ytrain.apply(lambda x: 1 if x == 'p' else 0)
ytest = ytest.apply(lambda x: 1 if x == 'p' else 0)

2 - Fit a machine learning algorithm of your choice to the data, tuning hyperparameters using the Better Holdout Method. Generate training and validation plots and comment on your results.


In [ ]:
# in solution he encodes all variables once using df.apply(le.fit_transform)

This is a class to perform label encoding on multiple columns:


In [84]:
class MultiLabelEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname, col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

#     def fit_transform(self, X, y=None):
#         return self.fit(X, y).transform(X)
    
#     def get_params(self):
#         return 'columns'
    
#     def set_params(self, param, value):
#         self.param = value

In [85]:
# encode, one hot encode and then KNN
pipe = Pipeline([
    ('mle', MultiLabelEncoder()),
    ('ohe', OneHotEncoder()),
    ('knn', KNeighborsClassifier())
])

In [57]:
# better holdout
Xtr, Xcv, ytr, ycv = train_test_split(Xtrain, ytrain, test_size=0.2, random_state=42)
scores = []
# looping through various neighbors number
for k in range(1, 21):
    pipe.set_params(knn__n_neighbors=k)
    pipe.fit(Xtr, ytr)
    tr_score = pipe.score(Xtr, ytr)
    cv_score = pipe.score(Xcv, ycv)
    scores.append([k, tr_score, cv_score])

scores = np.array(scores)

In [58]:
scores


Out[58]:
array([[  1.        ,   1.        ,   1.        ],
       [  2.        ,   1.        ,   1.        ],
       [  3.        ,   1.        ,   1.        ],
       [  4.        ,   1.        ,   1.        ],
       [  5.        ,   1.        ,   1.        ],
       [  6.        ,   1.        ,   1.        ],
       [  7.        ,   1.        ,   1.        ],
       [  8.        ,   0.99980766,   0.99846154],
       [  9.        ,   1.        ,   0.99846154],
       [ 10.        ,   0.99884593,   0.99692308],
       [ 11.        ,   0.99903828,   0.99692308],
       [ 12.        ,   0.99884593,   0.99615385],
       [ 13.        ,   0.99903828,   0.99615385],
       [ 14.        ,   0.99846124,   0.99615385],
       [ 15.        ,   0.99884593,   0.99692308],
       [ 16.        ,   0.99846124,   0.99615385],
       [ 17.        ,   0.99865359,   0.99615385],
       [ 18.        ,   0.99846124,   0.99538462],
       [ 19.        ,   0.99846124,   0.99538462],
       [ 20.        ,   0.99846124,   0.99538462]])

In [59]:
fig, ax = plt.subplots(figsize=(16, 8))
# plotting train and holdout scores
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='Hold Out')
ax.set_title('KNN with Different # of Neighbors'.format(k))
ax.set_ylabel('Accuracy')
ax.set_xlabel('# of Neighbors')
ax.legend();



In [60]:
pipe.set_params(knn__n_neighbors=5)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)


Out[60]:
1.0

The best results are for a number of neighbors between 2 and 7 (100% accuracy on both train and test set), so I'm going with the default of 5.

3 - Repeat part (2), this time using cross validation. Comment on your results.


In [86]:
# from solution:

train_scores, test_scores = validation_curve(pipe, Xtrain, ytrain, param_name='knn__n_neighbors', param_range=range(1, 21))

In [89]:
# solution continues:

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.plot(range(1, 21), train_scores_mean, label='Train')
plt.fill_between(range(1, 21), train_scores_mean - train_scores_std, train_scores_mean + train_scores_std)
plt.plot(range(1, 21), test_scores_mean, label='Test')
plt.fill_between(range(1, 21), test_scores_mean - test_scores_std, test_scores_mean + test_scores_std)
plt.legend();



In [61]:
scores = []
# same as above, only using kfold
for k in range(1, 11):
    kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain) # you have to do this every time because the folds
                                                                    # are eliminated after use
    fold_scores = []
    pipe.set_params(knn__n_neighbors=k)
    print('{} Neighbors'.format(k))
    # looping through the folds
    for i, (train, test) in enumerate(kfold):
        # creating train and cv sets
        Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
        Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
        # fit and score
        pipe.fit(Xtr, ytr)
        tr_score = pipe.score(Xtr, ytr)
        cv_score = pipe.score(Xcv, ycv)
        # append fold scores
        fold_scores.append([i, tr_score, cv_score])
        print('Fold {} done'.format(i+1))
    fold_scores = np.array(fold_scores)
    # append mean and std score of the folds
    scores.append([k, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
                   fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])

scores = np.array(scores)


1 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
2 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
3 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
4 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
5 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
6 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
7 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
8 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
9 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
10 Neighbors
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done

In [62]:
scores


Out[62]:
array([[  1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  2.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  3.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  4.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  5.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  6.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  7.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  8.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  9.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  1.00000000e+01,   9.99897427e-01,   9.99384378e-01,
          2.05145741e-04,   1.41017865e-03]])

In [63]:
fig, ax = plt.subplots(figsize=(16, 8))
# plotting train and cv mean scores and filling with std
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
                 scores[:, 1] + scores[:, 3],
                 scores[:, 1] - scores[:, 3],
                 alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
                 scores[:, 2] + scores[:, 4],
                 scores[:, 2] - scores[:, 4],
                 alpha=0.15, color='orange')
ax.set_title('KNN with Different # of Neighbors'.format(k))
ax.set_ylabel('Accuracy')
ax.set_xlabel('# of Neighbors')
ax.legend();


We get perfect accuracy for a number of neighbors between 2 and 8 this time; we can also see that we have no variance for this range of values.

4 - Repeat part (3) using a different machine learning algorithm. Comment on your results.


In [64]:
# encoder and random forest
pipe = Pipeline([
    ('mle', MultiLabelEncoder()),
    #('ohe', OneHotEncoder()),
    ('forest', RandomForestClassifier())
])

In [65]:
scores = []

for k in np.arange(2, 50, 2):
    kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
    fold_scores = []
    pipe.set_params(forest__n_estimators=k)
    print('{} Trees'.format(k))
    for i, (train, test) in enumerate(kfold):
        Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
        Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
        pipe.fit(Xtr, ytr)
        tr_score = pipe.score(Xtr, ytr)
        cv_score = pipe.score(Xcv, ycv)
        fold_scores.append([i, tr_score, cv_score])
        print('Fold {} done'.format(i+1))

    fold_scores = np.array(fold_scores)
    scores.append([k, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
                   fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])

scores = np.array(scores)


2 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
4 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
6 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
8 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
10 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
12 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
14 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
16 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
18 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
20 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
22 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
24 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
26 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
28 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
30 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
32 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
34 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
36 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
38 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
40 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
42 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
44 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
46 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
48 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done

In [66]:
scores


Out[66]:
array([[  2.00000000e+00,   9.99948709e-01,   9.98000000e-01,
          1.53872457e-04,   3.51160376e-03],
       [  4.00000000e+00,   1.00000000e+00,   9.87384615e-01,
          0.00000000e+00,   3.37957667e-02],
       [  6.00000000e+00,   1.00000000e+00,   9.98923077e-01,
          0.00000000e+00,   2.75638044e-03],
       [  8.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  1.00000000e+01,   1.00000000e+00,   9.99538462e-01,
          0.00000000e+00,   1.38461538e-03],
       [  1.20000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  1.40000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  1.60000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  1.80000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  2.00000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  2.20000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  2.40000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  2.60000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  2.80000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  3.00000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  3.20000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  3.40000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  3.60000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  3.80000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  4.00000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  4.20000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  4.40000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  4.60000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00],
       [  4.80000000e+01,   1.00000000e+00,   1.00000000e+00,
          0.00000000e+00,   0.00000000e+00]])

In [67]:
fig, ax = plt.subplots(figsize=(16, 8))

ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
                 scores[:, 1] + scores[:, 3],
                 scores[:, 1] - scores[:, 3],
                 alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
                 scores[:, 2] + scores[:, 4],
                 scores[:, 2] - scores[:, 4],
                 alpha=0.15, color='orange')
ax.set_title('Random Forest with different # of Trees'.format(k))
ax.set_ylabel('Accuracy')
ax.set_xlabel('# of Trees')
ax.legend();



In [68]:
pipe.set_params(forest__n_estimators=10)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)


Out[68]:
1.0

We get perfect accuracy for about 8 trees or more, so I'm going with the default value of 10.

5 - Which ever of your two algorithms in parts (3) and (4) performed more poorly, perform a variable selection to see if you can improve your results.


In [70]:
mle = MultiLabelEncoder()
pca = PCA()
pca.fit(mle.fit_transform(Xtrain), ytrain)


Out[70]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [71]:
pca.explained_variance_ratio_


Out[71]:
array([  3.39467519e-01,   1.64901173e-01,   1.22427133e-01,
         6.88847905e-02,   5.85212974e-02,   5.08395611e-02,
         4.59054537e-02,   3.89389040e-02,   3.56630776e-02,
         2.54799438e-02,   2.01696746e-02,   1.01237071e-02,
         5.57913274e-03,   4.14982185e-03,   3.39576708e-03,
         2.04664212e-03,   1.50135444e-03,   1.23698691e-03,
         4.89268153e-04,   2.23930935e-04,   5.48609943e-05])

Let's try 7 components:


In [72]:
pipe = Pipeline([
    ('mle', MultiLabelEncoder()),
    ('sel', PCA(n_components=7)),
    ('forest', RandomForestClassifier())
])

In [74]:
scores = []

for k in np.arange(2, 22, 2):
    kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
    fold_scores = []
    pipe.set_params(forest__n_estimators=k)
    print('{} Trees'.format(k))
    for i, (train, test) in enumerate(kfold):
        Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
        Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
        pipe.fit(Xtr, ytr)
        tr_score = pipe.score(Xtr, ytr)
        cv_score = pipe.score(Xcv, ycv)
        fold_scores.append([i, tr_score, cv_score])
        print('Fold {} done'.format(i+1))

    fold_scores = np.array(fold_scores)
    scores.append([k, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
                   fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])

scores = np.array(scores)


2 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
4 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
6 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
8 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
10 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
12 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
14 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
16 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
18 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
20 Trees
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done

In [75]:
fig, ax = plt.subplots(figsize=(16, 8))

ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
                 scores[:, 1] + scores[:, 3],
                 scores[:, 1] - scores[:, 3],
                 alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
                 scores[:, 2] + scores[:, 4],
                 scores[:, 2] - scores[:, 4],
                 alpha=0.15, color='orange')
ax.set_title('Random Forest with different # of Trees'.format(k))
ax.set_ylabel('Accuracy')
ax.set_xlabel('# of Trees')
ax.legend();


Well it's actually worse!

6 - Pick a classification algorithm that has at least two hyperparameters and tune them using GridSeachCV. Comment on your results.

I'm trying logistic regression + PCA tuning number of components and regularization:

In solution he uses random forest tuning n_estimators and max_features!


In [8]:
from sklearn.feature_selection import RFE

So, apparently GridSearchCV gets stuck if I use custom scorers or if I'm multithreading or something like that, so I'm going to try to encode my labels outside the pipeline...


In [9]:
mle = MultiLabelEncoder()
Xtrain_enc = mle.fit_transform(Xtrain)
Xtest_enc = mle.transform(Xtest)

In [13]:
pipe = Pipeline([
    ('sel', RFE(estimator=LogisticRegression())),
    ('lr', LogisticRegression())
])

# C_range = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300]
C_range = [0.01, 0.03, 0.1, 0.3, 1, 3]
comp_range = np.arange(1, 16)
# using grid search to tune regularization and number of PCs
param_grid = [{'sel__n_features_to_select':comp_range, 'lr__C': C_range}]
gs = GridSearchCV(estimator=pipe,
                  param_grid=param_grid,
                  cv=10)
# this could take a very long time...
gs.fit(Xtrain_enc, ytrain)


Out[13]:
GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('sel', RFE(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
  n_f...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'sel__n_features_to_select': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]), 'lr__C': [0.01, 0.03, 0.1, 0.3, 1, 3]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [14]:
print(gs.best_score_)
print(gs.best_params_)


0.954146791814
{'lr__C': 0.3, 'sel__n_features_to_select': 13}

In [16]:
pipe.set_params(sel__n_features_to_select=13, lr__C=0.3)
pipe.fit(Xtrain_enc, ytrain)
pipe.score(Xtest_enc, ytest)


Out[16]:
0.95323076923076921

We get 95% accuracy with simple logistic regression, not bad!

Regression

1 - Head over to the Machine Learning Repository, download the Parkinson's Telemonitoring Data Set, and put it into a dataframe. Be sure to familiarize yourself with the data before proceeding, removing the columns related to time, age, and sex. We're going to be predicting motor_UPDRS, so drop the total_UPDRS variable as well. Break the data into training and testing sets. Be sure to keep this testing set the same for the duration of this exercise as we will be using to test various algorithms!


In [90]:
parkinson = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data')
parkinson.drop(['age', 'sex', 'test_time', 'total_UPDRS'], axis=1, inplace=True)

In solution he uses dummies for subject#, which I don't think is really "honest".


In [4]:
parkinson.head()


Out[4]:
subject# motor_UPDRS Jitter(%) Jitter(Abs) Jitter:RAP Jitter:PPQ5 Jitter:DDP Shimmer Shimmer(dB) Shimmer:APQ3 Shimmer:APQ5 Shimmer:APQ11 Shimmer:DDA NHR HNR RPDE DFA PPE
0 1 28.199 0.00662 0.000034 0.00401 0.00317 0.01204 0.02565 0.230 0.01438 0.01309 0.01662 0.04314 0.014290 21.640 0.41888 0.54842 0.16006
1 1 28.447 0.00300 0.000017 0.00132 0.00150 0.00395 0.02024 0.179 0.00994 0.01072 0.01689 0.02982 0.011112 27.183 0.43493 0.56477 0.10810
2 1 28.695 0.00481 0.000025 0.00205 0.00208 0.00616 0.01675 0.181 0.00734 0.00844 0.01458 0.02202 0.020220 23.047 0.46222 0.54405 0.21014
3 1 28.905 0.00528 0.000027 0.00191 0.00264 0.00573 0.02309 0.327 0.01106 0.01265 0.01963 0.03317 0.027837 24.445 0.48730 0.57794 0.33277
4 1 29.187 0.00335 0.000020 0.00093 0.00130 0.00278 0.01703 0.176 0.00679 0.00929 0.01819 0.02036 0.011625 26.126 0.47188 0.56122 0.19361

In [5]:
parkinson.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5875 entries, 0 to 5874
Data columns (total 18 columns):
subject#         5875 non-null int64
motor_UPDRS      5875 non-null float64
Jitter(%)        5875 non-null float64
Jitter(Abs)      5875 non-null float64
Jitter:RAP       5875 non-null float64
Jitter:PPQ5      5875 non-null float64
Jitter:DDP       5875 non-null float64
Shimmer          5875 non-null float64
Shimmer(dB)      5875 non-null float64
Shimmer:APQ3     5875 non-null float64
Shimmer:APQ5     5875 non-null float64
Shimmer:APQ11    5875 non-null float64
Shimmer:DDA      5875 non-null float64
NHR              5875 non-null float64
HNR              5875 non-null float64
RPDE             5875 non-null float64
DFA              5875 non-null float64
PPE              5875 non-null float64
dtypes: float64(17), int64(1)
memory usage: 826.2 KB

In [6]:
parkinson.describe().T


Out[6]:
count mean std min 25% 50% 75% max
subject# 5875.0 21.494128 12.372279 1.000000 10.000000 22.000000 33.000000 42.000000
motor_UPDRS 5875.0 21.296229 8.129282 5.037700 15.000000 20.871000 27.596500 39.511000
Jitter(%) 5875.0 0.006154 0.005624 0.000830 0.003580 0.004900 0.006800 0.099990
Jitter(Abs) 5875.0 0.000044 0.000036 0.000002 0.000022 0.000035 0.000053 0.000446
Jitter:RAP 5875.0 0.002987 0.003124 0.000330 0.001580 0.002250 0.003290 0.057540
Jitter:PPQ5 5875.0 0.003277 0.003732 0.000430 0.001820 0.002490 0.003460 0.069560
Jitter:DDP 5875.0 0.008962 0.009371 0.000980 0.004730 0.006750 0.009870 0.172630
Shimmer 5875.0 0.034035 0.025835 0.003060 0.019120 0.027510 0.039750 0.268630
Shimmer(dB) 5875.0 0.310960 0.230254 0.026000 0.175000 0.253000 0.365000 2.107000
Shimmer:APQ3 5875.0 0.017156 0.013237 0.001610 0.009280 0.013700 0.020575 0.162670
Shimmer:APQ5 5875.0 0.020144 0.016664 0.001940 0.010790 0.015940 0.023755 0.167020
Shimmer:APQ11 5875.0 0.027481 0.019986 0.002490 0.015665 0.022710 0.032715 0.275460
Shimmer:DDA 5875.0 0.051467 0.039711 0.004840 0.027830 0.041110 0.061735 0.488020
NHR 5875.0 0.032120 0.059692 0.000286 0.010955 0.018448 0.031463 0.748260
HNR 5875.0 21.679495 4.291096 1.659000 19.406000 21.920000 24.444000 37.875000
RPDE 5875.0 0.541473 0.100986 0.151020 0.469785 0.542250 0.614045 0.966080
DFA 5875.0 0.653240 0.070902 0.514040 0.596180 0.643600 0.711335 0.865600
PPE 5875.0 0.219589 0.091498 0.021983 0.156340 0.205500 0.264490 0.731730

In [91]:
# I'm dropping subject# because it is the patient id
Xtrain, Xtest, ytrain, ytest = train_test_split(parkinson.drop(['subject#', 'motor_UPDRS'], axis=1),
                                                parkinson.motor_UPDRS,
                                                test_size=0.2,
                                                random_state=54)

2 - Fit a machine learning algorithm of your choice to the data, tuning hyperparameters using the Better Holdout Method. Generate training and validation plots and comment on your results.


In [92]:
# scaling and ridge regression
pipe = Pipeline([
    ('stsc', StandardScaler()),
    ('ridge', Ridge())
])

In [93]:
# better hold out
Xtr, Xcv, ytr, ycv = train_test_split(Xtrain, ytrain, test_size=0.2, random_state=42)
scores = []
# looping through regularization values
for a in [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]:
    pipe.set_params(ridge__alpha=a)
    pipe.fit(Xtr, ytr)
    tr_score = pipe.score(Xtr, ytr)
    cv_score = pipe.score(Xcv, ycv)
    scores.append([a, tr_score, cv_score])

scores = np.array(scores)

In [94]:
scores


Out[94]:
array([[  1.00000000e-03,   1.16318220e-01,   7.79041095e-02],
       [  3.00000000e-03,   1.16275461e-01,   7.78532625e-02],
       [  1.00000000e-02,   1.16236154e-01,   7.78025200e-02],
       [  3.00000000e-02,   1.16218763e-01,   7.77831332e-02],
       [  1.00000000e-01,   1.16211646e-01,   7.77870354e-02],
       [  3.00000000e-01,   1.16209196e-01,   7.78218305e-02],
       [  1.00000000e+00,   1.16205051e-01,   7.79468802e-02],
       [  3.00000000e+00,   1.16179855e-01,   7.82654107e-02],
       [  1.00000000e+01,   1.16004991e-01,   7.90742483e-02],
       [  3.00000000e+01,   1.15389966e-01,   8.03377924e-02],
       [  1.00000000e+02,   1.13505951e-01,   8.20404736e-02],
       [  3.00000000e+02,   1.08208849e-01,   8.20675091e-02],
       [  1.00000000e+03,   9.12025646e-02,   7.33942476e-02],
       [  3.00000000e+03,   6.45159818e-02,   5.52072122e-02]])

In [95]:
fig, ax = plt.subplots(figsize=(16, 8), subplot_kw=dict(xscale='log'))

ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='Hold Out')
ax.set_title(r'Ridge Regression with different $\alpha$')
ax.set_ylabel('$R^2$')
ax.set_xlabel(r'\alpha')
ax.legend();


These doesn't seems so randomly scattered...


In [97]:
ypred = pipe.predict(Xtrain)
residuals = ypred - ytrain
plt.figure(figsize=(16, 8))
plt.scatter(ypred, residuals)
plt.hlines(y=0, xmin=ypred.min(), xmax=ypred.max(), lw=2, color='red');



In [98]:
pipe.set_params(ridge__alpha=100)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)


Out[98]:
0.091677187638304902

The score is very bad, the best choice for the regularization parameter seems to be 100 but there seems to be a problem of serious bias in this dataset. We may need more features to get accurate predictions.

3 - Repeat part (2), this time using cross validation. Comment on your results.

Again in the solution he uses cross_validation_curve...


In [14]:
scores = []

for a in [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]:
    kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
    fold_scores = []
    pipe.set_params(ridge__alpha=a)
    print('alpha = {}'.format(a))
    for i, (train, test) in enumerate(kfold):
        Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
        Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
        pipe.fit(Xtr, ytr)
        tr_score = pipe.score(Xtr, ytr)
        cv_score = pipe.score(Xcv, ycv)
        fold_scores.append([i, tr_score, cv_score])
        print('Fold {} done'.format(i+1))

    fold_scores = np.array(fold_scores)
    scores.append([a, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
                   fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])

scores = np.array(scores)


alpha = 0.001
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.003
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.01
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.03
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.1
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.3
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 1
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 3
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 10
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 30
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 100
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 300
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 1000
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 3000
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done

In [15]:
scores


Out[15]:
array([[  1.00000000e-03,   1.10647735e-01,   9.74801567e-02,
          2.68480397e-03,   2.67480515e-02],
       [  3.00000000e-03,   1.10596127e-01,   9.75220009e-02,
          2.67857362e-03,   2.67084848e-02],
       [  1.00000000e-02,   1.10541942e-01,   9.75317132e-02,
          2.67243942e-03,   2.66857620e-02],
       [  3.00000000e-02,   1.10516140e-01,   9.75337815e-02,
          2.66940417e-03,   2.66721342e-02],
       [  1.00000000e-01,   1.10505295e-01,   9.75440898e-02,
          2.66809593e-03,   2.66462125e-02],
       [  3.00000000e-01,   1.10501756e-01,   9.75760533e-02,
          2.66761795e-03,   2.65795742e-02],
       [  1.00000000e+00,   1.10497960e-01,   9.76838590e-02,
          2.66663807e-03,   2.63583311e-02],
       [  3.00000000e+00,   1.10478219e-01,   9.79528837e-02,
          2.66059827e-03,   2.57977963e-02],
       [  1.00000000e+01,   1.10340899e-01,   9.86046213e-02,
          2.62209515e-03,   2.43741017e-02],
       [  3.00000000e+01,   1.09847211e-01,   9.94907422e-02,
          2.51210298e-03,   2.21951834e-02],
       [  1.00000000e+02,   1.08279204e-01,   1.00093253e-01,
          2.31436191e-03,   1.92621905e-02],
       [  3.00000000e+02,   1.03752361e-01,   9.77665731e-02,
          2.10799886e-03,   1.64362035e-02],
       [  1.00000000e+03,   8.86741532e-02,   8.44991341e-02,
          1.80978256e-03,   1.31636954e-02],
       [  3.00000000e+03,   6.40438452e-02,   6.07589895e-02,
          1.39119087e-03,   1.01174318e-02]])

In [16]:
fig, ax = plt.subplots(figsize=(16, 8), subplot_kw=dict(xscale='log'))

ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
                 scores[:, 1] + scores[:, 3],
                 scores[:, 1] - scores[:, 3],
                 alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
                 scores[:, 2] + scores[:, 4],
                 scores[:, 2] - scores[:, 4],
                 alpha=0.15, color='orange')
ax.set_title(r'Ridge Regression with different $\alpha$')
ax.set_ylabel('$R^2$')
ax.set_xlabel(r'$\alpha$')
ax.legend();


The two curves are even closer, but again performances are overall very poor. This plot confirms that we have a high bias for this dataset.

4 - Repeat part (3) using a different machine learning algorithm. Comment on your results.


In [104]:
# scaling and random forest
pipe = Pipeline([
    ('scl', StandardScaler()),
    ('forest', RandomForestRegressor())
])

In [105]:
scores = []
# looping through different number of trees
for n in [3, 10, 30, 100, 300]:
    kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
    fold_scores = []
    pipe.set_params(forest__n_estimators=n)
    print('# trees = {}'.format(n))
    # looping through folds
    for i, (train, test) in enumerate(kfold):
        Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
        Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
        pipe.fit(Xtr, ytr)
        tr_score = pipe.score(Xtr, ytr)
        cv_score = pipe.score(Xcv, ycv)
        fold_scores.append([i, tr_score, cv_score])
        print('Fold {} done'.format(i+1))

    fold_scores = np.array(fold_scores)
    scores.append([n, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
                   fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])

scores = np.array(scores)


# trees = 3
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
# trees = 10
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
# trees = 30
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
# trees = 100
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
# trees = 300
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done

In [106]:
scores


Out[106]:
array([[  3.00000000e+00,   7.75759563e-01,   1.11207568e-01,
          6.20566798e-03,   9.25870932e-02],
       [  1.00000000e+01,   8.73927468e-01,   3.04753759e-01,
          3.50032359e-03,   5.00045767e-02],
       [  3.00000000e+01,   9.02512362e-01,   3.44222628e-01,
          1.40166596e-03,   4.45500648e-02],
       [  1.00000000e+02,   9.11940634e-01,   3.61549622e-01,
          1.40964680e-03,   3.39684626e-02],
       [  3.00000000e+02,   9.14474654e-01,   3.66088495e-01,
          1.15033779e-03,   3.40254413e-02]])

In [107]:
fig, ax = plt.subplots(figsize=(16, 8))

ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
                 scores[:, 1] + scores[:, 3],
                 scores[:, 1] - scores[:, 3],
                 alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
                 scores[:, 2] + scores[:, 4],
                 scores[:, 2] - scores[:, 4],
                 alpha=0.15, color='orange')
ax.set_title('Random Forest Regression with Different # of Trees')
ax.set_ylabel('$R^2$')
ax.set_xlabel('# of Trees')
ax.legend();



In [108]:
pipe.set_params(forest__n_estimators=100)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)


Out[108]:
0.32650193384436466

Using a random forest we get better performances but we have a serious problem of high variance in this case.

5 - Which ever of your two algorithms in parts (3) and (4) performed more poorly, perform a variable selection to see if you can improve your results.

I'm going to use PCA to try and improve the R regression, but I don't think there is much to do...

In solution he uses LassoCV with SelectFromModel:


In [101]:
scl = StandardScaler()
sfm = SelectFromModel(LassoCV())
Xtrain_sc = scl.fit_transform(Xtrain)
Xtest_sc = scl.transform(Xtest)
Xtrain_lm = sfm.fit_transform(Xtrain_sc, ytrain)
Xtest_lm = sfm.transform(Xtest_sc)

train_scores, test_scores = validation_curve(Ridge(), Xtrain_lm, ytrain,
                                             param_name='alpha',
                                             param_range=[0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000])

In [103]:
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

fig, ax = plt.subplots(figsize=(16, 8), subplot_kw=dict(xscale='log'))

ax.plot([0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000],
        train_scores_mean, label='Train')
ax.fill_between([0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000],
                train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.2)
ax.plot([0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000],
        test_scores_mean, label='Test')
ax.fill_between([0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000],
                test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.2)
ax.legend();



In [21]:
pca = PCA()
pca.fit(Xtrain)
pca.explained_variance_ratio_


Out[21]:
array([  9.98053787e-01,   1.13952768e-03,   3.14424702e-04,
         2.75678723e-04,   1.45276689e-04,   6.06695114e-05,
         5.65762585e-06,   2.43417414e-06,   1.91760025e-06,
         3.76194769e-07,   1.95582169e-07,   3.85328969e-08,
         1.59326323e-08,   1.08253999e-11,   3.56549469e-13,
         3.45293868e-13])

I guess we can try with 12 components


In [22]:
pipe = Pipeline([
    ('scl', StandardScaler()),
    ('pca', PCA(n_components=12)),
    ('ridge', Ridge())
])

In [23]:
scores = []

for a in [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]:
    kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
    fold_scores = []
    pipe.set_params(ridge__alpha=a)
    print('alpha = {}'.format(a))
    for i, (train, test) in enumerate(kfold):
        Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
        Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
        pipe.fit(Xtr, ytr)
        
        tr_score = pipe.score(Xtr, ytr)
        cv_score = pipe.score(Xcv, ycv)
        fold_scores.append([i, tr_score, cv_score])
        print('Fold {} done'.format(i+1))

    fold_scores = np.array(fold_scores)
    scores.append([a, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
                   fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])

scores = np.array(scores)


alpha = 0.001
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.003
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.01
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.03
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.1
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 0.3
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 1
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 3
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 10
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 30
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 100
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 300
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 1000
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done
alpha = 3000
Fold 1 done
Fold 2 done
Fold 3 done
Fold 4 done
Fold 5 done
Fold 6 done
Fold 7 done
Fold 8 done
Fold 9 done
Fold 10 done

In [24]:
scores


Out[24]:
array([[  1.00000000e-03,   1.09179570e-01,   9.88752483e-02,
          2.21892150e-03,   2.12792494e-02],
       [  3.00000000e-03,   1.09179570e-01,   9.88753444e-02,
          2.21892150e-03,   2.12791443e-02],
       [  1.00000000e-02,   1.09179570e-01,   9.88756809e-02,
          2.21892150e-03,   2.12787766e-02],
       [  3.00000000e-02,   1.09179569e-01,   9.88766415e-02,
          2.21892148e-03,   2.12777263e-02],
       [  1.00000000e-01,   1.09179565e-01,   9.88799978e-02,
          2.21892126e-03,   2.12740539e-02],
       [  3.00000000e-01,   1.09179528e-01,   9.88895342e-02,
          2.21891930e-03,   2.12635927e-02],
       [  1.00000000e+00,   1.09179116e-01,   9.89223077e-02,
          2.21889739e-03,   2.12273368e-02],
       [  3.00000000e+00,   1.09175672e-01,   9.90110431e-02,
          2.21871233e-03,   2.11266791e-02],
       [  1.00000000e+01,   1.09142253e-01,   9.92733225e-02,
          2.21687510e-03,   2.08039042e-02],
       [  3.00000000e+01,   1.08939442e-01,   9.97506103e-02,
          2.20571012e-03,   2.00585354e-02],
       [  1.00000000e+02,   1.07810164e-01,   1.00099091e-01,
          2.15376794e-03,   1.84107502e-02],
       [  3.00000000e+02,   1.03556978e-01,   9.77169932e-02,
          2.03835879e-03,   1.61554595e-02],
       [  1.00000000e+03,   8.86099895e-02,   8.44638102e-02,
          1.78582427e-03,   1.30906810e-02],
       [  3.00000000e+03,   6.40218648e-02,   6.07434671e-02,
          1.38291098e-03,   1.00943465e-02]])

In [25]:
fig, ax = plt.subplots(figsize=(16, 8), subplot_kw=dict(xscale='log'))

ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
                 scores[:, 1] + scores[:, 3],
                 scores[:, 1] - scores[:, 3],
                 alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
                 scores[:, 2] + scores[:, 4],
                 scores[:, 2] - scores[:, 4],
                 alpha=0.15, color='orange')
ax.set_title(r'Ridge Regression with different $\alpha$')
ax.set_ylabel('$R^2$')
ax.set_xlabel(r'$\alpha$')
ax.legend();


Not much of an improvement...

6 - Pick a regression algorithm that has at least two hyperparameters and tune them using GridSeachCV. Comment on your results.

I'm going to try and add some polynomial features to the Ridge regression to see if it can perform better:


In [44]:
pipe = Pipeline([
    ('scl', StandardScaler()),
    ('pca', PCA(n_components=9)),
    ('poly', PolynomialFeatures()),
    ('ridge', Ridge())
])

alpha_range = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
degree_range = np.arange(1, 4)
# using grid search to tune regularization and degree of polynomials
param_grid = [{'poly__degree':degree_range, 'ridge__alpha': alpha_range}]
gs = GridSearchCV(estimator=pipe,
                  param_grid=param_grid,
                  cv=10,
                  n_jobs=2)

gs.fit(Xtrain, ytrain)


Out[44]:
GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=9, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('ridge', Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid=[{'poly__degree': array([1, 2, 3]), 'ridge__alpha': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [46]:
print(gs.best_score_)
print(gs.best_params_)


0.184314178869
{'poly__degree': 2, 'ridge__alpha': 30}

In [47]:
pipe.set_params(poly__degree=2, ridge__alpha=30)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)


Out[47]:
0.18657109593637167

We have a slight improvement introducing polynomial features, but the score is still very low...

In solutions using random forest and tuning n_estimators and max_features he gets 0.357.