Tree Methods

When it comes to machine learning classification algorithms, tree based methods tend to be among the most powerful. Not only do they tend to produce very accurate models, but they are quite easy to interpret. You can think of each branch of the decision tree representing a question, the answer to which tells you which direction to move. This is akin to medical diagnosis -- if white blood cell count is larger than x, then...

Tree methods do not come without their shortfalls however as they are quite susceptible to overfitting. In this exercise, we'll first take a look at a single decision tree, and then expand the concept more broadly to an ensemble of trees.

Decision Trees

Our basic building block here is going to be the decision tree. As discussed above, decision trees are somewhat modeled after the way a doctor diagnoses a disease in a patient. This leads to a model that is both easy to interpret and tends to perform well. In this exercise, we'll take a look at the Breast Cancer data set and fit a decision tree model.

1 - Head over to the Machine Learning Repository, download the Breast Cancer Wisconsin, put it into a dataframe, and split into training and test sets. Be sure to familiarize yourself with the data before proceeding.


In [1]:
import pandas as pd
import numpy as np

Attribute List

  1. Sample code number
  2. Clump Thickness
  3. Uniformity of Cell Size
  4. Uniformity of Cell Shape
  5. Marginal Adhesion
  6. Single Epithelial Cell Size
  7. Bare Nuclei
  8. Bland Chromatin
  9. Normal Nucleoli
  10. Mitoses
  11. Class (2 for benign, 4 for malignant)

In [2]:
bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',
                 header=None)
bcw.columns = ['id', 'thickness', 'cell_size', 'cell_shape', 'adhesion', 'single_cell_size',
               'bare_nuclei', 'chromatin', 'normal_nucleoli', 'mitoses', 'class']

In [3]:
bcw.head()


Out[3]:
id thickness cell_size cell_shape adhesion single_cell_size bare_nuclei chromatin normal_nucleoli mitoses class
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2

In [4]:
bcw.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
id                  699 non-null int64
thickness           699 non-null int64
cell_size           699 non-null int64
cell_shape          699 non-null int64
adhesion            699 non-null int64
single_cell_size    699 non-null int64
bare_nuclei         699 non-null object
chromatin           699 non-null int64
normal_nucleoli     699 non-null int64
mitoses             699 non-null int64
class               699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.1+ KB

In [5]:
bcw.describe().T


Out[5]:
count mean std min 25% 50% 75% max
id 699.0 1.071704e+06 617095.729819 61634.0 870688.5 1171710.0 1238298.0 13454352.0
thickness 699.0 4.417740e+00 2.815741 1.0 2.0 4.0 6.0 10.0
cell_size 699.0 3.134478e+00 3.051459 1.0 1.0 1.0 5.0 10.0
cell_shape 699.0 3.207439e+00 2.971913 1.0 1.0 1.0 5.0 10.0
adhesion 699.0 2.806867e+00 2.855379 1.0 1.0 1.0 4.0 10.0
single_cell_size 699.0 3.216023e+00 2.214300 1.0 2.0 2.0 4.0 10.0
chromatin 699.0 3.437768e+00 2.438364 1.0 2.0 3.0 5.0 10.0
normal_nucleoli 699.0 2.866953e+00 3.053634 1.0 1.0 1.0 4.0 10.0
mitoses 699.0 1.589413e+00 1.715078 1.0 1.0 1.0 1.0 10.0
class 699.0 2.689557e+00 0.951273 2.0 2.0 2.0 4.0 4.0

There are some non numerical values in the bare_nuclei column, let's take a look and fix them:


In [6]:
print(bcw.bare_nuclei.unique())
bcw.bare_nuclei = bcw.bare_nuclei.apply(lambda x: int(x) if x != '?' else np.nan)


['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']

Let's also look at how the two classes are distributed:


In [7]:
bcw['class'].value_counts()


Out[7]:
2    458
4    241
Name: class, dtype: int64

Finally we drop the rows with nas and split the data in train and test:


In [8]:
from sklearn.model_selection import train_test_split

In [9]:
bcw = bcw.dropna()
bcw_train, bcw_test, bcw_target_train, bcw_target_test = train_test_split(bcw.drop(['id', 'class'], axis=1), bcw.loc[:, 'class'], test_size=0.3, random_state=57)

2 - Fit a decision tree model to the data, using the default sklearn parameters, and report the training and testing accuracies. What is the most important feature? Comment on your results.


In [10]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
treeClass = DecisionTreeClassifier(random_state=59)
treeClass.fit(bcw_train, bcw_target_train)
print(treeClass.score(bcw_train, bcw_target_train))
print(treeClass.score(bcw_test, bcw_target_test))


1.0
0.946341463415

In [12]:
print(bcw_train.columns)
print(treeClass.feature_importances_)


Index(['thickness', 'cell_size', 'cell_shape', 'adhesion', 'single_cell_size',
       'bare_nuclei', 'chromatin', 'normal_nucleoli', 'mitoses'],
      dtype='object')
[ 0.03857663  0.76442362  0.0095702   0.01262001  0.00410603  0.1096927
  0.00458909  0.04689664  0.00952508]

The most important feature is cell size. We can observe that the decision tree has a perfect training score but it is overfitting the data since the precision drops to 94.6% for the test set; nonetheless the accuracy of the model is very good.

3 - Now try different values of the max_features parameter and report the training and testing errors. Comment on your results.


In [13]:
for n_features in np.arange(1, 10):
    treeClass.set_params(max_features=n_features)
    treeClass.fit(bcw_train, bcw_target_train)
    print('Training accuracy for {} features: {}'.format(n_features, treeClass.score(bcw_train, bcw_target_train)))
    print('Testing accuracy for {} features: {}'.format(n_features, treeClass.score(bcw_test, bcw_target_test)))
    # print(treeClass.feature_importances_)
    print('\n')


Training accuracy for 1 features: 1.0
Testing accuracy for 1 features: 0.9219512195121952


Training accuracy for 2 features: 1.0
Testing accuracy for 2 features: 0.9463414634146341


Training accuracy for 3 features: 1.0
Testing accuracy for 3 features: 0.926829268292683


Training accuracy for 4 features: 1.0
Testing accuracy for 4 features: 0.926829268292683


Training accuracy for 5 features: 1.0
Testing accuracy for 5 features: 0.9560975609756097


Training accuracy for 6 features: 1.0
Testing accuracy for 6 features: 0.9414634146341463


Training accuracy for 7 features: 1.0
Testing accuracy for 7 features: 0.9707317073170731


Training accuracy for 8 features: 1.0
Testing accuracy for 8 features: 0.9463414634146341


Training accuracy for 9 features: 1.0
Testing accuracy for 9 features: 0.9463414634146341


For all the possibilities we always get 100% accuracy on the training set, while the testing score is always over 92% and it goes up and down with the number of features.

4 - Now try different settings for the min_samples_split parameter. Comment on your results.


In [14]:
treeClass = DecisionTreeClassifier(random_state=59)

for min_samples in np.arange(50, 700, 50):
    treeClass.set_params(min_samples_split=min_samples)
    treeClass.fit(bcw_train, bcw_target_train)
    print('Training accuracy for {} samples: {}'.format(min_samples, treeClass.score(bcw_train, bcw_target_train)))
    print('Testing accuracy for {} samples: {}'.format(min_samples, treeClass.score(bcw_test, bcw_target_test)))
    print('\n')


Training accuracy for 50 samples: 0.9686192468619247
Testing accuracy for 50 samples: 0.9609756097560975


Training accuracy for 100 samples: 0.9686192468619247
Testing accuracy for 100 samples: 0.9609756097560975


Training accuracy for 150 samples: 0.9686192468619247
Testing accuracy for 150 samples: 0.9609756097560975


Training accuracy for 200 samples: 0.9686192468619247
Testing accuracy for 200 samples: 0.9609756097560975


Training accuracy for 250 samples: 0.9686192468619247
Testing accuracy for 250 samples: 0.9609756097560975


Training accuracy for 300 samples: 0.9686192468619247
Testing accuracy for 300 samples: 0.9609756097560975


Training accuracy for 350 samples: 0.9393305439330544
Testing accuracy for 350 samples: 0.9073170731707317


Training accuracy for 400 samples: 0.9393305439330544
Testing accuracy for 400 samples: 0.9073170731707317


Training accuracy for 450 samples: 0.9393305439330544
Testing accuracy for 450 samples: 0.9073170731707317


Training accuracy for 500 samples: 0.6485355648535565
Testing accuracy for 500 samples: 0.6536585365853659


Training accuracy for 550 samples: 0.6485355648535565
Testing accuracy for 550 samples: 0.6536585365853659


Training accuracy for 600 samples: 0.6485355648535565
Testing accuracy for 600 samples: 0.6536585365853659


Training accuracy for 650 samples: 0.6485355648535565
Testing accuracy for 650 samples: 0.6536585365853659


With this settings we lower the overfitting and we get better overall performances for values under 300, while the score starts to get worst and worst over this number. For values between 50 and 300 the scores are the same, so I'll try smaller values (since the default is 2):


In [15]:
for min_samples in np.arange(2, 20, 1):
    treeClass.set_params(min_samples_split=min_samples)
    treeClass.fit(bcw_train, bcw_target_train)
    print('Training accuracy for {} samples: {}'.format(min_samples, treeClass.score(bcw_train, bcw_target_train)))
    print('Testing accuracy for {} samples: {}'.format(min_samples, treeClass.score(bcw_test, bcw_target_test)))
    print('\n')


Training accuracy for 2 samples: 1.0
Testing accuracy for 2 samples: 0.9463414634146341


Training accuracy for 3 samples: 0.99581589958159
Testing accuracy for 3 samples: 0.9317073170731708


Training accuracy for 4 samples: 0.9937238493723849
Testing accuracy for 4 samples: 0.9463414634146341


Training accuracy for 5 samples: 0.9937238493723849
Testing accuracy for 5 samples: 0.9463414634146341


Training accuracy for 6 samples: 0.9937238493723849
Testing accuracy for 6 samples: 0.9463414634146341


Training accuracy for 7 samples: 0.9916317991631799
Testing accuracy for 7 samples: 0.9463414634146341


Training accuracy for 8 samples: 0.9916317991631799
Testing accuracy for 8 samples: 0.9463414634146341


Training accuracy for 9 samples: 0.9895397489539749
Testing accuracy for 9 samples: 0.9463414634146341


Training accuracy for 10 samples: 0.9895397489539749
Testing accuracy for 10 samples: 0.9463414634146341


Training accuracy for 11 samples: 0.9853556485355649
Testing accuracy for 11 samples: 0.9463414634146341


Training accuracy for 12 samples: 0.9853556485355649
Testing accuracy for 12 samples: 0.9463414634146341


Training accuracy for 13 samples: 0.9853556485355649
Testing accuracy for 13 samples: 0.9463414634146341


Training accuracy for 14 samples: 0.9790794979079498
Testing accuracy for 14 samples: 0.9463414634146341


Training accuracy for 15 samples: 0.9769874476987448
Testing accuracy for 15 samples: 0.9512195121951219


Training accuracy for 16 samples: 0.9769874476987448
Testing accuracy for 16 samples: 0.9512195121951219


Training accuracy for 17 samples: 0.9769874476987448
Testing accuracy for 17 samples: 0.9512195121951219


Training accuracy for 18 samples: 0.9769874476987448
Testing accuracy for 18 samples: 0.9512195121951219


Training accuracy for 19 samples: 0.9769874476987448
Testing accuracy for 19 samples: 0.9512195121951219


By increasing the min_samples_split parameter we lower the overfitting and, by the time we get over 15, the performances are slightly better than that of the default value.

5 - Using the models you build in part (4), and taking into consideration what you found in part (3), print out the a graphical representation of the tree for each. Comment on your results.


In [17]:
from sklearn.tree import export_graphviz

for min_samples in np.arange(2, 20, 1):
    treeClass.set_params(min_samples_split=min_samples)
    treeClass.fit(bcw_train, bcw_target_train)
    file = 'dots/tree_{}samples.dot'.format(min_samples)
    export_graphviz(treeClass,
                    out_file=file,
                    feature_names=['thickness', 'cell_size', 'cell_shape', 'adhesion', 'single_cell_size', 'bare_nuclei', 'chromatin' ,'normal_nucleoli', 'mitoses'])

I execute the batch ìcreateGraph.bat to transform the dot files in images.


In [18]:
from IPython.display import Image
from IPython.display import display
    
for min_samples in np.arange(2, 20, 1):
    img = 'figures/tree_{}samples.png'.format(min_samples)
    print('min_split_samples={}'.format(min_samples))
    display(Image(img))


min_split_samples=2
min_split_samples=3
min_split_samples=4
min_split_samples=5
min_split_samples=6
min_split_samples=7
min_split_samples=8
min_split_samples=9
min_split_samples=10
min_split_samples=11
min_split_samples=12
min_split_samples=13
min_split_samples=14
min_split_samples=15
min_split_samples=16
min_split_samples=17
min_split_samples=18
min_split_samples=19

By increasing the parameter the tree is less deep, if I had tried out a value of 25 or more I would have eliminated most of the left branch of the tree. Also, by decreasing the number of features I change the way the tree is splitted.

6 - Fit two other classification models of your choice to the data and comment on your results.


In [19]:
from sklearn.neighbors import KNeighborsClassifier

KNN = KNeighborsClassifier()
KNN.fit(bcw_train, bcw_target_train)
print(KNN.score(bcw_train, bcw_target_train))
print(KNN.score(bcw_test, bcw_target_test))


0.983263598326
0.960975609756

In [20]:
from sklearn.naive_bayes import GaussianNB

NB = GaussianNB()
NB.fit(bcw_train, bcw_target_train)
print(NB.score(bcw_train, bcw_target_train))
print(NB.score(bcw_test, bcw_target_test))


0.960251046025
0.960975609756

Both K-Nearest Neighbors and Gaussian Naive Bayes don't get perfect training accuracy but are less prone to overfitting and get similar if not better results without being tuned; I guess that for this dataset the two classes are easy to separate.

Random Forests

Decision trees are a very handy method for classification, but their real power comes into play when they are grouped together into a forest. Random Forests essentially work by creating an ensemble of decision trees, and having each tree "vote" on what category a datapoint belongs to. Intuitively, this may not sound like it would provide much of a benefit, but in practice, random forests tend to be among the most powerful tools available.

1 - Using the same data as above, fit a Random Forest model using the default settings for the hyperparameters in sklearn. How do your results compare to the single decision tree model?


In [21]:
from sklearn.ensemble import RandomForestClassifier

In [22]:
forest = RandomForestClassifier(random_state=59)
forest.fit(bcw_train, bcw_target_train)
print(forest.score(bcw_train, bcw_target_train))
print(forest.score(bcw_test, bcw_target_test))


0.995815899582
0.956097560976

There is less overfitting and the testing accuracy is 1% better.

2 - Now try different values of the max_features parameter and report the training and testing errors. Comment on your results.


In [23]:
for n_features in np.arange(1, 10):
    forest.set_params(max_features=n_features)
    forest.fit(bcw_train, bcw_target_train)
    print('Training accuracy for {} features: {}'.format(n_features, forest.score(bcw_train, bcw_target_train)))
    print('Testing accuracy for {} features: {}'.format(n_features, forest.score(bcw_test, bcw_target_test)))
    print('\n')


Training accuracy for 1 features: 0.99581589958159
Testing accuracy for 1 features: 0.9463414634146341


Training accuracy for 2 features: 0.9937238493723849
Testing accuracy for 2 features: 0.975609756097561


Training accuracy for 3 features: 0.99581589958159
Testing accuracy for 3 features: 0.9560975609756097


Training accuracy for 4 features: 0.997907949790795
Testing accuracy for 4 features: 0.9365853658536586


Training accuracy for 5 features: 0.997907949790795
Testing accuracy for 5 features: 0.9512195121951219


Training accuracy for 6 features: 0.997907949790795
Testing accuracy for 6 features: 0.9560975609756097


Training accuracy for 7 features: 0.9916317991631799
Testing accuracy for 7 features: 0.9463414634146341


Training accuracy for 8 features: 0.9916317991631799
Testing accuracy for 8 features: 0.9317073170731708


Training accuracy for 9 features: 0.9916317991631799
Testing accuracy for 9 features: 0.9365853658536586


In general we have better generalization than the tree alone; the best result is for 2 features for which we get 97.5% testing accuracy.

3 - Now try out a few different values of the n_estimators parameter. Comment on your results.


In [24]:
forest = RandomForestClassifier(random_state=59)

for n_estimators in [3, 10, 30, 100, 300, 1000]:
    forest.set_params(n_estimators=n_estimators)
    forest.fit(bcw_train, bcw_target_train)
    print('Training accuracy for {} estimators: {}'.format(n_estimators, forest.score(bcw_train, bcw_target_train)))
    print('Testing accuracy for {} estimators: {}'.format(n_estimators, forest.score(bcw_test, bcw_target_test)))
    print('\n')


Training accuracy for 3 estimators: 0.9937238493723849
Testing accuracy for 3 estimators: 0.9463414634146341


Training accuracy for 10 estimators: 0.99581589958159
Testing accuracy for 10 estimators: 0.9560975609756097


Training accuracy for 30 estimators: 1.0
Testing accuracy for 30 estimators: 0.9609756097560975


Training accuracy for 100 estimators: 1.0
Testing accuracy for 100 estimators: 0.9609756097560975


Training accuracy for 300 estimators: 1.0
Testing accuracy for 300 estimators: 0.9658536585365853


Training accuracy for 1000 estimators: 1.0
Testing accuracy for 1000 estimators: 0.9658536585365853


Increasing the number of estimators we get both better training and testing accuracies, so we don't overfit the model while getting perfect training accuracy.

4 - Now try a few different values for the min_samples_split parameter. Then build a final model and comment on your results.


In [25]:
forest = RandomForestClassifier(random_state=59)

for min_samples in 2**np.arange(1, 10):
    forest.set_params(min_samples_split=min_samples)
    forest.fit(bcw_train, bcw_target_train)
    print('Training accuracy for {} samples split: {}'.format(min_samples, forest.score(bcw_train, bcw_target_train)))
    print('Testing accuracy for {} samples split: {}'.format(min_samples, forest.score(bcw_test, bcw_target_test)))
    print('\n')


Training accuracy for 2 samples split: 0.99581589958159
Testing accuracy for 2 samples split: 0.9560975609756097


Training accuracy for 4 samples split: 0.9874476987447699
Testing accuracy for 4 samples split: 0.9658536585365853


Training accuracy for 8 samples split: 0.9790794979079498
Testing accuracy for 8 samples split: 0.9658536585365853


Training accuracy for 16 samples split: 0.9728033472803347
Testing accuracy for 16 samples split: 0.9804878048780488


Training accuracy for 32 samples split: 0.9728033472803347
Testing accuracy for 32 samples split: 0.9609756097560975


Training accuracy for 64 samples split: 0.9748953974895398
Testing accuracy for 64 samples split: 0.9658536585365853


Training accuracy for 128 samples split: 0.9623430962343096
Testing accuracy for 128 samples split: 0.9560975609756097


Training accuracy for 256 samples split: 0.9435146443514645
Testing accuracy for 256 samples split: 0.9414634146341463


Training accuracy for 512 samples split: 0.6485355648535565
Testing accuracy for 512 samples split: 0.6536585365853659


It seems that values between 4 and 64 give the better generalization, with the best result for 16.


In [26]:
forest.set_params(min_samples_split=8, n_estimators=300, max_features=2)
forest.fit(bcw_train, bcw_target_train)
print('Training accuracy: {}'.format(forest.score(bcw_train, bcw_target_train)))
print('Testing accuracy: {}'.format(forest.score(bcw_test, bcw_target_test)))


Training accuracy: 0.9832635983263598
Testing accuracy: 0.975609756097561

97.5% is good, it seems to me that the real benefit comes from the choice of the max_features parameter though.

5 - Determine the most important three features in your model, and then fit a new model using only those three features. Comment on your results.


In [27]:
pd.DataFrame(data=forest.feature_importances_,
             index=list(bcw_train.columns),
             columns=['importance']).sort_values(by='importance', ascending=False)


Out[27]:
importance
cell_size 0.214167
cell_shape 0.162288
bare_nuclei 0.160138
single_cell_size 0.139963
chromatin 0.123559
normal_nucleoli 0.080987
adhesion 0.056929
thickness 0.054763
mitoses 0.007206

In [28]:
bcw_train_red = bcw_train.loc[:, ['cell_size', 'cell_shape', 'bare_nuclei']]
bcw_test_red = bcw_test.loc[:, ['cell_size', 'cell_shape', 'bare_nuclei']]

In [29]:
forest = RandomForestClassifier(random_state=59)
forest.fit(bcw_train_red, bcw_target_train)
print(forest.score(bcw_train_red, bcw_target_train))
print(forest.score(bcw_test_red, bcw_target_test))


0.985355648536
0.956097560976

The result is slightly worse on the training set but the same on the testing set. Let's try the best model from before:


In [30]:
forest.set_params(min_samples_split=8, n_estimators=300, max_features=2)
forest.fit(bcw_train_red, bcw_target_train)
print('Training accuracy: {}'.format(forest.score(bcw_train_red, bcw_target_train)))
print('Testing accuracy: {}'.format(forest.score(bcw_test_red, bcw_target_test)))


Training accuracy: 0.9769874476987448
Testing accuracy: 0.9560975609756097

Here the result is worse, but this isn't optimized for this number of features!

6 - Fit two other classification models of your choice to the reduced data and comment on your results, taking into consideration your results in the Decision Trees section above.


In [31]:
KNN = KNeighborsClassifier()
KNN.fit(bcw_train_red, bcw_target_train)
print(KNN.score(bcw_train_red, bcw_target_train))
print(KNN.score(bcw_test_red, bcw_target_test))

# previous results:
# 0.983263598326
# 0.960975609756


0.966527196653
0.951219512195

In [32]:
NB = GaussianNB()
NB.fit(bcw_train_red, bcw_target_train)
print(NB.score(bcw_train_red, bcw_target_train))
print(NB.score(bcw_test_red, bcw_target_test))

# previous results:
# 0.960251046025
# 0.960975609756


0.964435146444
0.960975609756

With respect to the full data we get very similar results for Gaussian Naive Bayes and slightly worse for K-Nearest Neighbors, both are very similar to the forest accuracies though.

Decision Tree and Random Forest Regression

So far we have only used decision trees for classification problems, but, as it turns out, we can also use them for regression problems. Again, using trees tends to be computationally very expensive, but they are undoubtedly one of the most powerful tools you'll have at your disposal, especially for modeling non-linear data.

1 - Head over to the Machine Learning Repository, download the Concrete Compressive Strength Data Set, put it into a dataframe, and split into training and test sets. Be sure to familiarize yourself with the data before proceeding.


In [33]:
concrete = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls')
concrete.columns = ['cement', 'blast_furnace_slag', 'fly_ash', 'water', 'superplasticizer',
                    'coarse_aggregate', 'fine_aggregate', 'age', 'concrete_compressive_strength']

In [34]:
concrete.head()


Out[34]:
cement blast_furnace_slag fly_ash water superplasticizer coarse_aggregate fine_aggregate age concrete_compressive_strength
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.986111
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.887366
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.269535
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.052780
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.296075

In [35]:
concrete.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
cement                           1030 non-null float64
blast_furnace_slag               1030 non-null float64
fly_ash                          1030 non-null float64
water                            1030 non-null float64
superplasticizer                 1030 non-null float64
coarse_aggregate                 1030 non-null float64
fine_aggregate                   1030 non-null float64
age                              1030 non-null int64
concrete_compressive_strength    1030 non-null float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB

In [36]:
concrete.describe().T


Out[36]:
count mean std min 25% 50% 75% max
cement 1030.0 281.165631 104.507142 102.000000 192.375000 272.900000 350.000000 540.000000
blast_furnace_slag 1030.0 73.895485 86.279104 0.000000 0.000000 22.000000 142.950000 359.400000
fly_ash 1030.0 54.187136 63.996469 0.000000 0.000000 0.000000 118.270000 200.100000
water 1030.0 181.566359 21.355567 121.750000 164.900000 185.000000 192.000000 247.000000
superplasticizer 1030.0 6.203112 5.973492 0.000000 0.000000 6.350000 10.160000 32.200000
coarse_aggregate 1030.0 972.918592 77.753818 801.000000 932.000000 968.000000 1029.400000 1145.000000
fine_aggregate 1030.0 773.578883 80.175427 594.000000 730.950000 779.510000 824.000000 992.600000
age 1030.0 45.662136 63.169912 1.000000 7.000000 28.000000 56.000000 365.000000
concrete_compressive_strength 1030.0 35.817836 16.705679 2.331808 23.707115 34.442774 46.136287 82.599225

In [37]:
Xtrain, Xtest, ytrain, ytest = train_test_split(concrete.iloc[:, :-1], concrete.iloc[:, -1], test_size=0.3, random_state=78)

2 - Fit a multilinear regression model to the data and print the error.


In [38]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
lr.fit(Xtrain, ytrain)
print(mean_squared_error(lr.predict(Xtrain), ytrain))
print(mean_squared_error(lr.predict(Xtest), ytest))


105.626762698
112.281766745

The model is pretty bad, but as the data description said this relation is highly nonlinear. Also, we haven't normalized the data, which is bad for linear regression.

3 - Fit a decision tree regressor to the data, using the default values for the hyperparameters in sklearn, and print the error. Comment on your results.


In [59]:
from sklearn.tree import DecisionTreeRegressor

treereg = DecisionTreeRegressor(random_state=68)
treereg.fit(Xtrain, ytrain)
print(mean_squared_error(treereg.predict(Xtrain), ytrain))
print(mean_squared_error(treereg.predict(Xtest), ytest))


0.0472989160893
47.5380658415

With this model the training error is very low but we have bad generalization. Still, the model is way better than linear regressione.

4 - Determine the top three features, and fit a new decision tree regressor to the data. Comment on your results.


In [60]:
pd.DataFrame(data=treereg.feature_importances_,
             index=list(Xtrain.columns),
             columns=['importance']).sort_values(by='importance', ascending=False)


Out[60]:
importance
cement 0.374141
age 0.317829
blast_furnace_slag 0.113377
water 0.108268
superplasticizer 0.029799
coarse_aggregate 0.028815
fine_aggregate 0.019587
fly_ash 0.008184

In [61]:
Xtrain_red = Xtrain.loc[:, ['cement', 'age', 'blast_furnace_slag']]
Xtest_red = Xtest.loc[:, ['cement', 'age', 'blast_furnace_slag']]

In [62]:
treereg.fit(Xtrain_red, ytrain)
print(mean_squared_error(treereg.predict(Xtrain_red), ytrain))
print(mean_squared_error(treereg.predict(Xtest_red), ytest))


2.02601713685
71.6936692442

The result is worst, I would maybe include also the fourth best feature which has very similar importance as the third to get similar results to the full model:


In [63]:
Xtrain_red = Xtrain.loc[:, ['cement', 'age', 'blast_furnace_slag', 'water']]
Xtest_red = Xtest.loc[:, ['cement', 'age', 'blast_furnace_slag', 'water']]

In [64]:
treereg.fit(Xtrain_red, ytrain)
print(mean_squared_error(treereg.predict(Xtrain_red), ytrain))
print(mean_squared_error(treereg.predict(Xtest_red), ytest))


1.07688867862
52.1744339809

Yep, it's almost the same now.

5 - Tweak the hyperparameters a bit, until you get a fit with a decision tree that you are comfortable with, then print the error and $R^2$ of your final model. Comment on your results.


In [65]:
scores = []

for min_samples in 2**np.arange(1, 10):
    for max_features in np.arange(1, 9):
        for min_impurity_decrease in (0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30):
            treereg.set_params(min_samples_split=min_samples, max_features=max_features, min_impurity_decrease=min_impurity_decrease)
            treereg.fit(Xtrain, ytrain)
            train_MSE = mean_squared_error(treereg.predict(Xtrain), ytrain)
            test_MSE = mean_squared_error(treereg.predict(Xtest), ytest)
            scores.append([min_samples, max_features, min_impurity_decrease, train_MSE, test_MSE])

scores_df = pd.DataFrame(data=scores, columns=['min_samples', 'max_features', 'min_impurity_decrease', 'train_MSE', 'test_MSE'])

In [66]:
scores_df.sort_values(by='test_MSE')


Out[66]:
min_samples max_features min_impurity_decrease train_MSE test_MSE
60 2 7 0.001 0.097601 43.659474
32 2 4 0.010 1.561799 43.777403
61 2 7 0.003 0.211171 44.605638
71 2 8 0.003 0.230356 45.889988
151 4 8 0.003 1.830092 46.409812
152 4 8 0.010 2.124584 46.514719
221 8 7 0.003 8.710916 46.631963
130 4 6 0.001 2.621187 47.734303
113 4 4 0.030 5.471112 48.012985
201 8 5 0.003 8.403215 48.071857
200 8 5 0.001 8.402019 48.088031
73 2 8 0.030 2.621335 48.312631
112 4 4 0.010 4.528237 48.440725
63 2 7 0.030 2.878880 48.947704
31 2 4 0.003 1.066663 48.956190
220 8 7 0.001 8.630825 49.023634
42 2 5 0.010 1.106937 49.235677
140 4 7 0.001 1.998403 49.770850
232 8 8 0.010 8.192768 49.880341
202 8 5 0.010 9.097459 49.941711
121 4 5 0.003 3.265616 49.971514
212 8 6 0.010 11.592719 50.059599
144 4 7 0.100 8.743146 50.063722
122 4 5 0.010 3.747044 50.126319
230 8 8 0.001 8.144103 50.158489
231 8 8 0.003 8.145199 50.165906
72 2 8 0.010 0.899309 50.264624
143 4 7 0.030 3.995731 50.319982
131 4 6 0.003 2.524592 50.737483
153 4 8 0.030 3.215245 50.763936
... ... ... ... ... ...
658 512 2 10.000 218.379859 189.462632
659 512 2 30.000 218.379859 189.462632
655 512 2 0.300 218.379859 189.462632
707 512 7 3.000 218.379859 189.462632
708 512 7 10.000 218.379859 189.462632
709 512 7 30.000 218.379859 189.462632
663 512 3 0.030 218.379859 189.462632
109 4 3 30.000 218.379859 189.462632
665 512 3 0.300 218.379859 189.462632
666 512 3 1.000 218.379859 189.462632
29 2 3 30.000 218.379859 189.462632
667 512 3 3.000 218.379859 189.462632
668 512 3 10.000 218.379859 189.462632
664 512 3 0.100 218.379859 189.462632
706 512 7 1.000 218.379859 189.462632
703 512 7 0.030 218.379859 189.462632
704 512 7 0.100 218.379859 189.462632
702 512 7 0.010 218.379859 189.462632
701 512 7 0.003 218.379859 189.462632
700 512 7 0.001 218.379859 189.462632
662 512 3 0.010 218.379859 189.462632
661 512 3 0.003 218.379859 189.462632
189 8 3 30.000 218.379859 189.462632
660 512 3 0.001 218.379859 189.462632
509 128 3 30.000 218.379859 189.462632
269 16 3 30.000 218.379859 189.462632
429 64 3 30.000 218.379859 189.462632
349 32 3 30.000 218.379859 189.462632
705 512 7 0.300 218.379859 189.462632
669 512 3 30.000 218.379859 189.462632

720 rows × 5 columns


In [53]:
from sklearn.metrics import r2_score

In [67]:
treereg.set_params(min_samples_split=2, max_features=7, min_impurity_decrease=0.001)
treereg.fit(Xtrain, ytrain)
print('Training MSE: {}'.format(mean_squared_error(treereg.predict(Xtrain), ytrain)))
print('Testing MSE: {}'.format(mean_squared_error(treereg.predict(Xtest), ytest)))
print('Training R^2: {}'.format(r2_score(treereg.predict(Xtrain), ytrain)))
print('Testing R^2: {}'.format(r2_score(treereg.predict(Xtest), ytest)))
# R^2 and model.score are the same in this case!


Training MSE: 0.09760097128377535
Testing MSE: 43.659474406270945
Training R^2: 0.9996634471752562
Testing R^2: 0.8238521100402227

6 - Fit a polynomial regression model to the data and print the error. Comment on your results.


In [82]:
from sklearn.preprocessing import PolynomialFeatures

quadratic = PolynomialFeatures(degree=2)
cubic = PolynomialFeatures(degree=3)
quartic = PolynomialFeatures(degree=4)

Xtrain2 = quadratic.fit_transform(Xtrain)
Xtest2 = quadratic.fit_transform(Xtest)
Xtrain3 = cubic.fit_transform(Xtrain)
Xtest3 = cubic.fit_transform(Xtest)
Xtrain4 = quartic.fit_transform(Xtrain)
Xtest4 = quartic.fit_transform(Xtest)

In [83]:
lr.fit(Xtrain2, ytrain)
print('Poly of degree 2, training: {}'.format(mean_squared_error(lr.predict(Xtrain2), ytrain)))
print('Poly of degree 2, testing: {}'.format(mean_squared_error(lr.predict(Xtest2), ytest)))

lr.fit(Xtrain3, ytrain)
print('Poly of degree 3, training: {}'.format(mean_squared_error(lr.predict(Xtrain3), ytrain)))
print('Poly of degree 3, testing: {}'.format(mean_squared_error(lr.predict(Xtest3), ytest)))

lr.fit(Xtrain4, ytrain)
print('Poly of degree 4, training: {}'.format(mean_squared_error(lr.predict(Xtrain4), ytrain)))
print('Poly of degree 4, testing: {}'.format(mean_squared_error(lr.predict(Xtest4), ytest)))


Poly of degree 2, training: 53.21344378765793
Poly of degree 2, testing: 68.3780764650379
Poly of degree 3, training: 27.14188882608678
Poly of degree 3, testing: 58.34844809846433
Poly of degree 4, training: 74.23849084236245
Poly of degree 4, testing: 11155.410870626754

We get better generalization for degree 2 and 3 but the result is still worst than the decision tree. For degree 4 we get a lot of overfitting.

7 - Fit a random forest regressor to the data, using the default values for the hyperparameters in sklearn, and print the error. Comment on your results.


In [69]:
from sklearn.ensemble import RandomForestRegressor

forestreg = RandomForestRegressor(random_state=91)
forestreg.fit(Xtrain, ytrain)
print(mean_squared_error(forestreg.predict(Xtrain), ytrain))
print(mean_squared_error(forestreg.predict(Xtest), ytest))


5.91840250844
31.1570891874

The training error is worst (but very good all things considered) but the testing error is way better (better generalization as expected).

8 - Determine the top three features, and fit a new random forest regressor to the data. Comment on your results.


In [70]:
pd.DataFrame(data=forestreg.feature_importances_,
             index=list(Xtrain.columns),
             columns=['importance']).sort_values(by='importance', ascending=False)


Out[70]:
importance
age 0.348812
cement 0.317501
water 0.096903
blast_furnace_slag 0.082603
superplasticizer 0.073507
fine_aggregate 0.035714
coarse_aggregate 0.025927
fly_ash 0.019033

In [71]:
Xtrain_red = Xtrain.loc[:, ['age', 'cement', 'water']]
Xtest_red = Xtest.loc[:, ['age', 'cement', 'water']]

In [74]:
forestreg.fit(Xtrain_red, ytrain)
print(mean_squared_error(forestreg.predict(Xtrain_red), ytrain))
print(mean_squared_error(forestreg.predict(Xtest_red), ytest))


11.3151381863
56.554040365

Again the result is worst, but using the top four features it gets closer to the full model:


In [75]:
Xtrain_red = Xtrain.loc[:, ['age', 'cement', 'water', 'blast_furnace_slag']]
Xtest_red = Xtest.loc[:, ['age', 'cement', 'water', 'blast_furnace_slag']]

In [76]:
forestreg.fit(Xtrain_red, ytrain)
print(mean_squared_error(forestreg.predict(Xtrain_red), ytrain))
print(mean_squared_error(forestreg.predict(Xtest_red), ytest))


6.34327769824
31.8530119792

9 - Tweak the hyperparameters a bit, until you get a fit with a random forest that you are comfortable with, then print the error and $R^2$ of your final model. Comment on your results.


In [79]:
forestreg.set_params(n_jobs=-1)
forest_scores = []

for n_estimators in (10, 30, 100, 300):
    for min_samples in 2**np.arange(1, 8):
        for max_features in np.arange(1, 9):
            for min_impurity_decrease in (0.001, 0.003, 0.01, 0.03, 0.1, 0.3):
                forestreg.set_params(n_estimators=n_estimators, min_samples_split=min_samples, max_features=max_features, min_impurity_decrease=min_impurity_decrease)
                forestreg.fit(Xtrain, ytrain)
                train_MSE = mean_squared_error(forestreg.predict(Xtrain), ytrain)
                test_MSE = mean_squared_error(forestreg.predict(Xtest), ytest)
                forest_scores.append([n_estimators, min_samples, max_features, min_impurity_decrease, train_MSE, test_MSE])
    print('{} estimators done'.format(n_estimators))

forest_scores_df = pd.DataFrame(data=forest_scores, columns=['n_estimators', 'min_samples', 'max_features', 'min_impurity_decrease', 'train_MSE', 'test_MSE'])


10 estimators done
30 estimators done
100 estimators done
300 estimators done

In [80]:
forest_scores_df.sort_values(by='test_MSE')


Out[80]:
n_estimators min_samples max_features min_impurity_decrease train_MSE test_MSE
684 100 2 3 0.001 3.707479 25.749699
1020 300 2 3 0.001 3.430176 25.897050
696 100 2 5 0.001 3.468512 26.164080
1032 300 2 5 0.001 3.340047 26.232232
1027 300 2 4 0.003 3.394135 26.305300
1028 300 2 4 0.010 3.684132 26.434957
1026 300 2 4 0.001 3.249728 26.559349
1021 300 2 3 0.003 3.605562 26.691993
1068 300 4 3 0.001 5.886957 26.699807
1022 300 2 3 0.010 3.846487 26.725852
686 100 2 3 0.010 4.036399 26.728575
685 100 2 3 0.003 3.820184 26.735652
1034 300 2 5 0.010 3.689213 26.803360
1044 300 2 7 0.001 3.392739 26.819310
1039 300 2 6 0.003 3.362704 26.824635
1038 300 2 6 0.001 3.324991 26.877545
356 30 2 4 0.010 4.338079 26.897680
692 100 2 4 0.010 3.828635 26.927900
1029 300 2 4 0.030 4.345660 26.974953
1045 300 2 7 0.003 3.361912 26.975610
1033 300 2 5 0.003 3.381798 26.997225
1035 300 2 5 0.030 4.346827 27.048094
732 100 4 3 0.001 6.070712 27.049624
697 100 2 5 0.003 3.504518 27.129279
691 100 2 4 0.003 3.605182 27.180129
1075 300 4 4 0.003 5.194350 27.189868
1040 300 2 6 0.010 3.620749 27.215332
1050 300 2 8 0.001 3.435451 27.255133
1069 300 4 3 0.003 5.936568 27.294603
687 100 2 3 0.030 4.947183 27.305525
... ... ... ... ... ... ...
294 10 128 2 0.001 114.214072 104.493399
295 10 128 2 0.003 114.214072 104.493399
299 10 128 2 0.300 114.214072 104.493399
296 10 128 2 0.010 114.214072 104.493399
298 10 128 2 0.100 114.214072 104.493399
297 10 128 2 0.030 114.214072 104.493399
960 100 128 1 0.001 126.773099 119.448219
962 100 128 1 0.010 127.496325 120.223397
964 100 128 1 0.100 127.496325 120.223397
961 100 128 1 0.003 127.496325 120.223397
963 100 128 1 0.030 127.496325 120.223397
965 100 128 1 0.300 128.473494 120.907442
627 30 128 1 0.030 128.232470 123.757255
624 30 128 1 0.001 128.232470 123.757255
625 30 128 1 0.003 128.232470 123.757255
626 30 128 1 0.010 128.232470 123.757255
628 30 128 1 0.100 128.232470 123.757255
1296 300 128 1 0.001 132.554078 124.203750
1297 300 128 1 0.003 132.798575 124.463556
1299 300 128 1 0.030 132.798575 124.463556
1298 300 128 1 0.010 132.798575 124.463556
1300 300 128 1 0.100 132.998312 124.615008
293 10 128 1 0.300 132.470412 124.709540
290 10 128 1 0.010 132.470412 124.709540
289 10 128 1 0.003 132.470412 124.709540
288 10 128 1 0.001 132.470412 124.709540
292 10 128 1 0.100 132.470412 124.709540
291 10 128 1 0.030 132.470412 124.709540
629 30 128 1 0.300 130.855701 125.079576
1301 300 128 1 0.300 133.922478 125.459040

1344 rows × 6 columns


In [85]:
# from solution: both here and in the decision tree changing criterion to MAE produced an improvement
forestreg.set_params(n_estimators=100, min_samples_split=2, max_features=3, min_impurity_decrease=0.001)
forestreg.fit(Xtrain, ytrain)
print('Training MSE: {}'.format(mean_squared_error(forestreg.predict(Xtrain), ytrain)))
print('Testing MSE: {}'.format(mean_squared_error(forestreg.predict(Xtest), ytest)))
print('Training R^2: {}'.format(r2_score(forestreg.predict(Xtrain), ytrain)))
print('Testing R^2: {}'.format(r2_score(forestreg.predict(Xtest), ytest)))


Training MSE: 4.821995613930632
Testing MSE: 27.373452174156235
Training R^2: 0.981138576599212
Testing R^2: 0.857550201116013

We get a fairly good model given the nonlinear data with a testing $R^2$ score of over 0.86. This result is way better than our first linear model and its MSE is almost half of our first regression tree.