When it comes to machine learning classification algorithms, tree based methods tend to be among the most powerful. Not only do they tend to produce very accurate models, but they are quite easy to interpret. You can think of each branch of the decision tree representing a question, the answer to which tells you which direction to move. This is akin to medical diagnosis -- if white blood cell count is larger than x, then...
Tree methods do not come without their shortfalls however as they are quite susceptible to overfitting. In this exercise, we'll first take a look at a single decision tree, and then expand the concept more broadly to an ensemble of trees.
Our basic building block here is going to be the decision tree. As discussed above, decision trees are somewhat modeled after the way a doctor diagnoses a disease in a patient. This leads to a model that is both easy to interpret and tends to perform well. In this exercise, we'll take a look at the Breast Cancer data set and fit a decision tree model.
1 - Head over to the Machine Learning Repository, download the Breast Cancer Wisconsin, put it into a dataframe, and split into training and test sets. Be sure to familiarize yourself with the data before proceeding.
In [1]:
import pandas as pd
import numpy as np
In [2]:
bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',
header=None)
bcw.columns = ['id', 'thickness', 'cell_size', 'cell_shape', 'adhesion', 'single_cell_size',
'bare_nuclei', 'chromatin', 'normal_nucleoli', 'mitoses', 'class']
In [3]:
bcw.head()
Out[3]:
In [4]:
bcw.info()
In [5]:
bcw.describe().T
Out[5]:
There are some non numerical values in the bare_nuclei
column, let's take a look and fix them:
In [6]:
print(bcw.bare_nuclei.unique())
bcw.bare_nuclei = bcw.bare_nuclei.apply(lambda x: int(x) if x != '?' else np.nan)
Let's also look at how the two classes are distributed:
In [7]:
bcw['class'].value_counts()
Out[7]:
Finally we drop the rows with nas and split the data in train and test:
In [8]:
from sklearn.model_selection import train_test_split
In [9]:
bcw = bcw.dropna()
bcw_train, bcw_test, bcw_target_train, bcw_target_test = train_test_split(bcw.drop(['id', 'class'], axis=1), bcw.loc[:, 'class'], test_size=0.3, random_state=57)
2 - Fit a decision tree model to the data, using the default sklearn
parameters, and report the training and testing accuracies. What is the most important feature? Comment on your results.
In [10]:
from sklearn.tree import DecisionTreeClassifier
In [11]:
treeClass = DecisionTreeClassifier(random_state=59)
treeClass.fit(bcw_train, bcw_target_train)
print(treeClass.score(bcw_train, bcw_target_train))
print(treeClass.score(bcw_test, bcw_target_test))
In [12]:
print(bcw_train.columns)
print(treeClass.feature_importances_)
The most important feature is cell size. We can observe that the decision tree has a perfect training score but it is overfitting the data since the precision drops to 94.6% for the test set; nonetheless the accuracy of the model is very good.
3 - Now try different values of the max_features
parameter and report the training and testing errors. Comment on your results.
In [13]:
for n_features in np.arange(1, 10):
treeClass.set_params(max_features=n_features)
treeClass.fit(bcw_train, bcw_target_train)
print('Training accuracy for {} features: {}'.format(n_features, treeClass.score(bcw_train, bcw_target_train)))
print('Testing accuracy for {} features: {}'.format(n_features, treeClass.score(bcw_test, bcw_target_test)))
# print(treeClass.feature_importances_)
print('\n')
For all the possibilities we always get 100% accuracy on the training set, while the testing score is always over 92% and it goes up and down with the number of features.
4 - Now try different settings for the min_samples_split
parameter. Comment on your results.
In [14]:
treeClass = DecisionTreeClassifier(random_state=59)
for min_samples in np.arange(50, 700, 50):
treeClass.set_params(min_samples_split=min_samples)
treeClass.fit(bcw_train, bcw_target_train)
print('Training accuracy for {} samples: {}'.format(min_samples, treeClass.score(bcw_train, bcw_target_train)))
print('Testing accuracy for {} samples: {}'.format(min_samples, treeClass.score(bcw_test, bcw_target_test)))
print('\n')
With this settings we lower the overfitting and we get better overall performances for values under 300, while the score starts to get worst and worst over this number. For values between 50 and 300 the scores are the same, so I'll try smaller values (since the default is 2):
In [15]:
for min_samples in np.arange(2, 20, 1):
treeClass.set_params(min_samples_split=min_samples)
treeClass.fit(bcw_train, bcw_target_train)
print('Training accuracy for {} samples: {}'.format(min_samples, treeClass.score(bcw_train, bcw_target_train)))
print('Testing accuracy for {} samples: {}'.format(min_samples, treeClass.score(bcw_test, bcw_target_test)))
print('\n')
By increasing the min_samples_split
parameter we lower the overfitting and, by the time we get over 15, the performances are slightly better than that of the default value.
5 - Using the models you build in part (4), and taking into consideration what you found in part (3), print out the a graphical representation of the tree for each. Comment on your results.
In [17]:
from sklearn.tree import export_graphviz
for min_samples in np.arange(2, 20, 1):
treeClass.set_params(min_samples_split=min_samples)
treeClass.fit(bcw_train, bcw_target_train)
file = 'dots/tree_{}samples.dot'.format(min_samples)
export_graphviz(treeClass,
out_file=file,
feature_names=['thickness', 'cell_size', 'cell_shape', 'adhesion', 'single_cell_size', 'bare_nuclei', 'chromatin' ,'normal_nucleoli', 'mitoses'])
I execute the batch ìcreateGraph.bat
to transform the dot files in images.
In [18]:
from IPython.display import Image
from IPython.display import display
for min_samples in np.arange(2, 20, 1):
img = 'figures/tree_{}samples.png'.format(min_samples)
print('min_split_samples={}'.format(min_samples))
display(Image(img))
By increasing the parameter the tree is less deep, if I had tried out a value of 25 or more I would have eliminated most of the left branch of the tree. Also, by decreasing the number of features I change the way the tree is splitted.
6 - Fit two other classification models of your choice to the data and comment on your results.
In [19]:
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier()
KNN.fit(bcw_train, bcw_target_train)
print(KNN.score(bcw_train, bcw_target_train))
print(KNN.score(bcw_test, bcw_target_test))
In [20]:
from sklearn.naive_bayes import GaussianNB
NB = GaussianNB()
NB.fit(bcw_train, bcw_target_train)
print(NB.score(bcw_train, bcw_target_train))
print(NB.score(bcw_test, bcw_target_test))
Both K-Nearest Neighbors and Gaussian Naive Bayes don't get perfect training accuracy but are less prone to overfitting and get similar if not better results without being tuned; I guess that for this dataset the two classes are easy to separate.
Decision trees are a very handy method for classification, but their real power comes into play when they are grouped together into a forest. Random Forests essentially work by creating an ensemble of decision trees, and having each tree "vote" on what category a datapoint belongs to. Intuitively, this may not sound like it would provide much of a benefit, but in practice, random forests tend to be among the most powerful tools available.
1 - Using the same data as above, fit a Random Forest model using the default settings for the hyperparameters in sklearn
. How do your results compare to the single decision tree model?
In [21]:
from sklearn.ensemble import RandomForestClassifier
In [22]:
forest = RandomForestClassifier(random_state=59)
forest.fit(bcw_train, bcw_target_train)
print(forest.score(bcw_train, bcw_target_train))
print(forest.score(bcw_test, bcw_target_test))
There is less overfitting and the testing accuracy is 1% better.
2 - Now try different values of the max_features
parameter and report the training and testing errors. Comment on your results.
In [23]:
for n_features in np.arange(1, 10):
forest.set_params(max_features=n_features)
forest.fit(bcw_train, bcw_target_train)
print('Training accuracy for {} features: {}'.format(n_features, forest.score(bcw_train, bcw_target_train)))
print('Testing accuracy for {} features: {}'.format(n_features, forest.score(bcw_test, bcw_target_test)))
print('\n')
In general we have better generalization than the tree alone; the best result is for 2 features for which we get 97.5% testing accuracy.
3 - Now try out a few different values of the n_estimators
parameter. Comment on your results.
In [24]:
forest = RandomForestClassifier(random_state=59)
for n_estimators in [3, 10, 30, 100, 300, 1000]:
forest.set_params(n_estimators=n_estimators)
forest.fit(bcw_train, bcw_target_train)
print('Training accuracy for {} estimators: {}'.format(n_estimators, forest.score(bcw_train, bcw_target_train)))
print('Testing accuracy for {} estimators: {}'.format(n_estimators, forest.score(bcw_test, bcw_target_test)))
print('\n')
Increasing the number of estimators we get both better training and testing accuracies, so we don't overfit the model while getting perfect training accuracy.
4 - Now try a few different values for the min_samples_split
parameter. Then build a final model and comment on your results.
In [25]:
forest = RandomForestClassifier(random_state=59)
for min_samples in 2**np.arange(1, 10):
forest.set_params(min_samples_split=min_samples)
forest.fit(bcw_train, bcw_target_train)
print('Training accuracy for {} samples split: {}'.format(min_samples, forest.score(bcw_train, bcw_target_train)))
print('Testing accuracy for {} samples split: {}'.format(min_samples, forest.score(bcw_test, bcw_target_test)))
print('\n')
It seems that values between 4 and 64 give the better generalization, with the best result for 16.
In [26]:
forest.set_params(min_samples_split=8, n_estimators=300, max_features=2)
forest.fit(bcw_train, bcw_target_train)
print('Training accuracy: {}'.format(forest.score(bcw_train, bcw_target_train)))
print('Testing accuracy: {}'.format(forest.score(bcw_test, bcw_target_test)))
97.5% is good, it seems to me that the real benefit comes from the choice of the max_features
parameter though.
5 - Determine the most important three features in your model, and then fit a new model using only those three features. Comment on your results.
In [27]:
pd.DataFrame(data=forest.feature_importances_,
index=list(bcw_train.columns),
columns=['importance']).sort_values(by='importance', ascending=False)
Out[27]:
In [28]:
bcw_train_red = bcw_train.loc[:, ['cell_size', 'cell_shape', 'bare_nuclei']]
bcw_test_red = bcw_test.loc[:, ['cell_size', 'cell_shape', 'bare_nuclei']]
In [29]:
forest = RandomForestClassifier(random_state=59)
forest.fit(bcw_train_red, bcw_target_train)
print(forest.score(bcw_train_red, bcw_target_train))
print(forest.score(bcw_test_red, bcw_target_test))
The result is slightly worse on the training set but the same on the testing set. Let's try the best model from before:
In [30]:
forest.set_params(min_samples_split=8, n_estimators=300, max_features=2)
forest.fit(bcw_train_red, bcw_target_train)
print('Training accuracy: {}'.format(forest.score(bcw_train_red, bcw_target_train)))
print('Testing accuracy: {}'.format(forest.score(bcw_test_red, bcw_target_test)))
Here the result is worse, but this isn't optimized for this number of features!
6 - Fit two other classification models of your choice to the reduced data and comment on your results, taking into consideration your results in the Decision Trees
section above.
In [31]:
KNN = KNeighborsClassifier()
KNN.fit(bcw_train_red, bcw_target_train)
print(KNN.score(bcw_train_red, bcw_target_train))
print(KNN.score(bcw_test_red, bcw_target_test))
# previous results:
# 0.983263598326
# 0.960975609756
In [32]:
NB = GaussianNB()
NB.fit(bcw_train_red, bcw_target_train)
print(NB.score(bcw_train_red, bcw_target_train))
print(NB.score(bcw_test_red, bcw_target_test))
# previous results:
# 0.960251046025
# 0.960975609756
With respect to the full data we get very similar results for Gaussian Naive Bayes and slightly worse for K-Nearest Neighbors, both are very similar to the forest accuracies though.
So far we have only used decision trees for classification problems, but, as it turns out, we can also use them for regression problems. Again, using trees tends to be computationally very expensive, but they are undoubtedly one of the most powerful tools you'll have at your disposal, especially for modeling non-linear data.
1 - Head over to the Machine Learning Repository, download the Concrete Compressive Strength Data Set, put it into a dataframe, and split into training and test sets. Be sure to familiarize yourself with the data before proceeding.
In [33]:
concrete = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls')
concrete.columns = ['cement', 'blast_furnace_slag', 'fly_ash', 'water', 'superplasticizer',
'coarse_aggregate', 'fine_aggregate', 'age', 'concrete_compressive_strength']
In [34]:
concrete.head()
Out[34]:
In [35]:
concrete.info()
In [36]:
concrete.describe().T
Out[36]:
In [37]:
Xtrain, Xtest, ytrain, ytest = train_test_split(concrete.iloc[:, :-1], concrete.iloc[:, -1], test_size=0.3, random_state=78)
2 - Fit a multilinear regression model to the data and print the error.
In [38]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lr = LinearRegression()
lr.fit(Xtrain, ytrain)
print(mean_squared_error(lr.predict(Xtrain), ytrain))
print(mean_squared_error(lr.predict(Xtest), ytest))
The model is pretty bad, but as the data description said this relation is highly nonlinear. Also, we haven't normalized the data, which is bad for linear regression.
3 - Fit a decision tree regressor to the data, using the default values for the hyperparameters in sklearn
, and print the error. Comment on your results.
In [59]:
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(random_state=68)
treereg.fit(Xtrain, ytrain)
print(mean_squared_error(treereg.predict(Xtrain), ytrain))
print(mean_squared_error(treereg.predict(Xtest), ytest))
With this model the training error is very low but we have bad generalization. Still, the model is way better than linear regressione.
4 - Determine the top three features, and fit a new decision tree regressor to the data. Comment on your results.
In [60]:
pd.DataFrame(data=treereg.feature_importances_,
index=list(Xtrain.columns),
columns=['importance']).sort_values(by='importance', ascending=False)
Out[60]:
In [61]:
Xtrain_red = Xtrain.loc[:, ['cement', 'age', 'blast_furnace_slag']]
Xtest_red = Xtest.loc[:, ['cement', 'age', 'blast_furnace_slag']]
In [62]:
treereg.fit(Xtrain_red, ytrain)
print(mean_squared_error(treereg.predict(Xtrain_red), ytrain))
print(mean_squared_error(treereg.predict(Xtest_red), ytest))
The result is worst, I would maybe include also the fourth best feature which has very similar importance as the third to get similar results to the full model:
In [63]:
Xtrain_red = Xtrain.loc[:, ['cement', 'age', 'blast_furnace_slag', 'water']]
Xtest_red = Xtest.loc[:, ['cement', 'age', 'blast_furnace_slag', 'water']]
In [64]:
treereg.fit(Xtrain_red, ytrain)
print(mean_squared_error(treereg.predict(Xtrain_red), ytrain))
print(mean_squared_error(treereg.predict(Xtest_red), ytest))
Yep, it's almost the same now.
5 - Tweak the hyperparameters a bit, until you get a fit with a decision tree that you are comfortable with, then print the error and $R^2$ of your final model. Comment on your results.
In [65]:
scores = []
for min_samples in 2**np.arange(1, 10):
for max_features in np.arange(1, 9):
for min_impurity_decrease in (0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30):
treereg.set_params(min_samples_split=min_samples, max_features=max_features, min_impurity_decrease=min_impurity_decrease)
treereg.fit(Xtrain, ytrain)
train_MSE = mean_squared_error(treereg.predict(Xtrain), ytrain)
test_MSE = mean_squared_error(treereg.predict(Xtest), ytest)
scores.append([min_samples, max_features, min_impurity_decrease, train_MSE, test_MSE])
scores_df = pd.DataFrame(data=scores, columns=['min_samples', 'max_features', 'min_impurity_decrease', 'train_MSE', 'test_MSE'])
In [66]:
scores_df.sort_values(by='test_MSE')
Out[66]:
In [53]:
from sklearn.metrics import r2_score
In [67]:
treereg.set_params(min_samples_split=2, max_features=7, min_impurity_decrease=0.001)
treereg.fit(Xtrain, ytrain)
print('Training MSE: {}'.format(mean_squared_error(treereg.predict(Xtrain), ytrain)))
print('Testing MSE: {}'.format(mean_squared_error(treereg.predict(Xtest), ytest)))
print('Training R^2: {}'.format(r2_score(treereg.predict(Xtrain), ytrain)))
print('Testing R^2: {}'.format(r2_score(treereg.predict(Xtest), ytest)))
# R^2 and model.score are the same in this case!
6 - Fit a polynomial regression model to the data and print the error. Comment on your results.
In [82]:
from sklearn.preprocessing import PolynomialFeatures
quadratic = PolynomialFeatures(degree=2)
cubic = PolynomialFeatures(degree=3)
quartic = PolynomialFeatures(degree=4)
Xtrain2 = quadratic.fit_transform(Xtrain)
Xtest2 = quadratic.fit_transform(Xtest)
Xtrain3 = cubic.fit_transform(Xtrain)
Xtest3 = cubic.fit_transform(Xtest)
Xtrain4 = quartic.fit_transform(Xtrain)
Xtest4 = quartic.fit_transform(Xtest)
In [83]:
lr.fit(Xtrain2, ytrain)
print('Poly of degree 2, training: {}'.format(mean_squared_error(lr.predict(Xtrain2), ytrain)))
print('Poly of degree 2, testing: {}'.format(mean_squared_error(lr.predict(Xtest2), ytest)))
lr.fit(Xtrain3, ytrain)
print('Poly of degree 3, training: {}'.format(mean_squared_error(lr.predict(Xtrain3), ytrain)))
print('Poly of degree 3, testing: {}'.format(mean_squared_error(lr.predict(Xtest3), ytest)))
lr.fit(Xtrain4, ytrain)
print('Poly of degree 4, training: {}'.format(mean_squared_error(lr.predict(Xtrain4), ytrain)))
print('Poly of degree 4, testing: {}'.format(mean_squared_error(lr.predict(Xtest4), ytest)))
We get better generalization for degree 2 and 3 but the result is still worst than the decision tree. For degree 4 we get a lot of overfitting.
7 - Fit a random forest regressor to the data, using the default values for the hyperparameters in sklearn
, and print the error. Comment on your results.
In [69]:
from sklearn.ensemble import RandomForestRegressor
forestreg = RandomForestRegressor(random_state=91)
forestreg.fit(Xtrain, ytrain)
print(mean_squared_error(forestreg.predict(Xtrain), ytrain))
print(mean_squared_error(forestreg.predict(Xtest), ytest))
The training error is worst (but very good all things considered) but the testing error is way better (better generalization as expected).
8 - Determine the top three features, and fit a new random forest regressor to the data. Comment on your results.
In [70]:
pd.DataFrame(data=forestreg.feature_importances_,
index=list(Xtrain.columns),
columns=['importance']).sort_values(by='importance', ascending=False)
Out[70]:
In [71]:
Xtrain_red = Xtrain.loc[:, ['age', 'cement', 'water']]
Xtest_red = Xtest.loc[:, ['age', 'cement', 'water']]
In [74]:
forestreg.fit(Xtrain_red, ytrain)
print(mean_squared_error(forestreg.predict(Xtrain_red), ytrain))
print(mean_squared_error(forestreg.predict(Xtest_red), ytest))
Again the result is worst, but using the top four features it gets closer to the full model:
In [75]:
Xtrain_red = Xtrain.loc[:, ['age', 'cement', 'water', 'blast_furnace_slag']]
Xtest_red = Xtest.loc[:, ['age', 'cement', 'water', 'blast_furnace_slag']]
In [76]:
forestreg.fit(Xtrain_red, ytrain)
print(mean_squared_error(forestreg.predict(Xtrain_red), ytrain))
print(mean_squared_error(forestreg.predict(Xtest_red), ytest))
9 - Tweak the hyperparameters a bit, until you get a fit with a random forest that you are comfortable with, then print the error and $R^2$ of your final model. Comment on your results.
In [79]:
forestreg.set_params(n_jobs=-1)
forest_scores = []
for n_estimators in (10, 30, 100, 300):
for min_samples in 2**np.arange(1, 8):
for max_features in np.arange(1, 9):
for min_impurity_decrease in (0.001, 0.003, 0.01, 0.03, 0.1, 0.3):
forestreg.set_params(n_estimators=n_estimators, min_samples_split=min_samples, max_features=max_features, min_impurity_decrease=min_impurity_decrease)
forestreg.fit(Xtrain, ytrain)
train_MSE = mean_squared_error(forestreg.predict(Xtrain), ytrain)
test_MSE = mean_squared_error(forestreg.predict(Xtest), ytest)
forest_scores.append([n_estimators, min_samples, max_features, min_impurity_decrease, train_MSE, test_MSE])
print('{} estimators done'.format(n_estimators))
forest_scores_df = pd.DataFrame(data=forest_scores, columns=['n_estimators', 'min_samples', 'max_features', 'min_impurity_decrease', 'train_MSE', 'test_MSE'])
In [80]:
forest_scores_df.sort_values(by='test_MSE')
Out[80]:
In [85]:
# from solution: both here and in the decision tree changing criterion to MAE produced an improvement
forestreg.set_params(n_estimators=100, min_samples_split=2, max_features=3, min_impurity_decrease=0.001)
forestreg.fit(Xtrain, ytrain)
print('Training MSE: {}'.format(mean_squared_error(forestreg.predict(Xtrain), ytrain)))
print('Testing MSE: {}'.format(mean_squared_error(forestreg.predict(Xtest), ytest)))
print('Training R^2: {}'.format(r2_score(forestreg.predict(Xtrain), ytrain)))
print('Testing R^2: {}'.format(r2_score(forestreg.predict(Xtest), ytest)))
We get a fairly good model given the nonlinear data with a testing $R^2$ score of over 0.86. This result is way better than our first linear model and its MSE is almost half of our first regression tree.