In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pickle
%matplotlib inline
For this lecture, we are going to be using these lecture notes from a Data Mining course by Cosma Shalizi (a CMU Stats professor). Be sure to read through them first...
Let's start by creating at a very basic dataset with two features and two possible values (-1 and 1).
In [67]:
d = {'Value': np.random.binomial(1, 0.5,size=(100,))*2-1,
'x1' : np.random.permutation(100)*15+3000,
'x2' : np.random.permutation(100)}
df = pd.DataFrame(d)
plt.plot(df[df.Value>0].x1,df[df.Value>0].x2, 'b+')
plt.plot(df[df.Value<=0].x1,df[df.Value<=0].x2, 'ro')
plt.show()
It would clearly be difficult to find a classifier (even as robust as a decision tree) to find a decision boundary that can separate out -1s (reds) from +1s (blues). This should be reflected in the objective function that is optimized for training the model (i.e., finding the right partitioning of the input space).
Let's see how that goes...
First, let's define the funciton $S$ (shown on page 5 of the PDF), and let's have it calculate the sum of squared errors for all leaves at a node. As input, it will take the dataset that will go trhough that node of the tree (d), the feature we would like to use for splitting (f) and the threshold (t):
In [5]:
def S(d,f,t):
left = d[d[f]>t]['Value']
right = d[d[f]<=t]['Value']
S_left = np.sum((left-np.mean(left))**2)
S_right = np.sum((right-np.mean(right))**2)
return S_left + S_right
Now let's put it to use to check which variable we should split first:
In [6]:
x1thresholds = df.x1.unique() # These are the x1 thresholds
x2thresholds = df.x2.unique() # These are the x2 thresholds
# Let's list all possible thresholds and calculate their S value:
Sx1 = [S(df,'x1',t) for t in x1thresholds]
Sx2 = [S(df,'x2',t) for t in x2thresholds]
plt.plot(Sx1,'r--')
plt.plot(Sx2,'b--')
x1t = x1thresholds[np.argmin(Sx1)]
x2t = x2thresholds[np.argmin(Sx2)]
Sx1min = np.min(Sx1)
Sx2min = np.min(Sx2)
plt.plot(np.argmin(Sx1), Sx1min,'or')
plt.plot(np.argmin(Sx2), Sx2min,'or')
plt.legend(['S values for x1', 'S values for x2'])
if Sx2min > Sx1min:
print("Looks like we should split x1 at {}".format(x1t))
else:
print("Looks like we should split x2 at {}".format(x2t))
plt.show()
So, definitely the output of S over all possible thresholds for each variable is very non-smooth, but the list of all possible thresholds is not that long so we can easily solve it.
So we go ahead and make the split... Two new filtered datasets:
In [6]:
branch1 = df[df.x1>x1t]
branch2 = df[df.x1<=x1t]
... and you can go from there to repeat the process. But I'm tired.
So we are going to try it out with sklearn (scikit-learn):
In [8]:
from sklearn import tree
In [9]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(df[['x1','x2']], df['Value'])
In terms of training the model (i.e. fitting the tree) we are done. Nothing else to do. The object clf (classifier) is a tree and it has been trained. We can now call its methods to do fancy things:
In [49]:
# Let's look at the importance of the features
# (this is called a Gini importance metric)
#print(clf.feature_importances_)
Out[49]:
In [68]:
# Let's see how well it scores when feeding it the same dataset it was trained on:
clf.score(df[['x1','x2']], df['Value'])
Out[68]:
It got a perfect score! How can that be? That tree must be really deep. Can we visualize it? I took some code from here and tweaked it to do just that:
In [69]:
plot_step = 1
X = df[['x1','x2']]
y = df['Value']
# Let's pre-compute the range for our features
x_min, x_max = X['x1'].min() - 1, X['x1'].max() + 1
y_min, y_max = X['x2'].min() - 1, X['x2'].max() + 1
# And create a meshgrid so that we can create a countour plot on it
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
# Now we predict the values for all of the cells in the meshgrid
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # If you're curious what this is, read here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.c_.html and https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html
# And we reshape those results to have the same shape as the mesh
Z = Z.reshape(xx.shape)
# Now we can finally contour-plot, using a specific colormap
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
# Let's add labels to the axes
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis("tight")
# Plot the training points
plt.plot(df[df.Value>0].x1,df[df.Value>0].x2, 'b+')
plt.plot(df[df.Value<=0].x1,df[df.Value<=0].x2, 'ro')
plt.axis("tight")
# And add a few more beautification items
plt.suptitle("Decision surface of our decision tree")
plt.legend(['Positives','Negatives'])
plt.show()
Now let's try a regression task. To begin, let's use an existing dataset.
In [70]:
from sklearn.datasets import load_boston
print(load_boston().DESCR)
df = pd.DataFrame(
data=np.column_stack((load_boston().data,load_boston().target)),
columns=np.append(load_boston().feature_names,['Median_Value'])
)
df.describe()
Out[70]:
Now let's pick just two of those variables to do a simple regression:
In [71]:
def plot_regdata():
plt.plot(df['LSTAT'],df['Median_Value'],'rd')
plt.xlabel('% lower status of the population [%]')
plt.ylabel('Median Value [$10^3$ U.S. Dollars]')
plot_regdata()
plt.show()
Now we can fit a regression tree on these two variables:
In [76]:
x = df[['LSTAT']]
y = df[['Median_Value']]
xrange = np.arange(x.min(),x.max(),(x.max()-x.min())/100).reshape(100,1)
reg = tree.DecisionTreeRegressor(max_depth=2000) # Default parameters, though you can tweak these!
reg.fit(x,y)
plot_regdata()
plt.plot(xrange,reg.predict(xrange),'b--',linewidth=3)
plt.show()
So there you go, we fit a regression tree and it produced a very non-linear and non-smooth response over the feature 'LSTAT'. How well does it work? Let's test its score:
In [77]:
print(reg.score(x,y)) # Check out the documentation for the score method, to make sure you understand what it means.
Clearly, though, this does not mean that using one single variable (% Lower Status of the Population, in this case) can explain 95% of the variance in the Median Value of a house in Boston. More likely, we have overfitted our model to the data. We haven't covered overfitting yet, or ways to prevent it, but can you think of ways to achieve a more balanced model fit?