Lecture #6


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import pickle

%matplotlib inline

For this lecture, we are going to be using these lecture notes from a Data Mining course by Cosma Shalizi (a CMU Stats professor). Be sure to read through them first...

Let's start by creating at a very basic dataset with two features and two possible values (-1 and 1).


In [67]:
d = {'Value': np.random.binomial(1, 0.5,size=(100,))*2-1,
                  'x1' : np.random.permutation(100)*15+3000,
                  'x2' : np.random.permutation(100)}
df = pd.DataFrame(d)

plt.plot(df[df.Value>0].x1,df[df.Value>0].x2, 'b+')
plt.plot(df[df.Value<=0].x1,df[df.Value<=0].x2, 'ro')
plt.show()


It would clearly be difficult to find a classifier (even as robust as a decision tree) to find a decision boundary that can separate out -1s (reds) from +1s (blues). This should be reflected in the objective function that is optimized for training the model (i.e., finding the right partitioning of the input space).

Let's see how that goes...

First, let's define the funciton $S$ (shown on page 5 of the PDF), and let's have it calculate the sum of squared errors for all leaves at a node. As input, it will take the dataset that will go trhough that node of the tree (d), the feature we would like to use for splitting (f) and the threshold (t):


In [5]:
def S(d,f,t):
    left = d[d[f]>t]['Value']
    right = d[d[f]<=t]['Value']

    S_left = np.sum((left-np.mean(left))**2)
    S_right = np.sum((right-np.mean(right))**2)

    return S_left + S_right

Now let's put it to use to check which variable we should split first:


In [6]:
x1thresholds = df.x1.unique() # These are the x1 thresholds
x2thresholds = df.x2.unique() # These are the x2 thresholds

# Let's list all possible thresholds and calculate their S value:
Sx1 = [S(df,'x1',t) for t in x1thresholds]
Sx2 = [S(df,'x2',t) for t in x2thresholds]

plt.plot(Sx1,'r--')
plt.plot(Sx2,'b--')

x1t = x1thresholds[np.argmin(Sx1)]
x2t = x2thresholds[np.argmin(Sx2)]

Sx1min = np.min(Sx1)
Sx2min = np.min(Sx2)

plt.plot(np.argmin(Sx1), Sx1min,'or')
plt.plot(np.argmin(Sx2), Sx2min,'or')
plt.legend(['S values for x1', 'S values for x2'])

if Sx2min > Sx1min:
    print("Looks like we should split x1 at {}".format(x1t))
else:
    print("Looks like we should split x2 at {}".format(x2t))
        
plt.show()


Looks like we should split x1 at 3030

So, definitely the output of S over all possible thresholds for each variable is very non-smooth, but the list of all possible thresholds is not that long so we can easily solve it.

So we go ahead and make the split... Two new filtered datasets:


In [6]:
branch1 = df[df.x1>x1t]
branch2 = df[df.x1<=x1t]

... and you can go from there to repeat the process. But I'm tired.

So we are going to try it out with sklearn (scikit-learn):


In [8]:
from sklearn import tree

In [9]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(df[['x1','x2']], df['Value'])

In terms of training the model (i.e. fitting the tree) we are done. Nothing else to do. The object clf (classifier) is a tree and it has been trained. We can now call its methods to do fancy things:


In [49]:
# Let's look at the importance of the features 
# (this is called a Gini importance metric)
#print(clf.feature_importances_)


Out[49]:
x1 x2
Value
-1 3795 45
1 3690 54

In [68]:
# Let's see how well it scores when feeding it the same dataset it was trained on:
clf.score(df[['x1','x2']], df['Value'])


Out[68]:
0.57999999999999996

It got a perfect score! How can that be? That tree must be really deep. Can we visualize it? I took some code from here and tweaked it to do just that:


In [69]:
plot_step = 1

X = df[['x1','x2']]
y = df['Value']

# Let's pre-compute the range for our features
x_min, x_max = X['x1'].min() - 1, X['x1'].max() + 1
y_min, y_max = X['x2'].min() - 1, X['x2'].max() + 1
# And create a meshgrid so that we can create a countour plot on it
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

# Now we predict the values for all of the cells in the meshgrid
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # If you're curious what this is, read here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.c_.html and https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html
# And we reshape those results to have the same shape as the mesh
Z = Z.reshape(xx.shape)
# Now we can finally contour-plot, using a specific colormap
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

# Let's add labels to the axes
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis("tight")

# Plot the training points
plt.plot(df[df.Value>0].x1,df[df.Value>0].x2, 'b+')
plt.plot(df[df.Value<=0].x1,df[df.Value<=0].x2, 'ro')
plt.axis("tight")

# And add a few more beautification items
plt.suptitle("Decision surface of our decision tree")
plt.legend(['Positives','Negatives'])
plt.show()


Now let's try a regression task. To begin, let's use an existing dataset.


In [70]:
from sklearn.datasets import load_boston

print(load_boston().DESCR)

df = pd.DataFrame(
        data=np.column_stack((load_boston().data,load_boston().target)),
        columns=np.append(load_boston().feature_names,['Median_Value'])
    )

df.describe()


Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

Out[70]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT Median_Value
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000

Now let's pick just two of those variables to do a simple regression:


In [71]:
def plot_regdata():
    plt.plot(df['LSTAT'],df['Median_Value'],'rd')
    plt.xlabel('% lower status of the population [%]')
    plt.ylabel('Median Value [$10^3$ U.S. Dollars]')

plot_regdata()
plt.show()


Now we can fit a regression tree on these two variables:


In [76]:
x = df[['LSTAT']]
y = df[['Median_Value']]
xrange = np.arange(x.min(),x.max(),(x.max()-x.min())/100).reshape(100,1)

reg = tree.DecisionTreeRegressor(max_depth=2000) # Default parameters, though you can tweak these!
reg.fit(x,y)

plot_regdata()
plt.plot(xrange,reg.predict(xrange),'b--',linewidth=3)
plt.show()


So there you go, we fit a regression tree and it produced a very non-linear and non-smooth response over the feature 'LSTAT'. How well does it work? Let's test its score:


In [77]:
print(reg.score(x,y)) # Check out the documentation for the score method, to make sure you understand what it means.


0.959008812687

Clearly, though, this does not mean that using one single variable (% Lower Status of the Population, in this case) can explain 95% of the variance in the Median Value of a house in Boston. More likely, we have overfitted our model to the data. We haven't covered overfitting yet, or ways to prevent it, but can you think of ways to achieve a more balanced model fit?