We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)


In [3]:
import pandas as pd
%matplotlib inline
import numpy as np

In [32]:
from sklearn import tree
from sklearn import datasets
from sklearn import cross_validation
from sklearn import metrics

In [7]:
iris = datasets.load_iris() # load iris data set
x = iris.data[:,2:] # the attributes
y = iris.target # the target variable

In [9]:
x_train, x_test, y_train, y_test = cross_validation.train_test_split(x,y,test_size=0.5)

In [11]:
dt = tree.DecisionTreeClassifier()

In [12]:
dt = dt.fit(x_train,y_train)

In [29]:
#from Learning scikit-learn: Machine Learning in Python
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n"
    if show_classification_report:
        print "Classification report"
        print metrics.classification_report(y,y_pred),"\n"
    if show_confussion_matrix:
        print "Confusion matrix"
        print metrics.confusion_matrix(y,y_pred),"\n"

In [16]:
measure_performance(x_train, y_train, dt)


Accuracy:1.000 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        22
          1       1.00      1.00      1.00        27
          2       1.00      1.00      1.00        26

avg / total       1.00      1.00      1.00        75


Confusion matrix
[[22  0  0]
 [ 0 27  0]
 [ 0  0 26]] 


In [17]:
measure_performance(x_test,y_test,dt)


Accuracy:0.960 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        28
          1       0.95      0.91      0.93        23
          2       0.92      0.96      0.94        24

avg / total       0.96      0.96      0.96        75


Confusion matrix
[[28  0  0]
 [ 0 21  2]
 [ 0  1 23]] 


In [ ]:
#pretty good results (96% accuracy, with high precision and recall)

In [18]:
#visualize the model
from sklearn.externals.six import StringIO
import pydotplus #pip install pydotplus

In [19]:
with open("iris_50.dot", 'w') as f: #output the .dot file
    f = tree.export_graphviz(dt, out_file=f)

In [20]:
import os
os.unlink('iris_50.dot') #remove the file from the file path

In [21]:
dot_data = StringIO() 
tree.export_graphviz(dt, out_file=dot_data) #brew install graphviz
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("iris_50.pdf")


/Users/richarddunks/anaconda/lib/python2.7/site-packages/pyparsing.py:3546: DeprecationWarning: Operator '<<' is deprecated, use '<<=' instead
  ret << Group( Suppress(opener) + ZeroOrMore( ignoreExpr | ret | content ) + Suppress(closer) )
/Users/richarddunks/anaconda/lib/python2.7/site-packages/pydotplus/parser.py:490: DeprecationWarning: Operator '<<' is deprecated, use '<<=' instead
  'edge_point'
/Users/richarddunks/anaconda/lib/python2.7/site-packages/pydotplus/parser.py:502: DeprecationWarning: Operator '<<' is deprecated, use '<<=' instead
  stmt_list << OneOrMore(stmt + Optional(semi.suppress()))
Out[21]:
True

In [23]:
from wand.image import Image as WImage 
img = WImage(filename='iris_50.pdf')
img


Out[23]:

2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.


In [28]:
x_train_75, x_test_25, y_train_75, y_test_25 = cross_validation.train_test_split(x,y,train_size=0.75)

In [29]:
dt_75_25 = tree.DecisionTreeClassifier()

In [30]:
dt_75_25 = dt_75_25.fit(x_train_75,y_train_75)

In [31]:
measure_performance(x_train_75, y_train_75, dt)


Accuracy:0.982 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        37
          1       0.97      0.97      0.97        37
          2       0.97      0.97      0.97        38

avg / total       0.98      0.98      0.98       112


Confusion matrix
[[37  0  0]
 [ 0 36  1]
 [ 0  1 37]] 


In [32]:
measure_performance(x_test_25, y_test_25, dt)


Accuracy:0.974 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        13
          1       1.00      0.92      0.96        13
          2       0.92      1.00      0.96        12

avg / total       0.98      0.97      0.97        38


Confusion matrix
[[13  0  0]
 [ 0 12  1]
 [ 0  0 12]] 


In [34]:
#interestingly, our performance deteriorated on our training set with more examples (100% vs 98.2%), 
#but increased on our test set (97% vs 96%), likely due to more examples

3. Perform 10-fold cross validation on the data and compare your results to the hold out method we used in 1 and 2. Take the average of the results. What do you notice about the accuracy measures in each of these?


In [1]:
from sklearn import cross_validation

In [5]:
iris = datasets.load_iris() # load iris data set
x = iris.data[:,2:] # the attributes
y = iris.target # the target variable

In [6]:
dt = tree.DecisionTreeClassifier()

In [7]:
dt = dt.fit(x,y) #build the model on all the data and then test with cross-fold validation

In [12]:
cv = cross_validation.KFold(len(x),10,shuffle=True,random_state=0)

The method above is a more elaborated way of creating the cross-folds. cross_val_score is already doing this under the hood. We're just making it explicit


In [14]:
scores = cross_validation.cross_val_score(dt,x,y,cv=cv)

In [15]:
scores.mean()


Out[15]:
0.94000000000000006

Based on this result, it's likely our model will achieve a 94% accuracy on unseen data, rather than the 97% predicted with the hold-out method

4. Open the seeds_dataset.txt and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?

For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/seeds


In [41]:
df = pd.read_csv("data/seeds_dataset.txt",header=None)

In [42]:
df


Out[42]:
0 1 2 3 4 5 6 7
0 15.26 14.84 0.8710 5.763 3.312 2.2210 5.220 1
1 14.88 14.57 0.8811 5.554 3.333 1.0180 4.956 1
2 14.29 14.09 0.9050 5.291 3.337 2.6990 4.825 1
3 13.84 13.94 0.8955 5.324 3.379 2.2590 4.805 1
4 16.14 14.99 0.9034 5.658 3.562 1.3550 5.175 1
5 14.38 14.21 0.8951 5.386 3.312 2.4620 4.956 1
6 14.69 14.49 0.8799 5.563 3.259 3.5860 5.219 1
7 14.11 14.10 0.8911 5.420 3.302 2.7000 5.000 1
8 16.63 15.46 0.8747 6.053 3.465 2.0400 5.877 1
9 16.44 15.25 0.8880 5.884 3.505 1.9690 5.533 1
10 15.26 14.85 0.8696 5.714 3.242 4.5430 5.314 1
11 14.03 14.16 0.8796 5.438 3.201 1.7170 5.001 1
12 13.89 14.02 0.8880 5.439 3.199 3.9860 4.738 1
13 13.78 14.06 0.8759 5.479 3.156 3.1360 4.872 1
14 13.74 14.05 0.8744 5.482 3.114 2.9320 4.825 1
15 14.59 14.28 0.8993 5.351 3.333 4.1850 4.781 1
16 13.99 13.83 0.9183 5.119 3.383 5.2340 4.781 1
17 15.69 14.75 0.9058 5.527 3.514 1.5990 5.046 1
18 14.70 14.21 0.9153 5.205 3.466 1.7670 4.649 1
19 12.72 13.57 0.8686 5.226 3.049 4.1020 4.914 1
20 14.16 14.40 0.8584 5.658 3.129 3.0720 5.176 1
21 14.11 14.26 0.8722 5.520 3.168 2.6880 5.219 1
22 15.88 14.90 0.8988 5.618 3.507 0.7651 5.091 1
23 12.08 13.23 0.8664 5.099 2.936 1.4150 4.961 1
24 15.01 14.76 0.8657 5.789 3.245 1.7910 5.001 1
25 16.19 15.16 0.8849 5.833 3.421 0.9030 5.307 1
26 13.02 13.76 0.8641 5.395 3.026 3.3730 4.825 1
27 12.74 13.67 0.8564 5.395 2.956 2.5040 4.869 1
28 14.11 14.18 0.8820 5.541 3.221 2.7540 5.038 1
29 13.45 14.02 0.8604 5.516 3.065 3.5310 5.097 1
... ... ... ... ... ... ... ... ...
180 11.41 12.95 0.8560 5.090 2.775 4.9570 4.825 3
181 12.46 13.41 0.8706 5.236 3.017 4.9870 5.147 3
182 12.19 13.36 0.8579 5.240 2.909 4.8570 5.158 3
183 11.65 13.07 0.8575 5.108 2.850 5.2090 5.135 3
184 12.89 13.77 0.8541 5.495 3.026 6.1850 5.316 3
185 11.56 13.31 0.8198 5.363 2.683 4.0620 5.182 3
186 11.81 13.45 0.8198 5.413 2.716 4.8980 5.352 3
187 10.91 12.80 0.8372 5.088 2.675 4.1790 4.956 3
188 11.23 12.82 0.8594 5.089 2.821 7.5240 4.957 3
189 10.59 12.41 0.8648 4.899 2.787 4.9750 4.794 3
190 10.93 12.80 0.8390 5.046 2.717 5.3980 5.045 3
191 11.27 12.86 0.8563 5.091 2.804 3.9850 5.001 3
192 11.87 13.02 0.8795 5.132 2.953 3.5970 5.132 3
193 10.82 12.83 0.8256 5.180 2.630 4.8530 5.089 3
194 12.11 13.27 0.8639 5.236 2.975 4.1320 5.012 3
195 12.80 13.47 0.8860 5.160 3.126 4.8730 4.914 3
196 12.79 13.53 0.8786 5.224 3.054 5.4830 4.958 3
197 13.37 13.78 0.8849 5.320 3.128 4.6700 5.091 3
198 12.62 13.67 0.8481 5.410 2.911 3.3060 5.231 3
199 12.76 13.38 0.8964 5.073 3.155 2.8280 4.830 3
200 12.38 13.44 0.8609 5.219 2.989 5.4720 5.045 3
201 12.67 13.32 0.8977 4.984 3.135 2.3000 4.745 3
202 11.18 12.72 0.8680 5.009 2.810 4.0510 4.828 3
203 12.70 13.41 0.8874 5.183 3.091 8.4560 5.000 3
204 12.37 13.47 0.8567 5.204 2.960 3.9190 5.001 3
205 12.19 13.20 0.8783 5.137 2.981 3.6310 4.870 3
206 11.23 12.88 0.8511 5.140 2.795 4.3250 5.003 3
207 13.20 13.66 0.8883 5.236 3.232 8.3150 5.056 3
208 11.84 13.21 0.8521 5.175 2.836 3.5980 5.044 3
209 12.30 13.34 0.8684 5.243 2.974 5.6370 5.063 3

210 rows × 8 columns


In [43]:
df.describe()


Out[43]:
0 1 2 3 4 5 6 7
count 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000
mean 14.847524 14.559286 0.870999 5.628533 3.258605 3.700201 5.408071 2.000000
std 2.909699 1.305959 0.023629 0.443063 0.377714 1.503557 0.491480 0.818448
min 10.590000 12.410000 0.808100 4.899000 2.630000 0.765100 4.519000 1.000000
25% 12.270000 13.450000 0.856900 5.262250 2.944000 2.561500 5.045000 1.000000
50% 14.355000 14.320000 0.873450 5.523500 3.237000 3.599000 5.223000 2.000000
75% 17.305000 15.715000 0.887775 5.979750 3.561750 4.768750 5.877000 3.000000
max 21.180000 17.250000 0.918300 6.675000 4.033000 8.456000 6.550000 3.000000

In [44]:
from pandas.tools.plotting import scatter_matrix

In [45]:
scatter_matrix(df,alpha=0.2, figsize=(10, 10), diagonal='kde')


Out[45]:
array([[<matplotlib.axes.AxesSubplot object at 0x10cfaf590>,
        <matplotlib.axes.AxesSubplot object at 0x10bdaad50>,
        <matplotlib.axes.AxesSubplot object at 0x10df0d490>,
        <matplotlib.axes.AxesSubplot object at 0x10df87ad0>,
        <matplotlib.axes.AxesSubplot object at 0x10e98ee90>,
        <matplotlib.axes.AxesSubplot object at 0x10da1bf90>,
        <matplotlib.axes.AxesSubplot object at 0x10db90a50>,
        <matplotlib.axes.AxesSubplot object at 0x10e9e4d10>],
       [<matplotlib.axes.AxesSubplot object at 0x10f08cad0>,
        <matplotlib.axes.AxesSubplot object at 0x10f228810>,
        <matplotlib.axes.AxesSubplot object at 0x10f2ae5d0>,
        <matplotlib.axes.AxesSubplot object at 0x10f313510>,
        <matplotlib.axes.AxesSubplot object at 0x10f397550>,
        <matplotlib.axes.AxesSubplot object at 0x10f337b50>,
        <matplotlib.axes.AxesSubplot object at 0x10f58d2d0>,
        <matplotlib.axes.AxesSubplot object at 0x10f614090>],
       [<matplotlib.axes.AxesSubplot object at 0x10f677850>,
        <matplotlib.axes.AxesSubplot object at 0x10f6fe710>,
        <matplotlib.axes.AxesSubplot object at 0x10f7664d0>,
        <matplotlib.axes.AxesSubplot object at 0x10f7e95d0>,
        <matplotlib.axes.AxesSubplot object at 0x10f96e390>,
        <matplotlib.axes.AxesSubplot object at 0x10f9df290>,
        <matplotlib.axes.AxesSubplot object at 0x10fb67050>,
        <matplotlib.axes.AxesSubplot object at 0x10fbbff50>],
       [<matplotlib.axes.AxesSubplot object at 0x10fd41f90>,
        <matplotlib.axes.AxesSubplot object at 0x10fbe4650>,
        <matplotlib.axes.AxesSubplot object at 0x10fe3ad10>,
        <matplotlib.axes.AxesSubplot object at 0x10febdad0>,
        <matplotlib.axes.AxesSubplot object at 0x10fefdf90>,
        <matplotlib.axes.AxesSubplot object at 0x11007bd50>,
        <matplotlib.axes.AxesSubplot object at 0x1100e5990>,
        <matplotlib.axes.AxesSubplot object at 0x11017b790>],
       [<matplotlib.axes.AxesSubplot object at 0x1101b7910>,
        <matplotlib.axes.AxesSubplot object at 0x110266750>,
        <matplotlib.axes.AxesSubplot object at 0x1102ed510>,
        <matplotlib.axes.AxesSubplot object at 0x110352490>,
        <matplotlib.axes.AxesSubplot object at 0x1103e1410>,
        <matplotlib.axes.AxesSubplot object at 0x11042fe50>,
        <matplotlib.axes.AxesSubplot object at 0x1104ce150>,
        <matplotlib.axes.AxesSubplot object at 0x1107072d0>],
       [<matplotlib.axes.AxesSubplot object at 0x1107b7b10>,
        <matplotlib.axes.AxesSubplot object at 0x11083d8d0>,
        <matplotlib.axes.AxesSubplot object at 0x1108a5950>,
        <matplotlib.axes.AxesSubplot object at 0x1109348d0>,
        <matplotlib.axes.AxesSubplot object at 0x110990350>,
        <matplotlib.axes.AxesSubplot object at 0x110a22610>,
        <matplotlib.axes.AxesSubplot object at 0x110a5b790>,
        <matplotlib.axes.AxesSubplot object at 0x110b0bfd0>],
       [<matplotlib.axes.AxesSubplot object at 0x110b93d90>,
        <matplotlib.axes.AxesSubplot object at 0x110bfae10>,
        <matplotlib.axes.AxesSubplot object at 0x110d89d90>,
        <matplotlib.axes.AxesSubplot object at 0x110de6810>,
        <matplotlib.axes.AxesSubplot object at 0x110e77ad0>,
        <matplotlib.axes.AxesSubplot object at 0x110eb3c50>,
        <matplotlib.axes.AxesSubplot object at 0x110f6f4d0>,
        <matplotlib.axes.AxesSubplot object at 0x110ff5290>],
       [<matplotlib.axes.AxesSubplot object at 0x11115c310>,
        <matplotlib.axes.AxesSubplot object at 0x1111eb290>,
        <matplotlib.axes.AxesSubplot object at 0x111239cd0>,
        <matplotlib.axes.AxesSubplot object at 0x1112cbf90>,
        <matplotlib.axes.AxesSubplot object at 0x111313150>,
        <matplotlib.axes.AxesSubplot object at 0x1113a4390>,
        <matplotlib.axes.AxesSubplot object at 0x1116299d0>,
        <matplotlib.axes.AxesSubplot object at 0x111693d50>]], dtype=object)

In [46]:
df.corr()


Out[46]:
0 1 2 3 4 5 6 7
0 1.000000 0.994341 0.608288 0.949985 0.970771 -0.229572 0.863693 -0.346058
1 0.994341 1.000000 0.529244 0.972422 0.944829 -0.217340 0.890784 -0.327900
2 0.608288 0.529244 1.000000 0.367915 0.761635 -0.331471 0.226825 -0.531007
3 0.949985 0.972422 0.367915 1.000000 0.860415 -0.171562 0.932806 -0.257269
4 0.970771 0.944829 0.761635 0.860415 1.000000 -0.258037 0.749131 -0.423463
5 -0.229572 -0.217340 -0.331471 -0.171562 -0.258037 1.000000 -0.011079 0.577273
6 0.863693 0.890784 0.226825 0.932806 0.749131 -0.011079 1.000000 0.024301
7 -0.346058 -0.327900 -0.531007 -0.257269 -0.423463 0.577273 0.024301 1.000000

based on the various characteristics of the wheat kernel, we're predicting the variety, either Kama, Rosa and Canadian. Some features seem highly correlated and potentially useful for splitting features (area and perimeter), while others don't appear correlated and unlikely to help split features (asymetry coefficient)

5. Using the seeds_dataset.txt, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50, 75-25, 10-fold cross validation) and discuss the results.


In [65]:
x = np.asarray(df[[0,1,2,3,4,5,6]])
y = np.asarray(df[7])

In [48]:
#50-50 split
x_train_50,x_test_50,y_train_50,y_test_50 = cross_validation.train_test_split(x,y,train_size=0.5)

In [49]:
dt_seeds_50 = tree.DecisionTreeClassifier()

In [50]:
dt_seeds_50 = dt_seeds_50.fit(x_train_50,y_train_50)

In [51]:
measure_performance(x_train_50,y_train_50,dt_seeds_50)


Accuracy:1.000 

Classification report
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        33
          2       1.00      1.00      1.00        33
          3       1.00      1.00      1.00        39

avg / total       1.00      1.00      1.00       105


Confusion matrix
[[33  0  0]
 [ 0 33  0]
 [ 0  0 39]] 


In [52]:
measure_performance(x_test_50,y_test_50,dt_seeds_50)


Accuracy:0.924 

Classification report
             precision    recall  f1-score   support

          1       0.89      0.89      0.89        37
          2       0.95      0.97      0.96        37
          3       0.93      0.90      0.92        31

avg / total       0.92      0.92      0.92       105


Confusion matrix
[[33  2  2]
 [ 1 36  0]
 [ 3  0 28]] 

class 1 and 3 are hard for the classifier to distinguish, which makes sense looking at the scatter matrix results and seeing the overlap between them


In [53]:
#75-25 split
x_train_75,x_test_25,y_train_75,y_test_25 = cross_validation.train_test_split(x,y,train_size=0.75)

In [54]:
dt_seeds_75 = tree.DecisionTreeClassifier()

In [55]:
dt_seeds_75 = dt_seeds_75.fit(x_train_75,y_train_75)

In [56]:
measure_performance(x_train_75,y_train_75,dt_seeds_75)


Accuracy:1.000 

Classification report
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        59
          2       1.00      1.00      1.00        47
          3       1.00      1.00      1.00        51

avg / total       1.00      1.00      1.00       157


Confusion matrix
[[59  0  0]
 [ 0 47  0]
 [ 0  0 51]] 


In [57]:
measure_performance(x_test_25,y_test_25,dt_seeds_75)


Accuracy:0.925 

Classification report
             precision    recall  f1-score   support

          1       0.77      0.91      0.83        11
          2       0.95      0.91      0.93        23
          3       1.00      0.95      0.97        19

avg / total       0.93      0.92      0.93        53


Confusion matrix
[[10  1  0]
 [ 2 21  0]
 [ 1  0 18]] 

Although the precision and recall for class 3 have improved, there's still a problem in having the model distinguish between classes 1 and 3


In [58]:
# 10 fold cross validation
dt_seeds_cv = tree.DecisionTreeClassifier()
dt_seeds_cv = dt_seeds_cv.fit(x,y) #fit the model on all data

In [60]:
scores = cross_validation.cross_val_score(dt_seeds_cv,x,y,cv=10)

In [61]:
scores.mean()


Out[61]:
0.92857142857142849

In [66]:
dt_seeds_cv.feature_importances_


Out[66]:
array([ 0.34860165,  0.0122449 ,  0.0125    ,  0.00714286,  0.02042607,
        0.06731387,  0.53177066])

The decision tree model for this data is likely fairly accurate, approximately 93% on this data.


In [ ]: