We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)

import pandas as pd
%matplotlib inline
import numpy as np

from sklearn import tree
from sklearn import datasets
from sklearn import cross_validation
from sklearn import metrics

iris = datasets.load_iris() # load iris data set
x = iris.data[:,2:] # the attributes
y = iris.target # the target variable

x_train, x_test, y_train, y_test = cross_validation.train_test_split(x,y,test_size=0.5)

dt = tree.DecisionTreeClassifier()

dt = dt.fit(x_train,y_train)

#from Learning scikit-learn: Machine Learning in Python
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    if show_accuracy:
        print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n"
    if show_classification_report:
        print "Classification report"
        print metrics.classification_report(y,y_pred),"\n"
    if show_confussion_matrix:
        print "Confusion matrix"
        print metrics.confusion_matrix(y,y_pred),"\n"

measure_performance(x_train, y_train, dt)


Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        22
          1       1.00      1.00      1.00        27
          2       1.00      1.00      1.00        26

avg / total       1.00      1.00      1.00        75

Confusion matrix
[[22  0  0]
 [ 0 27  0]
 [ 0  0 26]] 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        28
          1       0.95      0.91      0.93        23
          2       0.92      0.96      0.94        24

avg / total       0.96      0.96      0.96        75

Confusion matrix
[[28  0  0]
 [ 0 21  2]
 [ 0  1 23]] 

#pretty good results (96% accuracy, with high precision and recall)

#visualize the model
from sklearn.externals.six import StringIO
import pydotplus #pip install pydotplus

with open("iris_50.dot", 'w') as f: #output the .dot file
    f = tree.export_graphviz(dt, out_file=f)

import os
os.unlink('iris_50.dot') #remove the file from the file path

dot_data = StringIO() 
tree.export_graphviz(dt, out_file=dot_data) #brew install graphviz
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 

from wand.image import Image as WImage 
img = WImage(filename='iris_50.pdf')


2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.

x_train_75, x_test_25, y_train_75, y_test_25 = cross_validation.train_test_split(x,y,train_size=0.75)

dt_75_25 = tree.DecisionTreeClassifier()

dt_75_25 = dt_75_25.fit(x_train_75,y_train_75)

measure_performance(x_train_75, y_train_75, dt)


Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        37
          1       0.97      0.97      0.97        37
          2       0.97      0.97      0.97        38

avg / total       0.98      0.98      0.98       112

Confusion matrix
[[37  0  0]
 [ 0 36  1]
 [ 0  1 37]] 

measure_performance(x_test_25, y_test_25, dt)


Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        13
          1       1.00      0.92      0.96        13
          2       0.92      1.00      0.96        12

avg / total       0.98      0.97      0.97        38

Confusion matrix
[[13  0  0]
 [ 0 12  1]
 [ 0  0 12]] 

#interestingly, our performance deteriorated on our training set with more examples (100% vs 98.2%), 
#but increased on our test set (97% vs 96%), likely due to more examples

3. Perform 10-fold cross validation on the data and compare your results to the hold out method we used in 1 and 2. Take the average of the results. What do you notice about the accuracy measures in each of these?

from sklearn import cross_validation

iris = datasets.load_iris() # load iris data set
x = iris.data[:,2:] # the attributes
y = iris.target # the target variable

dt = tree.DecisionTreeClassifier()

dt = dt.fit(x,y) #build the model on all the data and then test with cross-fold validation

cv = cross_validation.KFold(len(x),10,shuffle=True,random_state=0)

The method above is a more elaborated way of creating the cross-folds. cross_val_score is already doing this under the hood. We're just making it explicit

scores = cross_validation.cross_val_score(dt,x,y,cv=cv)

Based on this result, it's likely our model will achieve a 94% accuracy on unseen data, rather than the 97% predicted with the hold-out method

4. Open the seeds_dataset.txt and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?

For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/seeds

df = pd.read_csv("data/seeds_dataset.txt",header=None)

0 1 2 3 4 5 6 7
0 15.26 14.84 0.8710 5.763 3.312 2.2210 5.220 1
1 14.88 14.57 0.8811 5.554 3.333 1.0180 4.956 1
2 14.29 14.09 0.9050 5.291 3.337 2.6990 4.825 1
3 13.84 13.94 0.8955 5.324 3.379 2.2590 4.805 1
4 16.14 14.99 0.9034 5.658 3.562 1.3550 5.175 1
5 14.38 14.21 0.8951 5.386 3.312 2.4620 4.956 1
6 14.69 14.49 0.8799 5.563 3.259 3.5860 5.219 1
7 14.11 14.10 0.8911 5.420 3.302 2.7000 5.000 1
8 16.63 15.46 0.8747 6.053 3.465 2.0400 5.877 1
9 16.44 15.25 0.8880 5.884 3.505 1.9690 5.533 1
10 15.26 14.85 0.8696 5.714 3.242 4.5430 5.314 1
11 14.03 14.16 0.8796 5.438 3.201 1.7170 5.001 1
12 13.89 14.02 0.8880 5.439 3.199 3.9860 4.738 1
13 13.78 14.06 0.8759 5.479 3.156 3.1360 4.872 1
14 13.74 14.05 0.8744 5.482 3.114 2.9320 4.825 1
15 14.59 14.28 0.8993 5.351 3.333 4.1850 4.781 1
16 13.99 13.83 0.9183 5.119 3.383 5.2340 4.781 1
17 15.69 14.75 0.9058 5.527 3.514 1.5990 5.046 1
18 14.70 14.21 0.9153 5.205 3.466 1.7670 4.649 1
19 12.72 13.57 0.8686 5.226 3.049 4.1020 4.914 1
20 14.16 14.40 0.8584 5.658 3.129 3.0720 5.176 1
21 14.11 14.26 0.8722 5.520 3.168 2.6880 5.219 1
22 15.88 14.90 0.8988 5.618 3.507 0.7651 5.091 1
23 12.08 13.23 0.8664 5.099 2.936 1.4150 4.961 1
24 15.01 14.76 0.8657 5.789 3.245 1.7910 5.001 1
25 16.19 15.16 0.8849 5.833 3.421 0.9030 5.307 1
26 13.02 13.76 0.8641 5.395 3.026 3.3730 4.825 1
27 12.74 13.67 0.8564 5.395 2.956 2.5040 4.869 1
28 14.11 14.18 0.8820 5.541 3.221 2.7540 5.038 1
29 13.45 14.02 0.8604 5.516 3.065 3.5310 5.097 1
... ... ... ... ... ... ... ... ...
180 11.41 12.95 0.8560 5.090 2.775 4.9570 4.825 3
181 12.46 13.41 0.8706 5.236 3.017 4.9870 5.147 3
182 12.19 13.36 0.8579 5.240 2.909 4.8570 5.158 3
183 11.65 13.07 0.8575 5.108 2.850 5.2090 5.135 3
184 12.89 13.77 0.8541 5.495 3.026 6.1850 5.316 3
185 11.56 13.31 0.8198 5.363 2.683 4.0620 5.182 3
186 11.81 13.45 0.8198 5.413 2.716 4.8980 5.352 3
187 10.91 12.80 0.8372 5.088 2.675 4.1790 4.956 3
188 11.23 12.82 0.8594 5.089 2.821 7.5240 4.957 3
189 10.59 12.41 0.8648 4.899 2.787 4.9750 4.794 3
190 10.93 12.80 0.8390 5.046 2.717 5.3980 5.045 3
191 11.27 12.86 0.8563 5.091 2.804 3.9850 5.001 3
192 11.87 13.02 0.8795 5.132 2.953 3.5970 5.132 3
193 10.82 12.83 0.8256 5.180 2.630 4.8530 5.089 3
194 12.11 13.27 0.8639 5.236 2.975 4.1320 5.012 3
195 12.80 13.47 0.8860 5.160 3.126 4.8730 4.914 3
196 12.79 13.53 0.8786 5.224 3.054 5.4830 4.958 3
197 13.37 13.78 0.8849 5.320 3.128 4.6700 5.091 3
198 12.62 13.67 0.8481 5.410 2.911 3.3060 5.231 3
199 12.76 13.38 0.8964 5.073 3.155 2.8280 4.830 3
200 12.38 13.44 0.8609 5.219 2.989 5.4720 5.045 3
201 12.67 13.32 0.8977 4.984 3.135 2.3000 4.745 3
202 11.18 12.72 0.8680 5.009 2.810 4.0510 4.828 3
203 12.70 13.41 0.8874 5.183 3.091 8.4560 5.000 3
204 12.37 13.47 0.8567 5.204 2.960 3.9190 5.001 3
205 12.19 13.20 0.8783 5.137 2.981 3.6310 4.870 3
206 11.23 12.88 0.8511 5.140 2.795 4.3250 5.003 3
207 13.20 13.66 0.8883 5.236 3.232 8.3150 5.056 3
208 11.84 13.21 0.8521 5.175 2.836 3.5980 5.044 3
209 12.30 13.34 0.8684 5.243 2.974 5.6370 5.063 3

210 rows × 8 columns

0 1 2 3 4 5 6 7
count 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000 210.000000
mean 14.847524 14.559286 0.870999 5.628533 3.258605 3.700201 5.408071 2.000000
std 2.909699 1.305959 0.023629 0.443063 0.377714 1.503557 0.491480 0.818448
min 10.590000 12.410000 0.808100 4.899000 2.630000 0.765100 4.519000 1.000000
25% 12.270000 13.450000 0.856900 5.262250 2.944000 2.561500 5.045000 1.000000
50% 14.355000 14.320000 0.873450 5.523500 3.237000 3.599000 5.223000 2.000000
75% 17.305000 15.715000 0.887775 5.979750 3.561750 4.768750 5.877000 3.000000
max 21.180000 17.250000 0.918300 6.675000 4.033000 8.456000 6.550000 3.000000

from pandas.tools.plotting import scatter_matrix

scatter_matrix(df,alpha=0.2, figsize=(10, 10), diagonal='kde')

0 1 2 3 4 5 6 7
0 1.000000 0.994341 0.608288 0.949985 0.970771 -0.229572 0.863693 -0.346058
1 0.994341 1.000000 0.529244 0.972422 0.944829 -0.217340 0.890784 -0.327900
2 0.608288 0.529244 1.000000 0.367915 0.761635 -0.331471 0.226825 -0.531007
3 0.949985 0.972422 0.367915 1.000000 0.860415 -0.171562 0.932806 -0.257269
4 0.970771 0.944829 0.761635 0.860415 1.000000 -0.258037 0.749131 -0.423463
5 -0.229572 -0.217340 -0.331471 -0.171562 -0.258037 1.000000 -0.011079 0.577273
6 0.863693 0.890784 0.226825 0.932806 0.749131 -0.011079 1.000000 0.024301
7 -0.346058 -0.327900 -0.531007 -0.257269 -0.423463 0.577273 0.024301 1.000000

based on the various characteristics of the wheat kernel, we're predicting the variety, either Kama, Rosa and Canadian. Some features seem highly correlated and potentially useful for splitting features (area and perimeter), while others don't appear correlated and unlikely to help split features (asymetry coefficient)

5. Using the seeds_dataset.txt, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50, 75-25, 10-fold cross validation) and discuss the results.

x = np.asarray(df[[0,1,2,3,4,5,6]])
y = np.asarray(df[7])

#50-50 split
x_train_50,x_test_50,y_train_50,y_test_50 = cross_validation.train_test_split(x,y,train_size=0.5)

dt_seeds_50 = tree.DecisionTreeClassifier()

dt_seeds_50 = dt_seeds_50.fit(x_train_50,y_train_50)

Classification report
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        33
          2       1.00      1.00      1.00        33
          3       1.00      1.00      1.00        39

avg / total       1.00      1.00      1.00       105

Confusion matrix
[[33  0  0]
 [ 0 33  0]
 [ 0  0 39]] 

Classification report
             precision    recall  f1-score   support

          1       0.89      0.89      0.89        37
          2       0.95      0.97      0.96        37
          3       0.93      0.90      0.92        31

avg / total       0.92      0.92      0.92       105

Confusion matrix
[[33  2  2]
 [ 1 36  0]
 [ 3  0 28]] 

class 1 and 3 are hard for the classifier to distinguish, which makes sense looking at the scatter matrix results and seeing the overlap between them

#75-25 split
x_train_75,x_test_25,y_train_75,y_test_25 = cross_validation.train_test_split(x,y,train_size=0.75)

dt_seeds_75 = tree.DecisionTreeClassifier()

dt_seeds_75 = dt_seeds_75.fit(x_train_75,y_train_75)

Classification report
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        59
          2       1.00      1.00      1.00        47
          3       1.00      1.00      1.00        51

avg / total       1.00      1.00      1.00       157

Confusion matrix
[[59  0  0]
 [ 0 47  0]
 [ 0  0 51]] 

Classification report
             precision    recall  f1-score   support

          1       0.77      0.91      0.83        11
          2       0.95      0.91      0.93        23
          3       1.00      0.95      0.97        19

avg / total       0.93      0.92      0.93        53

Confusion matrix
[[10  1  0]
 [ 2 21  0]
 [ 1  0 18]] 

Although the precision and recall for class 3 have improved, there's still a problem in having the model distinguish between classes 1 and 3

# 10 fold cross validation
dt_seeds_cv = tree.DecisionTreeClassifier()
dt_seeds_cv = dt_seeds_cv.fit(x,y) #fit the model on all data

scores = cross_validation.cross_val_score(dt_seeds_cv,x,y,cv=10)

array([ 0.34860165,  0.0122449 ,  0.0125    ,  0.00714286,  0.02042607,
        0.06731387,  0.53177066])

The decision tree model for this data is likely fairly accurate, approximately 93% on this data.

