## We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

### 1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)

``````

In [3]:

import pandas as pd
%matplotlib inline
import numpy as np

``````
``````

In [32]:

from sklearn import tree
from sklearn import datasets
from sklearn import cross_validation
from sklearn import metrics

``````
``````

In [7]:

x = iris.data[:,2:] # the attributes
y = iris.target # the target variable

``````
``````

In [9]:

x_train, x_test, y_train, y_test = cross_validation.train_test_split(x,y,test_size=0.5)

``````
``````

In [11]:

dt = tree.DecisionTreeClassifier()

``````
``````

In [12]:

dt = dt.fit(x_train,y_train)

``````
``````

In [29]:

#from Learning scikit-learn: Machine Learning in Python
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n"
if show_classification_report:
print "Classification report"
print metrics.classification_report(y,y_pred),"\n"
if show_confussion_matrix:
print "Confusion matrix"
print metrics.confusion_matrix(y,y_pred),"\n"

``````
``````

In [16]:

measure_performance(x_train, y_train, dt)

``````
``````

Accuracy:1.000

Classification report
precision    recall  f1-score   support

0       1.00      1.00      1.00        22
1       1.00      1.00      1.00        27
2       1.00      1.00      1.00        26

avg / total       1.00      1.00      1.00        75

Confusion matrix
[[22  0  0]
[ 0 27  0]
[ 0  0 26]]

``````
``````

In [17]:

measure_performance(x_test,y_test,dt)

``````
``````

Accuracy:0.960

Classification report
precision    recall  f1-score   support

0       1.00      1.00      1.00        28
1       0.95      0.91      0.93        23
2       0.92      0.96      0.94        24

avg / total       0.96      0.96      0.96        75

Confusion matrix
[[28  0  0]
[ 0 21  2]
[ 0  1 23]]

``````
``````

In [ ]:

#pretty good results (96% accuracy, with high precision and recall)

``````
``````

In [18]:

#visualize the model
from sklearn.externals.six import StringIO
import pydotplus #pip install pydotplus

``````
``````

In [19]:

with open("iris_50.dot", 'w') as f: #output the .dot file
f = tree.export_graphviz(dt, out_file=f)

``````
``````

In [20]:

import os
os.unlink('iris_50.dot') #remove the file from the file path

``````
``````

In [21]:

dot_data = StringIO()
tree.export_graphviz(dt, out_file=dot_data) #brew install graphviz
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("iris_50.pdf")

``````
``````

/Users/richarddunks/anaconda/lib/python2.7/site-packages/pyparsing.py:3546: DeprecationWarning: Operator '<<' is deprecated, use '<<=' instead
ret << Group( Suppress(opener) + ZeroOrMore( ignoreExpr | ret | content ) + Suppress(closer) )
/Users/richarddunks/anaconda/lib/python2.7/site-packages/pydotplus/parser.py:490: DeprecationWarning: Operator '<<' is deprecated, use '<<=' instead
'edge_point'
/Users/richarddunks/anaconda/lib/python2.7/site-packages/pydotplus/parser.py:502: DeprecationWarning: Operator '<<' is deprecated, use '<<=' instead
stmt_list << OneOrMore(stmt + Optional(semi.suppress()))

Out[21]:

True

``````
``````

In [23]:

from wand.image import Image as WImage
img = WImage(filename='iris_50.pdf')
img

``````
``````

Out[23]:

``````

### 2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.

``````

In [28]:

x_train_75, x_test_25, y_train_75, y_test_25 = cross_validation.train_test_split(x,y,train_size=0.75)

``````
``````

In [29]:

dt_75_25 = tree.DecisionTreeClassifier()

``````
``````

In [30]:

dt_75_25 = dt_75_25.fit(x_train_75,y_train_75)

``````
``````

In [31]:

measure_performance(x_train_75, y_train_75, dt)

``````
``````

Accuracy:0.982

Classification report
precision    recall  f1-score   support

0       1.00      1.00      1.00        37
1       0.97      0.97      0.97        37
2       0.97      0.97      0.97        38

avg / total       0.98      0.98      0.98       112

Confusion matrix
[[37  0  0]
[ 0 36  1]
[ 0  1 37]]

``````
``````

In [32]:

measure_performance(x_test_25, y_test_25, dt)

``````
``````

Accuracy:0.974

Classification report
precision    recall  f1-score   support

0       1.00      1.00      1.00        13
1       1.00      0.92      0.96        13
2       0.92      1.00      0.96        12

avg / total       0.98      0.97      0.97        38

Confusion matrix
[[13  0  0]
[ 0 12  1]
[ 0  0 12]]

``````
``````

In [34]:

#interestingly, our performance deteriorated on our training set with more examples (100% vs 98.2%),
#but increased on our test set (97% vs 96%), likely due to more examples

``````

### 3. Perform 10-fold cross validation on the data and compare your results to the hold out method we used in 1 and 2. Take the average of the results. What do you notice about the accuracy measures in each of these?

``````

In [1]:

from sklearn import cross_validation

``````
``````

In [5]:

x = iris.data[:,2:] # the attributes
y = iris.target # the target variable

``````
``````

In [6]:

dt = tree.DecisionTreeClassifier()

``````
``````

In [7]:

dt = dt.fit(x,y) #build the model on all the data and then test with cross-fold validation

``````
``````

In [12]:

cv = cross_validation.KFold(len(x),10,shuffle=True,random_state=0)

``````

The method above is a more elaborated way of creating the cross-folds. `cross_val_score` is already doing this under the hood. We're just making it explicit

``````

In [14]:

scores = cross_validation.cross_val_score(dt,x,y,cv=cv)

``````
``````

In [15]:

scores.mean()

``````
``````

Out[15]:

0.94000000000000006

``````

Based on this result, it's likely our model will achieve a 94% accuracy on unseen data, rather than the 97% predicted with the hold-out method

### 4. Open the seeds_dataset.txt and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?

For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/seeds

``````

In [41]:

``````
``````

In [42]:

df

``````
``````

Out[42]:

0
1
2
3
4
5
6
7

0
15.26
14.84
0.8710
5.763
3.312
2.2210
5.220
1

1
14.88
14.57
0.8811
5.554
3.333
1.0180
4.956
1

2
14.29
14.09
0.9050
5.291
3.337
2.6990
4.825
1

3
13.84
13.94
0.8955
5.324
3.379
2.2590
4.805
1

4
16.14
14.99
0.9034
5.658
3.562
1.3550
5.175
1

5
14.38
14.21
0.8951
5.386
3.312
2.4620
4.956
1

6
14.69
14.49
0.8799
5.563
3.259
3.5860
5.219
1

7
14.11
14.10
0.8911
5.420
3.302
2.7000
5.000
1

8
16.63
15.46
0.8747
6.053
3.465
2.0400
5.877
1

9
16.44
15.25
0.8880
5.884
3.505
1.9690
5.533
1

10
15.26
14.85
0.8696
5.714
3.242
4.5430
5.314
1

11
14.03
14.16
0.8796
5.438
3.201
1.7170
5.001
1

12
13.89
14.02
0.8880
5.439
3.199
3.9860
4.738
1

13
13.78
14.06
0.8759
5.479
3.156
3.1360
4.872
1

14
13.74
14.05
0.8744
5.482
3.114
2.9320
4.825
1

15
14.59
14.28
0.8993
5.351
3.333
4.1850
4.781
1

16
13.99
13.83
0.9183
5.119
3.383
5.2340
4.781
1

17
15.69
14.75
0.9058
5.527
3.514
1.5990
5.046
1

18
14.70
14.21
0.9153
5.205
3.466
1.7670
4.649
1

19
12.72
13.57
0.8686
5.226
3.049
4.1020
4.914
1

20
14.16
14.40
0.8584
5.658
3.129
3.0720
5.176
1

21
14.11
14.26
0.8722
5.520
3.168
2.6880
5.219
1

22
15.88
14.90
0.8988
5.618
3.507
0.7651
5.091
1

23
12.08
13.23
0.8664
5.099
2.936
1.4150
4.961
1

24
15.01
14.76
0.8657
5.789
3.245
1.7910
5.001
1

25
16.19
15.16
0.8849
5.833
3.421
0.9030
5.307
1

26
13.02
13.76
0.8641
5.395
3.026
3.3730
4.825
1

27
12.74
13.67
0.8564
5.395
2.956
2.5040
4.869
1

28
14.11
14.18
0.8820
5.541
3.221
2.7540
5.038
1

29
13.45
14.02
0.8604
5.516
3.065
3.5310
5.097
1

...
...
...
...
...
...
...
...
...

180
11.41
12.95
0.8560
5.090
2.775
4.9570
4.825
3

181
12.46
13.41
0.8706
5.236
3.017
4.9870
5.147
3

182
12.19
13.36
0.8579
5.240
2.909
4.8570
5.158
3

183
11.65
13.07
0.8575
5.108
2.850
5.2090
5.135
3

184
12.89
13.77
0.8541
5.495
3.026
6.1850
5.316
3

185
11.56
13.31
0.8198
5.363
2.683
4.0620
5.182
3

186
11.81
13.45
0.8198
5.413
2.716
4.8980
5.352
3

187
10.91
12.80
0.8372
5.088
2.675
4.1790
4.956
3

188
11.23
12.82
0.8594
5.089
2.821
7.5240
4.957
3

189
10.59
12.41
0.8648
4.899
2.787
4.9750
4.794
3

190
10.93
12.80
0.8390
5.046
2.717
5.3980
5.045
3

191
11.27
12.86
0.8563
5.091
2.804
3.9850
5.001
3

192
11.87
13.02
0.8795
5.132
2.953
3.5970
5.132
3

193
10.82
12.83
0.8256
5.180
2.630
4.8530
5.089
3

194
12.11
13.27
0.8639
5.236
2.975
4.1320
5.012
3

195
12.80
13.47
0.8860
5.160
3.126
4.8730
4.914
3

196
12.79
13.53
0.8786
5.224
3.054
5.4830
4.958
3

197
13.37
13.78
0.8849
5.320
3.128
4.6700
5.091
3

198
12.62
13.67
0.8481
5.410
2.911
3.3060
5.231
3

199
12.76
13.38
0.8964
5.073
3.155
2.8280
4.830
3

200
12.38
13.44
0.8609
5.219
2.989
5.4720
5.045
3

201
12.67
13.32
0.8977
4.984
3.135
2.3000
4.745
3

202
11.18
12.72
0.8680
5.009
2.810
4.0510
4.828
3

203
12.70
13.41
0.8874
5.183
3.091
8.4560
5.000
3

204
12.37
13.47
0.8567
5.204
2.960
3.9190
5.001
3

205
12.19
13.20
0.8783
5.137
2.981
3.6310
4.870
3

206
11.23
12.88
0.8511
5.140
2.795
4.3250
5.003
3

207
13.20
13.66
0.8883
5.236
3.232
8.3150
5.056
3

208
11.84
13.21
0.8521
5.175
2.836
3.5980
5.044
3

209
12.30
13.34
0.8684
5.243
2.974
5.6370
5.063
3

210 rows × 8 columns

``````
``````

In [43]:

df.describe()

``````
``````

Out[43]:

0
1
2
3
4
5
6
7

count
210.000000
210.000000
210.000000
210.000000
210.000000
210.000000
210.000000
210.000000

mean
14.847524
14.559286
0.870999
5.628533
3.258605
3.700201
5.408071
2.000000

std
2.909699
1.305959
0.023629
0.443063
0.377714
1.503557
0.491480
0.818448

min
10.590000
12.410000
0.808100
4.899000
2.630000
0.765100
4.519000
1.000000

25%
12.270000
13.450000
0.856900
5.262250
2.944000
2.561500
5.045000
1.000000

50%
14.355000
14.320000
0.873450
5.523500
3.237000
3.599000
5.223000
2.000000

75%
17.305000
15.715000
0.887775
5.979750
3.561750
4.768750
5.877000
3.000000

max
21.180000
17.250000
0.918300
6.675000
4.033000
8.456000
6.550000
3.000000

``````
``````

In [44]:

from pandas.tools.plotting import scatter_matrix

``````
``````

In [45]:

scatter_matrix(df,alpha=0.2, figsize=(10, 10), diagonal='kde')

``````
``````

Out[45]:

array([[<matplotlib.axes.AxesSubplot object at 0x10cfaf590>,
<matplotlib.axes.AxesSubplot object at 0x10df0d490>,
<matplotlib.axes.AxesSubplot object at 0x10e98ee90>,
<matplotlib.axes.AxesSubplot object at 0x10da1bf90>,
<matplotlib.axes.AxesSubplot object at 0x10db90a50>,
<matplotlib.axes.AxesSubplot object at 0x10e9e4d10>],
<matplotlib.axes.AxesSubplot object at 0x10f228810>,
<matplotlib.axes.AxesSubplot object at 0x10f2ae5d0>,
<matplotlib.axes.AxesSubplot object at 0x10f313510>,
<matplotlib.axes.AxesSubplot object at 0x10f397550>,
<matplotlib.axes.AxesSubplot object at 0x10f337b50>,
<matplotlib.axes.AxesSubplot object at 0x10f58d2d0>,
<matplotlib.axes.AxesSubplot object at 0x10f614090>],
[<matplotlib.axes.AxesSubplot object at 0x10f677850>,
<matplotlib.axes.AxesSubplot object at 0x10f6fe710>,
<matplotlib.axes.AxesSubplot object at 0x10f7664d0>,
<matplotlib.axes.AxesSubplot object at 0x10f7e95d0>,
<matplotlib.axes.AxesSubplot object at 0x10f96e390>,
<matplotlib.axes.AxesSubplot object at 0x10f9df290>,
<matplotlib.axes.AxesSubplot object at 0x10fb67050>,
<matplotlib.axes.AxesSubplot object at 0x10fbbff50>],
[<matplotlib.axes.AxesSubplot object at 0x10fd41f90>,
<matplotlib.axes.AxesSubplot object at 0x10fbe4650>,
<matplotlib.axes.AxesSubplot object at 0x10fefdf90>,
<matplotlib.axes.AxesSubplot object at 0x11007bd50>,
<matplotlib.axes.AxesSubplot object at 0x1100e5990>,
<matplotlib.axes.AxesSubplot object at 0x11017b790>],
[<matplotlib.axes.AxesSubplot object at 0x1101b7910>,
<matplotlib.axes.AxesSubplot object at 0x110266750>,
<matplotlib.axes.AxesSubplot object at 0x1102ed510>,
<matplotlib.axes.AxesSubplot object at 0x110352490>,
<matplotlib.axes.AxesSubplot object at 0x1103e1410>,
<matplotlib.axes.AxesSubplot object at 0x11042fe50>,
<matplotlib.axes.AxesSubplot object at 0x1104ce150>,
<matplotlib.axes.AxesSubplot object at 0x1107072d0>],
[<matplotlib.axes.AxesSubplot object at 0x1107b7b10>,
<matplotlib.axes.AxesSubplot object at 0x11083d8d0>,
<matplotlib.axes.AxesSubplot object at 0x1108a5950>,
<matplotlib.axes.AxesSubplot object at 0x1109348d0>,
<matplotlib.axes.AxesSubplot object at 0x110990350>,
<matplotlib.axes.AxesSubplot object at 0x110a22610>,
<matplotlib.axes.AxesSubplot object at 0x110a5b790>,
<matplotlib.axes.AxesSubplot object at 0x110b0bfd0>],
[<matplotlib.axes.AxesSubplot object at 0x110b93d90>,
<matplotlib.axes.AxesSubplot object at 0x110bfae10>,
<matplotlib.axes.AxesSubplot object at 0x110d89d90>,
<matplotlib.axes.AxesSubplot object at 0x110de6810>,
<matplotlib.axes.AxesSubplot object at 0x110eb3c50>,
<matplotlib.axes.AxesSubplot object at 0x110f6f4d0>,
<matplotlib.axes.AxesSubplot object at 0x110ff5290>],
[<matplotlib.axes.AxesSubplot object at 0x11115c310>,
<matplotlib.axes.AxesSubplot object at 0x1111eb290>,
<matplotlib.axes.AxesSubplot object at 0x111239cd0>,
<matplotlib.axes.AxesSubplot object at 0x1112cbf90>,
<matplotlib.axes.AxesSubplot object at 0x111313150>,
<matplotlib.axes.AxesSubplot object at 0x1113a4390>,
<matplotlib.axes.AxesSubplot object at 0x1116299d0>,
<matplotlib.axes.AxesSubplot object at 0x111693d50>]], dtype=object)

``````
``````

In [46]:

df.corr()

``````
``````

Out[46]:

0
1
2
3
4
5
6
7

0
1.000000
0.994341
0.608288
0.949985
0.970771
-0.229572
0.863693
-0.346058

1
0.994341
1.000000
0.529244
0.972422
0.944829
-0.217340
0.890784
-0.327900

2
0.608288
0.529244
1.000000
0.367915
0.761635
-0.331471
0.226825
-0.531007

3
0.949985
0.972422
0.367915
1.000000
0.860415
-0.171562
0.932806
-0.257269

4
0.970771
0.944829
0.761635
0.860415
1.000000
-0.258037
0.749131
-0.423463

5
-0.229572
-0.217340
-0.331471
-0.171562
-0.258037
1.000000
-0.011079
0.577273

6
0.863693
0.890784
0.226825
0.932806
0.749131
-0.011079
1.000000
0.024301

7
-0.346058
-0.327900
-0.531007
-0.257269
-0.423463
0.577273
0.024301
1.000000

``````

based on the various characteristics of the wheat kernel, we're predicting the variety, either Kama, Rosa and Canadian. Some features seem highly correlated and potentially useful for splitting features (area and perimeter), while others don't appear correlated and unlikely to help split features (asymetry coefficient)

### 5. Using the seeds_dataset.txt, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50, 75-25, 10-fold cross validation) and discuss the results.

``````

In [65]:

x = np.asarray(df[[0,1,2,3,4,5,6]])
y = np.asarray(df[7])

``````
``````

In [48]:

#50-50 split
x_train_50,x_test_50,y_train_50,y_test_50 = cross_validation.train_test_split(x,y,train_size=0.5)

``````
``````

In [49]:

dt_seeds_50 = tree.DecisionTreeClassifier()

``````
``````

In [50]:

dt_seeds_50 = dt_seeds_50.fit(x_train_50,y_train_50)

``````
``````

In [51]:

measure_performance(x_train_50,y_train_50,dt_seeds_50)

``````
``````

Accuracy:1.000

Classification report
precision    recall  f1-score   support

1       1.00      1.00      1.00        33
2       1.00      1.00      1.00        33
3       1.00      1.00      1.00        39

avg / total       1.00      1.00      1.00       105

Confusion matrix
[[33  0  0]
[ 0 33  0]
[ 0  0 39]]

``````
``````

In [52]:

measure_performance(x_test_50,y_test_50,dt_seeds_50)

``````
``````

Accuracy:0.924

Classification report
precision    recall  f1-score   support

1       0.89      0.89      0.89        37
2       0.95      0.97      0.96        37
3       0.93      0.90      0.92        31

avg / total       0.92      0.92      0.92       105

Confusion matrix
[[33  2  2]
[ 1 36  0]
[ 3  0 28]]

``````

class 1 and 3 are hard for the classifier to distinguish, which makes sense looking at the scatter matrix results and seeing the overlap between them

``````

In [53]:

#75-25 split
x_train_75,x_test_25,y_train_75,y_test_25 = cross_validation.train_test_split(x,y,train_size=0.75)

``````
``````

In [54]:

dt_seeds_75 = tree.DecisionTreeClassifier()

``````
``````

In [55]:

dt_seeds_75 = dt_seeds_75.fit(x_train_75,y_train_75)

``````
``````

In [56]:

measure_performance(x_train_75,y_train_75,dt_seeds_75)

``````
``````

Accuracy:1.000

Classification report
precision    recall  f1-score   support

1       1.00      1.00      1.00        59
2       1.00      1.00      1.00        47
3       1.00      1.00      1.00        51

avg / total       1.00      1.00      1.00       157

Confusion matrix
[[59  0  0]
[ 0 47  0]
[ 0  0 51]]

``````
``````

In [57]:

measure_performance(x_test_25,y_test_25,dt_seeds_75)

``````
``````

Accuracy:0.925

Classification report
precision    recall  f1-score   support

1       0.77      0.91      0.83        11
2       0.95      0.91      0.93        23
3       1.00      0.95      0.97        19

avg / total       0.93      0.92      0.93        53

Confusion matrix
[[10  1  0]
[ 2 21  0]
[ 1  0 18]]

``````

Although the precision and recall for class 3 have improved, there's still a problem in having the model distinguish between classes 1 and 3

``````

In [58]:

# 10 fold cross validation
dt_seeds_cv = tree.DecisionTreeClassifier()
dt_seeds_cv = dt_seeds_cv.fit(x,y) #fit the model on all data

``````
``````

In [60]:

scores = cross_validation.cross_val_score(dt_seeds_cv,x,y,cv=10)

``````
``````

In [61]:

scores.mean()

``````
``````

Out[61]:

0.92857142857142849

``````
``````

In [66]:

dt_seeds_cv.feature_importances_

``````
``````

Out[66]:

array([ 0.34860165,  0.0122449 ,  0.0125    ,  0.00714286,  0.02042607,
0.06731387,  0.53177066])

``````

The decision tree model for this data is likely fairly accurate, approximately 93% on this data.

``````

In [ ]:

``````