We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)



In [7]:

    
from sklearn import datasets
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np



In [13]:

    
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confussion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")



In [9]:

    
iris = datasets.load_iris()
x = iris.data[:,2:]
y = iris.target
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x,y)



In [ ]:

    
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.50,train_size=0.50)



In [15]:

    
measure_performance(x_test,y_test,dt)









    



Accuracy:1.000 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        23
          1       1.00      1.00      1.00        25
          2       1.00      1.00      1.00        27

avg / total       1.00      1.00      1.00        75
 

Confusion matrix
[[23  0  0]
 [ 0 25  0]
 [ 0  0 27]]



In [14]:

    
measure_performance(x_train,y_train,dt)









    



Accuracy:0.987 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        27
          1       1.00      0.96      0.98        25
          2       0.96      1.00      0.98        23

avg / total       0.99      0.99      0.99        75
 

Confusion matrix
[[27  0  0]
 [ 0 24  1]
 [ 0  0 23]]

Well, we accidentally misclassified one item, putting a thing 2 into the thing 3 pile. 98.7% accuracy seems pretty good to me, though.

2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.



In [16]:

    
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)



In [17]:

    
measure_performance(x_train,y_train,dt)









    



Accuracy:0.991 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        42
          1       1.00      0.97      0.98        33
          2       0.97      1.00      0.99        37

avg / total       0.99      0.99      0.99       112
 

Confusion matrix
[[42  0  0]
 [ 0 32  1]
 [ 0  0 37]]

We still only missed one, making the exact same mistake. There must be a heavily overlapping area of these two species that confuses the model. Either way, we got the same number of errors with twice the training data, which seems pretty bad. I wonder how much we can cut it before it makes a difference in accuracy?



In [28]:

    
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.95,train_size=0.05)
measure_performance(x_test,y_test, dt)









    



Accuracy:0.993 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        47
          1       1.00      0.98      0.99        48
          2       0.98      1.00      0.99        48

avg / total       0.99      0.99      0.99       143
 

Confusion matrix
[[47  0  0]
 [ 0 47  1]
 [ 0  0 48]]

Weird. Small dataset, I guess?

3. Load the breast cancer dataset (`datasets.load_breast_cancer()`) and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?

For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)



In [44]:

    
bc = datasets.load_breast_cancer()



In [47]:

    
x = bc.data[:,2:]
y = bc.target
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x,y)



In [50]:

    
dt









    Out[50]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

4. Using the breast cancer data, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50 and 75-25) and discuss the results.



In [51]:

    
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.50,train_size=0.50)



In [52]:

    
measure_performance(x_train,y_train,dt)









    



Accuracy:1.000 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       102
          1       1.00      1.00      1.00       182

avg / total       1.00      1.00      1.00       284
 

Confusion matrix
[[102   0]
 [  0 182]]



In [53]:

    
measure_performance(x_test,y_test,dt)









    



Accuracy:1.000 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       110
          1       1.00      1.00      1.00       175

avg / total       1.00      1.00      1.00       285
 

Confusion matrix
[[110   0]
 [  0 175]]



In [54]:

    
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)



In [55]:

    
measure_performance(x_train,y_train,dt)









    



Accuracy:1.000 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       161
          1       1.00      1.00      1.00       265

avg / total       1.00      1.00      1.00       426
 

Confusion matrix
[[161   0]
 [  0 265]]

Going to be honest, I barely have any idea what I'm looking at here. I believe we are trying to predict malignancy of cancer cells, but my understanding of the process to get here is such that I really just copied what we did in class and applied it here. I couldn't say that I used all 10 of the variables described by the dataset or that I just applied the diagnosis to itself as a predictor. This is about as clear as mud to me currently. Maybe a visual tree will help.



In [57]:

    
from sklearn import tree
from sklearn.externals.six import StringIO
import pydotplus

dt = tree.DecisionTreeClassifier()
dt = dt.fit(x,y)
with open("bc.dot", 'w') as f:
    f = tree.export_graphviz(dt, out_file=f)



In [59]:

    
import os
os.unlink('bc.dot')



In [60]:

    
dot_data = StringIO() 
tree.export_graphviz(dt, out_file=dot_data) #brew install graphviz
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("bc.pdf")









    Out[60]:





True



In [61]:

    
from IPython.display import IFrame
IFrame("bc.pdf", width=800, height=800)









    Out[61]:

So, I guess we did actually take in the data correctly, but the model was suspiciously accurate because it has a lot of information to generate the prediction with.



In [ ]:

1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)

2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.

3. Load the breast cancer dataset (datasets.load_breast_cancer()) and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?

4. Using the breast cancer data, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50 and 75-25) and discuss the results.

3. Load the breast cancer dataset (`datasets.load_breast_cancer()`) and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?