We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)


In [4]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn import tree

1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)


In [5]:
iris = datasets.load_iris()

In [6]:
x = iris.data[:,2:] #attributes
y = iris.target #target variable

In [7]:
dt = tree.DecisionTreeClassifier()

In [8]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)

In [9]:
dt = dt.fit(x_train,y_train)

In [10]:
from sklearn import metrics

In [11]:
import numpy as np

In [12]:
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confussion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

In [13]:
measure_performance(x_test,y_test,dt)


Accuracy:0.947 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        27
          1       0.87      1.00      0.93        27
          2       1.00      0.81      0.89        21

avg / total       0.95      0.95      0.95        75
 

Confusion matrix
[[27  0  0]
 [ 0 27  0]
 [ 0  4 17]] 


In [14]:
#Accuracy: Out of all the predicted outcomes (true positive, false positive, true negative, false negative), how many are true positives or negatives?
# a score of .973 means that 97.3 percent of those classifed were true positives and true negatives. 
# While 2.7 percent were false positives (said to be true when it was false) or
# false negatives (said to be false when it was true )

#Precision: When the condition was predicted to be true (true positives, false positives), 
#how many of those 'true' predictions were true positives 
#a score of 1 means that all of the cases predicted to be true were true positives 
#a score of .93 means 93 percent of those classified as 1 were in class 1 (tp)
#while 7 percent of those classified as 1 were not in class 1 (fp, we said it was true when it was false)

#Recall: When the actual condition was true,how many were predicted true positives? 
#a score of 1 means that all the cases that were actually true were predicted to be true
#a score of .9 means that when the actual condition was true, 90% were predicted to be true, and 10% were predicted
#to be false (false negatives, we said it was false when it was true)

#confusion matrix: rows are the actual condition: class 0, class 1, class 2
#columns are the predicted conditions, predicted_class 0, predicted_class 1, predicted class 2
#so the matrix tells us that when the class was actually 2, we predicted it was class 1 twice

2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.


In [15]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)

In [16]:
dt = dt.fit(x_train,y_train)

In [17]:
measure_performance(x_test,y_test,dt)


Accuracy:0.921 

Classification report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        10
          1       0.89      0.94      0.91        17
          2       0.90      0.82      0.86        11

avg / total       0.92      0.92      0.92        38
 

Confusion matrix
[[10  0  0]
 [ 0 16  1]
 [ 0  2  9]] 


In [18]:
# they're worse, in a way. 
# but i suspect that the first model was overfitting the data.

3. Load the breast cancer dataset (datasets.load_breast_cancer()) and perform basic exploratory analysis. What attributes do we have? What are we trying to predict?

For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29


In [19]:
breast_cancer = datasets.load_breast_cancer()

In [20]:
breast_cancer['feature_names']


Out[20]:
array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'], 
      dtype='<U23')

In [21]:
breast_cancer['target']


Out[21]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])

4. Using the breast cancer data, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50 and 75-25) and discuss the results.


In [22]:
x = breast_cancer.data[:,:]
y = breast_cancer.target

In [23]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)

In [24]:
dt = dt.fit(x_train,y_train)

In [25]:
measure_performance(x_test,y_test,dt)


Accuracy:0.919 

Classification report
             precision    recall  f1-score   support

          0       0.93      0.86      0.89       111
          1       0.91      0.96      0.94       174

avg / total       0.92      0.92      0.92       285
 

Confusion matrix
[[ 95  16]
 [  7 167]] 


In [26]:
# when benign (0), 16 cases were said to be malignant when they were not
# when malignant (1), 7 cases were said to be benign when they were not

In [27]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)

In [28]:
dt = dt.fit(x_train,y_train)

In [29]:
measure_performance(x_test,y_test,dt)


Accuracy:0.965 

Classification report
             precision    recall  f1-score   support

          0       0.98      0.93      0.95        56
          1       0.96      0.99      0.97        87

avg / total       0.97      0.97      0.96       143
 

Confusion matrix
[[52  4]
 [ 1 86]] 


In [ ]:
# the model is better.
# when when the actual case was benign, it predicted that it was malignant 4 times 
# when the actual case when malignant, it predicted that it was benign once. 

# perhaps this is because the data set is larger, and there is a relationship between the size of data set and
# the split between training and testing