We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)


In [1]:
from sklearn import datasets, tree, metrics
from sklearn.cross_validation import train_test_split
import numpy as np

dt = tree.DecisionTreeClassifier()

iris = datasets.load_iris()
x = iris.data[:,2:]
y = iris.target

In [2]:
# 50% - 50%

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)
dt = dt.fit(x_train,y_train)

y_pred=dt.predict(x_test)
print("50%-50%")
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y_test, y_pred)),"\nClassification report:")
print(metrics.classification_report(y_test,y_pred),"\n")
print(metrics.confusion_matrix(y_test,y_pred),"\n")


50%-50%
Accuracy:0.947 
Classification report:
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        24
          1       0.96      0.89      0.92        27
          2       0.88      0.96      0.92        24

avg / total       0.95      0.95      0.95        75
 

[[24  0  0]
 [ 0 24  3]
 [ 0  1 23]] 

2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.


In [3]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.75,train_size=0.25)
dt = dt.fit(x_train,y_train)

In [4]:
y_pred=dt.predict(x_test)
print("75%-25%")
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y_test, y_pred)),"\n")
print(metrics.classification_report(y_test,y_pred),"\nClassification report:")
print(metrics.confusion_matrix(y_test,y_pred),"\n")


75%-25%
Accuracy:0.965 

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        37
          1       0.93      0.98      0.95        41
          2       0.97      0.91      0.94        35

avg / total       0.97      0.96      0.96       113
 
Classification report:
[[37  0  0]
 [ 0 40  1]
 [ 0  3 32]] 


Comment

  • Maybe the 75-25 model is overfitting

  • Maybe reducing the test set increase the chances of having a high proportion of outliers in this set


3. Load the breast cancer dataset (datasets.load_breast_cancer()) and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?

For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29


In [5]:
cancer = datasets.load_breast_cancer()

In [6]:
print("Here are the attributes we have:\n", cancer['DESCR'][1200:3057])


Here are the attributes we have:
                  Min     Max
    ===================================== ======= ========
    radius (mean):                         6.981   28.11
    texture (mean):                        9.71    39.28
    perimeter (mean):                      43.79   188.5
    area (mean):                           143.5   2501.0
    smoothness (mean):                     0.053   0.163
    compactness (mean):                    0.019   0.345
    concavity (mean):                      0.0     0.427
    concave points (mean):                 0.0     0.201
    symmetry (mean):                       0.106   0.304
    fractal dimension (mean):              0.05    0.097
    radius (standard error):               0.112   2.873
    texture (standard error):              0.36    4.885
    perimeter (standard error):            0.757   21.98
    area (standard error):                 6.802   542.2
    smoothness (standard error):           0.002   0.031
    compactness (standard error):          0.002   0.135
    concavity (standard error):            0.0     0.396
    concave points (standard error):       0.0     0.053
    symmetry (standard error):             0.008   0.079
    fractal dimension (standard error):    0.001   0.03
    radius (worst):                        7.93    36.04
    texture (worst):                       12.02   49.54
    perimeter (worst):                     50.41   251.2
    area (worst):                          185.2   4254.0
    smoothness (worst):                    0.071   0.223
    compactness (worst):                   0.027   1.058
    concavity (worst):                     0.0     1.252
    concave points (worst):                0.0     0.291
    symmetry (worst):                      0.156   0.664
    fractal dimension (worst):             0.055   0.208
    ===================================== ======= ========

In [7]:
x = cancer.data[:,2:] # the attributes
y = cancer.target # the target variable
example_data = [i for i in x[0]]
print("Here's the a sample of these 32 attributes (first data row):")
print(*example_data)
print("We're trying to predict if a subject has cancer or not. Here is a sample of the targets:", y[20:30])


Here's the a sample of these 32 attributes (first data row):
122.8 1001.0 0.1184 0.2776 0.3001 0.1471 0.2419 0.07871 1.095 0.9053 8.589 153.4 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.1189
We're trying to predict if a subject has cancer or not. Here is a sample of the targets: [1 1 0 0 0 0 0 0 0 0]

In [ ]:


In [ ]:

4. Using the breast cancer data, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50 and 75-25) and discuss the results.


In [8]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)
dt = dt.fit(x_train,y_train)

In [9]:
y_pred=dt.predict(x_test)
print("50%-50%")
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y_test, y_pred)),"\nClassification report:")
print(metrics.classification_report(y_test,y_pred),"\n")
print(metrics.confusion_matrix(y_test,y_pred),"\n")


50%-50%
Accuracy:0.909 
Classification report:
             precision    recall  f1-score   support

          0       0.88      0.88      0.88       105
          1       0.93      0.93      0.93       180

avg / total       0.91      0.91      0.91       285
 

[[ 92  13]
 [ 13 167]] 


In [10]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.75,train_size=0.25)
dt = dt.fit(x_train,y_train)

In [11]:
y_pred=dt.predict(x_test)
print("75%-25%")
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y_test, y_pred)),"\nClassification report:")
print(metrics.classification_report(y_test,y_pred),"\n")
print(metrics.confusion_matrix(y_test,y_pred),"\n")


75%-25%
Accuracy:0.916 
Classification report:
             precision    recall  f1-score   support

          0       0.89      0.89      0.89       161
          1       0.93      0.93      0.93       266

avg / total       0.92      0.92      0.92       427
 

[[143  18]
 [ 18 248]] 


In [ ]: