In [7]:
from sklearn import datasets
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np
In [13]:
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
if show_classification_report:
print("Classification report")
print(metrics.classification_report(y,y_pred),"\n")
if show_confussion_matrix:
print("Confusion matrix")
print(metrics.confusion_matrix(y,y_pred),"\n")
In [9]:
iris = datasets.load_iris()
x = iris.data[:,2:]
y = iris.target
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x,y)
In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.50,train_size=0.50)
In [15]:
measure_performance(x_test,y_test,dt)
In [14]:
measure_performance(x_train,y_train,dt)
Well, we accidentally misclassified one item, putting a thing 2 into the thing 3 pile. 98.7% accuracy seems pretty good to me, though.
In [16]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)
In [17]:
measure_performance(x_train,y_train,dt)
We still only missed one, making the exact same mistake. There must be a heavily overlapping area of these two species that confuses the model. Either way, we got the same number of errors with twice the training data, which seems pretty bad. I wonder how much we can cut it before it makes a difference in accuracy?
In [28]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.95,train_size=0.05)
measure_performance(x_test,y_test, dt)
Weird. Small dataset, I guess?
datasets.load_breast_cancer()
) and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Attribute Information:
1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)
In [44]:
bc = datasets.load_breast_cancer()
In [47]:
x = bc.data[:,2:]
y = bc.target
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x,y)
In [50]:
dt
Out[50]:
In [51]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.50,train_size=0.50)
In [52]:
measure_performance(x_train,y_train,dt)
In [53]:
measure_performance(x_test,y_test,dt)
In [54]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)
In [55]:
measure_performance(x_train,y_train,dt)
Going to be honest, I barely have any idea what I'm looking at here. I believe we are trying to predict malignancy of cancer cells, but my understanding of the process to get here is such that I really just copied what we did in class and applied it here. I couldn't say that I used all 10 of the variables described by the dataset or that I just applied the diagnosis to itself as a predictor. This is about as clear as mud to me currently. Maybe a visual tree will help.
In [57]:
from sklearn import tree
from sklearn.externals.six import StringIO
import pydotplus
dt = tree.DecisionTreeClassifier()
dt = dt.fit(x,y)
with open("bc.dot", 'w') as f:
f = tree.export_graphviz(dt, out_file=f)
In [59]:
import os
os.unlink('bc.dot')
In [60]:
dot_data = StringIO()
tree.export_graphviz(dt, out_file=dot_data) #brew install graphviz
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("bc.pdf")
Out[60]:
In [61]:
from IPython.display import IFrame
IFrame("bc.pdf", width=800, height=800)
Out[61]:
So, I guess we did actually take in the data correctly, but the model was suspiciously accurate because it has a lot of information to generate the prediction with.
In [ ]: