We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)
In [19]:
import pandas as pd
%matplotlib inline
import numpy as np
import pydotplus
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import tree
from sklearn.externals.six import StringIO
from sklearn.cross_validation import train_test_split
from sklearn import tree
from sklearn import metrics
In [20]:
iris_info = datasets.load_iris()
iris_info
Out[20]:
{'DESCR': 'Iris Plants Database\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...\n',
'data': array([[ 5.1, 3.5, 1.4, 0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
[ 4.6, 3.1, 1.5, 0.2],
[ 5. , 3.6, 1.4, 0.2],
[ 5.4, 3.9, 1.7, 0.4],
[ 4.6, 3.4, 1.4, 0.3],
[ 5. , 3.4, 1.5, 0.2],
[ 4.4, 2.9, 1.4, 0.2],
[ 4.9, 3.1, 1.5, 0.1],
[ 5.4, 3.7, 1.5, 0.2],
[ 4.8, 3.4, 1.6, 0.2],
[ 4.8, 3. , 1.4, 0.1],
[ 4.3, 3. , 1.1, 0.1],
[ 5.8, 4. , 1.2, 0.2],
[ 5.7, 4.4, 1.5, 0.4],
[ 5.4, 3.9, 1.3, 0.4],
[ 5.1, 3.5, 1.4, 0.3],
[ 5.7, 3.8, 1.7, 0.3],
[ 5.1, 3.8, 1.5, 0.3],
[ 5.4, 3.4, 1.7, 0.2],
[ 5.1, 3.7, 1.5, 0.4],
[ 4.6, 3.6, 1. , 0.2],
[ 5.1, 3.3, 1.7, 0.5],
[ 4.8, 3.4, 1.9, 0.2],
[ 5. , 3. , 1.6, 0.2],
[ 5. , 3.4, 1.6, 0.4],
[ 5.2, 3.5, 1.5, 0.2],
[ 5.2, 3.4, 1.4, 0.2],
[ 4.7, 3.2, 1.6, 0.2],
[ 4.8, 3.1, 1.6, 0.2],
[ 5.4, 3.4, 1.5, 0.4],
[ 5.2, 4.1, 1.5, 0.1],
[ 5.5, 4.2, 1.4, 0.2],
[ 4.9, 3.1, 1.5, 0.1],
[ 5. , 3.2, 1.2, 0.2],
[ 5.5, 3.5, 1.3, 0.2],
[ 4.9, 3.1, 1.5, 0.1],
[ 4.4, 3. , 1.3, 0.2],
[ 5.1, 3.4, 1.5, 0.2],
[ 5. , 3.5, 1.3, 0.3],
[ 4.5, 2.3, 1.3, 0.3],
[ 4.4, 3.2, 1.3, 0.2],
[ 5. , 3.5, 1.6, 0.6],
[ 5.1, 3.8, 1.9, 0.4],
[ 4.8, 3. , 1.4, 0.3],
[ 5.1, 3.8, 1.6, 0.2],
[ 4.6, 3.2, 1.4, 0.2],
[ 5.3, 3.7, 1.5, 0.2],
[ 5. , 3.3, 1.4, 0.2],
[ 7. , 3.2, 4.7, 1.4],
[ 6.4, 3.2, 4.5, 1.5],
[ 6.9, 3.1, 4.9, 1.5],
[ 5.5, 2.3, 4. , 1.3],
[ 6.5, 2.8, 4.6, 1.5],
[ 5.7, 2.8, 4.5, 1.3],
[ 6.3, 3.3, 4.7, 1.6],
[ 4.9, 2.4, 3.3, 1. ],
[ 6.6, 2.9, 4.6, 1.3],
[ 5.2, 2.7, 3.9, 1.4],
[ 5. , 2. , 3.5, 1. ],
[ 5.9, 3. , 4.2, 1.5],
[ 6. , 2.2, 4. , 1. ],
[ 6.1, 2.9, 4.7, 1.4],
[ 5.6, 2.9, 3.6, 1.3],
[ 6.7, 3.1, 4.4, 1.4],
[ 5.6, 3. , 4.5, 1.5],
[ 5.8, 2.7, 4.1, 1. ],
[ 6.2, 2.2, 4.5, 1.5],
[ 5.6, 2.5, 3.9, 1.1],
[ 5.9, 3.2, 4.8, 1.8],
[ 6.1, 2.8, 4. , 1.3],
[ 6.3, 2.5, 4.9, 1.5],
[ 6.1, 2.8, 4.7, 1.2],
[ 6.4, 2.9, 4.3, 1.3],
[ 6.6, 3. , 4.4, 1.4],
[ 6.8, 2.8, 4.8, 1.4],
[ 6.7, 3. , 5. , 1.7],
[ 6. , 2.9, 4.5, 1.5],
[ 5.7, 2.6, 3.5, 1. ],
[ 5.5, 2.4, 3.8, 1.1],
[ 5.5, 2.4, 3.7, 1. ],
[ 5.8, 2.7, 3.9, 1.2],
[ 6. , 2.7, 5.1, 1.6],
[ 5.4, 3. , 4.5, 1.5],
[ 6. , 3.4, 4.5, 1.6],
[ 6.7, 3.1, 4.7, 1.5],
[ 6.3, 2.3, 4.4, 1.3],
[ 5.6, 3. , 4.1, 1.3],
[ 5.5, 2.5, 4. , 1.3],
[ 5.5, 2.6, 4.4, 1.2],
[ 6.1, 3. , 4.6, 1.4],
[ 5.8, 2.6, 4. , 1.2],
[ 5. , 2.3, 3.3, 1. ],
[ 5.6, 2.7, 4.2, 1.3],
[ 5.7, 3. , 4.2, 1.2],
[ 5.7, 2.9, 4.2, 1.3],
[ 6.2, 2.9, 4.3, 1.3],
[ 5.1, 2.5, 3. , 1.1],
[ 5.7, 2.8, 4.1, 1.3],
[ 6.3, 3.3, 6. , 2.5],
[ 5.8, 2.7, 5.1, 1.9],
[ 7.1, 3. , 5.9, 2.1],
[ 6.3, 2.9, 5.6, 1.8],
[ 6.5, 3. , 5.8, 2.2],
[ 7.6, 3. , 6.6, 2.1],
[ 4.9, 2.5, 4.5, 1.7],
[ 7.3, 2.9, 6.3, 1.8],
[ 6.7, 2.5, 5.8, 1.8],
[ 7.2, 3.6, 6.1, 2.5],
[ 6.5, 3.2, 5.1, 2. ],
[ 6.4, 2.7, 5.3, 1.9],
[ 6.8, 3. , 5.5, 2.1],
[ 5.7, 2.5, 5. , 2. ],
[ 5.8, 2.8, 5.1, 2.4],
[ 6.4, 3.2, 5.3, 2.3],
[ 6.5, 3. , 5.5, 1.8],
[ 7.7, 3.8, 6.7, 2.2],
[ 7.7, 2.6, 6.9, 2.3],
[ 6. , 2.2, 5. , 1.5],
[ 6.9, 3.2, 5.7, 2.3],
[ 5.6, 2.8, 4.9, 2. ],
[ 7.7, 2.8, 6.7, 2. ],
[ 6.3, 2.7, 4.9, 1.8],
[ 6.7, 3.3, 5.7, 2.1],
[ 7.2, 3.2, 6. , 1.8],
[ 6.2, 2.8, 4.8, 1.8],
[ 6.1, 3. , 4.9, 1.8],
[ 6.4, 2.8, 5.6, 2.1],
[ 7.2, 3. , 5.8, 1.6],
[ 7.4, 2.8, 6.1, 1.9],
[ 7.9, 3.8, 6.4, 2. ],
[ 6.4, 2.8, 5.6, 2.2],
[ 6.3, 2.8, 5.1, 1.5],
[ 6.1, 2.6, 5.6, 1.4],
[ 7.7, 3. , 6.1, 2.3],
[ 6.3, 3.4, 5.6, 2.4],
[ 6.4, 3.1, 5.5, 1.8],
[ 6. , 3. , 4.8, 1.8],
[ 6.9, 3.1, 5.4, 2.1],
[ 6.7, 3.1, 5.6, 2.4],
[ 6.9, 3.1, 5.1, 2.3],
[ 5.8, 2.7, 5.1, 1.9],
[ 6.8, 3.2, 5.9, 2.3],
[ 6.7, 3.3, 5.7, 2.5],
[ 6.7, 3. , 5.2, 2.3],
[ 6.3, 2.5, 5. , 1.9],
[ 6.5, 3. , 5.2, 2. ],
[ 6.2, 3.4, 5.4, 2.3],
[ 5.9, 3. , 5.1, 1.8]]),
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'target_names': array(['setosa', 'versicolor', 'virginica'],
dtype='<U10')}
In [21]:
x = iris.data[:,2:]
y = iris.target
In [22]:
x
Out[22]:
array([[ 1.4, 0.2],
[ 1.4, 0.2],
[ 1.3, 0.2],
[ 1.5, 0.2],
[ 1.4, 0.2],
[ 1.7, 0.4],
[ 1.4, 0.3],
[ 1.5, 0.2],
[ 1.4, 0.2],
[ 1.5, 0.1],
[ 1.5, 0.2],
[ 1.6, 0.2],
[ 1.4, 0.1],
[ 1.1, 0.1],
[ 1.2, 0.2],
[ 1.5, 0.4],
[ 1.3, 0.4],
[ 1.4, 0.3],
[ 1.7, 0.3],
[ 1.5, 0.3],
[ 1.7, 0.2],
[ 1.5, 0.4],
[ 1. , 0.2],
[ 1.7, 0.5],
[ 1.9, 0.2],
[ 1.6, 0.2],
[ 1.6, 0.4],
[ 1.5, 0.2],
[ 1.4, 0.2],
[ 1.6, 0.2],
[ 1.6, 0.2],
[ 1.5, 0.4],
[ 1.5, 0.1],
[ 1.4, 0.2],
[ 1.5, 0.1],
[ 1.2, 0.2],
[ 1.3, 0.2],
[ 1.5, 0.1],
[ 1.3, 0.2],
[ 1.5, 0.2],
[ 1.3, 0.3],
[ 1.3, 0.3],
[ 1.3, 0.2],
[ 1.6, 0.6],
[ 1.9, 0.4],
[ 1.4, 0.3],
[ 1.6, 0.2],
[ 1.4, 0.2],
[ 1.5, 0.2],
[ 1.4, 0.2],
[ 4.7, 1.4],
[ 4.5, 1.5],
[ 4.9, 1.5],
[ 4. , 1.3],
[ 4.6, 1.5],
[ 4.5, 1.3],
[ 4.7, 1.6],
[ 3.3, 1. ],
[ 4.6, 1.3],
[ 3.9, 1.4],
[ 3.5, 1. ],
[ 4.2, 1.5],
[ 4. , 1. ],
[ 4.7, 1.4],
[ 3.6, 1.3],
[ 4.4, 1.4],
[ 4.5, 1.5],
[ 4.1, 1. ],
[ 4.5, 1.5],
[ 3.9, 1.1],
[ 4.8, 1.8],
[ 4. , 1.3],
[ 4.9, 1.5],
[ 4.7, 1.2],
[ 4.3, 1.3],
[ 4.4, 1.4],
[ 4.8, 1.4],
[ 5. , 1.7],
[ 4.5, 1.5],
[ 3.5, 1. ],
[ 3.8, 1.1],
[ 3.7, 1. ],
[ 3.9, 1.2],
[ 5.1, 1.6],
[ 4.5, 1.5],
[ 4.5, 1.6],
[ 4.7, 1.5],
[ 4.4, 1.3],
[ 4.1, 1.3],
[ 4. , 1.3],
[ 4.4, 1.2],
[ 4.6, 1.4],
[ 4. , 1.2],
[ 3.3, 1. ],
[ 4.2, 1.3],
[ 4.2, 1.2],
[ 4.2, 1.3],
[ 4.3, 1.3],
[ 3. , 1.1],
[ 4.1, 1.3],
[ 6. , 2.5],
[ 5.1, 1.9],
[ 5.9, 2.1],
[ 5.6, 1.8],
[ 5.8, 2.2],
[ 6.6, 2.1],
[ 4.5, 1.7],
[ 6.3, 1.8],
[ 5.8, 1.8],
[ 6.1, 2.5],
[ 5.1, 2. ],
[ 5.3, 1.9],
[ 5.5, 2.1],
[ 5. , 2. ],
[ 5.1, 2.4],
[ 5.3, 2.3],
[ 5.5, 1.8],
[ 6.7, 2.2],
[ 6.9, 2.3],
[ 5. , 1.5],
[ 5.7, 2.3],
[ 4.9, 2. ],
[ 6.7, 2. ],
[ 4.9, 1.8],
[ 5.7, 2.1],
[ 6. , 1.8],
[ 4.8, 1.8],
[ 4.9, 1.8],
[ 5.6, 2.1],
[ 5.8, 1.6],
[ 6.1, 1.9],
[ 6.4, 2. ],
[ 5.6, 2.2],
[ 5.1, 1.5],
[ 5.6, 1.4],
[ 6.1, 2.3],
[ 5.6, 2.4],
[ 5.5, 1.8],
[ 4.8, 1.8],
[ 5.4, 2.1],
[ 5.6, 2.4],
[ 5.1, 2.3],
[ 5.1, 1.9],
[ 5.9, 2.3],
[ 5.7, 2.5],
[ 5.2, 2.3],
[ 5. , 1.9],
[ 5.2, 2. ],
[ 5.4, 2.3],
[ 5.1, 1.8]])
In [23]:
y
Out[23]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [24]:
iris['feature_names']
Out[24]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
In [25]:
dt = tree.DecisionTreeClassifier()
In [78]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)
In [79]:
dt = dt.fit(x_train,y_train)
In [80]:
def output(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
if show_classification_report:
print("Classification report")
print(metrics.classification_report(y,y_pred),"\n")
if show_confussion_matrix:
print("Confusion matrix")
print(metrics.confusion_matrix(y,y_pred),"\n")
In [34]:
output(x_train,y_train,dt)
Accuracy:0.987
Classification report
precision recall f1-score support
0 1.00 1.00 1.00 24
1 0.96 1.00 0.98 27
2 1.00 0.96 0.98 24
avg / total 0.99 0.99 0.99 75
Confusion matrix
[[24 0 0]
[ 0 27 0]
[ 0 1 23]]
In [35]:
output(x_test,y_test,dt)
Accuracy:0.933
Classification report
precision recall f1-score support
0 1.00 1.00 1.00 26
1 0.88 0.91 0.89 23
2 0.92 0.88 0.90 26
avg / total 0.93 0.93 0.93 75
Confusion matrix
[[26 0 0]
[ 0 21 2]
[ 0 3 23]]
In [38]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)
In [39]:
output(x_train,y_train,dt)
Accuracy:0.946
Classification report
precision recall f1-score support
0 1.00 1.00 1.00 42
1 0.89 0.94 0.91 33
2 0.94 0.89 0.92 37
avg / total 0.95 0.95 0.95 112
Confusion matrix
[[42 0 0]
[ 0 31 2]
[ 0 4 33]]
In [40]:
output(x_test,y_test,dt)
Accuracy:1.000
Classification report
precision recall f1-score support
0 1.00 1.00 1.00 8
1 1.00 1.00 1.00 17
2 1.00 1.00 1.00 13
avg / total 1.00 1.00 1.00 38
Confusion matrix
[[ 8 0 0]
[ 0 17 0]
[ 0 0 13]]
In [ ]:
In [48]:
breast_cancer_data = datasets.load_breast_cancer()
In [49]:
type(breast_cancer_data)
Out[49]:
sklearn.datasets.base.Bunch
In [50]:
breast_cancer_data
Out[50]:
{'DESCR': 'Breast Cancer Wisconsin (Diagnostic) Database\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 569\n\n :Number of Attributes: 30 numeric, predictive attributes and the class\n\n :Attribute Information:\n - radius (mean of distances from center to points on the perimeter)\n - texture (standard deviation of gray-scale values)\n - perimeter\n - area\n - smoothness (local variation in radius lengths)\n - compactness (perimeter^2 / area - 1.0)\n - concavity (severity of concave portions of the contour)\n - concave points (number of concave portions of the contour)\n - symmetry \n - fractal dimension ("coastline approximation" - 1)\n \n The mean, standard error, and "worst" or largest (mean of the three\n largest values) of these features were computed for each image,\n resulting in 30 features. For instance, field 3 is Mean Radius, field\n 13 is Radius SE, field 23 is Worst Radius.\n \n - class:\n - WDBC-Malignant\n - WDBC-Benign\n\n :Summary Statistics:\n\n ===================================== ======= ========\n Min Max\n ===================================== ======= ========\n radius (mean): 6.981 28.11\n texture (mean): 9.71 39.28\n perimeter (mean): 43.79 188.5\n area (mean): 143.5 2501.0\n smoothness (mean): 0.053 0.163\n compactness (mean): 0.019 0.345\n concavity (mean): 0.0 0.427\n concave points (mean): 0.0 0.201\n symmetry (mean): 0.106 0.304\n fractal dimension (mean): 0.05 0.097\n radius (standard error): 0.112 2.873\n texture (standard error): 0.36 4.885\n perimeter (standard error): 0.757 21.98\n area (standard error): 6.802 542.2\n smoothness (standard error): 0.002 0.031\n compactness (standard error): 0.002 0.135\n concavity (standard error): 0.0 0.396\n concave points (standard error): 0.0 0.053\n symmetry (standard error): 0.008 0.079\n fractal dimension (standard error): 0.001 0.03\n radius (worst): 7.93 36.04\n texture (worst): 12.02 49.54\n perimeter (worst): 50.41 251.2\n area (worst): 185.2 4254.0\n smoothness (worst): 0.071 0.223\n compactness (worst): 0.027 1.058\n concavity (worst): 0.0 1.252\n concave points (worst): 0.0 0.291\n symmetry (worst): 0.156 0.664\n fractal dimension (worst): 0.055 0.208\n ===================================== ======= ========\n\n :Missing Attribute Values: None\n\n :Class Distribution: 212 - Malignant, 357 - Benign\n\n :Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian\n\n :Donor: Nick Street\n\n :Date: November, 1995\n\nThis is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.\nhttps://goo.gl/U2Uwz2\n\nFeatures are computed from a digitized image of a fine needle\naspirate (FNA) of a breast mass. They describe\ncharacteristics of the cell nuclei present in the image.\nA few of the images can be found at\nhttp://www.cs.wisc.edu/~street/images/\n\nSeparating plane described above was obtained using\nMultisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree\nConstruction Via Linear Programming." Proceedings of the 4th\nMidwest Artificial Intelligence and Cognitive Science Society,\npp. 97-101, 1992], a classification method which uses linear\nprogramming to construct a decision tree. Relevant features\nwere selected using an exhaustive search in the space of 1-4\nfeatures and 1-3 separating planes.\n\nThe actual linear program used to obtain the separating plane\nin the 3-dimensional space is that described in:\n[K. P. Bennett and O. L. Mangasarian: "Robust Linear\nProgramming Discrimination of Two Linearly Inseparable Sets",\nOptimization Methods and Software 1, 1992, 23-34].\n\nThis database is also available through the UW CS ftp server:\n\nftp ftp.cs.wisc.edu\ncd math-prog/cpo-dataset/machine-learn/WDBC/\n\nReferences\n----------\n - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction \n for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on \n Electronic Imaging: Science and Technology, volume 1905, pages 861-870, \n San Jose, CA, 1993. \n - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and \n prognosis via linear programming. Operations Research, 43(4), pages 570-577, \n July-August 1995.\n - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques\n to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) \n 163-171.\n',
'data': array([[ 1.79900000e+01, 1.03800000e+01, 1.22800000e+02, ...,
2.65400000e-01, 4.60100000e-01, 1.18900000e-01],
[ 2.05700000e+01, 1.77700000e+01, 1.32900000e+02, ...,
1.86000000e-01, 2.75000000e-01, 8.90200000e-02],
[ 1.96900000e+01, 2.12500000e+01, 1.30000000e+02, ...,
2.43000000e-01, 3.61300000e-01, 8.75800000e-02],
...,
[ 1.66000000e+01, 2.80800000e+01, 1.08300000e+02, ...,
1.41800000e-01, 2.21800000e-01, 7.82000000e-02],
[ 2.06000000e+01, 2.93300000e+01, 1.40100000e+02, ...,
2.65000000e-01, 4.08700000e-01, 1.24000000e-01],
[ 7.76000000e+00, 2.45400000e+01, 4.79200000e+01, ...,
0.00000000e+00, 2.87100000e-01, 7.03900000e-02]]),
'feature_names': array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension'],
dtype='<U23'),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1,
0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1,
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1]),
'target_names': array(['malignant', 'benign'],
dtype='<U9')}
In [51]:
breast_cancer_data['feature_names']
Out[51]:
array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension'],
dtype='<U23')
In [52]:
print(type(breast_cancer_data.data))
<class 'numpy.ndarray'>
In [53]:
print(breast_cancer_data.target_names)
['malignant' 'benign']
In [54]:
print(breast_cancer_data.DESCR)
Breast Cancer Wisconsin (Diagnostic) Database
Notes
-----
Data Set Characteristics:
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
- class:
- WDBC-Malignant
- WDBC-Benign
:Summary Statistics:
===================================== ======= ========
Min Max
===================================== ======= ========
radius (mean): 6.981 28.11
texture (mean): 9.71 39.28
perimeter (mean): 43.79 188.5
area (mean): 143.5 2501.0
smoothness (mean): 0.053 0.163
compactness (mean): 0.019 0.345
concavity (mean): 0.0 0.427
concave points (mean): 0.0 0.201
symmetry (mean): 0.106 0.304
fractal dimension (mean): 0.05 0.097
radius (standard error): 0.112 2.873
texture (standard error): 0.36 4.885
perimeter (standard error): 0.757 21.98
area (standard error): 6.802 542.2
smoothness (standard error): 0.002 0.031
compactness (standard error): 0.002 0.135
concavity (standard error): 0.0 0.396
concave points (standard error): 0.0 0.053
symmetry (standard error): 0.008 0.079
fractal dimension (standard error): 0.001 0.03
radius (worst): 7.93 36.04
texture (worst): 12.02 49.54
perimeter (worst): 50.41 251.2
area (worst): 185.2 4254.0
smoothness (worst): 0.071 0.223
compactness (worst): 0.027 1.058
concavity (worst): 0.0 1.252
concave points (worst): 0.0 0.291
symmetry (worst): 0.156 0.664
fractal dimension (worst): 0.055 0.208
===================================== ======= ========
:Missing Attribute Values: None
:Class Distribution: 212 - Malignant, 357 - Benign
:Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
:Donor: Nick Street
:Date: November, 1995
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2
Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
A few of the images can be found at
http://www.cs.wisc.edu/~street/images/
Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
References
----------
- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577,
July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
163-171.
In [56]:
breast_cancer_df = pd.DataFrame(breast_cancer_data.data, columns=breast_cancer_data['feature_names'])
breast_cancer_df
Out[56]:
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
...
worst radius
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension
0
17.990
10.38
122.80
1001.0
0.11840
0.27760
0.300100
0.147100
0.2419
0.07871
...
25.380
17.33
184.60
2019.0
0.16220
0.66560
0.71190
0.26540
0.4601
0.11890
1
20.570
17.77
132.90
1326.0
0.08474
0.07864
0.086900
0.070170
0.1812
0.05667
...
24.990
23.41
158.80
1956.0
0.12380
0.18660
0.24160
0.18600
0.2750
0.08902
2
19.690
21.25
130.00
1203.0
0.10960
0.15990
0.197400
0.127900
0.2069
0.05999
...
23.570
25.53
152.50
1709.0
0.14440
0.42450
0.45040
0.24300
0.3613
0.08758
3
11.420
20.38
77.58
386.1
0.14250
0.28390
0.241400
0.105200
0.2597
0.09744
...
14.910
26.50
98.87
567.7
0.20980
0.86630
0.68690
0.25750
0.6638
0.17300
4
20.290
14.34
135.10
1297.0
0.10030
0.13280
0.198000
0.104300
0.1809
0.05883
...
22.540
16.67
152.20
1575.0
0.13740
0.20500
0.40000
0.16250
0.2364
0.07678
5
12.450
15.70
82.57
477.1
0.12780
0.17000
0.157800
0.080890
0.2087
0.07613
...
15.470
23.75
103.40
741.6
0.17910
0.52490
0.53550
0.17410
0.3985
0.12440
6
18.250
19.98
119.60
1040.0
0.09463
0.10900
0.112700
0.074000
0.1794
0.05742
...
22.880
27.66
153.20
1606.0
0.14420
0.25760
0.37840
0.19320
0.3063
0.08368
7
13.710
20.83
90.20
577.9
0.11890
0.16450
0.093660
0.059850
0.2196
0.07451
...
17.060
28.14
110.60
897.0
0.16540
0.36820
0.26780
0.15560
0.3196
0.11510
8
13.000
21.82
87.50
519.8
0.12730
0.19320
0.185900
0.093530
0.2350
0.07389
...
15.490
30.73
106.20
739.3
0.17030
0.54010
0.53900
0.20600
0.4378
0.10720
9
12.460
24.04
83.97
475.9
0.11860
0.23960
0.227300
0.085430
0.2030
0.08243
...
15.090
40.68
97.65
711.4
0.18530
1.05800
1.10500
0.22100
0.4366
0.20750
10
16.020
23.24
102.70
797.8
0.08206
0.06669
0.032990
0.033230
0.1528
0.05697
...
19.190
33.88
123.80
1150.0
0.11810
0.15510
0.14590
0.09975
0.2948
0.08452
11
15.780
17.89
103.60
781.0
0.09710
0.12920
0.099540
0.066060
0.1842
0.06082
...
20.420
27.28
136.50
1299.0
0.13960
0.56090
0.39650
0.18100
0.3792
0.10480
12
19.170
24.80
132.40
1123.0
0.09740
0.24580
0.206500
0.111800
0.2397
0.07800
...
20.960
29.94
151.70
1332.0
0.10370
0.39030
0.36390
0.17670
0.3176
0.10230
13
15.850
23.95
103.70
782.7
0.08401
0.10020
0.099380
0.053640
0.1847
0.05338
...
16.840
27.66
112.00
876.5
0.11310
0.19240
0.23220
0.11190
0.2809
0.06287
14
13.730
22.61
93.60
578.3
0.11310
0.22930
0.212800
0.080250
0.2069
0.07682
...
15.030
32.01
108.80
697.7
0.16510
0.77250
0.69430
0.22080
0.3596
0.14310
15
14.540
27.54
96.73
658.8
0.11390
0.15950
0.163900
0.073640
0.2303
0.07077
...
17.460
37.13
124.10
943.2
0.16780
0.65770
0.70260
0.17120
0.4218
0.13410
16
14.680
20.13
94.74
684.5
0.09867
0.07200
0.073950
0.052590
0.1586
0.05922
...
19.070
30.88
123.40
1138.0
0.14640
0.18710
0.29140
0.16090
0.3029
0.08216
17
16.130
20.68
108.10
798.8
0.11700
0.20220
0.172200
0.102800
0.2164
0.07356
...
20.960
31.48
136.80
1315.0
0.17890
0.42330
0.47840
0.20730
0.3706
0.11420
18
19.810
22.15
130.00
1260.0
0.09831
0.10270
0.147900
0.094980
0.1582
0.05395
...
27.320
30.88
186.80
2398.0
0.15120
0.31500
0.53720
0.23880
0.2768
0.07615
19
13.540
14.36
87.46
566.3
0.09779
0.08129
0.066640
0.047810
0.1885
0.05766
...
15.110
19.26
99.70
711.2
0.14400
0.17730
0.23900
0.12880
0.2977
0.07259
20
13.080
15.71
85.63
520.0
0.10750
0.12700
0.045680
0.031100
0.1967
0.06811
...
14.500
20.49
96.09
630.5
0.13120
0.27760
0.18900
0.07283
0.3184
0.08183
21
9.504
12.44
60.34
273.9
0.10240
0.06492
0.029560
0.020760
0.1815
0.06905
...
10.230
15.66
65.13
314.9
0.13240
0.11480
0.08867
0.06227
0.2450
0.07773
22
15.340
14.26
102.50
704.4
0.10730
0.21350
0.207700
0.097560
0.2521
0.07032
...
18.070
19.08
125.10
980.9
0.13900
0.59540
0.63050
0.23930
0.4667
0.09946
23
21.160
23.04
137.20
1404.0
0.09428
0.10220
0.109700
0.086320
0.1769
0.05278
...
29.170
35.59
188.00
2615.0
0.14010
0.26000
0.31550
0.20090
0.2822
0.07526
24
16.650
21.38
110.00
904.6
0.11210
0.14570
0.152500
0.091700
0.1995
0.06330
...
26.460
31.56
177.00
2215.0
0.18050
0.35780
0.46950
0.20950
0.3613
0.09564
25
17.140
16.40
116.00
912.7
0.11860
0.22760
0.222900
0.140100
0.3040
0.07413
...
22.250
21.40
152.40
1461.0
0.15450
0.39490
0.38530
0.25500
0.4066
0.10590
26
14.580
21.53
97.41
644.8
0.10540
0.18680
0.142500
0.087830
0.2252
0.06924
...
17.620
33.21
122.40
896.9
0.15250
0.66430
0.55390
0.27010
0.4264
0.12750
27
18.610
20.25
122.10
1094.0
0.09440
0.10660
0.149000
0.077310
0.1697
0.05699
...
21.310
27.26
139.90
1403.0
0.13380
0.21170
0.34460
0.14900
0.2341
0.07421
28
15.300
25.27
102.40
732.4
0.10820
0.16970
0.168300
0.087510
0.1926
0.06540
...
20.270
36.71
149.30
1269.0
0.16410
0.61100
0.63350
0.20240
0.4027
0.09876
29
17.570
15.05
115.00
955.1
0.09847
0.11570
0.098750
0.079530
0.1739
0.06149
...
20.010
19.52
134.90
1227.0
0.12550
0.28120
0.24890
0.14560
0.2756
0.07919
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
539
7.691
25.44
48.34
170.4
0.08668
0.11990
0.092520
0.013640
0.2037
0.07751
...
8.678
31.89
54.49
223.6
0.15960
0.30640
0.33930
0.05000
0.2790
0.10660
540
11.540
14.44
74.65
402.9
0.09984
0.11200
0.067370
0.025940
0.1818
0.06782
...
12.260
19.68
78.78
457.8
0.13450
0.21180
0.17970
0.06918
0.2329
0.08134
541
14.470
24.99
95.81
656.4
0.08837
0.12300
0.100900
0.038900
0.1872
0.06341
...
16.220
31.73
113.50
808.9
0.13400
0.42020
0.40400
0.12050
0.3187
0.10230
542
14.740
25.42
94.70
668.6
0.08275
0.07214
0.041050
0.030270
0.1840
0.05680
...
16.510
32.29
107.40
826.4
0.10600
0.13760
0.16110
0.10950
0.2722
0.06956
543
13.210
28.06
84.88
538.4
0.08671
0.06877
0.029870
0.032750
0.1628
0.05781
...
14.370
37.17
92.48
629.6
0.10720
0.13810
0.10620
0.07958
0.2473
0.06443
544
13.870
20.70
89.77
584.8
0.09578
0.10180
0.036880
0.023690
0.1620
0.06688
...
15.050
24.75
99.17
688.6
0.12640
0.20370
0.13770
0.06845
0.2249
0.08492
545
13.620
23.23
87.19
573.2
0.09246
0.06747
0.029740
0.024430
0.1664
0.05801
...
15.350
29.09
97.58
729.8
0.12160
0.15170
0.10490
0.07174
0.2642
0.06953
546
10.320
16.35
65.31
324.9
0.09434
0.04994
0.010120
0.005495
0.1885
0.06201
...
11.250
21.77
71.12
384.9
0.12850
0.08842
0.04384
0.02381
0.2681
0.07399
547
10.260
16.58
65.85
320.8
0.08877
0.08066
0.043580
0.024380
0.1669
0.06714
...
10.830
22.04
71.08
357.4
0.14610
0.22460
0.17830
0.08333
0.2691
0.09479
548
9.683
19.34
61.05
285.7
0.08491
0.05030
0.023370
0.009615
0.1580
0.06235
...
10.930
25.59
69.10
364.2
0.11990
0.09546
0.09350
0.03846
0.2552
0.07920
549
10.820
24.21
68.89
361.6
0.08192
0.06602
0.015480
0.008160
0.1976
0.06328
...
13.030
31.45
83.90
505.6
0.12040
0.16330
0.06194
0.03264
0.3059
0.07626
550
10.860
21.48
68.51
360.5
0.07431
0.04227
0.000000
0.000000
0.1661
0.05948
...
11.660
24.77
74.08
412.3
0.10010
0.07348
0.00000
0.00000
0.2458
0.06592
551
11.130
22.44
71.49
378.4
0.09566
0.08194
0.048240
0.022570
0.2030
0.06552
...
12.020
28.26
77.80
436.6
0.10870
0.17820
0.15640
0.06413
0.3169
0.08032
552
12.770
29.43
81.35
507.9
0.08276
0.04234
0.019970
0.014990
0.1539
0.05637
...
13.870
36.00
88.10
594.7
0.12340
0.10640
0.08653
0.06498
0.2407
0.06484
553
9.333
21.94
59.01
264.0
0.09240
0.05605
0.039960
0.012820
0.1692
0.06576
...
9.845
25.05
62.86
295.8
0.11030
0.08298
0.07993
0.02564
0.2435
0.07393
554
12.880
28.92
82.50
514.3
0.08123
0.05824
0.061950
0.023430
0.1566
0.05708
...
13.890
35.74
88.84
595.7
0.12270
0.16200
0.24390
0.06493
0.2372
0.07242
555
10.290
27.61
65.67
321.4
0.09030
0.07658
0.059990
0.027380
0.1593
0.06127
...
10.840
34.91
69.57
357.6
0.13840
0.17100
0.20000
0.09127
0.2226
0.08283
556
10.160
19.59
64.73
311.7
0.10030
0.07504
0.005025
0.011160
0.1791
0.06331
...
10.650
22.88
67.88
347.3
0.12650
0.12000
0.01005
0.02232
0.2262
0.06742
557
9.423
27.88
59.26
271.3
0.08123
0.04971
0.000000
0.000000
0.1742
0.06059
...
10.490
34.24
66.50
330.6
0.10730
0.07158
0.00000
0.00000
0.2475
0.06969
558
14.590
22.68
96.39
657.1
0.08473
0.13300
0.102900
0.037360
0.1454
0.06147
...
15.480
27.27
105.90
733.5
0.10260
0.31710
0.36620
0.11050
0.2258
0.08004
559
11.510
23.93
74.52
403.5
0.09261
0.10210
0.111200
0.041050
0.1388
0.06570
...
12.480
37.16
82.28
474.2
0.12980
0.25170
0.36300
0.09653
0.2112
0.08732
560
14.050
27.15
91.38
600.4
0.09929
0.11260
0.044620
0.043040
0.1537
0.06171
...
15.300
33.17
100.20
706.7
0.12410
0.22640
0.13260
0.10480
0.2250
0.08321
561
11.200
29.37
70.67
386.0
0.07449
0.03558
0.000000
0.000000
0.1060
0.05502
...
11.920
38.30
75.19
439.6
0.09267
0.05494
0.00000
0.00000
0.1566
0.05905
562
15.220
30.62
103.40
716.9
0.10480
0.20870
0.255000
0.094290
0.2128
0.07152
...
17.520
42.79
128.70
915.0
0.14170
0.79170
1.17000
0.23560
0.4089
0.14090
563
20.920
25.09
143.00
1347.0
0.10990
0.22360
0.317400
0.147400
0.2149
0.06879
...
24.290
29.41
179.10
1819.0
0.14070
0.41860
0.65990
0.25420
0.2929
0.09873
564
21.560
22.39
142.00
1479.0
0.11100
0.11590
0.243900
0.138900
0.1726
0.05623
...
25.450
26.40
166.10
2027.0
0.14100
0.21130
0.41070
0.22160
0.2060
0.07115
565
20.130
28.25
131.20
1261.0
0.09780
0.10340
0.144000
0.097910
0.1752
0.05533
...
23.690
38.25
155.00
1731.0
0.11660
0.19220
0.32150
0.16280
0.2572
0.06637
566
16.600
28.08
108.30
858.1
0.08455
0.10230
0.092510
0.053020
0.1590
0.05648
...
18.980
34.12
126.70
1124.0
0.11390
0.30940
0.34030
0.14180
0.2218
0.07820
567
20.600
29.33
140.10
1265.0
0.11780
0.27700
0.351400
0.152000
0.2397
0.07016
...
25.740
39.42
184.60
1821.0
0.16500
0.86810
0.93870
0.26500
0.4087
0.12400
568
7.760
24.54
47.92
181.0
0.05263
0.04362
0.000000
0.000000
0.1587
0.05884
...
9.456
30.37
59.16
268.6
0.08996
0.06444
0.00000
0.00000
0.2871
0.07039
569 rows × 30 columns
In [58]:
breast_cancer_df['diagnosis']= breast_cancer_data.target
In [61]:
breast_cancer_df.corr()
Out[61]:
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
...
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension
diagnosis
mean radius
1.000000
0.323782
0.997855
0.987357
0.170581
0.506124
0.676764
0.822529
0.147741
-0.311631
...
0.297008
0.965137
0.941082
0.119616
0.413463
0.526911
0.744214
0.163953
0.007066
-0.730029
mean texture
0.323782
1.000000
0.329533
0.321086
-0.023389
0.236702
0.302418
0.293464
0.071401
-0.076437
...
0.912045
0.358040
0.343546
0.077503
0.277830
0.301025
0.295316
0.105008
0.119205
-0.415185
mean perimeter
0.997855
0.329533
1.000000
0.986507
0.207278
0.556936
0.716136
0.850977
0.183027
-0.261477
...
0.303038
0.970387
0.941550
0.150549
0.455774
0.563879
0.771241
0.189115
0.051019
-0.742636
mean area
0.987357
0.321086
0.986507
1.000000
0.177028
0.498502
0.685983
0.823269
0.151293
-0.283110
...
0.287489
0.959120
0.959213
0.123523
0.390410
0.512606
0.722017
0.143570
0.003738
-0.708984
mean smoothness
0.170581
-0.023389
0.207278
0.177028
1.000000
0.659123
0.521984
0.553695
0.557775
0.584792
...
0.036072
0.238853
0.206718
0.805324
0.472468
0.434926
0.503053
0.394309
0.499316
-0.358560
mean compactness
0.506124
0.236702
0.556936
0.498502
0.659123
1.000000
0.883121
0.831135
0.602641
0.565369
...
0.248133
0.590210
0.509604
0.565541
0.865809
0.816275
0.815573
0.510223
0.687382
-0.596534
mean concavity
0.676764
0.302418
0.716136
0.685983
0.521984
0.883121
1.000000
0.921391
0.500667
0.336783
...
0.299879
0.729565
0.675987
0.448822
0.754968
0.884103
0.861323
0.409464
0.514930
-0.696360
mean concave points
0.822529
0.293464
0.850977
0.823269
0.553695
0.831135
0.921391
1.000000
0.462497
0.166917
...
0.292752
0.855923
0.809630
0.452753
0.667454
0.752399
0.910155
0.375744
0.368661
-0.776614
mean symmetry
0.147741
0.071401
0.183027
0.151293
0.557775
0.602641
0.500667
0.462497
1.000000
0.479921
...
0.090651
0.219169
0.177193
0.426675
0.473200
0.433721
0.430297
0.699826
0.438413
-0.330499
mean fractal dimension
-0.311631
-0.076437
-0.261477
-0.283110
0.584792
0.565369
0.336783
0.166917
0.479921
1.000000
...
-0.051269
-0.205151
-0.231854
0.504942
0.458798
0.346234
0.175325
0.334019
0.767297
0.012838
radius error
0.679090
0.275869
0.691765
0.732562
0.301467
0.497473
0.631925
0.698050
0.303379
0.000111
...
0.194799
0.719684
0.751548
0.141919
0.287103
0.380585
0.531062
0.094543
0.049559
-0.567134
texture error
-0.097317
0.386358
-0.086761
-0.066280
0.068406
0.046205
0.076218
0.021480
0.128053
0.164174
...
0.409003
-0.102242
-0.083195
-0.073658
-0.092439
-0.068956
-0.119638
-0.128215
-0.045655
0.008303
perimeter error
0.674172
0.281673
0.693135
0.726628
0.296092
0.548905
0.660391
0.710650
0.313893
0.039830
...
0.200371
0.721031
0.730713
0.130054
0.341919
0.418899
0.554897
0.109930
0.085433
-0.556141
area error
0.735864
0.259845
0.744983
0.800086
0.246552
0.455653
0.617427
0.690299
0.223970
-0.090170
...
0.196497
0.761213
0.811408
0.125389
0.283257
0.385100
0.538166
0.074126
0.017539
-0.548236
smoothness error
-0.222600
0.006614
-0.202694
-0.166777
0.332375
0.135299
0.098564
0.027653
0.187321
0.401964
...
-0.074743
-0.217304
-0.182195
0.314457
-0.055558
-0.058298
-0.102007
-0.107342
0.101480
0.067016
compactness error
0.206000
0.191975
0.250744
0.212583
0.318943
0.738722
0.670279
0.490424
0.421659
0.559837
...
0.143003
0.260516
0.199371
0.227394
0.678780
0.639147
0.483208
0.277878
0.590973
-0.292999
concavity error
0.194204
0.143293
0.228082
0.207660
0.248396
0.570517
0.691270
0.439167
0.342627
0.446630
...
0.100241
0.226680
0.188353
0.168481
0.484858
0.662564
0.440472
0.197788
0.439329
-0.253730
concave points error
0.376169
0.163851
0.407217
0.372320
0.380676
0.642262
0.683260
0.615634
0.393298
0.341198
...
0.086741
0.394999
0.342271
0.215351
0.452888
0.549592
0.602450
0.143116
0.310655
-0.408042
symmetry error
-0.104321
0.009127
-0.081629
-0.072497
0.200774
0.229977
0.178009
0.095351
0.449137
0.345007
...
-0.077473
-0.103753
-0.110343
-0.012662
0.060255
0.037119
-0.030413
0.389402
0.078079
0.006522
fractal dimension error
-0.042641
0.054458
-0.005523
-0.019887
0.283607
0.507318
0.449301
0.257584
0.331786
0.688132
...
-0.003195
-0.001000
-0.022736
0.170568
0.390159
0.379975
0.215204
0.111094
0.591328
-0.077972
worst radius
0.969539
0.352573
0.969476
0.962746
0.213120
0.535315
0.688236
0.830318
0.185728
-0.253691
...
0.359921
0.993708
0.984015
0.216574
0.475820
0.573975
0.787424
0.243529
0.093492
-0.776454
worst texture
0.297008
0.912045
0.303038
0.287489
0.036072
0.248133
0.299879
0.292752
0.090651
-0.051269
...
1.000000
0.365098
0.345842
0.225429
0.360832
0.368366
0.359755
0.233027
0.219122
-0.456903
worst perimeter
0.965137
0.358040
0.970387
0.959120
0.238853
0.590210
0.729565
0.855923
0.219169
-0.205151
...
0.365098
1.000000
0.977578
0.236775
0.529408
0.618344
0.816322
0.269493
0.138957
-0.782914
worst area
0.941082
0.343546
0.941550
0.959213
0.206718
0.509604
0.675987
0.809630
0.177193
-0.231854
...
0.345842
0.977578
1.000000
0.209145
0.438296
0.543331
0.747419
0.209146
0.079647
-0.733825
worst smoothness
0.119616
0.077503
0.150549
0.123523
0.805324
0.565541
0.448822
0.452753
0.426675
0.504942
...
0.225429
0.236775
0.209145
1.000000
0.568187
0.518523
0.547691
0.493838
0.617624
-0.421465
worst compactness
0.413463
0.277830
0.455774
0.390410
0.472468
0.865809
0.754968
0.667454
0.473200
0.458798
...
0.360832
0.529408
0.438296
0.568187
1.000000
0.892261
0.801080
0.614441
0.810455
-0.590998
worst concavity
0.526911
0.301025
0.563879
0.512606
0.434926
0.816275
0.884103
0.752399
0.433721
0.346234
...
0.368366
0.618344
0.543331
0.518523
0.892261
1.000000
0.855434
0.532520
0.686511
-0.659610
worst concave points
0.744214
0.295316
0.771241
0.722017
0.503053
0.815573
0.861323
0.910155
0.430297
0.175325
...
0.359755
0.816322
0.747419
0.547691
0.801080
0.855434
1.000000
0.502528
0.511114
-0.793566
worst symmetry
0.163953
0.105008
0.189115
0.143570
0.394309
0.510223
0.409464
0.375744
0.699826
0.334019
...
0.233027
0.269493
0.209146
0.493838
0.614441
0.532520
0.502528
1.000000
0.537848
-0.416294
worst fractal dimension
0.007066
0.119205
0.051019
0.003738
0.499316
0.687382
0.514930
0.368661
0.438413
0.767297
...
0.219122
0.138957
0.079647
0.617624
0.810455
0.686511
0.511114
0.537848
1.000000
-0.323872
diagnosis
-0.730029
-0.415185
-0.742636
-0.708984
-0.358560
-0.596534
-0.696360
-0.776614
-0.330499
0.012838
...
-0.456903
-0.782914
-0.733825
-0.421465
-0.590998
-0.659610
-0.793566
-0.416294
-0.323872
1.000000
31 rows × 31 columns
In [65]:
x_seed = breast_cancer_data.data[:,:2]
y_seed = breast_cancer_data.target
In [66]:
x_seed
Out[66]:
array([[ 17.99, 10.38],
[ 20.57, 17.77],
[ 19.69, 21.25],
...,
[ 16.6 , 28.08],
[ 20.6 , 29.33],
[ 7.76, 24.54]])
In [67]:
y_seed
Out[67]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1,
0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1,
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])
In [75]:
x_train, x_test, y_train, y_test = split_test(x,y,test_size=0.5,train_size=0.5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-75-43e41ac742c4> in <module>()
----> 1 x_train, x_test, y_train, y_test = split_test(x,y,test_size=0.5,train_size=0.5)
NameError: name 'split_test' is not defined
In [69]:
dt = dt.fit(x_train,y_train)
In [71]:
output(x_train,y_train,dt)
Accuracy:1.000
Classification report
precision recall f1-score support
0 1.00 1.00 1.00 114
1 1.00 1.00 1.00 170
avg / total 1.00 1.00 1.00 284
Confusion matrix
[[114 0]
[ 0 170]]
In [72]:
output(x_test,y_test,dt)
Accuracy:0.818
Classification report
precision recall f1-score support
0 0.71 0.80 0.75 98
1 0.89 0.83 0.86 187
avg / total 0.82 0.82 0.82 285
Confusion matrix
[[ 78 20]
[ 32 155]]
In [81]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)
In [82]:
output(x_train,y_train,dt)
Accuracy:0.937
Classification report
precision recall f1-score support
0 0.95 0.88 0.91 162
1 0.93 0.97 0.95 264
avg / total 0.94 0.94 0.94 426
Confusion matrix
[[143 19]
[ 8 256]]
In [83]:
output(x_test,y_test,dt)
Accuracy:0.902
Classification report
precision recall f1-score support
0 0.89 0.82 0.85 50
1 0.91 0.95 0.93 93
avg / total 0.90 0.90 0.90 143
Confusion matrix
[[41 9]
[ 5 88]]
In [ ]:
Content source: ledeprogram/algorithms
Similar notebooks: