## We covered a lot of information today and I'd like you to practice developing classification trees on your own. For each exercise, work through the problem, determine the result, and provide the requested interpretation in comments along with the code. The point is to build classifiers, not necessarily good classifiers (that will hopefully come later)

### 1. Load the iris dataset and create a holdout set that is 50% of the data (50% in training and 50% in test). Output the results (don't worry about creating the tree visual unless you'd like to) and discuss them briefly (are they good or not?)

``````

In [1]:

import pandas as pd
%matplotlib inline

``````
``````

In [2]:

from sklearn import datasets
from pandas.tools.plotting import scatter_matrix

``````
``````

In [3]:

import matplotlib.pyplot as plt

``````
``````

In [4]:

``````
``````

In [5]:

iris

``````
``````

Out[5]:

{'DESCR': 'Iris Plants Database\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...\n',
'data': array([[ 5.1,  3.5,  1.4,  0.2],
[ 4.9,  3. ,  1.4,  0.2],
[ 4.7,  3.2,  1.3,  0.2],
[ 4.6,  3.1,  1.5,  0.2],
[ 5. ,  3.6,  1.4,  0.2],
[ 5.4,  3.9,  1.7,  0.4],
[ 4.6,  3.4,  1.4,  0.3],
[ 5. ,  3.4,  1.5,  0.2],
[ 4.4,  2.9,  1.4,  0.2],
[ 4.9,  3.1,  1.5,  0.1],
[ 5.4,  3.7,  1.5,  0.2],
[ 4.8,  3.4,  1.6,  0.2],
[ 4.8,  3. ,  1.4,  0.1],
[ 4.3,  3. ,  1.1,  0.1],
[ 5.8,  4. ,  1.2,  0.2],
[ 5.7,  4.4,  1.5,  0.4],
[ 5.4,  3.9,  1.3,  0.4],
[ 5.1,  3.5,  1.4,  0.3],
[ 5.7,  3.8,  1.7,  0.3],
[ 5.1,  3.8,  1.5,  0.3],
[ 5.4,  3.4,  1.7,  0.2],
[ 5.1,  3.7,  1.5,  0.4],
[ 4.6,  3.6,  1. ,  0.2],
[ 5.1,  3.3,  1.7,  0.5],
[ 4.8,  3.4,  1.9,  0.2],
[ 5. ,  3. ,  1.6,  0.2],
[ 5. ,  3.4,  1.6,  0.4],
[ 5.2,  3.5,  1.5,  0.2],
[ 5.2,  3.4,  1.4,  0.2],
[ 4.7,  3.2,  1.6,  0.2],
[ 4.8,  3.1,  1.6,  0.2],
[ 5.4,  3.4,  1.5,  0.4],
[ 5.2,  4.1,  1.5,  0.1],
[ 5.5,  4.2,  1.4,  0.2],
[ 4.9,  3.1,  1.5,  0.1],
[ 5. ,  3.2,  1.2,  0.2],
[ 5.5,  3.5,  1.3,  0.2],
[ 4.9,  3.1,  1.5,  0.1],
[ 4.4,  3. ,  1.3,  0.2],
[ 5.1,  3.4,  1.5,  0.2],
[ 5. ,  3.5,  1.3,  0.3],
[ 4.5,  2.3,  1.3,  0.3],
[ 4.4,  3.2,  1.3,  0.2],
[ 5. ,  3.5,  1.6,  0.6],
[ 5.1,  3.8,  1.9,  0.4],
[ 4.8,  3. ,  1.4,  0.3],
[ 5.1,  3.8,  1.6,  0.2],
[ 4.6,  3.2,  1.4,  0.2],
[ 5.3,  3.7,  1.5,  0.2],
[ 5. ,  3.3,  1.4,  0.2],
[ 7. ,  3.2,  4.7,  1.4],
[ 6.4,  3.2,  4.5,  1.5],
[ 6.9,  3.1,  4.9,  1.5],
[ 5.5,  2.3,  4. ,  1.3],
[ 6.5,  2.8,  4.6,  1.5],
[ 5.7,  2.8,  4.5,  1.3],
[ 6.3,  3.3,  4.7,  1.6],
[ 4.9,  2.4,  3.3,  1. ],
[ 6.6,  2.9,  4.6,  1.3],
[ 5.2,  2.7,  3.9,  1.4],
[ 5. ,  2. ,  3.5,  1. ],
[ 5.9,  3. ,  4.2,  1.5],
[ 6. ,  2.2,  4. ,  1. ],
[ 6.1,  2.9,  4.7,  1.4],
[ 5.6,  2.9,  3.6,  1.3],
[ 6.7,  3.1,  4.4,  1.4],
[ 5.6,  3. ,  4.5,  1.5],
[ 5.8,  2.7,  4.1,  1. ],
[ 6.2,  2.2,  4.5,  1.5],
[ 5.6,  2.5,  3.9,  1.1],
[ 5.9,  3.2,  4.8,  1.8],
[ 6.1,  2.8,  4. ,  1.3],
[ 6.3,  2.5,  4.9,  1.5],
[ 6.1,  2.8,  4.7,  1.2],
[ 6.4,  2.9,  4.3,  1.3],
[ 6.6,  3. ,  4.4,  1.4],
[ 6.8,  2.8,  4.8,  1.4],
[ 6.7,  3. ,  5. ,  1.7],
[ 6. ,  2.9,  4.5,  1.5],
[ 5.7,  2.6,  3.5,  1. ],
[ 5.5,  2.4,  3.8,  1.1],
[ 5.5,  2.4,  3.7,  1. ],
[ 5.8,  2.7,  3.9,  1.2],
[ 6. ,  2.7,  5.1,  1.6],
[ 5.4,  3. ,  4.5,  1.5],
[ 6. ,  3.4,  4.5,  1.6],
[ 6.7,  3.1,  4.7,  1.5],
[ 6.3,  2.3,  4.4,  1.3],
[ 5.6,  3. ,  4.1,  1.3],
[ 5.5,  2.5,  4. ,  1.3],
[ 5.5,  2.6,  4.4,  1.2],
[ 6.1,  3. ,  4.6,  1.4],
[ 5.8,  2.6,  4. ,  1.2],
[ 5. ,  2.3,  3.3,  1. ],
[ 5.6,  2.7,  4.2,  1.3],
[ 5.7,  3. ,  4.2,  1.2],
[ 5.7,  2.9,  4.2,  1.3],
[ 6.2,  2.9,  4.3,  1.3],
[ 5.1,  2.5,  3. ,  1.1],
[ 5.7,  2.8,  4.1,  1.3],
[ 6.3,  3.3,  6. ,  2.5],
[ 5.8,  2.7,  5.1,  1.9],
[ 7.1,  3. ,  5.9,  2.1],
[ 6.3,  2.9,  5.6,  1.8],
[ 6.5,  3. ,  5.8,  2.2],
[ 7.6,  3. ,  6.6,  2.1],
[ 4.9,  2.5,  4.5,  1.7],
[ 7.3,  2.9,  6.3,  1.8],
[ 6.7,  2.5,  5.8,  1.8],
[ 7.2,  3.6,  6.1,  2.5],
[ 6.5,  3.2,  5.1,  2. ],
[ 6.4,  2.7,  5.3,  1.9],
[ 6.8,  3. ,  5.5,  2.1],
[ 5.7,  2.5,  5. ,  2. ],
[ 5.8,  2.8,  5.1,  2.4],
[ 6.4,  3.2,  5.3,  2.3],
[ 6.5,  3. ,  5.5,  1.8],
[ 7.7,  3.8,  6.7,  2.2],
[ 7.7,  2.6,  6.9,  2.3],
[ 6. ,  2.2,  5. ,  1.5],
[ 6.9,  3.2,  5.7,  2.3],
[ 5.6,  2.8,  4.9,  2. ],
[ 7.7,  2.8,  6.7,  2. ],
[ 6.3,  2.7,  4.9,  1.8],
[ 6.7,  3.3,  5.7,  2.1],
[ 7.2,  3.2,  6. ,  1.8],
[ 6.2,  2.8,  4.8,  1.8],
[ 6.1,  3. ,  4.9,  1.8],
[ 6.4,  2.8,  5.6,  2.1],
[ 7.2,  3. ,  5.8,  1.6],
[ 7.4,  2.8,  6.1,  1.9],
[ 7.9,  3.8,  6.4,  2. ],
[ 6.4,  2.8,  5.6,  2.2],
[ 6.3,  2.8,  5.1,  1.5],
[ 6.1,  2.6,  5.6,  1.4],
[ 7.7,  3. ,  6.1,  2.3],
[ 6.3,  3.4,  5.6,  2.4],
[ 6.4,  3.1,  5.5,  1.8],
[ 6. ,  3. ,  4.8,  1.8],
[ 6.9,  3.1,  5.4,  2.1],
[ 6.7,  3.1,  5.6,  2.4],
[ 6.9,  3.1,  5.1,  2.3],
[ 5.8,  2.7,  5.1,  1.9],
[ 6.8,  3.2,  5.9,  2.3],
[ 6.7,  3.3,  5.7,  2.5],
[ 6.7,  3. ,  5.2,  2.3],
[ 6.3,  2.5,  5. ,  1.9],
[ 6.5,  3. ,  5.2,  2. ],
[ 6.2,  3.4,  5.4,  2.3],
[ 5.9,  3. ,  5.1,  1.8]]),
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'target_names': array(['setosa', 'versicolor', 'virginica'],
dtype='<U10')}

``````
``````

In [6]:

x = iris.data[:,:2]
y = iris.target

``````
``````

In [7]:

from sklearn import tree

``````
``````

In [8]:

from sklearn.cross_validation import train_test_split

``````
``````

In [9]:

dt = tree.DecisionTreeClassifier()

``````
``````

In [10]:

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)

``````
``````

In [11]:

dt = dt.fit(x_train,y_train)

``````
``````

In [12]:

from sklearn import metrics

``````
``````

In [13]:

import numpy as np

``````
``````

In [14]:

def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
if show_classification_report:
print("Classification report")
print(metrics.classification_report(y,y_pred),"\n")
if show_confussion_matrix:
print("Confusion matrix")
print(metrics.confusion_matrix(y,y_pred),"\n")

``````
``````

In [15]:

measure_performance(x_train,y_train,dt)

``````
``````

Accuracy:0.947

Classification report
precision    recall  f1-score   support

0       1.00      1.00      1.00        21
1       0.87      1.00      0.93        26
2       1.00      0.86      0.92        28

avg / total       0.95      0.95      0.95        75

Confusion matrix
[[21  0  0]
[ 0 26  0]
[ 0  4 24]]

``````
``````

In [16]:

measure_performance(x_test,y_test,dt)

``````
``````

Accuracy:0.707

Classification report
precision    recall  f1-score   support

0       0.97      0.97      0.97        29
1       0.55      0.46      0.50        24
2       0.54      0.64      0.58        22

avg / total       0.71      0.71      0.70        75

Confusion matrix
[[28  1  0]
[ 1 11 12]
[ 0  8 14]]

``````

Every time I run the code, the result is different, i.e. the accuracy level varies. Sometimes the training dataset does better, sometimes is the other way around. Since I do not really understand what's going on under the hood, I am not able to go further into my analysis. However, I would say that, in general, I could not trust the results of such a model, because, as a journalist, I am accountable for everything I produce. In this case, I could not rely on a "source" that is changing its version every single time I interview him/her.

### 2. Redo the model with a 75% - 25% training/test split and compare the results. Are they better or worse than before? Discuss why this may be.

``````

In [17]:

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.75,train_size=0.25)

``````
``````

In [18]:

def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
if show_classification_report:
print("Classification report")
print(metrics.classification_report(y,y_pred),"\n")
if show_confussion_matrix:
print("Confusion matrix")
print(metrics.confusion_matrix(y,y_pred),"\n")

``````
``````

In [19]:

measure_performance(x_train,y_train,dt)

``````
``````

Accuracy:0.730

Classification report
precision    recall  f1-score   support

0       1.00      1.00      1.00        12
1       0.62      0.42      0.50        12
2       0.59      0.77      0.67        13

avg / total       0.73      0.73      0.72        37

Confusion matrix
[[12  0  0]
[ 0  5  7]
[ 0  3 10]]

``````
``````

In [20]:

measure_performance(x_test,y_test,dt)

``````
``````

Accuracy:0.858

Classification report
precision    recall  f1-score   support

0       0.97      0.97      0.97        38
1       0.76      0.84      0.80        38
2       0.85      0.76      0.80        37

avg / total       0.86      0.86      0.86       113

Confusion matrix
[[37  1  0]
[ 1 32  5]
[ 0  9 28]]

``````

In this case, the test dataset seems to do a little bit better than the training dataset. I guess that is because the larger the training dataset, the more accuracy it can achieve. On the other hand, the smaller the test data, the easier it is for the model to be accurate.

### 3. Load the breast cancer dataset (`datasets.load_breast_cancer()`) and perform basic exploratory analysis. What attributes to we have? What are we trying to predict?

For context of the data, see the documentation here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

``````

In [21]:

``````
``````

In [22]:

breast_cancer

``````
``````

Out[22]:

{'DESCR': 'Breast Cancer Wisconsin (Diagnostic) Database\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 569\n\n    :Number of Attributes: 30 numeric, predictive attributes and the class\n\n    :Attribute Information:\n        - radius (mean of distances from center to points on the perimeter)\n        - texture (standard deviation of gray-scale values)\n        - perimeter\n        - area\n        - smoothness (local variation in radius lengths)\n        - compactness (perimeter^2 / area - 1.0)\n        - concavity (severity of concave portions of the contour)\n        - concave points (number of concave portions of the contour)\n        - symmetry \n        - fractal dimension ("coastline approximation" - 1)\n        \n        The mean, standard error, and "worst" or largest (mean of the three\n        largest values) of these features were computed for each image,\n        resulting in 30 features.  For instance, field 3 is Mean Radius, field\n        13 is Radius SE, field 23 is Worst Radius.\n        \n        - class:\n                - WDBC-Malignant\n                - WDBC-Benign\n\n    :Summary Statistics:\n\n    ===================================== ======= ========\n                                           Min     Max\n    ===================================== ======= ========\n    radius (mean):                         6.981   28.11\n    texture (mean):                        9.71    39.28\n    perimeter (mean):                      43.79   188.5\n    area (mean):                           143.5   2501.0\n    smoothness (mean):                     0.053   0.163\n    compactness (mean):                    0.019   0.345\n    concavity (mean):                      0.0     0.427\n    concave points (mean):                 0.0     0.201\n    symmetry (mean):                       0.106   0.304\n    fractal dimension (mean):              0.05    0.097\n    radius (standard error):               0.112   2.873\n    texture (standard error):              0.36    4.885\n    perimeter (standard error):            0.757   21.98\n    area (standard error):                 6.802   542.2\n    smoothness (standard error):           0.002   0.031\n    compactness (standard error):          0.002   0.135\n    concavity (standard error):            0.0     0.396\n    concave points (standard error):       0.0     0.053\n    symmetry (standard error):             0.008   0.079\n    fractal dimension (standard error):    0.001   0.03\n    radius (worst):                        7.93    36.04\n    texture (worst):                       12.02   49.54\n    perimeter (worst):                     50.41   251.2\n    area (worst):                          185.2   4254.0\n    smoothness (worst):                    0.071   0.223\n    compactness (worst):                   0.027   1.058\n    concavity (worst):                     0.0     1.252\n    concave points (worst):                0.0     0.291\n    symmetry (worst):                      0.156   0.664\n    fractal dimension (worst):             0.055   0.208\n    ===================================== ======= ========\n\n    :Missing Attribute Values: None\n\n    :Class Distribution: 212 - Malignant, 357 - Benign\n\n    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian\n\n    :Donor: Nick Street\n\n    :Date: November, 1995\n\nThis is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.\nhttps://goo.gl/U2Uwz2\n\nFeatures are computed from a digitized image of a fine needle\naspirate (FNA) of a breast mass.  They describe\ncharacteristics of the cell nuclei present in the image.\nA few of the images can be found at\nhttp://www.cs.wisc.edu/~street/images/\n\nSeparating plane described above was obtained using\nMultisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree\nConstruction Via Linear Programming." Proceedings of the 4th\nMidwest Artificial Intelligence and Cognitive Science Society,\npp. 97-101, 1992], a classification method which uses linear\nprogramming to construct a decision tree.  Relevant features\nwere selected using an exhaustive search in the space of 1-4\nfeatures and 1-3 separating planes.\n\nThe actual linear program used to obtain the separating plane\nin the 3-dimensional space is that described in:\n[K. P. Bennett and O. L. Mangasarian: "Robust Linear\nProgramming Discrimination of Two Linearly Inseparable Sets",\nOptimization Methods and Software 1, 1992, 23-34].\n\nThis database is also available through the UW CS ftp server:\n\nftp ftp.cs.wisc.edu\ncd math-prog/cpo-dataset/machine-learn/WDBC/\n\nReferences\n----------\n   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction \n     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on \n     Electronic Imaging: Science and Technology, volume 1905, pages 861-870, \n     San Jose, CA, 1993. \n   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and \n     prognosis via linear programming. Operations Research, 43(4), pages 570-577, \n     July-August 1995.\n   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques\n     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) \n     163-171.\n',
'data': array([[  1.79900000e+01,   1.03800000e+01,   1.22800000e+02, ...,
2.65400000e-01,   4.60100000e-01,   1.18900000e-01],
[  2.05700000e+01,   1.77700000e+01,   1.32900000e+02, ...,
1.86000000e-01,   2.75000000e-01,   8.90200000e-02],
[  1.96900000e+01,   2.12500000e+01,   1.30000000e+02, ...,
2.43000000e-01,   3.61300000e-01,   8.75800000e-02],
...,
[  1.66000000e+01,   2.80800000e+01,   1.08300000e+02, ...,
1.41800000e-01,   2.21800000e-01,   7.82000000e-02],
[  2.06000000e+01,   2.93300000e+01,   1.40100000e+02, ...,
2.65000000e-01,   4.08700000e-01,   1.24000000e-01],
[  7.76000000e+00,   2.45400000e+01,   4.79200000e+01, ...,
0.00000000e+00,   2.87100000e-01,   7.03900000e-02]]),
'feature_names': array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension'],
dtype='<U23'),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1,
0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1,
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1]),
'target_names': array(['malignant', 'benign'],
dtype='<U9')}

``````
``````

In [23]:

breast_cancer.keys()

``````
``````

Out[23]:

dict_keys(['feature_names', 'target_names', 'target', 'DESCR', 'data'])

``````
``````

In [24]:

breast_cancer['feature_names']

``````
``````

Out[24]:

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension'],
dtype='<U23')

``````
``````

In [25]:

type(breast_cancer)

``````
``````

Out[25]:

sklearn.datasets.base.Bunch

``````
``````

In [26]:

df = pd.DataFrame(breast_cancer.data, columns=breast_cancer['feature_names'])

``````
``````

In [27]:

df

``````
``````

Out[27]:

mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
...
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension

0
17.990
10.38
122.80
1001.0
0.11840
0.27760
0.300100
0.147100
0.2419
0.07871
...
25.380
17.33
184.60
2019.0
0.16220
0.66560
0.71190
0.26540
0.4601
0.11890

1
20.570
17.77
132.90
1326.0
0.08474
0.07864
0.086900
0.070170
0.1812
0.05667
...
24.990
23.41
158.80
1956.0
0.12380
0.18660
0.24160
0.18600
0.2750
0.08902

2
19.690
21.25
130.00
1203.0
0.10960
0.15990
0.197400
0.127900
0.2069
0.05999
...
23.570
25.53
152.50
1709.0
0.14440
0.42450
0.45040
0.24300
0.3613
0.08758

3
11.420
20.38
77.58
386.1
0.14250
0.28390
0.241400
0.105200
0.2597
0.09744
...
14.910
26.50
98.87
567.7
0.20980
0.86630
0.68690
0.25750
0.6638
0.17300

4
20.290
14.34
135.10
1297.0
0.10030
0.13280
0.198000
0.104300
0.1809
0.05883
...
22.540
16.67
152.20
1575.0
0.13740
0.20500
0.40000
0.16250
0.2364
0.07678

5
12.450
15.70
82.57
477.1
0.12780
0.17000
0.157800
0.080890
0.2087
0.07613
...
15.470
23.75
103.40
741.6
0.17910
0.52490
0.53550
0.17410
0.3985
0.12440

6
18.250
19.98
119.60
1040.0
0.09463
0.10900
0.112700
0.074000
0.1794
0.05742
...
22.880
27.66
153.20
1606.0
0.14420
0.25760
0.37840
0.19320
0.3063
0.08368

7
13.710
20.83
90.20
577.9
0.11890
0.16450
0.093660
0.059850
0.2196
0.07451
...
17.060
28.14
110.60
897.0
0.16540
0.36820
0.26780
0.15560
0.3196
0.11510

8
13.000
21.82
87.50
519.8
0.12730
0.19320
0.185900
0.093530
0.2350
0.07389
...
15.490
30.73
106.20
739.3
0.17030
0.54010
0.53900
0.20600
0.4378
0.10720

9
12.460
24.04
83.97
475.9
0.11860
0.23960
0.227300
0.085430
0.2030
0.08243
...
15.090
40.68
97.65
711.4
0.18530
1.05800
1.10500
0.22100
0.4366
0.20750

10
16.020
23.24
102.70
797.8
0.08206
0.06669
0.032990
0.033230
0.1528
0.05697
...
19.190
33.88
123.80
1150.0
0.11810
0.15510
0.14590
0.09975
0.2948
0.08452

11
15.780
17.89
103.60
781.0
0.09710
0.12920
0.099540
0.066060
0.1842
0.06082
...
20.420
27.28
136.50
1299.0
0.13960
0.56090
0.39650
0.18100
0.3792
0.10480

12
19.170
24.80
132.40
1123.0
0.09740
0.24580
0.206500
0.111800
0.2397
0.07800
...
20.960
29.94
151.70
1332.0
0.10370
0.39030
0.36390
0.17670
0.3176
0.10230

13
15.850
23.95
103.70
782.7
0.08401
0.10020
0.099380
0.053640
0.1847
0.05338
...
16.840
27.66
112.00
876.5
0.11310
0.19240
0.23220
0.11190
0.2809
0.06287

14
13.730
22.61
93.60
578.3
0.11310
0.22930
0.212800
0.080250
0.2069
0.07682
...
15.030
32.01
108.80
697.7
0.16510
0.77250
0.69430
0.22080
0.3596
0.14310

15
14.540
27.54
96.73
658.8
0.11390
0.15950
0.163900
0.073640
0.2303
0.07077
...
17.460
37.13
124.10
943.2
0.16780
0.65770
0.70260
0.17120
0.4218
0.13410

16
14.680
20.13
94.74
684.5
0.09867
0.07200
0.073950
0.052590
0.1586
0.05922
...
19.070
30.88
123.40
1138.0
0.14640
0.18710
0.29140
0.16090
0.3029
0.08216

17
16.130
20.68
108.10
798.8
0.11700
0.20220
0.172200
0.102800
0.2164
0.07356
...
20.960
31.48
136.80
1315.0
0.17890
0.42330
0.47840
0.20730
0.3706
0.11420

18
19.810
22.15
130.00
1260.0
0.09831
0.10270
0.147900
0.094980
0.1582
0.05395
...
27.320
30.88
186.80
2398.0
0.15120
0.31500
0.53720
0.23880
0.2768
0.07615

19
13.540
14.36
87.46
566.3
0.09779
0.08129
0.066640
0.047810
0.1885
0.05766
...
15.110
19.26
99.70
711.2
0.14400
0.17730
0.23900
0.12880
0.2977
0.07259

20
13.080
15.71
85.63
520.0
0.10750
0.12700
0.045680
0.031100
0.1967
0.06811
...
14.500
20.49
96.09
630.5
0.13120
0.27760
0.18900
0.07283
0.3184
0.08183

21
9.504
12.44
60.34
273.9
0.10240
0.06492
0.029560
0.020760
0.1815
0.06905
...
10.230
15.66
65.13
314.9
0.13240
0.11480
0.08867
0.06227
0.2450
0.07773

22
15.340
14.26
102.50
704.4
0.10730
0.21350
0.207700
0.097560
0.2521
0.07032
...
18.070
19.08
125.10
980.9
0.13900
0.59540
0.63050
0.23930
0.4667
0.09946

23
21.160
23.04
137.20
1404.0
0.09428
0.10220
0.109700
0.086320
0.1769
0.05278
...
29.170
35.59
188.00
2615.0
0.14010
0.26000
0.31550
0.20090
0.2822
0.07526

24
16.650
21.38
110.00
904.6
0.11210
0.14570
0.152500
0.091700
0.1995
0.06330
...
26.460
31.56
177.00
2215.0
0.18050
0.35780
0.46950
0.20950
0.3613
0.09564

25
17.140
16.40
116.00
912.7
0.11860
0.22760
0.222900
0.140100
0.3040
0.07413
...
22.250
21.40
152.40
1461.0
0.15450
0.39490
0.38530
0.25500
0.4066
0.10590

26
14.580
21.53
97.41
644.8
0.10540
0.18680
0.142500
0.087830
0.2252
0.06924
...
17.620
33.21
122.40
896.9
0.15250
0.66430
0.55390
0.27010
0.4264
0.12750

27
18.610
20.25
122.10
1094.0
0.09440
0.10660
0.149000
0.077310
0.1697
0.05699
...
21.310
27.26
139.90
1403.0
0.13380
0.21170
0.34460
0.14900
0.2341
0.07421

28
15.300
25.27
102.40
732.4
0.10820
0.16970
0.168300
0.087510
0.1926
0.06540
...
20.270
36.71
149.30
1269.0
0.16410
0.61100
0.63350
0.20240
0.4027
0.09876

29
17.570
15.05
115.00
955.1
0.09847
0.11570
0.098750
0.079530
0.1739
0.06149
...
20.010
19.52
134.90
1227.0
0.12550
0.28120
0.24890
0.14560
0.2756
0.07919

...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...

539
7.691
25.44
48.34
170.4
0.08668
0.11990
0.092520
0.013640
0.2037
0.07751
...
8.678
31.89
54.49
223.6
0.15960
0.30640
0.33930
0.05000
0.2790
0.10660

540
11.540
14.44
74.65
402.9
0.09984
0.11200
0.067370
0.025940
0.1818
0.06782
...
12.260
19.68
78.78
457.8
0.13450
0.21180
0.17970
0.06918
0.2329
0.08134

541
14.470
24.99
95.81
656.4
0.08837
0.12300
0.100900
0.038900
0.1872
0.06341
...
16.220
31.73
113.50
808.9
0.13400
0.42020
0.40400
0.12050
0.3187
0.10230

542
14.740
25.42
94.70
668.6
0.08275
0.07214
0.041050
0.030270
0.1840
0.05680
...
16.510
32.29
107.40
826.4
0.10600
0.13760
0.16110
0.10950
0.2722
0.06956

543
13.210
28.06
84.88
538.4
0.08671
0.06877
0.029870
0.032750
0.1628
0.05781
...
14.370
37.17
92.48
629.6
0.10720
0.13810
0.10620
0.07958
0.2473
0.06443

544
13.870
20.70
89.77
584.8
0.09578
0.10180
0.036880
0.023690
0.1620
0.06688
...
15.050
24.75
99.17
688.6
0.12640
0.20370
0.13770
0.06845
0.2249
0.08492

545
13.620
23.23
87.19
573.2
0.09246
0.06747
0.029740
0.024430
0.1664
0.05801
...
15.350
29.09
97.58
729.8
0.12160
0.15170
0.10490
0.07174
0.2642
0.06953

546
10.320
16.35
65.31
324.9
0.09434
0.04994
0.010120
0.005495
0.1885
0.06201
...
11.250
21.77
71.12
384.9
0.12850
0.08842
0.04384
0.02381
0.2681
0.07399

547
10.260
16.58
65.85
320.8
0.08877
0.08066
0.043580
0.024380
0.1669
0.06714
...
10.830
22.04
71.08
357.4
0.14610
0.22460
0.17830
0.08333
0.2691
0.09479

548
9.683
19.34
61.05
285.7
0.08491
0.05030
0.023370
0.009615
0.1580
0.06235
...
10.930
25.59
69.10
364.2
0.11990
0.09546
0.09350
0.03846
0.2552
0.07920

549
10.820
24.21
68.89
361.6
0.08192
0.06602
0.015480
0.008160
0.1976
0.06328
...
13.030
31.45
83.90
505.6
0.12040
0.16330
0.06194
0.03264
0.3059
0.07626

550
10.860
21.48
68.51
360.5
0.07431
0.04227
0.000000
0.000000
0.1661
0.05948
...
11.660
24.77
74.08
412.3
0.10010
0.07348
0.00000
0.00000
0.2458
0.06592

551
11.130
22.44
71.49
378.4
0.09566
0.08194
0.048240
0.022570
0.2030
0.06552
...
12.020
28.26
77.80
436.6
0.10870
0.17820
0.15640
0.06413
0.3169
0.08032

552
12.770
29.43
81.35
507.9
0.08276
0.04234
0.019970
0.014990
0.1539
0.05637
...
13.870
36.00
88.10
594.7
0.12340
0.10640
0.08653
0.06498
0.2407
0.06484

553
9.333
21.94
59.01
264.0
0.09240
0.05605
0.039960
0.012820
0.1692
0.06576
...
9.845
25.05
62.86
295.8
0.11030
0.08298
0.07993
0.02564
0.2435
0.07393

554
12.880
28.92
82.50
514.3
0.08123
0.05824
0.061950
0.023430
0.1566
0.05708
...
13.890
35.74
88.84
595.7
0.12270
0.16200
0.24390
0.06493
0.2372
0.07242

555
10.290
27.61
65.67
321.4
0.09030
0.07658
0.059990
0.027380
0.1593
0.06127
...
10.840
34.91
69.57
357.6
0.13840
0.17100
0.20000
0.09127
0.2226
0.08283

556
10.160
19.59
64.73
311.7
0.10030
0.07504
0.005025
0.011160
0.1791
0.06331
...
10.650
22.88
67.88
347.3
0.12650
0.12000
0.01005
0.02232
0.2262
0.06742

557
9.423
27.88
59.26
271.3
0.08123
0.04971
0.000000
0.000000
0.1742
0.06059
...
10.490
34.24
66.50
330.6
0.10730
0.07158
0.00000
0.00000
0.2475
0.06969

558
14.590
22.68
96.39
657.1
0.08473
0.13300
0.102900
0.037360
0.1454
0.06147
...
15.480
27.27
105.90
733.5
0.10260
0.31710
0.36620
0.11050
0.2258
0.08004

559
11.510
23.93
74.52
403.5
0.09261
0.10210
0.111200
0.041050
0.1388
0.06570
...
12.480
37.16
82.28
474.2
0.12980
0.25170
0.36300
0.09653
0.2112
0.08732

560
14.050
27.15
91.38
600.4
0.09929
0.11260
0.044620
0.043040
0.1537
0.06171
...
15.300
33.17
100.20
706.7
0.12410
0.22640
0.13260
0.10480
0.2250
0.08321

561
11.200
29.37
70.67
386.0
0.07449
0.03558
0.000000
0.000000
0.1060
0.05502
...
11.920
38.30
75.19
439.6
0.09267
0.05494
0.00000
0.00000
0.1566
0.05905

562
15.220
30.62
103.40
716.9
0.10480
0.20870
0.255000
0.094290
0.2128
0.07152
...
17.520
42.79
128.70
915.0
0.14170
0.79170
1.17000
0.23560
0.4089
0.14090

563
20.920
25.09
143.00
1347.0
0.10990
0.22360
0.317400
0.147400
0.2149
0.06879
...
24.290
29.41
179.10
1819.0
0.14070
0.41860
0.65990
0.25420
0.2929
0.09873

564
21.560
22.39
142.00
1479.0
0.11100
0.11590
0.243900
0.138900
0.1726
0.05623
...
25.450
26.40
166.10
2027.0
0.14100
0.21130
0.41070
0.22160
0.2060
0.07115

565
20.130
28.25
131.20
1261.0
0.09780
0.10340
0.144000
0.097910
0.1752
0.05533
...
23.690
38.25
155.00
1731.0
0.11660
0.19220
0.32150
0.16280
0.2572
0.06637

566
16.600
28.08
108.30
858.1
0.08455
0.10230
0.092510
0.053020
0.1590
0.05648
...
18.980
34.12
126.70
1124.0
0.11390
0.30940
0.34030
0.14180
0.2218
0.07820

567
20.600
29.33
140.10
1265.0
0.11780
0.27700
0.351400
0.152000
0.2397
0.07016
...
25.740
39.42
184.60
1821.0
0.16500
0.86810
0.93870
0.26500
0.4087
0.12400

568
7.760
24.54
47.92
181.0
0.05263
0.04362
0.000000
0.000000
0.1587
0.05884
...
9.456
30.37
59.16
268.6
0.08996
0.06444
0.00000
0.00000
0.2871
0.07039

569 rows × 30 columns

``````
``````

In [28]:

df.corr()

``````
``````

Out[28]:

mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
...
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension

1.000000
0.323782
0.997855
0.987357
0.170581
0.506124
0.676764
0.822529
0.147741
-0.311631
...
0.969539
0.297008
0.965137
0.941082
0.119616
0.413463
0.526911
0.744214
0.163953
0.007066

mean texture
0.323782
1.000000
0.329533
0.321086
-0.023389
0.236702
0.302418
0.293464
0.071401
-0.076437
...
0.352573
0.912045
0.358040
0.343546
0.077503
0.277830
0.301025
0.295316
0.105008
0.119205

mean perimeter
0.997855
0.329533
1.000000
0.986507
0.207278
0.556936
0.716136
0.850977
0.183027
-0.261477
...
0.969476
0.303038
0.970387
0.941550
0.150549
0.455774
0.563879
0.771241
0.189115
0.051019

mean area
0.987357
0.321086
0.986507
1.000000
0.177028
0.498502
0.685983
0.823269
0.151293
-0.283110
...
0.962746
0.287489
0.959120
0.959213
0.123523
0.390410
0.512606
0.722017
0.143570
0.003738

mean smoothness
0.170581
-0.023389
0.207278
0.177028
1.000000
0.659123
0.521984
0.553695
0.557775
0.584792
...
0.213120
0.036072
0.238853
0.206718
0.805324
0.472468
0.434926
0.503053
0.394309
0.499316

mean compactness
0.506124
0.236702
0.556936
0.498502
0.659123
1.000000
0.883121
0.831135
0.602641
0.565369
...
0.535315
0.248133
0.590210
0.509604
0.565541
0.865809
0.816275
0.815573
0.510223
0.687382

mean concavity
0.676764
0.302418
0.716136
0.685983
0.521984
0.883121
1.000000
0.921391
0.500667
0.336783
...
0.688236
0.299879
0.729565
0.675987
0.448822
0.754968
0.884103
0.861323
0.409464
0.514930

mean concave points
0.822529
0.293464
0.850977
0.823269
0.553695
0.831135
0.921391
1.000000
0.462497
0.166917
...
0.830318
0.292752
0.855923
0.809630
0.452753
0.667454
0.752399
0.910155
0.375744
0.368661

mean symmetry
0.147741
0.071401
0.183027
0.151293
0.557775
0.602641
0.500667
0.462497
1.000000
0.479921
...
0.185728
0.090651
0.219169
0.177193
0.426675
0.473200
0.433721
0.430297
0.699826
0.438413

mean fractal dimension
-0.311631
-0.076437
-0.261477
-0.283110
0.584792
0.565369
0.336783
0.166917
0.479921
1.000000
...
-0.253691
-0.051269
-0.205151
-0.231854
0.504942
0.458798
0.346234
0.175325
0.334019
0.767297

0.679090
0.275869
0.691765
0.732562
0.301467
0.497473
0.631925
0.698050
0.303379
0.000111
...
0.715065
0.194799
0.719684
0.751548
0.141919
0.287103
0.380585
0.531062
0.094543
0.049559

texture error
-0.097317
0.386358
-0.086761
-0.066280
0.068406
0.046205
0.076218
0.021480
0.128053
0.164174
...
-0.111690
0.409003
-0.102242
-0.083195
-0.073658
-0.092439
-0.068956
-0.119638
-0.128215
-0.045655

perimeter error
0.674172
0.281673
0.693135
0.726628
0.296092
0.548905
0.660391
0.710650
0.313893
0.039830
...
0.697201
0.200371
0.721031
0.730713
0.130054
0.341919
0.418899
0.554897
0.109930
0.085433

area error
0.735864
0.259845
0.744983
0.800086
0.246552
0.455653
0.617427
0.690299
0.223970
-0.090170
...
0.757373
0.196497
0.761213
0.811408
0.125389
0.283257
0.385100
0.538166
0.074126
0.017539

smoothness error
-0.222600
0.006614
-0.202694
-0.166777
0.332375
0.135299
0.098564
0.027653
0.187321
0.401964
...
-0.230691
-0.074743
-0.217304
-0.182195
0.314457
-0.055558
-0.058298
-0.102007
-0.107342
0.101480

compactness error
0.206000
0.191975
0.250744
0.212583
0.318943
0.738722
0.670279
0.490424
0.421659
0.559837
...
0.204607
0.143003
0.260516
0.199371
0.227394
0.678780
0.639147
0.483208
0.277878
0.590973

concavity error
0.194204
0.143293
0.228082
0.207660
0.248396
0.570517
0.691270
0.439167
0.342627
0.446630
...
0.186904
0.100241
0.226680
0.188353
0.168481
0.484858
0.662564
0.440472
0.197788
0.439329

concave points error
0.376169
0.163851
0.407217
0.372320
0.380676
0.642262
0.683260
0.615634
0.393298
0.341198
...
0.358127
0.086741
0.394999
0.342271
0.215351
0.452888
0.549592
0.602450
0.143116
0.310655

symmetry error
-0.104321
0.009127
-0.081629
-0.072497
0.200774
0.229977
0.178009
0.095351
0.449137
0.345007
...
-0.128121
-0.077473
-0.103753
-0.110343
-0.012662
0.060255
0.037119
-0.030413
0.389402
0.078079

fractal dimension error
-0.042641
0.054458
-0.005523
-0.019887
0.283607
0.507318
0.449301
0.257584
0.331786
0.688132
...
-0.037488
-0.003195
-0.001000
-0.022736
0.170568
0.390159
0.379975
0.215204
0.111094
0.591328

0.969539
0.352573
0.969476
0.962746
0.213120
0.535315
0.688236
0.830318
0.185728
-0.253691
...
1.000000
0.359921
0.993708
0.984015
0.216574
0.475820
0.573975
0.787424
0.243529
0.093492

worst texture
0.297008
0.912045
0.303038
0.287489
0.036072
0.248133
0.299879
0.292752
0.090651
-0.051269
...
0.359921
1.000000
0.365098
0.345842
0.225429
0.360832
0.368366
0.359755
0.233027
0.219122

worst perimeter
0.965137
0.358040
0.970387
0.959120
0.238853
0.590210
0.729565
0.855923
0.219169
-0.205151
...
0.993708
0.365098
1.000000
0.977578
0.236775
0.529408
0.618344
0.816322
0.269493
0.138957

worst area
0.941082
0.343546
0.941550
0.959213
0.206718
0.509604
0.675987
0.809630
0.177193
-0.231854
...
0.984015
0.345842
0.977578
1.000000
0.209145
0.438296
0.543331
0.747419
0.209146
0.079647

worst smoothness
0.119616
0.077503
0.150549
0.123523
0.805324
0.565541
0.448822
0.452753
0.426675
0.504942
...
0.216574
0.225429
0.236775
0.209145
1.000000
0.568187
0.518523
0.547691
0.493838
0.617624

worst compactness
0.413463
0.277830
0.455774
0.390410
0.472468
0.865809
0.754968
0.667454
0.473200
0.458798
...
0.475820
0.360832
0.529408
0.438296
0.568187
1.000000
0.892261
0.801080
0.614441
0.810455

worst concavity
0.526911
0.301025
0.563879
0.512606
0.434926
0.816275
0.884103
0.752399
0.433721
0.346234
...
0.573975
0.368366
0.618344
0.543331
0.518523
0.892261
1.000000
0.855434
0.532520
0.686511

worst concave points
0.744214
0.295316
0.771241
0.722017
0.503053
0.815573
0.861323
0.910155
0.430297
0.175325
...
0.787424
0.359755
0.816322
0.747419
0.547691
0.801080
0.855434
1.000000
0.502528
0.511114

worst symmetry
0.163953
0.105008
0.189115
0.143570
0.394309
0.510223
0.409464
0.375744
0.699826
0.334019
...
0.243529
0.233027
0.269493
0.209146
0.493838
0.614441
0.532520
0.502528
1.000000
0.537848

worst fractal dimension
0.007066
0.119205
0.051019
0.003738
0.499316
0.687382
0.514930
0.368661
0.438413
0.767297
...
0.093492
0.219122
0.138957
0.079647
0.617624
0.810455
0.686511
0.511114
0.537848
1.000000

30 rows × 30 columns

``````

Since I am not an expert in breast cancer, I do not feel able to select the most relevant features to perform a meaningful analysis. I have therefore chosen two of them that share a high correlation: mean perimeter and mean radius.

### 4. Using the breast cancer data, create a classifier to predict the type of seed. Perform the above hold out evaluation (50-50 and 75-25) and discuss the results.

``````

In [30]:

x = breast_cancer.data[:,:2]
y = breast_cancer.target

``````
``````

In [31]:

dt = tree.DecisionTreeClassifier()

``````
``````

In [32]:

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5,train_size=0.5)

``````
``````

In [33]:

dt = dt.fit(x_train,y_train)

``````
``````

In [34]:

def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
if show_classification_report:
print("Classification report")
print(metrics.classification_report(y,y_pred),"\n")
if show_confussion_matrix:
print("Confusion matrix")
print(metrics.confusion_matrix(y,y_pred),"\n")

``````
``````

In [37]:

measure_performance(x_train,y_train,dt)

``````
``````

Accuracy:1.000

Classification report
precision    recall  f1-score   support

0       1.00      1.00      1.00       113
1       1.00      1.00      1.00       171

avg / total       1.00      1.00      1.00       284

Confusion matrix
[[113   0]
[  0 171]]

``````
``````

In [38]:

measure_performance(x_test,y_test,dt)

``````
``````

Accuracy:0.863

Classification report
precision    recall  f1-score   support

0       0.83      0.77      0.80        99
1       0.88      0.91      0.90       186

avg / total       0.86      0.86      0.86       285

Confusion matrix
[[ 76  23]
[ 16 170]]

``````
``````

In [39]:

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,train_size=0.75)

``````
``````

In [40]:

dt = dt.fit(x_train,y_train)

``````
``````

In [41]:

def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
if show_classification_report:
print("Classification report")
print(metrics.classification_report(y,y_pred),"\n")
if show_confussion_matrix:
print("Confusion matrix")
print(metrics.confusion_matrix(y,y_pred),"\n")

``````
``````

In [42]:

measure_performance(x_train,y_train,dt)

``````
``````

Accuracy:1.000

Classification report
precision    recall  f1-score   support

0       1.00      1.00      1.00       151
1       1.00      1.00      1.00       275

avg / total       1.00      1.00      1.00       426

Confusion matrix
[[151   0]
[  0 275]]

``````
``````

In [43]:

measure_performance(x_test,y_test,dt)

``````
``````

Accuracy:0.825

Classification report
precision    recall  f1-score   support

0       0.80      0.79      0.79        61
1       0.84      0.85      0.85        82

avg / total       0.82      0.83      0.82       143

Confusion matrix
[[48 13]
[12 70]]

``````

In both cases the model seems to be overfitting the training dataset, while it shows a quite high degree of accuracy when tested on the test dataset. The former may be due to the fact that I have chosen two features that are really very correlated.

``````

In [ ]:

``````