Using Machine Learning to Predict Breast Cancer

Matt Massie, UC Berkeley Computer Sciences

Machine learning (ML) is data driven. Machine learning algorithms are constructed to learn from and make predictions on data instead of having strictly static instructions.
Supervised (e.g. classification) vs Unsupervised (e.g. anomaly detection) learning
In this short talk, we'll explore the freely available Breast Cancer Wisconsin Data Set on the University of California, Irvine Machine Learning site.
Data set creators:
1. Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center
2. W. Nick Street, Computer Sciences Dept. University of Wisconsin
3. Olvi L. Mangasarian, Computer Sciences Dept. University of Wisconsin



In [1]:

    
import numpy as np
import pandas as pd
def load_data(filename):
    import csv
    with open(filename, 'rb') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        df = pd.DataFrame([[-1 if el == '?' else int(el) for el in r] for r in csvreader])
        df.columns=["patient_id", "radius", "texture", "perimeter", "smoothness", "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension", "malignant"]
        df['malignant'] = df['malignant'].map({2: 0, 4: 1})
        return df

Training and Test Data Sets

Each patient record is randomly assigned to a "training" data set (80%) or a "test" dataset (20%). Best practices have a cross-validation set (60% training, 20% cross-validation, 20% test).



In [2]:

    
training_set = load_data("data/breast-cancer.train")
test_set     = load_data("data/breast-cancer.test")

print "Training set has %d patients" % (training_set.shape[0]) 
print "Test set has %d patients\n" % (test_set.shape[0])
print training_set.iloc[:, 0:6].head(3)
print
print training_set.iloc[:, 6:11].head(3)









    



Training set has 419 patients
Test set has 140 patients

   patient_id  radius  texture  perimeter  smoothness  compactness
0     1299994       5        1          1           1            2
1     1099510      10        4          3           1            3
2     1275807       4        2          4           3            2

   concavity  concave_points  symmetry  fractal_dimension  malignant
0          1               1         1                  1          0
1          3               6         5                  2          1
2          2               2         1                  1          0



In [3]:

    
training_set_malignant = training_set['malignant']
training_set_features = training_set.iloc[:, 1:10]
test_set_malignant = test_set['malignant']
test_set_features = test_set.iloc[:, 1:10]

Linear Support Vector Machine Classification

This image shows how support vector machine searches for a "Maximum-Margin Hyperplane" in 2-dimensional space.

The breast cancer data set is 9-dimensional.

Image by User:ZackWeinberg, based on PNG version by User:Cyc [CC BY-SA 3.0], via Wikimedia Commons

Using scikit-learn to predict malignant tumors



In [4]:

    
from sklearn.preprocessing import MinMaxScaler
from sklearn import svm

# (1) Scale the 'training set'
scaler = MinMaxScaler()
scaled_training_set_features = scaler.fit_transform(training_set_features)
# (2) Create the model
model = svm.LinearSVC(C=0.1)
# (3) Fit the model using the 'training set'
model.fit(scaled_training_set_features, training_set_malignant)
# (4) Scale the 'test set' using the same scaler as the 'training set'
scaled_test_set_features = scaler.transform(test_set_features)
# (5) Use the model to predict malignancy the 'test set'
test_set_malignant_predictions = model.predict(scaled_test_set_features)
print test_set_malignant_predictions









    



[0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1
 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1]

Evaluating performance of the model



In [5]:

    
from sklearn import metrics

accuracy = metrics.accuracy_score(test_set_malignant, \
                                  test_set_malignant_predictions) * 100
((tn, fp), (fn, tp)) = metrics.confusion_matrix(test_set_malignant, \
                                                test_set_malignant_predictions)

print "Accuracy: %.2f%%" % (accuracy)
print "True Positives: %d, True Negatives: %d" % (tp, tn)
print "False Positives: %d, False Negatives: %d" % (fp, fn)









    



Accuracy: 98.57%
True Positives: 53, True Negatives: 85
False Positives: 2, False Negatives: 0

Other Common Machine Learning Algorithms

Linear Regression
Logistic Regression
Decision Tree
Neural Networks
Naive Bayes
K-Means
Random Forest
Dimensionality Reduction Algorithms

Thank You

Looking forward to working with the Chiu lab to evaluate the mNGS SURPI data and create an optimal machine learning solution
Questions or comments, please feel free to contact me at massie@berkeley.edu