Breast Cancer prediction machine learning algorithm using logistic regression

The data was obtained from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original) Data description can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

Objective

Predict breast cancer based on a set of cytology features (clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nuceloli, mitosis).

Data Preprocessing

Missing values with "?" were replaced with "NaN" to make it easy to find and delete rows with missing data. Labels were added in the first row based on the data description. Values in the field classs was replaced as follows 2=0(benign) 4=1 (malignant /cancer). No other pre-processing was done. The data file "wisconsin_breast_cancer.csv" is present in this same GitHub folder.

Requirements

Python 3.5.

Pandas 0.18.1.

sklearn 0.17.1.

matplotlib 1.5.1.

If you have installed latest anaconda, you will have all the required packages.

Tutorial created by: Johnson Thomas.

Acknowledgement: Kevin Markham, Jason Brownlee and all others who created machine learning tutorials.



In [2]:

    
# Importing pandas to read CSV file



In [3]:

    
import pandas as pd



In [4]:

    
# Read breast cancer csv file to pandas data frame data
data = pd.read_csv('wisconsin_breast_cancer.csv')



In [5]:

    
# Display the first 5 rows of the csv file
data.head()



In [6]:

    
data.shape # It is always a good idea to understand your data









    Out[6]:





(699, 11)



In [7]:

    
# There are 699 rows and 11 columns in this CSV file



In [8]:

    
data.isnull().sum() # Find out how many cells have missing values
# Looks like the field nuclei has 16 missing values









    Out[8]:





id           0
thickness     0
size          0
shape         0
adhesion      0
single        0
nuclei       16
chromatin     0
nucleoli      0
mitosis       0
class         0
dtype: int64



In [9]:

    
data=data.dropna(how='any') # Dropping any rows that has missing values



In [10]:

    
# previously we had 699 rows. The above code deleted 16 rows with missing values and now we have 683 rows



In [11]:

    
#creating feature data set
# Here we are going to select feature fileds and assign it to a pandas data frame x. We do not need Id. So it was dropped
x=data[['thickness','size','shape','adhesion','single','nuclei','chromatin','nucleoli','mitosis']]



In [12]:

    
x.head() # printing the first 5 rows to see whether we got all the features



In [13]:

    
# Now we are going to extract the 'class' field from 'data' and store it in variable y
# This is the variable that we want to predict 0= no cancer 1 = cancer 
y=data['class']
y.head()









    Out[13]:





0    0
1    0
2    0
3    0
4    0
Name: class, dtype: int64

Getting ready to do classification

Import cross validation from sklearn for test trian split. Then import logistic regression from sklearn.

The goal is to spilit our data into two. A traning data set and a testing set. We will use the training set to train our logistic regression model. After that, we will use this model to predict cancer occurance from the features in the testing set. Since we already know the occurance of cancer in the testing set we can comapre it with the predicted vaules to benchmark our model.



In [14]:

    
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)



In [15]:

    
# train logistic regression model 
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train,y_train)









    Out[15]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [16]:

    
y_pred_class=logreg.predict(x_test) # make predictions based on x_test and store it to y_pred_class



In [18]:

    
# Now let us see how our model is performing. 
# We will start with accuracy
from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred_class))









    



0.929824561404



In [19]:

    
## Not bad at all. 92.9% of the time our model was able to identify breast cancer based on the training data 
# Now let us see how our "intelligent" model compare to a dumb model which predicts "o" or "not cancer" all the time
# because it the most prevelent class



In [20]:

    
y_test.value_counts() # as you see "0" is more prevalent









    Out[20]:





0    107
1     64
Name: class, dtype: int64



In [ ]:

    
# Now let us see the percentage of "o" in y_test. This will be the accuracy of a dumb model that predicts 0 
# all the time
1-y_test.mean()



In [22]:

    
# looks like we did better than the dumb model!

Now let us create a confusion matrix to identify sensitivity specificity & all the other good statistical stuff



In [24]:

    
print (metrics.confusion_matrix(y_test, y_pred_class))



In [25]:

    
#  let us see what this means
#
#                Predicted 0    Predicted 1    Total
#                                
#Actual  0        103              4            107
#Actual  1          8             56             64   

#Total            111             60



In [26]:

    
confusion =metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives



In [27]:

    
# Let us see the sensitivity of our logistic regression model
print (TP / float(TP+FN))



In [28]:

    
# Our model's sensitivity is 87.5%



In [29]:

    
# Let us calculate specificity
print (TN / float(TN+FP))









    



0.96261682243



In [30]:

    
# Looks like our model has pretty good specificity 96.2%



In [31]:

    
# Calculate false postive rate - predicting cancer when pt does not have cancer
print (FP/ float(TN+FP))









    



0.0373831775701



In [32]:

    
# pretty awesome 3.7%



In [33]:

    
# precison - when it is predicting cancer how precise is it 
# positive predictive value 
print (TP / float(TP+FP))









    



0.933333333333



In [34]:

    
# Negative predictive value
print (TN / float(TN+ FN))









    



0.927927927928



In [35]:

    
# Now let us say that you want to figure out the probaility of cancer from a set of features
# we can use the predict_proba function
# let us first see the predicted answers
logreg.predict(x_test)[0:10] # predicting cancer  based on the data from first 10 rows of x









    Out[35]:





array([0, 0, 1, 1, 0, 0, 0, 1, 0, 0], dtype=int64)



In [36]:

    
# Now let us predict the probaility of each prediction
logreg.predict_proba(x_test)[0:10, :]









    Out[36]:





array([[ 0.92925367,  0.07074633],
       [ 0.97810202,  0.02189798],
       [ 0.05497953,  0.94502047],
       [ 0.01971146,  0.98028854],
       [ 0.98442876,  0.01557124],
       [ 0.96262046,  0.03737954],
       [ 0.94219509,  0.05780491],
       [ 0.00202679,  0.99797321],
       [ 0.9598277 ,  0.0401723 ],
       [ 0.98330704,  0.01669296]])



In [37]:

    
# The first colun is the probability of it being benign. Second column is the probablity of it being cancerous

Now let us find out the area under the curve.

"roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve"



In [38]:

    
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc 
import matplotlib.pyplot as plt 
import random 
%matplotlib inline 
# calculates the probability of predicting "1" (cancer) and store the out put in probab_cancer
proba_cancer=logreg.predict_proba(x_test)[:,1]



In [39]:

    
# we need the actual values in the cancer column and the predicted probabilities of postive value "1"
roc_auc_score(y_test, proba_cancer)









    Out[39]:





0.99269859813084116



In [40]:

    
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, proba_cancer)
roc_auc = auc(false_positive_rate, true_positive_rate)



In [41]:

    
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b',
label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()



In [ ]:

	id	thickness	size	shape	adhesion	single	nuclei	chromatin	nucleoli	mitosis
0	1000025	5	1	1	1	2	1.0	3	1	1
1	1002945	5	4	4	5	7	10.0	3	2	1
2	1015425	3	1	1	1	2	2.0	3	1	1
3	1016277	6	8	8	1	3	4.0	3	7	1
4	1017023	4	1	1	3	2	1.0	3	1	1