Dataset from Breast Cancer UCI Machine Learning Repo
Attribute Information:
3-32: Ten real-valued features are computed for each cell nucleus:
The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits.
Missing attribute values: none
Class distribution: 357 benign, 212 malignant
In [19]:
import pandas as pd
In [20]:
# Read CSV data into df
df = pd.read_csv('./theAwesome_PredModel.csv')
# delete id column no need
df.drop('id',axis=1,inplace=True)
# delete unnamed colum at the end
df.drop('Unnamed: 32',axis=1,inplace=True)
df.head()
Out[20]:
In [8]:
# Learn the unique values in diagnosis column
df.diagnosis.unique()
# M: Malign (Yes Cancer)
# B: Benign (No Cancer)
# I can also map M and B as 1 and 0 for more numerical
# approach
df['diagnosis'] = df['diagnosis'].map({'M':1,'B':0})
Generate the information about your dataset: number of columns and rows, names and data types of the columns, memory usage of the dataset.
Hint: Pandas data frame info() function.
Generate descriptive statistics of all columns (input and output) of your dataset. Descriptive statistics for numerical columns include: count, mean, std, min, 25 percentile (Q1), 50 percentile (Q2, median), 75 percentile (Q3), max values of the columns. For categorical columns, determine distinct values and their frequency in each categorical column.
Hint: Pandas, data frame describe() function.
In [9]:
df.info()
In [10]:
df.describe()
Out[10]:
Split your data into Training and Test data set by randomly selecting; use 70% for training and 30 % for testing. Generate descriptive statistics of all columns (input and output) of Training and Test datasets. Review the descriptive statistics of input output columns in Train, Test and original Full (before the splitting operation) datasets and compare them to each other. Are they similar or not? Do you think Train and Test dataset are representative of the Full datasets ? why ?
Hint: Scikit learn, data train_test_split(), stratified function.
In [11]:
df["diagnosis"].value_counts(df["diagnosis"].unique()[0])
Out[11]:
In [12]:
# Splitting train and test data
# .7 and .3
import numpy as np # Linear algebra and numerical apps
msk = np.random.rand(len(df)) < 0.7
train_df = df[msk]
test_df = df[~msk]
In [15]:
train_df.describe()
Out[15]:
Analyze the output columns in Train and Test dataset. If the output column is numerical then calculate the IQR (inter quartile range, Q3-Q1) and Range (difference between max and min value). If your output column is categorical then determine if the column is nominal or ordinal, why?. Is there a class imbalance problem? (check if there is big difference between the number of distinct values in your categorical output column)
In [13]:
print(train_df["diagnosis"].value_counts(train_df["diagnosis"].unique()[0]))
print(len(train_df))
train_df.describe()
Out[13]:
In [14]:
print(test_df["diagnosis"].value_counts(test_df["diagnosis"].unique()[0]))
print(len(test_df))
test_df.describe()
Out[14]:
Our output/classification label is diagnosis(M(1)/B(0)), which is nominal categorical data.
The ratios between Benign and Malignant outputs in train and test are pretty similar to what we had in the full data.
Using one of the scaling method (max, min-max, standard or robust), create a scaler object and scale the numerical input columns of the Training dataset. Using the same scaler object, scale the numerical input columns of the Test set. Generate the descriptive statistics of the scaled input columns of Training and Test set.
If some of the input columns are categorical then convert them to binary columns using one-hotencoder() function (scikit learn) or dummy() function (Pandas data frame).
Hint: http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
In [13]:
# I am going to apply min-max scaling for my data.
from sklearn import preprocessing
# Fitting the minmax scaled version for training data
minmax_scale = preprocessing.MinMaxScaler().fit(train_df.iloc[:, 1:])
# Now actually scale train and test data
train_df.iloc[:, 1:] = minmax_scale.transform(train_df.iloc[:, 1:])
test_df.iloc[:, 1:] = minmax_scale.transform(test_df.iloc[:, 1:])
In [11]:
train_df.head()
Out[11]:
In [12]:
test_df.head()
Out[12]:
Using one of the methods (K-Nearest Neighbor, Naïve Bayes, Neural Network, Support Vector Machines, Decision Tree), build your predictive model using the scaled input columns of Training set. You can use any value for the model parameters, or use the default values. In building your model, use k-fold crossvalidation.
Hint:
In [15]:
# Input and Output
inp_train = train_df.iloc[:, 1:]
out_train = train_df["diagnosis"]
inp_test = test_df.iloc[:, 1:]
out_test = test_df["diagnosis"]
In [16]:
# Naive Bayes:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
nb_model = GaussianNB()
nb_model.fit(inp_train, out_train)
# Cross validation score of my model
nb_model_scores = cross_val_score(nb_model, inp_train, out_train, cv=10, scoring='accuracy')
print(nb_model_scores)
Apply your model to input (scaled) columns of Training dataset to obtain the predicted output for Training dataset. If your model is regression then plot actual output versus predicted output column of Training dataset. If your model is classification then generate confusion matrix on actual and predicted columns of Training dataset.
Hint: Matplotlip, Seaborn, Bokeh scatter(), plot() functions
In [17]:
# importing libraries for plotting
# Importing library for confusion matrix
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
In [18]:
# train prediction for train data
out_train_pred = nb_model.predict(inp_train)
# Compute confusion matrix for prediction of train
cm = confusion_matrix(out_train, out_train_pred)
print(cm)
# Show confusion matrix in a separate window
sns.heatmap(cm)
plt.title('Confusion matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
Apply your model to input (scaled) columns of Test dataset to obtain the predicted output for Test dataset. If your model is regression then plot actual output versus predicted output column of Test dataset. If your model is classification then generate confusion matrix on actual and predicted columns of Test dataset.
Hint: Matplotlip, Seaborn, Bokeh scatter(), plot() functions
In [19]:
# train prediction for train data
out_test_pred = nb_model.predict(inp_test)
# Compute confusion matrix for prediction of train
cm = confusion_matrix(out_test, out_test_pred)
print(cm)
# Show confusion matrix in a separate window
sns.heatmap(cm)
plt.title('Confusion matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
Using one of the error (evaluation) metrics (classification or regression), calculate the performance of the model on Training set and Test set. Compare the performance of the model on Training and Test set. Which one (Training or Testing performance) is better, is there an overfitting case, why ?. Would you deploy (Productionize) this model for using in actual usage in your business system? why ?
Classification Metrics: Accuracy, Precision, Recall, F-score, Recall, AUC, ROC etc Regression Metrics: RMSE, MSE, MAE, R2 etc
In [20]:
# I would like to use ROC
# Area under ROC Curve (or AUC for short) is
# a performance metric for binary classification problems.
from sklearn.metrics import roc_curve
# ROC curve for train data
fpr,tpr,thresholds = roc_curve(out_train, out_train_pred)
# plot the curve
plt.plot(fpr, tpr, label="Train Data")
# ROC curve for test data
fpr, tpr, thresholds = roc_curve(out_test, out_test_pred)
# Plotting the curves
plt.plot(fpr, tpr, label="Test Data")
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('ROC curve for Cancer classifer')
plt.xlabel('False positive rate (1-specificity)')
plt.ylabel('True positive rate (sensitivity)')
plt.legend(loc=4,)
plt.show()
As it seems clear in the plot we created, the Test data is better than the Train data. Which is not expected. I do not see the traces of overfitting since the test data is also performing well.
But there is also another chance that Test data is also overfitting... ??
Naive bayes on this particular data set works really good. It might be good for fast prototyping and usage.
Go back to Step5, and choose different values of the model parameters and re-train the model. Repeat Steps: 6 and 7. Using the same error metric, generate the accuracy of the model on Training and Test dataset. Did you get a better performance on Training or Test set? Explain why the new model performs better or worse than the former model.
Let's try to calibrate the GaussianNB(); I will be using isotonic, sigmoid calibration for Gaussian Naive Bayes:
In [19]:
# For Training Data:
# Let's remember we have GaussianNB model with
# no calibration called out_train_pred
from sklearn.calibration import CalibratedClassifierCV
# Gaussian Naive-Bayes with isotonic calibration
nb_model_isotonic = CalibratedClassifierCV(nb_model, cv=2, method='isotonic')
nb_model_isotonic.fit(inp_train, out_train)
out_train_isotonic = nb_model_isotonic.predict_proba(inp_train)[:, 1]
out_test_isotonic = nb_model_isotonic.predict_proba(inp_test)[:, 1]
In [20]:
# Gaussian Naive-Bayes with sigmoid calibration
nb_model_sigmoid = CalibratedClassifierCV(nb_model, cv=2, method='sigmoid')
nb_model_sigmoid.fit(inp_train, out_train)
out_train_sigmoid = nb_model_sigmoid.predict_proba(inp_train)[:, 1]
out_test_sigmoid = nb_model_sigmoid.predict_proba(inp_test)[:, 1]
In [21]:
## Plotting the comparison of train Data roc_curves
# ROC curve for train data no calibration
fpr,tpr,thresholds = roc_curve(out_train, out_train_pred)
# plot the curve
plt.plot(fpr, tpr, label="No Cal - Train Data")
# ROC curve for train data isotonic calibration
fpr,tpr,thresholds = roc_curve(out_train, out_train_isotonic)
# plot the curve
plt.plot(fpr, tpr, label="Isotonic - Train Data")
# ROC curve for train data sigmoid calibration
fpr,tpr,thresholds = roc_curve(out_train, out_train_sigmoid)
# plot the curve
plt.plot(fpr, tpr, label="Sigmoid - Train Data")
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])
plt.title('ROC curve of Train Data with Calibrations')
plt.xlabel('False positive rate (1-specificity)')
plt.ylabel('True positive rate (sensitivity)')
plt.legend(loc=4,)
plt.show()
In [22]:
# ROC curve for test data no calibration
fpr, tpr, thresholds = roc_curve(out_test, out_test_pred)
# Plotting the curves
plt.plot(fpr, tpr, label="No Cal - Test Data")
# ROC curve for test data isotonic calibration
fpr,tpr,thresholds = roc_curve(out_test, out_test_isotonic)
# plot the curve
plt.plot(fpr, tpr, label="Isotonic - Test Data")
# ROC curve for test data sigmoid calibration
fpr,tpr,thresholds = roc_curve(out_test, out_test_sigmoid)
# plot the curve
plt.plot(fpr, tpr, label="Sigmoid - Test Data")
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])
plt.title('ROC curve of Test Data with Calibrations')
plt.xlabel('False positive rate (1-specificity)')
plt.ylabel('True positive rate (sensitivity)')
plt.legend(loc=4,)
plt.show()
Extra calibration which add one more layer above the GaussianNB() works better than no calibration. Isotonic and Sigmoid calibrations are performed better than the initial no calibration version.
Choose another error metric other than you used in Step 8 and evaluate the performance of the model on Training and Test dataset by generating the accuracy of the model based on the new metric. Compare the results and explain which error metric is better for your modeling and why?
In [23]:
# Checking the error metric to Brier scores
from sklearn.metrics import brier_score_loss
# Checking for only test data predictions
print("Brier scores: (the smaller the better)")
mdl_score = brier_score_loss(out_test, out_test_pred)
print("No calibration: %1.3f" % mdl_score)
mdl_isotonic_score = brier_score_loss(out_test, out_test_isotonic)
print("With isotonic calibration: %1.3f" % mdl_isotonic_score)
mdl_sigmoid_score = brier_score_loss(out_test, out_test_sigmoid)
print("With sigmoid calibration: %1.3f" % mdl_sigmoid_score)
In [24]:
# Applying other metrics
from sklearn import metrics
print("Printing the different metric results for Not calibrated test data")
print("-"*60)
print("Precision score: %1.3f" %
metrics.precision_score(out_test, out_test_pred))
print("Recall score on: %1.3f" %
metrics.recall_score(out_test, out_test_pred))
print("F1 score on: %1.3f" %
metrics.f1_score(out_test, out_test_pred) )
print("Fbeta score with b=0.5 on: %1.3f" %
metrics.fbeta_score(out_test, out_test_pred, beta=0.5))
print("Fbeta score with b=1.0 on: %1.3f" %
metrics.fbeta_score(out_test, out_test_pred, beta=1))
print("Fbeta score with b=2.0 on: %1.3f" %
metrics.fbeta_score(out_test, out_test_pred, beta=2))
When it comes to selecting a way to show how well my models are working I always use both error and accuracy together. In this specific task I had an opportunity to try different metrics available in scikit-learn. In terms of showing a better results, for this model, I would go with Recall score. However I usually go with precision_score.
As the ending remarks for the project I would like to emphasize that Naive Bayes is working suprisingly good for this particular dataset (Breast Cancer from UCI ML website). I am suspecting that my model overfitted because for both test and train data is produced ~92-98% precision, which is quite impossible with ~30 or so features and 500 data points.
I could use more data and selected features to get more real results. For the Final project I am planning to use some techniques that will allow me to select features and only work with them.
-Enes K. Ergin-