Prediction Model

  • Course: Data Mining
  • Name: Enes Kemal Ergin

Dataset from Breast Cancer UCI Machine Learning Repo

Attribute Information:

  1. ID number
  2. Diagnosis (M = malignant, B = benign)

3-32: Ten real-valued features are computed for each cell nucleus:

  • radius (mean of distances from center to points on the perimeter)
  • texture (standard deviation of gray-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area - 1.0)
  • concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

Step 0: Data Preparation and Cleaning


In [19]:
import pandas as pd

In [20]:
# Read CSV data into df
df = pd.read_csv('./theAwesome_PredModel.csv')
# delete id column no need
df.drop('id',axis=1,inplace=True)
# delete unnamed colum at the end
df.drop('Unnamed: 32',axis=1,inplace=True)
df.head()


Out[20]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 31 columns


In [8]:
# Learn the unique values in diagnosis column
df.diagnosis.unique() 
# M: Malign (Yes Cancer)
# B: Benign (No Cancer)

# I can also map M and B as 1 and 0 for more numerical 
#  approach
df['diagnosis'] = df['diagnosis'].map({'M':1,'B':0})

Step 1: Data Information and Descriptive Statistics

Generate the information about your dataset: number of columns and rows, names and data types of the columns, memory usage of the dataset.

Hint: Pandas data frame info() function.

Generate descriptive statistics of all columns (input and output) of your dataset. Descriptive statistics for numerical columns include: count, mean, std, min, 25 percentile (Q1), 50 percentile (Q2, median), 75 percentile (Q3), max values of the columns. For categorical columns, determine distinct values and their frequency in each categorical column.

Hint: Pandas, data frame describe() function.


In [9]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
diagnosis                  569 non-null int64
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non-null float64
concave points_se          569 non-null float64
symmetry_se                569 non-null float64
fractal_dimension_se       569 non-null float64
radius_worst               569 non-null float64
texture_worst              569 non-null float64
perimeter_worst            569 non-null float64
area_worst                 569 non-null float64
smoothness_worst           569 non-null float64
compactness_worst          569 non-null float64
concavity_worst            569 non-null float64
concave points_worst       569 non-null float64
symmetry_worst             569 non-null float64
fractal_dimension_worst    569 non-null float64
dtypes: float64(30), int64(1)
memory usage: 137.9 KB

In [10]:
df.describe()


Out[10]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 0.372583 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 0.483918 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 0.000000 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 0.000000 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 0.000000 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 1.000000 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 1.000000 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 31 columns

Step 2: Train Test Split

Split your data into Training and Test data set by randomly selecting; use 70% for training and 30 % for testing. Generate descriptive statistics of all columns (input and output) of Training and Test datasets. Review the descriptive statistics of input output columns in Train, Test and original Full (before the splitting operation) datasets and compare them to each other. Are they similar or not? Do you think Train and Test dataset are representative of the Full datasets ? why ?

Hint: Scikit learn, data train_test_split(), stratified function.


In [11]:
df["diagnosis"].value_counts(df["diagnosis"].unique()[0])


Out[11]:
0    0.627417
1    0.372583
Name: diagnosis, dtype: float64

In [12]:
# Splitting train and test data
# .7 and .3 
import numpy as np # Linear algebra and numerical apps
msk = np.random.rand(len(df)) < 0.7
train_df = df[msk]
test_df = df[~msk]

In [15]:
train_df.describe()


Out[15]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 ... 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000
mean 0.383886 14.184332 19.225924 92.302227 663.239100 0.096341 0.103223 0.086774 0.048537 0.180788 ... 16.405773 25.683531 108.022346 900.204976 0.132793 0.252230 0.268960 0.114119 0.290083 0.083759
std 0.486908 3.650024 4.335877 25.114122 365.713579 0.013840 0.051327 0.076341 0.038415 0.027480 ... 5.027512 6.185842 34.791385 594.827452 0.022765 0.155543 0.205858 0.065442 0.062144 0.017556
min 0.000000 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.120300 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.164800 0.055210
25% 0.000000 11.602500 16.162500 74.262500 412.550000 0.086688 0.062370 0.028973 0.020195 0.161900 ... 12.842500 20.992500 83.535000 507.425000 0.117275 0.144425 0.108950 0.064945 0.250250 0.071572
50% 0.000000 13.415000 18.760000 86.210000 555.900000 0.096530 0.091705 0.061745 0.033285 0.178750 ... 14.975000 25.465000 98.115000 685.550000 0.132650 0.211750 0.229800 0.098330 0.280600 0.079460
75% 1.000000 16.167500 21.575000 106.525000 812.200000 0.105375 0.130200 0.123275 0.073580 0.195225 ... 19.792500 30.100000 129.075000 1216.000000 0.146075 0.341175 0.385300 0.162725 0.316875 0.092082
max 1.000000 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 ... 36.040000 47.160000 251.200000 4254.000000 0.222600 0.937900 1.252000 0.286700 0.663800 0.173000

8 rows × 31 columns

Step 3: Analysis of the Output Column

Analyze the output columns in Train and Test dataset. If the output column is numerical then calculate the IQR (inter quartile range, Q3-Q1) and Range (difference between max and min value). If your output column is categorical then determine if the column is nominal or ordinal, why?. Is there a class imbalance problem? (check if there is big difference between the number of distinct values in your categorical output column)


In [13]:
print(train_df["diagnosis"].value_counts(train_df["diagnosis"].unique()[0]))
print(len(train_df))
train_df.describe()


0    0.616114
1    0.383886
Name: diagnosis, dtype: float64
422
Out[13]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 ... 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000 422.000000
mean 0.383886 14.184332 19.225924 92.302227 663.239100 0.096341 0.103223 0.086774 0.048537 0.180788 ... 16.405773 25.683531 108.022346 900.204976 0.132793 0.252230 0.268960 0.114119 0.290083 0.083759
std 0.486908 3.650024 4.335877 25.114122 365.713579 0.013840 0.051327 0.076341 0.038415 0.027480 ... 5.027512 6.185842 34.791385 594.827452 0.022765 0.155543 0.205858 0.065442 0.062144 0.017556
min 0.000000 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.120300 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.164800 0.055210
25% 0.000000 11.602500 16.162500 74.262500 412.550000 0.086688 0.062370 0.028973 0.020195 0.161900 ... 12.842500 20.992500 83.535000 507.425000 0.117275 0.144425 0.108950 0.064945 0.250250 0.071572
50% 0.000000 13.415000 18.760000 86.210000 555.900000 0.096530 0.091705 0.061745 0.033285 0.178750 ... 14.975000 25.465000 98.115000 685.550000 0.132650 0.211750 0.229800 0.098330 0.280600 0.079460
75% 1.000000 16.167500 21.575000 106.525000 812.200000 0.105375 0.130200 0.123275 0.073580 0.195225 ... 19.792500 30.100000 129.075000 1216.000000 0.146075 0.341175 0.385300 0.162725 0.316875 0.092082
max 1.000000 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 ... 36.040000 47.160000 251.200000 4254.000000 0.222600 0.937900 1.252000 0.286700 0.663800 0.173000

8 rows × 31 columns


In [14]:
print(test_df["diagnosis"].value_counts(test_df["diagnosis"].unique()[0]))
print(len(test_df))
test_df.describe()


0    0.659864
1    0.340136
Name: diagnosis, dtype: float64
147
Out[14]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 ... 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000
mean 0.340136 13.963544 19.472585 91.012517 630.918367 0.096415 0.107551 0.094612 0.050017 0.182235 ... 15.877095 25.659116 105.076190 824.253741 0.131149 0.260109 0.281458 0.116004 0.290055 0.084481
std 0.475374 3.140310 4.208618 21.842541 308.798010 0.014738 0.056924 0.088731 0.040008 0.027291 ... 4.217536 6.052036 29.931954 486.390633 0.023061 0.162776 0.216822 0.066766 0.061277 0.019493
min 0.000000 8.950000 10.380000 56.360000 245.200000 0.064290 0.026750 0.000000 0.000000 0.106000 ... 9.414000 14.100000 60.900000 270.000000 0.085670 0.050360 0.000000 0.000000 0.156500 0.055040
25% 0.000000 11.940000 16.570000 77.080000 439.300000 0.085130 0.067720 0.032380 0.022085 0.162050 ... 13.220000 21.380000 86.160000 532.000000 0.114150 0.152400 0.131700 0.064410 0.251350 0.071370
50% 0.000000 13.170000 19.220000 86.870000 537.300000 0.094620 0.095090 0.060150 0.035280 0.181300 ... 14.960000 25.210000 97.170000 686.500000 0.130100 0.216400 0.224100 0.101500 0.292900 0.081130
75% 1.000000 14.995000 22.215000 98.570000 696.250000 0.105100 0.130900 0.133350 0.074435 0.196850 ... 17.375000 29.125000 115.800000 926.950000 0.144050 0.327600 0.379200 0.153000 0.320400 0.091870
max 1.000000 24.630000 33.560000 165.500000 1841.000000 0.139800 0.311400 0.426400 0.182300 0.255600 ... 31.010000 49.540000 206.800000 2944.000000 0.190900 1.058000 1.105000 0.291000 0.488200 0.207500

8 rows × 31 columns

Our output/classification label is diagnosis(M(1)/B(0)), which is nominal categorical data.

The ratios between Benign and Malignant outputs in train and test are pretty similar to what we had in the full data.

Step 4: Scale Training and Test Dataset

Using one of the scaling method (max, min-max, standard or robust), create a scaler object and scale the numerical input columns of the Training dataset. Using the same scaler object, scale the numerical input columns of the Test set. Generate the descriptive statistics of the scaled input columns of Training and Test set.

If some of the input columns are categorical then convert them to binary columns using one-hotencoder() function (scikit learn) or dummy() function (Pandas data frame).

Hint: http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing


In [13]:
# I am going to apply min-max scaling for my data.
from sklearn import preprocessing
# Fitting the minmax scaled version for training data
minmax_scale = preprocessing.MinMaxScaler().fit(train_df.iloc[:, 1:])
# Now actually scale train and test data
train_df.iloc[:, 1:] = minmax_scale.transform(train_df.iloc[:, 1:])
test_df.iloc[:, 1:] = minmax_scale.transform(test_df.iloc[:, 1:])


/Users/eneskemalergin/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:477: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

In [11]:
train_df.head()


Out[11]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
1 1 0.643144 0.272574 0.615783 0.501591 0.277910 0.181768 0.203799 0.366806 0.407367 ... 0.606901 0.324132 0.539818 0.435214 0.301026 0.148757 0.192971 0.639175 0.233590 0.222878
2 1 0.601496 0.390260 0.595743 0.449417 0.588699 0.431017 0.462946 0.668583 0.546587 ... 0.556386 0.384462 0.508442 0.374508 0.446763 0.381154 0.359744 0.835052 0.403706 0.213433
3 1 0.210090 0.360839 0.233501 0.102906 1.000000 0.811361 0.566135 0.549922 0.832611 ... 0.248310 0.412066 0.241347 0.094008 0.909445 0.812734 0.548642 0.884880 1.000000 0.773711
5 1 0.258839 0.202570 0.267984 0.141506 0.816227 0.461996 0.370075 0.422844 0.556338 ... 0.268232 0.333808 0.263908 0.136748 0.692253 0.479232 0.427716 0.598282 0.477035 0.454939
8 1 0.284869 0.409537 0.302052 0.159618 0.809976 0.533157 0.435976 0.488918 0.698808 ... 0.268943 0.532442 0.277852 0.136183 0.629996 0.494080 0.430511 0.707904 0.554504 0.342123

5 rows × 31 columns


In [12]:
test_df.head()


Out[12]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 1 0.521037 0.022658 0.545989 0.363733 0.698712 0.792037 0.703799 0.768949 0.736186 ... 0.620776 0.151110 0.668310 0.450698 0.572692 0.616677 0.568610 0.912027 0.598462 0.418864
4 1 0.629893 0.156578 0.630986 0.489290 0.472434 0.347893 0.464353 0.545217 0.405742 ... 0.519744 0.132328 0.506948 0.341575 0.397241 0.166732 0.319489 0.558419 0.157500 0.142595
6 1 0.533343 0.347311 0.523875 0.380276 0.401550 0.274891 0.264306 0.386827 0.397616 ... 0.531839 0.445077 0.511928 0.349194 0.445348 0.218115 0.302236 0.663918 0.295289 0.187853
7 1 0.318472 0.376057 0.320710 0.184263 0.704963 0.445126 0.219653 0.312859 0.615385 ... 0.324795 0.458736 0.299766 0.174941 0.595331 0.326157 0.213898 0.534708 0.321506 0.393939
17 1 0.433007 0.370984 0.444406 0.277964 0.681210 0.560763 0.403846 0.537376 0.598050 ... 0.463536 0.553785 0.430251 0.277674 0.690838 0.379982 0.382109 0.712371 0.422038 0.388036

5 rows × 31 columns

Step 5: Build Predictive Model

Using one of the methods (K-Nearest Neighbor, Naïve Bayes, Neural Network, Support Vector Machines, Decision Tree), build your predictive model using the scaled input columns of Training set. You can use any value for the model parameters, or use the default values. In building your model, use k-fold crossvalidation.

Hint:


In [15]:
# Input and Output
inp_train = train_df.iloc[:, 1:] 
out_train = train_df["diagnosis"]
inp_test = test_df.iloc[:, 1:] 
out_test = test_df["diagnosis"]

In [16]:
# Naive Bayes:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
nb_model = GaussianNB() 
nb_model.fit(inp_train, out_train)

# Cross validation score of my model
nb_model_scores = cross_val_score(nb_model, inp_train, out_train, cv=10, scoring='accuracy')
print(nb_model_scores)


[ 0.95        0.875       0.9         0.9         0.925       0.925       0.95
  0.925       0.94871795  0.94736842]

Step 6. Model Predictions on Training Dataset

Apply your model to input (scaled) columns of Training dataset to obtain the predicted output for Training dataset. If your model is regression then plot actual output versus predicted output column of Training dataset. If your model is classification then generate confusion matrix on actual and predicted columns of Training dataset.

Hint: Matplotlip, Seaborn, Bokeh scatter(), plot() functions


In [17]:
# importing libraries for plotting
# Importing library for confusion matrix
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')

In [18]:
# train prediction for train data
out_train_pred = nb_model.predict(inp_train)
# Compute confusion matrix for prediction of train
cm = confusion_matrix(out_train, out_train_pred)
print(cm)
# Show confusion matrix in a separate window
sns.heatmap(cm)
plt.title('Confusion matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()


[[240   9]
 [ 18 130]]

Step 7. Model Predictions on Test Dataset

Apply your model to input (scaled) columns of Test dataset to obtain the predicted output for Test dataset. If your model is regression then plot actual output versus predicted output column of Test dataset. If your model is classification then generate confusion matrix on actual and predicted columns of Test dataset.

Hint: Matplotlip, Seaborn, Bokeh scatter(), plot() functions


In [19]:
# train prediction for train data
out_test_pred = nb_model.predict(inp_test)
# Compute confusion matrix for prediction of train
cm = confusion_matrix(out_test, out_test_pred)
print(cm)
# Show confusion matrix in a separate window
sns.heatmap(cm)
plt.title('Confusion matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()


[[98 10]
 [ 2 62]]

Step 8. Model Performance

Using one of the error (evaluation) metrics (classification or regression), calculate the performance of the model on Training set and Test set. Compare the performance of the model on Training and Test set. Which one (Training or Testing performance) is better, is there an overfitting case, why ?. Would you deploy (Productionize) this model for using in actual usage in your business system? why ?

Classification Metrics: Accuracy, Precision, Recall, F-score, Recall, AUC, ROC etc Regression Metrics: RMSE, MSE, MAE, R2 etc


In [20]:
# I would like to use ROC
# Area under ROC Curve (or AUC for short) is 
#  a performance metric for binary classification problems.
from sklearn.metrics import roc_curve
# ROC curve for train data
fpr,tpr,thresholds = roc_curve(out_train, out_train_pred)
# plot the curve
plt.plot(fpr, tpr, label="Train Data")
# ROC curve for test data
fpr, tpr, thresholds = roc_curve(out_test, out_test_pred)
# Plotting the curves
plt.plot(fpr, tpr, label="Test Data")
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('ROC curve for Cancer classifer')
plt.xlabel('False positive rate (1-specificity)')
plt.ylabel('True positive rate (sensitivity)')
plt.legend(loc=4,)
plt.show()


As it seems clear in the plot we created, the Test data is better than the Train data. Which is not expected. I do not see the traces of overfitting since the test data is also performing well.

But there is also another chance that Test data is also overfitting... ??

Naive bayes on this particular data set works really good. It might be good for fast prototyping and usage.

Step 9. Update the Model

Go back to Step5, and choose different values of the model parameters and re-train the model. Repeat Steps: 6 and 7. Using the same error metric, generate the accuracy of the model on Training and Test dataset. Did you get a better performance on Training or Test set? Explain why the new model performs better or worse than the former model.


Let's try to calibrate the GaussianNB(); I will be using isotonic, sigmoid calibration for Gaussian Naive Bayes:


In [19]:
# For Training Data:
# Let's remember we have GaussianNB model with 
#  no calibration called out_train_pred

from sklearn.calibration import CalibratedClassifierCV
# Gaussian Naive-Bayes with isotonic calibration
nb_model_isotonic = CalibratedClassifierCV(nb_model, cv=2, method='isotonic')
nb_model_isotonic.fit(inp_train, out_train)
out_train_isotonic = nb_model_isotonic.predict_proba(inp_train)[:, 1]
out_test_isotonic = nb_model_isotonic.predict_proba(inp_test)[:, 1]

In [20]:
# Gaussian Naive-Bayes with sigmoid calibration
nb_model_sigmoid = CalibratedClassifierCV(nb_model, cv=2, method='sigmoid')
nb_model_sigmoid.fit(inp_train, out_train)
out_train_sigmoid = nb_model_sigmoid.predict_proba(inp_train)[:, 1]
out_test_sigmoid = nb_model_sigmoid.predict_proba(inp_test)[:, 1]

In [21]:
## Plotting the comparison of train Data roc_curves
# ROC curve for train data no calibration
fpr,tpr,thresholds = roc_curve(out_train, out_train_pred)
# plot the curve
plt.plot(fpr, tpr, label="No Cal - Train Data")

# ROC curve for train data isotonic calibration
fpr,tpr,thresholds = roc_curve(out_train, out_train_isotonic)
# plot the curve
plt.plot(fpr, tpr, label="Isotonic - Train Data")

# ROC curve for train data sigmoid calibration
fpr,tpr,thresholds = roc_curve(out_train, out_train_sigmoid)
# plot the curve
plt.plot(fpr, tpr, label="Sigmoid - Train Data")

plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])
plt.title('ROC curve of Train Data with Calibrations')
plt.xlabel('False positive rate (1-specificity)')
plt.ylabel('True positive rate (sensitivity)')
plt.legend(loc=4,)
plt.show()



In [22]:
# ROC curve for test data no calibration
fpr, tpr, thresholds = roc_curve(out_test, out_test_pred)
# Plotting the curves
plt.plot(fpr, tpr, label="No Cal - Test Data")

# ROC curve for test data isotonic calibration
fpr,tpr,thresholds = roc_curve(out_test, out_test_isotonic)
# plot the curve
plt.plot(fpr, tpr, label="Isotonic - Test Data")

# ROC curve for test data sigmoid calibration
fpr,tpr,thresholds = roc_curve(out_test, out_test_sigmoid)
# plot the curve
plt.plot(fpr, tpr, label="Sigmoid - Test Data")

plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])
plt.title('ROC curve of Test Data with Calibrations')
plt.xlabel('False positive rate (1-specificity)')
plt.ylabel('True positive rate (sensitivity)')
plt.legend(loc=4,)
plt.show()


Extra calibration which add one more layer above the GaussianNB() works better than no calibration. Isotonic and Sigmoid calibrations are performed better than the initial no calibration version.

Step 10. Change the Error Metric

Choose another error metric other than you used in Step 8 and evaluate the performance of the model on Training and Test dataset by generating the accuracy of the model based on the new metric. Compare the results and explain which error metric is better for your modeling and why?


In [23]:
# Checking the error metric to Brier scores
from sklearn.metrics import brier_score_loss

# Checking for only test data predictions

print("Brier scores: (the smaller the better)")
mdl_score = brier_score_loss(out_test, out_test_pred)
print("No calibration: %1.3f" % mdl_score)
mdl_isotonic_score = brier_score_loss(out_test, out_test_isotonic)
print("With isotonic calibration: %1.3f" % mdl_isotonic_score)
mdl_sigmoid_score = brier_score_loss(out_test, out_test_sigmoid)
print("With sigmoid calibration: %1.3f" % mdl_sigmoid_score)


Brier scores: (the smaller the better)
No calibration: 0.058
With isotonic calibration: 0.026
With sigmoid calibration: 0.037

In [24]:
# Applying other metrics
from sklearn import metrics
print("Printing the different metric results for Not calibrated test data")
print("-"*60)
print("Precision score: %1.3f" % 
      metrics.precision_score(out_test, out_test_pred))
print("Recall score on: %1.3f" % 
      metrics.recall_score(out_test, out_test_pred))
print("F1 score on: %1.3f" % 
      metrics.f1_score(out_test, out_test_pred) )
print("Fbeta score with b=0.5 on: %1.3f" %
      metrics.fbeta_score(out_test, out_test_pred, beta=0.5))
print("Fbeta score with b=1.0 on: %1.3f" %
      metrics.fbeta_score(out_test, out_test_pred, beta=1))   
print("Fbeta score with b=2.0 on: %1.3f" %
      metrics.fbeta_score(out_test, out_test_pred, beta=2))


Printing the different metric results for Not calibrated test data
------------------------------------------------------------
Precision score: 0.931
Recall score on: 0.915
F1 score on: 0.923
Fbeta score with b=0.5 on: 0.928
Fbeta score with b=1.0 on: 0.923
Fbeta score with b=2.0 on: 0.918

When it comes to selecting a way to show how well my models are working I always use both error and accuracy together. In this specific task I had an opportunity to try different metrics available in scikit-learn. In terms of showing a better results, for this model, I would go with Recall score. However I usually go with precision_score.


As the ending remarks for the project I would like to emphasize that Naive Bayes is working suprisingly good for this particular dataset (Breast Cancer from UCI ML website). I am suspecting that my model overfitted because for both test and train data is produced ~92-98% precision, which is quite impossible with ~30 or so features and 500 data points.

I could use more data and selected features to get more real results. For the Final project I am planning to use some techniques that will allow me to select features and only work with them.

-Enes K. Ergin-