In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

Definition of the problem

We need to develop a model that can classify breast cells as bningn (non harmful) or malignant (cancerous).

The list of attribues are:

  1. Sample code number: id number
  2. Clump Thickness: 1 - 10
  3. Uniformity of Cell Size: 1 - 10
  4. Uniformity of Cell Shape: 1 - 10
  5. Marginal Adhesion: 1 - 10
  6. Single Epithelial Cell Size: 1 - 10
  7. Bare Nuclei: 1 - 10
  8. Bland Chromatin: 1 - 10
  9. Normal Nucleoli: 1 - 10
  10. Mitoses: 1 - 10
  11. Class: (0 for benign, 1 for malignant)

In [2]:
# 1 Read dataset
cols = [
       'clump thickness',
       'uniformity of cell size',
       'uniformity of cell shape',
       'marginal adhesion',
       'single epithelial cell size',
       'bare nuclei',
       'bland chromatin',
       'normal nucleoli',
       'mitoses',
       'class']

df = pd.read_csv('breast-cancer-wisconsin.data',index_col=0,header=None)
df.index.name = 'id number'
df.columns=cols
df.head()


Out[2]:
clump thickness uniformity of cell size uniformity of cell shape marginal adhesion single epithelial cell size bare nuclei bland chromatin normal nucleoli mitoses class
id number
1000025 5 1 1 1 2 1 3 1 1 2
1002945 5 4 4 5 7 10 3 2 1 2
1015425 3 1 1 1 2 2 3 1 1 2
1016277 6 8 8 1 3 4 3 7 1 2
1017023 4 1 1 3 2 1 3 1 1 2

In [3]:
# Change class labels to 0 and 1 for simplicity

df['class']=df['class'].apply(lambda x: 0 if x == 2 else 1 )

Clean data

  1. missing data
  2. NaNs

In [4]:
# Is there missing data or NaNs?
df_size = len(df)
df_null = df.isnull().values.any()
#print("Data frame size {}, missing data in {}".format(df_size,len(df_null)))
df_null
#No null values


Out[4]:
False

There is no missing data in the dataset.

Now check if the actual data makes sense.


In [5]:
# See if there are strange values in the dataset:

def visualize_unique_values(df):
    # Loop over each column
    print('Column','Unique_values')
    for icol in range(len(cols)):
        # Select a column
        df_col = df.ix[:,icol]
        # Check unique values
        unique_values = df_col.unique()
        print(cols[icol],unique_values)

visualize_unique_values(df)


Column Unique_values
clump thickness [ 5  3  6  4  8  1  2  7 10  9]
uniformity of cell size [ 1  4  8 10  2  3  7  5  6  9]
uniformity of cell shape [ 1  4  8 10  2  3  5  6  7  9]
marginal adhesion [ 1  5  3  8 10  4  6  2  9  7]
single epithelial cell size [ 2  7  3  1  6  4  5  8 10  9]
bare nuclei ['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']
bland chromatin [ 3  9  1  2  4  5  7  8  6 10]
normal nucleoli [ 1  2  7  4  5  3 10  6  9  8]
mitoses [ 1  5  4  2  3  7 10  8  6]
class [0 1]

Warning. The column "Bare Nuclei" contains contains strings and '?' values.

Let's fix this.

Question: replace '?' with mode or median?


In [6]:
bare_nuclei = df['bare nuclei']

# 1 get data frame with all non missing data:
df2 = bare_nuclei.loc[bare_nuclei != '?']
print(len(df2),len(df))

# Get the mode value
col_mode=eval(df2.mode().values[0])
print("Mode :", col_mode)

#Verify:
bare_nuclei.value_counts()


683 699
Mode : 1
Out[6]:
1     402
10    132
5      30
2      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: bare nuclei, dtype: int64

Note that 402 rows have the mode value of '1'. This represets about 60% of the data.

Hence, it makes more sence to replace '?' with the mode: '1'


In [7]:
# Convert data to mode:
df2 = bare_nuclei.apply(lambda x: col_mode if x == '?' else int(x) )

#Check it worked:
print(df2.unique())


[ 1 10  2  4  3  9  7  5  8  6]

In [8]:
# Replace dataset column with clean data
df['bare nuclei'] = df2
# Check this actually worked
visualize_unique_values(df)


Column Unique_values
clump thickness [ 5  3  6  4  8  1  2  7 10  9]
uniformity of cell size [ 1  4  8 10  2  3  7  5  6  9]
uniformity of cell shape [ 1  4  8 10  2  3  5  6  7  9]
marginal adhesion [ 1  5  3  8 10  4  6  2  9  7]
single epithelial cell size [ 2  7  3  1  6  4  5  8 10  9]
bare nuclei [ 1 10  2  4  3  9  7  5  8  6]
bland chromatin [ 3  9  1  2  4  5  7  8  6 10]
normal nucleoli [ 1  2  7  4  5  3 10  6  9  8]
mitoses [ 1  5  4  2  3  7 10  8  6]
class [0 1]

Model

Now that we have cleaned the data, we pass to the model.

  1. Check class balance

In [9]:
y = df['class']
X = df.copy()
del X['class']

class1 = y[y == 0]
class2 = y[y == 1]

print("Class balance\n Class 0: {}\n Class 1: {}\n Ratio: {}".format(len(class1),len(class1),len(class1)/len(class2)))


Class balance
 Class 0: 458
 Class 1: 458
 Ratio: 1.900414937759336

Take home message: The classes are slightly umbalanced by about a factor of 2.

To deal with class imbalance, there are several options: Source https://elitedatascience.com/imbalanced-classes

  1. Up-sample minority class
  2. Down-sample majority class
  3. Penalize algorithms
  4. Use random forests

Based on my own experience with unbalanced data, I go for 4. Use random forests.

Visualize features

Let's visualize the feature distributions to get more insights.


In [10]:
%matplotlib inline 
import seaborn as sns

sns.pairplot(df, hue="class")
#df.head(1)


Out[10]:
<seaborn.axisgrid.PairGrid at 0x10ab21240>

Take home messages:

  1. No clear separation is found in between the classes, so need to go beyond a linear model. Random forests is OK too.

  2. Uniformity of cell size and uniformity of cell shape are clearly correlated, to take into account later.

Partition data:

  1. Partition as 60% and 40%.
  2. Stratify, important for unbalanced classes.

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=2, stratify= y )

As said before, I use a random forest classifier since this works best for umbalanced classes


In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

model = RandomForestClassifier()

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test,y_pred))


             precision    recall  f1-score   support

          0       0.97      0.98      0.98       183
          1       0.97      0.95      0.96        97

avg / total       0.97      0.97      0.97       280

WOW, the model performed great with default parameters.

Question: can we improve f1-score from 0.96 to 1.0? This is probably important since we are dealing with cancer, a serious disease.


In [13]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform


# Max number of features by default is sqrt(n_features), which is good to keep to prevent from overfitting.
#If “auto”, then max_features=sqrt(n_features).


rfc = RandomForestClassifier()
params = {'criterion': ['gini','entropy'],'n_estimators': range(10, 50, 10)}
searcher = RandomizedSearchCV(rfc, params, n_jobs=-1, n_iter=4,scoring='f1')


searcher.fit(X_train, y_train) #assuming X and y are your data


Out[13]:
RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
          fit_params={}, iid=True, n_iter=4, n_jobs=-1,
          param_distributions={'criterion': ['gini', 'entropy'], 'n_estimators': range(10, 50, 10)},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='f1', verbose=0)

In [14]:
import numpy as np
# Utility function to report best scores
def report(results, n_top=1):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)

        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")
report(searcher.cv_results_)


Model with rank: 1
Mean validation score: 0.938 (std: 0.018)
Parameters: {'n_estimators': 30, 'criterion': 'gini'}


In [15]:
model = RandomForestClassifier(n_estimators=30, max_depth = None)

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test,y_pred))


             precision    recall  f1-score   support

          0       0.98      0.99      0.98       183
          1       0.98      0.96      0.97        97

avg / total       0.98      0.98      0.98       280

The model barely improves its accuracy.

Take home message:
We now have estimated the optimal parameters for our RF model, and accuracy is ~0.97

Question:
Is the algorithm converged with respect to the size of the training set?


In [16]:
from sklearn.metrics import f1_score


n_steps=10
step = int(len(X_train)/n_steps)

results=[]
for ix in range(n_steps):
    size_train = step + ix * step
    model.fit(X_train[:size_train],y_train[:size_train])
    y_pred = model.predict(X_test)
    score = f1_score(y_test,y_pred)
    results.append([size_train,score])
results = np.array(results)

In [17]:
import matplotlib.pyplot as plt

plt.plot(results[:,0],results[:,1])

plt.xlabel('Training set size')
plt.ylabel('F1 score')
plt.title('Learning curve')
plt.grid(True)



In [24]:
from sklearn.metrics import confusion_matrix
import itertools


def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

    
    
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

class_names=[0,1]
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

plt.show()


Confusion matrix, without normalization
[[180   3]
 [  3  94]]
Normalized confusion matrix
[[ 0.98  0.02]
 [ 0.03  0.97]]

The learning curve is oscillating by 0.02
Hence the model is not yet converged with respect to the size of the training set, for that tight precision.

Take home message: To further improve the performance of the model we need more data!!

Summary

  1. I cleaned and explored the data.
  2. Missing '?' values were replaced with the mode for the corresponding column.
  3. By visualizing scatter plots, I found that some features are correlated.
    To get feature importances, I need other model than my RF since some features are correlated.
  4. I found the classes are umbalanced by a factor of two.
  5. To deal with the umbalanced class problem, I use a RF classifier and adopt the F1 metric.
  6. RF classifier with default parameters performs well, with F1 score of 0.97
  7. Tunning of the RF parameters did not improve the F1 score, and hence default parameters close to the optimal.
  8. By plotting the learning curve, I found that the model could be further improved by increasing the training set size.

ROC curve


In [21]:
from itertools import cycle

from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold
from scipy import interp


# #############################################################################
# Data IO and generation

# Import some data to play with
X = X_train
y = y_train
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape

# Add noisy features
#random_state = np.random.RandomState(0)
#X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# #############################################################################
# Classification and ROC analysis

# Run classifier with cross-validation and plot ROC curves
cv = StratifiedKFold(n_splits=6)
classifier = RandomForestClassifier(n_estimators=30, random_state = 0)


#svm.SVC(kernel='linear', probability=True,
#                     random_state=random_state)

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

i = 0
for train, test in cv.split(X, y):
    probas_ = classifier.fit(X.iloc[train], y.iloc[train]).predict_proba(X.iloc[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y.iloc[test], probas_[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=1, alpha=0.3,
             label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))

    i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
         label='Luck', alpha=.8)

mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)

std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                 label=r'$\pm$ 1 std. dev.')

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()



In [ ]:
# Confusion matrix

In [ ]:
from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)