In [62]:
import numpy as np
import seaborn as sns
import pandas as pd
sns.set()
from matplotlib import pyplot as plt
from sklearn import preprocessing
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = 34
In [63]:
cancer_db = pd.read_csv("risk_factors_cervical_cancer.csv")
cancer_db.head(20)
Out[63]:
In [64]:
cancer_db = cancer_db.replace('?', float('nan'))
cancer_db.info()
The last four columns are indicators (or screening methods) used to examine whether the patient has cancer or not [1]. As a first step, let's store all four indicators into y dataframe.
In [65]:
from sklearn.model_selection import train_test_split
x = cancer_db.drop(labels=['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
y = cancer_db[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, stratify = None, random_state = 42)
In [66]:
X_train = X_train.apply(pd.to_numeric)
X_train.info()
Initially, the X_train dataset needs to be analyzed to reduce the number of input features and handle missing values. In the given dataset it is more resonable to impute missing values as the size of dataframe is quite small. In the best scenario the 12% of the initial size might be lost. But as missing values usually staying not in line (otherwise it would question the idea to put this patient info in the table), this number might be bigger. Imputed values are considered to be the most frequent value in a column in order not to disturb the general picture.
In [67]:
plt.figure(figsize = (10,10))
sns.heatmap(X_train.apply(lambda x: x.fillna(x.mode()[0], axis=0)).corr())
Out[67]:
As it is shown the percentage of NaN values varies from 12% to 91%. Values with 91% NaN values are definately out of consideration, so need to be removed from dataset ('STDs: Time since first diagnosis' and 'STDs: Time since last diagnosis'). Part with STDs diseases need to be considered separately because generally consists of two main subsets, the first one is 'STDs: Number of diagnosis' and the second is set of all features starting with STDs. In the heatmap above it is clear that visually all STDs features highly correlates with the 'STDs: Number of diagnosis'. One may notice that the feature 'STDs' has better correlation, but avtually the feature 'STDs: Number of diagnosis' represents better the history of having STDs as was presented for every patient compared to others. Therefore, all features with STDs need to removed except for 'STDs: Number of diagnosis'.
In [68]:
X_train = X_train.drop(['STDs: Time since first diagnosis', 'STDs: Time since last diagnosis',\
'STDs', 'STDs (number)', 'STDs:condylomatosis', 'STDs:cervical condylomatosis',\
'STDs:vaginal condylomatosis', 'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis',\
'STDs:pelvic inflammatory disease', 'STDs:genital herpes',\
'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B',\
'STDs:HPV'], axis=1)
X_train.describe()
Out[68]:
In [69]:
X_train = X_train.apply(lambda x: x.fillna(x.mode()[0], axis=0))
X_train.describe()
Out[69]:
In [70]:
plt.figure(figsize = (10,10))
sns.heatmap(X_train.corr())
Out[70]:
'Dx:Cancer', 'Dx:CIN', 'Dx:HPV' and 'Dx' are also highly correlated features, so let's keep only 'Dx' as it correlates the best with three others.
In [71]:
X_train_m = X_train.drop(['Dx:Cancer', 'Dx:CIN', 'Dx:HPV'], axis=1)
As the final task is to predict cervical cancer we need to consider carefully the indicators given. Due to lack of information about indicators in data and related paper [2], we may assume that having '1' in one of four indicator results in having a cancer. Also, having '0' in other indicators may mean either negative result or not passing this examination. So, let's create new variable that will be calculated in following way: can_res = maxincolumns('Hinselmann', 'Schiller', 'Citology', 'Biopsy')
In [72]:
y_train['can_res'] = y_train.loc[:, ('Biopsy', 'Hinselmann', 'Schiller', 'Citology')].max(axis=1).copy()
sns.heatmap(y_train.corr())
Out[72]:
New target value highly correlates with all indicators so preserve only this value.
In [73]:
y_train_new = y_train.drop(['Biopsy', 'Hinselmann', 'Schiller', 'Citology'], axis=1)
db_train = pd.concat([X_train_m, y_train_new], axis=1, sort=False).reset_index(drop=True)
sns.heatmap(db_train.corr())
Out[73]:
In [74]:
db_train_cancer = db_train.loc[db_train['can_res'] == 1]
db_train_no_cancer = db_train.loc[db_train['can_res'] == 0]
db_train_cancer.describe()
Out[74]:
In [75]:
db_train_no_cancer.describe()
Out[75]:
From the two table right above there should be deducted several noticeable points:
In [76]:
X_train_m = X_train_m.drop(['Smokes', 'Smokes (years)', 'Smokes (packs/year)', 'Age',
'Number of sexual partners', 'First sexual intercourse', 'Num of pregnancies'], axis=1,
errors='ignore')
db_train = pd.concat([X_train_m, y_train_new], axis=1, sort=False).reset_index(drop=True)
db_train.sort_values(by='can_res', ascending=False).head(40)
Out[76]:
Besides the way for feature selection proposed we may try to use RandomForestClassifier with initial set of features (except for STDs) to compare two methods.
In [77]:
from sklearn.ensemble import RandomForestClassifier
# Build a forest and compute the feature importances
forest = RandomForestClassifier(n_estimators=250, random_state=0)
forest.fit(X_train, y_train_new)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(X_train.shape[1]):
print("%d. %s (%f)" % (f + 1, (X_train.columns[indices[f]]), importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure(figsize=(15,10))
plt.title("Feature importances", size=16)
plt.bar(range(X_train.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), [X_train.columns[indices[f]] for f in range(X_train.shape[1])],
rotation='vertical', size=16)
plt.yticks(size=16)
plt.xlim([-1, X_train.shape[1]])
plt.show()
In [78]:
X_train = X_train.loc[:, ['Age', 'First sexual intercourse', 'Hormonal Contraceptives (years)',
'Number of sexual partners', 'Num of pregnancies']]
db_train = pd.concat([X_train, y_train_new], axis=1, sort=False).reset_index(drop=True)
db_train.sort_values(by=['can_res'], ascending=False).head()
Out[78]:
In [79]:
plt.figure(figsize=(15,10))
sns.kdeplot(db_train['Age'].loc[db_train['can_res'] == 1], label="Cancer", shade=True)
sns.kdeplot(db_train['Age'].loc[db_train['can_res'] == 0], label="No cancer", shade=True)
plt.legend()
plt.title("Age distribution for women with and without cancer", size=16)
Out[79]:
In [80]:
plt.figure(figsize=(15,10))
sns.kdeplot(db_train['First sexual intercourse'].loc[db_train['can_res'] == 1], label="Cancer", shade=True)
sns.kdeplot(db_train['First sexual intercourse'].loc[db_train['can_res'] == 0], label="No cancer", shade=True)
plt.legend()
plt.title("First sexual intercourse distribution for women with and without cancer", size=16)
Out[80]:
In [81]:
plt.figure(figsize=(15,10))
sns.kdeplot(db_train['Hormonal Contraceptives (years)'].loc[db_train['can_res'] == 1], label="Cancer", shade=True)
sns.kdeplot(db_train['Hormonal Contraceptives (years)'].loc[db_train['can_res'] == 0], label="No cancer", shade=True)
plt.legend()
plt.title("Hormonal contraceptives distribution for women with and without cancer", size=16)
Out[81]:
In [82]:
plt.figure(figsize=(15,10))
sns.distplot(db_train['Number of sexual partners'].loc[db_train['can_res'] == 1], label="Cancer", kde=False)
sns.distplot(db_train['Number of sexual partners'].loc[db_train['can_res'] == 0], label="No cancer", kde=False)
plt.legend()
plt.title("Number of sexual partners distribution for women with and without cancer", size=16)
Out[82]:
In general, there is no visual difference between two groups therefore no simple rule to distinguish two classes.
In [83]:
X_test = X_test.apply(pd.to_numeric)
X_test = X_test.apply(lambda x: x.fillna(x.mode()[0], axis=0))
X_test = X_test.loc[:, ('Age', 'First sexual intercourse',
'Hormonal Contraceptives (years)', 'Number of sexual partners',
'Num of pregnancies')]
y_test['can_res'] = y_test.loc[:, ('Biopsy', 'Hinselmann', 'Schiller', 'Citology')].max(axis=1).copy()
y_test_new = y_test.drop(['Biopsy', 'Hinselmann', 'Schiller', 'Citology'], axis=1)
db_test = pd.concat([X_test, y_test_new], axis=1, sort=False).reset_index(drop=True)
In [84]:
X_train = X_train.reset_index(drop=True)
y_train_new = y_train_new.reset_index(drop=True)
Prepare parameter grid for GridSearchCV with RandomForestClassifier and KNeighborsClassifier.
In [85]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 200, num = 5)]
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
n_neighbors = [int(x) for x in np.linspace(start = 3, stop = 30)]
clf_rf = {
'clf': [RandomForestClassifier(random_state=7, n_jobs=-1)],
'clf__n_estimators': n_estimators,
'clf__max_depth': max_depth
}
clf_knc = {
'clf' : [KNeighborsClassifier()],
'clf__n_neighbors' : n_neighbors
}
grid_param = [clf_rf, clf_knc]
As we deal with imbalanced dataset in the given case we will SMOTE use over-sampling technique to improve precision and recall scores [4] on test data and to correct for a bias in the original dataset.
In [86]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, make_scorer, classification_report
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from imblearn.pipeline import Pipeline as imbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
kfold = StratifiedKFold(n_splits=5, random_state=7, shuffle=True) # 5-fold CV
pipe = imbPipeline([('sec', StandardScaler()),
('smote', SMOTE(kind="regular", random_state=7)),
('clf', RandomForestClassifier(random_state=7, n_jobs=-1))])
recall_scorer = make_scorer(recall_score, average='macro')
clf_tuned = GridSearchCV(pipe, param_grid = grid_param,
cv = kfold, verbose=2, n_jobs = -1, scoring=recall_scorer)
clf_tuned.fit(X_train, y_train_new)
print(clf_tuned.best_params_)
After cross-validation, KNeighborsClassifier scored the highest result with the following parameters:
Let's look at the results of tuned estimator
In [87]:
print(classification_report(y_train_new, clf_tuned.best_estimator_.predict(X_train)))
In [88]:
print(classification_report(y_test_new, clf_tuned.best_estimator_.predict(X_test)))
In the given dataset estimating of accuracy is considered to be wrong as classes are quite imbalanced. Precision and recall scores are much better as the first represents the ratio of TP/(TP + FP) and the latter TP/(TP + FN). Having TP in numerator is important as it takes into account hits to the smaller class in the dataset.
As a result of work done, we ended up with the numbers above. First of all it is worth mentioning that dataset was highly imbalanced, there were a lot of features and less then a thousand samples. Using Logistic Regression in this case was devoid of sense, as visually points in different classes didn't seem to be separable. In addition, imbalance of two classes made meaningless using any of standard approaches with RF or KNeighorClassifier even with good cross-validation technique. In such cases the simplest and quite effective option is a using of oversampling method to get rid of bias therefore make classes balanced. Specifically in the work SMOTE technique was used to balance data.
The applicability of this model in real life scenario is doubtfull as it is not representative. First of all, one may notice that there were up 13% samples with missing values. It is not on itself, it is problem in combination with a fact that the size of dataset is relatively small.
From biological standpoint the contribution of most-important features are more-less make sense, as it is widely known that the time of first sexual intercourse, the period of using hormonal Contraceptives, the number of sexual partners are in strong dependancy with the risk of cancer developing. Moreover, there are other more important factors, which weren't detected and weren't ranked high such as: smoking, having STDs etc, so we may notice not a strong compliance with a biological facts.
Anyway, keeping all there technicalities and discrepancies in mind, we got relatively satisfactory results even for test samples: precision value is 0.12 and recall value is 0.4.