</br>
In [1]:
% matplotlib inline
import numpy as np
import pandas as pd
import tensorflow as tf
import os
import skimage.io as imageio
import skimage.color as color
import skimage.transform as trf
import collections
from matplotlib import pyplot as plt
from matplotlib import cm
import sqlalchemy as sql
import sklearn.feature_extraction.image as skfeim
import random
import scipy.io as sio
imageio.use_plugin('matplotlib')
from ipywidgets import FloatProgress
from IPython.display import display
import time
import gc
%load_ext autoreload
%autoreload 2
import acquisition as acq
import cnn
import utils
import models
</br>
In [2]:
TYPES_SHORT = ['B','M']
TYPES_LONG = ['Benign','Malignant']
SUBTYPES_SHORT = ['A','F','TA','PT','DC','LC','MC','PC']
SUBTYPES_LONG = ['Adenosis','Fibroadenoma','Tubular Adenoma',
'Phyllodes Tumor', 'Ductal Carcinoma', 'Lobular Carcinoma',
'Mucinous Carcinoma', 'Papillary Carcinoma']
MAGNIFICATIONS = ['40X','100X','200X','400X']
IMAGE_SIZE_1 = (115,175)
PATCH_SIZE_1 = (32,32)
PATCH_NUMBER_1 = 20
IMAGE_SIZE_2 = (230,350)
PATCH_SIZE_2 = (32,32)
PATCH_NUMBER_2 = 100
In [3]:
BUILD_SQL = False #set to True if databases have to be (re)generated
if BUILD_SQL:
acq.generate_databases(os.path.join('..','Data','mkfold'), IMAGE_SIZE_1, PATCH_SIZE_1, PATCH_NUMBER_1)
acq.generate_databases(os.path.join('..','Data','mkfold'), IMAGE_SIZE_2, PATCH_SIZE_2, PATCH_NUMBER_2)
REMARK: Although the authors say that the dataset consists of only images with size 460x700, there are also some images with size 459x699. This is an inconsistency between the dataset description and the real content. Here these images were simply rescaled to IMAGE_SIZE_1 and IMAGE_SIZE_2.
Data cleaning is not necessary as the publisher already cleant the data in order to make this dataset useful for benchmarking. However, it will be shown in a later step that there is still a need for some cleaning due to an inconsistency within the dataset.
</br>
In [4]:
utils.make_statistics_plots(IMAGE_SIZE_1, MAGNIFICATIONS, TYPES_SHORT, TYPES_LONG, SUBTYPES_SHORT, SUBTYPES_LONG)
</br>
In [5]:
DO_CLEAN = False #set to True if databases should be (re)generated
if DO_CLEAN:
for image_size in [IMAGE_SIZE_1, IMAGE_SIZE_2]:
for m in MAGNIFICATIONS:
print('Clean '+str(image_size[0])+'_'+str(image_size[1])+'_'+m)
acq.remove_patient_from_database('13412', image_size, m)
For completeness, the corrected distributions are shown below.
In [6]:
utils.make_statistics_plots(IMAGE_SIZE_1, MAGNIFICATIONS, TYPES_SHORT, TYPES_LONG, SUBTYPES_SHORT, SUBTYPES_LONG,clean=1)
As we are dealing with images, it is also important to have a look at them. The next piece of code shows images for the different magnifications and tumor subtypes.
In [7]:
NUMBER_PER_CLASS = 5
for m in MAGNIFICATIONS:
engine = sql.create_engine('sqlite:///'+str(IMAGE_SIZE_1[0])+'_'+str(IMAGE_SIZE_1[1])+'_'+m+'_clean.sqlite', echo=False)
df = pd.read_sql('images',engine,columns=['subtype','image'])
utils.make_example_plot(df, IMAGE_SIZE_1, m, NUMBER_PER_CLASS)
del(df)
</br>
In [8]:
M = '400X'
IMAGE_SIZE = IMAGE_SIZE_2
utils.make_exploration_figures(IMAGE_SIZE, M, TYPES_SHORT, TYPES_LONG, SUBTYPES_SHORT, SUBTYPES_LONG)
</br>
</br>
In [9]:
M='40X'
engine = sql.create_engine('sqlite:///'+str(IMAGE_SIZE_1[0])+'_'+str(IMAGE_SIZE_1[1])+'_'+M+'_clean.sqlite', echo=False)
df = pd.read_sql('images',engine,columns=['subtype','image','patches'])
utils.make_example_patches_plot(df, IMAGE_SIZE_1, PATCH_SIZE_1, PATCH_NUMBER_1)
del(df)
engine = sql.create_engine('sqlite:///'+str(IMAGE_SIZE_2[0])+'_'+str(IMAGE_SIZE_2[1])+'_'+M+'_clean.sqlite', echo=False)
df = pd.read_sql('images',engine,columns=['subtype','image','patches'])
utils.make_example_patches_plot(df, IMAGE_SIZE_2, PATCH_SIZE_2, PATCH_NUMBER_2)
del(df)
gc.collect()
Out[9]:
In [10]:
## define train and test data
MAGNIFICATION = '400X'
IMAGE_SIZE = IMAGE_SIZE_1
PATCH_NUMBER = PATCH_NUMBER_1
PATCH_SIZE = PATCH_SIZE_1
FOLD = 'fold1'
In [11]:
## load data from database
engine = sql.create_engine('sqlite:///'+str(IMAGE_SIZE[0])+'_'+str(IMAGE_SIZE[1])+'_'+MAGNIFICATION+'_clean.sqlite', echo=False)
df = pd.read_sql('images',engine)
df_train = df.loc[df[FOLD]=='train'].copy().reset_index(drop=True)
df_test = df.loc[df[FOLD]=='test'].copy().reset_index(drop=True)
del df
In [12]:
## convert data to useful format
train_data, train_labels_bin, train_labels_all, _ = utils.convertToNDArray(df_train, PATCH_NUMBER, PATCH_SIZE)
test_data, test_labels_bin, test_labels_all, test_patients = utils.convertToNDArray(df_test, PATCH_NUMBER, PATCH_SIZE)
## convert labels to one hot
train_labels_bin = utils.convert_to_one_hot(train_labels_bin, 2)
test_labels_bin = utils.convert_to_one_hot(test_labels_bin, 2)
train_labels_all = utils.convert_to_one_hot(train_labels_all, 8)
test_labels_all = utils.convert_to_one_hot(test_labels_all, 8)
## vectorize input
train_data_vec=train_data.reshape((train_data.shape[0],PATCH_SIZE[0]*PATCH_SIZE[1]*3))
test_data_vec=test_data.reshape((test_data.shape[0],PATCH_SIZE[0]*PATCH_SIZE[1]*3))
In [13]:
NUMBER_ITERATIONS = 10001
In [14]:
STRIDES = 'pool'
DROPOUT_KEEP = 1
CORRECTION = 0
(losses, acc_train_patch, sens_train_patch, spec_train_patch, acc_test_patch, sens_test_patch, spec_test_patch, acc_test_image, sens_test_image, spec_test_image, acc_test_patient, sens_test_patient, spec_test_patient, mean_batch_time, mean_test_time, cnf_matrix, W) = models.run_model_1(train_data_vec, train_labels_bin, test_data_vec, test_labels_bin, test_patients,
DROPOUT_KEEP, NUMBER_ITERATIONS, PATCH_NUMBER, PATCH_SIZE,
strides=STRIDES, correction=CORRECTION)
sio.savemat('results_model_1_binary_image_size_'+str(IMAGE_SIZE[0])+'_'
+str(IMAGE_SIZE[1])+'_magnification_'+MAGNIFICATION
+'_fold_'+FOLD+'_strides_'+STRIDES+'_correction_'
+str(CORRECTION)+'_dropoutkeep_'+str(DROPOUT_KEEP)+'.mat',
{'losses':losses, 'acc_train_patch':acc_train_patch, 'sens_train_patch':sens_train_patch,
'spec_train_patch':spec_train_patch, 'acc_test_patch':acc_test_patch, 'sens_test_patch':sens_test_patch,
'spec_test_patch':spec_test_patch, 'acc_test_image':acc_test_image, 'sens_test_image':sens_test_image,
'spec_test_image':spec_test_image, 'acc_test_patient':acc_test_patient, 'sens_test_patient':sens_test_patient,
'spec_test_patient':spec_test_patient, 'mean_batch_time':mean_batch_time, 'mean_test_time':mean_test_time,
'cnf_matrix':cnf_matrix, 'W':W})
Since this first simulation takes a lot of time, i.e. more than 36 minutes for training alone (this is really small in real applications, but too slow to test several setups for this project), the subsampling was taken into the convolution instead of in the pooling. This lowers drastically the computation time of the convolutions. Of course, it should be checked that the performance of the model is not affected. Therefore, the previous simulation is repeated with the modified network:
In [15]:
STRIDES = 'conv'
DROPOUT_KEEP = 1
CORRECTION = 0
(losses, acc_train_patch, sens_train_patch, spec_train_patch, acc_test_patch, sens_test_patch, spec_test_patch, acc_test_image, sens_test_image, spec_test_image, acc_test_patient, sens_test_patient, spec_test_patient, mean_batch_time, mean_test_time, cnf_matrix, W) = models.run_model_1(train_data_vec, train_labels_bin, test_data_vec, test_labels_bin, test_patients,
DROPOUT_KEEP, NUMBER_ITERATIONS, PATCH_NUMBER, PATCH_SIZE,
strides=STRIDES, correction=CORRECTION)
sio.savemat('results_model_1_binary_image_size_'+str(IMAGE_SIZE[0])+'_'
+str(IMAGE_SIZE[1])+'_magnification_'+MAGNIFICATION
+'_fold_'+FOLD+'_strides_'+STRIDES+'_correction_'
+str(CORRECTION)+'_dropoutkeep_'+str(DROPOUT_KEEP)+'.mat',
{'losses':losses, 'acc_train_patch':acc_train_patch, 'sens_train_patch':sens_train_patch,
'spec_train_patch':spec_train_patch, 'acc_test_patch':acc_test_patch, 'sens_test_patch':sens_test_patch,
'spec_test_patch':spec_test_patch, 'acc_test_image':acc_test_image, 'sens_test_image':sens_test_image,
'spec_test_image':spec_test_image, 'acc_test_patient':acc_test_patient, 'sens_test_patient':sens_test_patient,
'spec_test_patient':spec_test_patient, 'mean_batch_time':mean_batch_time, 'mean_test_time':mean_test_time,
'cnf_matrix':cnf_matrix, 'W':W})
This is done in the next simulation:
In [16]:
STRIDES = 'conv'
DROPOUT_KEEP = 1
CORRECTION = 1
(losses, acc_train_patch, sens_train_patch, spec_train_patch, acc_test_patch, sens_test_patch, spec_test_patch, acc_test_image, sens_test_image, spec_test_image, acc_test_patient, sens_test_patient, spec_test_patient, mean_batch_time, mean_test_time, cnf_matrix, W) = models.run_model_1(train_data_vec, train_labels_bin, test_data_vec, test_labels_bin, test_patients,
DROPOUT_KEEP, NUMBER_ITERATIONS, PATCH_NUMBER, PATCH_SIZE,
strides=STRIDES, correction=CORRECTION)
sio.savemat('results_model_1_binary_image_size_'+str(IMAGE_SIZE[0])+'_'
+str(IMAGE_SIZE[1])+'_magnification_'+MAGNIFICATION
+'_fold_'+FOLD+'_strides_'+STRIDES+'_correction_'
+str(CORRECTION)+'_dropoutkeep_'+str(DROPOUT_KEEP)+'.mat',
{'losses':losses, 'acc_train_patch':acc_train_patch, 'sens_train_patch':sens_train_patch,
'spec_train_patch':spec_train_patch, 'acc_test_patch':acc_test_patch, 'sens_test_patch':sens_test_patch,
'spec_test_patch':spec_test_patch, 'acc_test_image':acc_test_image, 'sens_test_image':sens_test_image,
'spec_test_image':spec_test_image, 'acc_test_patient':acc_test_patient, 'sens_test_patient':sens_test_patient,
'spec_test_patient':spec_test_patient, 'mean_batch_time':mean_batch_time, 'mean_test_time':mean_test_time,
'cnf_matrix':cnf_matrix, 'W':W})
In [17]:
STRIDES = 'conv'
DROPOUT_KEEP = .7
CORRECTION = 0
(losses, acc_train_patch, sens_train_patch, spec_train_patch, acc_test_patch, sens_test_patch, spec_test_patch, acc_test_image, sens_test_image, spec_test_image, acc_test_patient, sens_test_patient, spec_test_patient, mean_batch_time, mean_test_time, cnf_matrix, W) = models.run_model_1(train_data_vec, train_labels_bin, test_data_vec, test_labels_bin, test_patients,
DROPOUT_KEEP, NUMBER_ITERATIONS, PATCH_NUMBER, PATCH_SIZE,
strides=STRIDES, correction=CORRECTION)
sio.savemat('results_model_1_binary_image_size_'+str(IMAGE_SIZE[0])+'_'
+str(IMAGE_SIZE[1])+'_magnification_'+MAGNIFICATION
+'_fold_'+FOLD+'_strides_'+STRIDES+'_correction_'
+str(CORRECTION)+'_dropoutkeep_'+str(DROPOUT_KEEP)+'.mat',
{'losses':losses, 'acc_train_patch':acc_train_patch, 'sens_train_patch':sens_train_patch,
'spec_train_patch':spec_train_patch, 'acc_test_patch':acc_test_patch, 'sens_test_patch':sens_test_patch,
'spec_test_patch':spec_test_patch, 'acc_test_image':acc_test_image, 'sens_test_image':sens_test_image,
'spec_test_image':spec_test_image, 'acc_test_patient':acc_test_patient, 'sens_test_patient':sens_test_patient,
'spec_test_patient':spec_test_patient, 'mean_batch_time':mean_batch_time, 'mean_test_time':mean_test_time,
'cnf_matrix':cnf_matrix, 'W':W})
In [18]:
NUMBER_ITERATIONS = 15001
STRIDES = 'conv'
DROPOUT_KEEP = 1
CORRECTION = 0
(losses, acc_train_patch, sens_train_patch, spec_train_patch, acc_test_patch, sens_test_patch, spec_test_patch, acc_test_image, sens_test_image, spec_test_image, acc_test_patient, sens_test_patient, spec_test_patient, mean_batch_time, mean_test_time, cnf_matrix, W) = models.run_model_1(train_data_vec, train_labels_all, test_data_vec, test_labels_all, test_patients,
DROPOUT_KEEP, NUMBER_ITERATIONS, PATCH_NUMBER, PATCH_SIZE,
strides=STRIDES, correction=CORRECTION)
sio.savemat('results_model_1_multi_image_size_'+str(IMAGE_SIZE[0])+'_'
+str(IMAGE_SIZE[1])+'_magnification_'+MAGNIFICATION
+'_fold_'+FOLD+'_strides_'+STRIDES+'_correction_'
+str(CORRECTION)+'_dropoutkeep_'+str(DROPOUT_KEEP)+'.mat',
{'losses':losses, 'acc_train_patch':acc_train_patch, 'sens_train_patch':sens_train_patch,
'spec_train_patch':spec_train_patch, 'acc_test_patch':acc_test_patch, 'sens_test_patch':sens_test_patch,
'spec_test_patch':spec_test_patch, 'acc_test_image':acc_test_image, 'sens_test_image':sens_test_image,
'spec_test_image':spec_test_image, 'acc_test_patient':acc_test_patient, 'sens_test_patient':sens_test_patient,
'spec_test_patient':spec_test_patient, 'mean_batch_time':mean_batch_time, 'mean_test_time':mean_test_time,
'cnf_matrix':cnf_matrix, 'W':W})
In [19]:
STRIDES = 'conv'
DROPOUT_KEEP = .7
CORRECTION = 0
(losses, acc_train_patch, sens_train_patch, spec_train_patch, acc_test_patch, sens_test_patch, spec_test_patch, acc_test_image, sens_test_image, spec_test_image, acc_test_patient, sens_test_patient, spec_test_patient, mean_batch_time, mean_test_time, cnf_matrix, W) = models.run_model_1(train_data_vec, train_labels_all, test_data_vec, test_labels_all, test_patients,
DROPOUT_KEEP, NUMBER_ITERATIONS, PATCH_NUMBER, PATCH_SIZE,
strides=STRIDES, correction=CORRECTION)
sio.savemat('results_model_1_multi_image_size_'+str(IMAGE_SIZE[0])+'_'
+str(IMAGE_SIZE[1])+'_magnification_'+MAGNIFICATION
+'_fold_'+FOLD+'_strides_'+STRIDES+'_correction_'
+str(CORRECTION)+'_dropoutkeep_'+str(DROPOUT_KEEP)+'.mat',
{'losses':losses, 'acc_train_patch':acc_train_patch, 'sens_train_patch':sens_train_patch,
'spec_train_patch':spec_train_patch, 'acc_test_patch':acc_test_patch, 'sens_test_patch':sens_test_patch,
'spec_test_patch':spec_test_patch, 'acc_test_image':acc_test_image, 'sens_test_image':sens_test_image,
'spec_test_image':spec_test_image, 'acc_test_patient':acc_test_patient, 'sens_test_patient':sens_test_patient,
'spec_test_patient':spec_test_patient, 'mean_batch_time':mean_batch_time, 'mean_test_time':mean_test_time,
'cnf_matrix':cnf_matrix, 'W':W})
In [20]:
## define train and test data
MAGNIFICATION = '400X'
IMAGE_SIZE = IMAGE_SIZE_2
PATCH_NUMBER = PATCH_NUMBER_2
PATCH_SIZE = PATCH_SIZE_2
FOLD = 'fold1'
In [21]:
## load data from database
engine = sql.create_engine('sqlite:///'+str(IMAGE_SIZE[0])+'_'+str(IMAGE_SIZE[1])+'_'+MAGNIFICATION+'_clean.sqlite', echo=False)
df = pd.read_sql('images',engine)
df_train = df.loc[df[FOLD]=='train'].copy().reset_index(drop=True)
df_test = df.loc[df[FOLD]=='test'].copy().reset_index(drop=True)
del df
In [22]:
## convert data to useful format
train_data, train_labels_bin, train_labels_all, _ = utils.convertToNDArray(df_train, PATCH_NUMBER, PATCH_SIZE)
test_data, test_labels_bin, test_labels_all, test_patients = utils.convertToNDArray(df_test, PATCH_NUMBER, PATCH_SIZE)
## convert labels to one hot
train_labels_bin = utils.convert_to_one_hot(train_labels_bin, 2)
test_labels_bin = utils.convert_to_one_hot(test_labels_bin, 2)
train_labels_all = utils.convert_to_one_hot(train_labels_all, 8)
test_labels_all = utils.convert_to_one_hot(test_labels_all, 8)
## vectorize input
train_data_vec=train_data.reshape((train_data.shape[0],PATCH_SIZE[0]*PATCH_SIZE[1]*3))
test_data_vec=test_data.reshape((test_data.shape[0],PATCH_SIZE[0]*PATCH_SIZE[1]*3))
In [23]:
NUMBER_ITERATIONS = 15001
STRIDES = 'conv'
DROPOUT_KEEP = 1
CORRECTION = 0
(losses, acc_train_patch, sens_train_patch, spec_train_patch, acc_test_patch, sens_test_patch, spec_test_patch, acc_test_image, sens_test_image, spec_test_image, acc_test_patient, sens_test_patient, spec_test_patient, mean_batch_time, mean_test_time, cnf_matrix, W) = models.run_model_1(train_data_vec, train_labels_all, test_data_vec, test_labels_all, test_patients,
DROPOUT_KEEP, NUMBER_ITERATIONS, PATCH_NUMBER, PATCH_SIZE,
strides=STRIDES, correction=CORRECTION)
sio.savemat('results_model_1_multi_image_size_'+str(IMAGE_SIZE[0])+'_'
+str(IMAGE_SIZE[1])+'_magnification_'+MAGNIFICATION
+'_fold_'+FOLD+'_strides_'+STRIDES+'_correction_'
+str(CORRECTION)+'_dropoutkeep_'+str(DROPOUT_KEEP)+'.mat',
{'losses':losses, 'acc_train_patch':acc_train_patch, 'sens_train_patch':sens_train_patch,
'spec_train_patch':spec_train_patch, 'acc_test_patch':acc_test_patch, 'sens_test_patch':sens_test_patch,
'spec_test_patch':spec_test_patch, 'acc_test_image':acc_test_image, 'sens_test_image':sens_test_image,
'spec_test_image':spec_test_image, 'acc_test_patient':acc_test_patient, 'sens_test_patient':sens_test_patient,
'spec_test_patient':spec_test_patient, 'mean_batch_time':mean_batch_time, 'mean_test_time':mean_test_time,
'cnf_matrix':cnf_matrix, 'W':W})
case | binary, pool stride | binary, conv stride | binary, conv stride + correction | binary, conv stride + dropout | multi, conv stride | multi, conv stride, IMAGE_SIZE_2 | |
---|---|---|---|---|---|---|---|
Test accuracy patches | 0.80 | 0.80 | 0.80 | 0.79 | 0.47 | 0.45 | |
Test accuracy images | 0.84 | 0.84 | 0.83 | 0.83 | 0.51 | 0.51 | |
Test accuracy patients | 0.85 | 0.85 | 0.89 | 0.85 | 0.59 | 0.55 | |
Test sensitivity patches | 0.83 | 0.83 | 0.83 | 0.87 | 0.80 | 0.82 | |
Test sensitivity images | 0.85 | 0.85 | 0.85 | 0.90 | 0.83 | 0.89 | |
Test sensitivity patients | 0.83 | 0.83 | 0.89 | 0.89 | 0.83 | 0.90 |
case | pool stride | conv stride |
---|---|---|
Training [s/100 iterations] | 22.2 | 6.5 |
Test [s/test set] | 22.2 | 2.5 |
For a more thorough evaluation of the proposed methods, all magnification factors and train/test folds should be considered. Sadly, time did not allow doing this within the scope of this project.
</br>
[1] Fabio A. Spanhol, Luiz S. Oliveira, Caroline Petitjean, and Laurent Heutte. A dataset for breast cancer histopathological image classification. IEEE Transactions on Biomedical Engineering, 63(7):1455–1462, jul 2016.
[2] Fabio A. Spanhol, Luiz S. Oliveira, Caroline Petitjean, and Laurent Heutte. Breast cancer histopathological image classification using convolutional neural networks. 2016.
[3] Neslihan Bayramoglu, Juho Kannala, and Janne Heikkilä. Deep learning for magnification independent breast cancer histopathology image classification. 2016.