------------- User's settings -------------


In [ ]:
# Define labels of the classes and location of raw data :
data = {
    'Class_1': '/raw/Class_1/',
    'Class_2': '/raw/Class_2/',
    'Class_3': '/raw/Class_3/',
}

# Define which filetype to be used in this raw data location:
filetype = 'cif'

# Select which channels to be included in the digested data:
channels = [3,6]

image_size = 48

# Desired location to save digested data :
directory = '/digested/'

split = {
    'Training' : 0.8,
    'Validation' : 0.1,
    'Testing' : 0.1
}

------------- (semi)-Automatic -------------


In [ ]:
import digest

In [ ]:
digest.parse(filetype, directory, data, channels, image_size)

In [ ]:
digest.split(directory, split)

In [ ]:
digest.class_weights(directory, data)

Note 1:

Split ratio for different validation methods:

  1. Split the whole collection of data into Training / Validation / Testing:

    For example:

         split = {
             "Training" : 0.8,
             "Validation" : 0.1,
             "Testing" : 0.1
         }
  2. Split the collection of data into Training / Validation, select another dataset for Testing:

    For example:

     - First, set raw data location, output directory and split ratio for Training / Validation:
    
         data = {
             "Class_1": "/raw/Class_1/",
             "Class_2": "/raw/Class_2/",
             "Class_3": "/raw/Class_3/",
         }
    
         directory = '/digested_TRAIN/'            
    
         split = {
             "Training" : 0.8,
             "Validation" : 0.2,
             "Testing" : 0
         }
    
     - Perform data digestion with this split:
    
         digest.parse(directory, data, channels, image_size)
         digest.class_weights(directory, data)
         digest.split(directory, split)
    
     - Then, set NEW raw data location, NEW output directory and NEW split ratio for Testing:
    
         data = {
             "Class_1": "/raw/Class_1/",
             "Class_2": "/raw/Class_2/",
             "Class_3": "/raw/Class_3/",
         }
    
         directory = '/digested_TEST/'            
    
         split = {
             "Training" : 0,
             "Validation" : 0,
             "Testing" : 1
         }
    
     - Repeat data digestion with NEW inputs:
    
         digest.parse(directory, data, channels, image_size)
         digest.class_weights(directory, data)
         digest.split(directory, split)
  3. k-fold cross validation:

    For example: for 5-fold cross validation

         split = {
             "Training" : 0.8,
             "Validation" : 0.2,
             "Testing" : 0
         }

Note 2:

If user intends to use our built-in CNN, any number of channels are welcome.

If user intends to use pre-trained networks from Keras.applications (VGG, ResNet50, Inception), be warned that these networks are built for 3 channels of RGB images. Therefore, one should selectively choose maximum 3 channels that provide sufficient information for making prediction, and should omit the channels that may introduce noise.

Caution:

In scenario 2 of Note 1, if new data has less number of classes (and folders) than previously digested (training) dataset, its digested label will not be correct. It's thus recommended to include all classes for training/validation/test data.