Sea Floor Lithology Feature Evaluator

This feature evaluator uses probabilistic Gaussian process classification to estimate the significance of features in predicting the sea floor sediment.

Five real features and one random feature are used in the evaluation.

Feature ID      Feature Name   
    1           bathymetry
    2           silicate
    3           productivity
    4           salinity
    5           temperature
    6           random

The types of sea floor sediment have been grouped into five catagories.

Sediment                                             Label
gravel, sand, silt                                     1
clay                                                   2
calcareous ooze, fine grained calcareous sediment      3 
radiolarian ooze                                       4
diatom ooze, sponge spicules                           5

Since this is a probabilistic process, the number of iterations affects the accuracy of the evaluation. In order to get a relatively reasonable evaluation, at least 30 iterations are required. However, it takes long time to run 30 iterations. So, if you just want to get a sense of how the evaluator works, you may try 5 iterations first. And then, if interested, you can add more iterations later.

  • Create the evaluator

    The constructor of evaluator takes the label of sediment as input parameter. The list of labels are given above. Each evaluator can only handle one label.

    For example:

     import sealitho_lib as sfl
     e = sfl.FeatureEvaluator(5)
  • Start the evaluation process (WARNING:THE FUNCTION MAY TAKE LONG TIME TO RETURN.)

    The doFeatureEvaluation() function takes the number of iterations as input parameter. Each time the function is called, the evalution result will be appended to previous results.

    For example:

     e.doFeatureEvaluation(5)
     e.doFeatureEvaluation(5)
    

    The above code equals to e.doFeatureEvaluation(10).

  • Get the evaluation result

    The result is in a three dimension array. The first dimension is the iteration. The second dimension is the feature combination. The third dimension is a tuple which contains two items. The first one is the feature combination in a tuple.The second one is the score.

    For example,

      [
          [
              ((1,2), 0.80497592295345111),
              ((1,2,3), 0.80497592295345111)
          ],
          [
              ((3), 0.80497592295345111),
              ((2,3), 0.80497592295345111)
          ]
      ]
    
    

    The example above has two iterations and each iteration has two feature combinations. In real world, each iteration has 62 feature combinations which is the number of all possible combinations of 6 features except the empty and full set(math.pow(2, 6)-2=62). Basically, it means all numbers between 0b000001 and 0b111110.

    Code example:

      e.getResults()
  • getMeanScores()

    Get the mean scores and standard deviation for each feature combination.

    For example:

     [
         ((1, 2, 4, 5), 0.80709291475581169, 0.022290250761077178),
         ((2, 3, 5), 0.75709283934338834, 0.023438694405096404),
             ......
         ((1, 3), 0.7968714536826742, 0.043176718930593445)
     ]
    
    

    Code example:

     e.getMeanScores()
  • getHighestMeanScores()

    Get the highest mean scores for each length of feature combinations.

    For example: the highest mean scores for feature combination length 1,2,3,4,5.

    [
         ((1,), 0.79627967403009137, 0.030071120117350029),
         ((3, 4), 0.80339827798743924, 0.035320722651528164),
         ((1, 2, 4), 0.83070040655888044, 0.012610268538728072),
         ((1, 2, 4, 5), 0.83746814502501521, 0.025451271286306136),
         ((1, 2, 3, 4, 5), 0.82545392324219369, 0.016482177962087206)
    ]
    
    

    Code example:

    e.getHighestMeanScores()
  • getHighestScores()

    Get the highest scores for each iteration.

    For example: the highest scores for two iterations

      [
          [
              ((3,), 0.76658640984483684),
              ((1, 3), 0.81270053475935822),
              ((1, 4, 5), 0.85954301075268824),
              ((1, 2, 4, 5), 0.85808094344679708),
              ((1, 2, 3, 4, 5), 0.82132253711201075)
          ],
          [
              ((1,), 0.79973262032085568),
              ((2, 3), 0.82170542635658905),
              ((1, 2, 3), 0.83983957219251337),
              ((1, 2, 3, 5), 0.87148268398268391),
              ((1, 2, 4, 5, 6), 0.84572192513368982)
          ]
      ]
  • printFeatureEvalPlot()

    Plot the highest mean scores and its standard deviation.

    Code example:

      e.printFeatureEvalPlot()
  • printFeatureEvalMatrix()

    Print a matrix which indicates how often a feature shows up in the highest score feature combinations. For example, the color at location (1,5) means how often feature 1(bathymetry) is in the feature combinations with highest score when 5 features are used. In general, the warmer the color is, the more significant the feature is.

    Code example:

      e.printFeatureEvalMatrix()

For more information, please read the paper


In [ ]:
import sea_floor_feature as sff

#Labels
gravel_sand_silt = 1
clay = 2
calcareous_ooze_fine_grained_sediment = 3
radiolarian_ooze = 4
diatom_ooze_sponge_spicules = 5

iterationNumber = 5

e = sff.FeatureEvaluator(diatom_ooze_sponge_spicules)
#run some iterations and will run more later so that we can see the differences in results.
#WARNING: THIS WILL TAKE A WHILE
e.doFeatureEvaluation(iterationNumber)
e.printFeatureEvalPlot()
e.printFeatureEvalMatrix()

In [ ]:
#run more iterations and print plot and matrix

#WARNING: THIS WILL TAKE A WHILE. 
#IF YOU DO NOT WANT MORE ITERATIONS, YOU CAN SKIP THIS CELL.
#Run 30 iteration may take several hours depending on how many CPUs the computer has.

iterationNumber=30

e.doFeatureEvaluation(iterationNumber)
e.printFeatureEvalPlot()
e.printFeatureEvalMatrix()

In [ ]:
#retrieve evaluation data

import numpy as np
import sea_floor_feature as sff

for f in e.getHighestMeanScores():
    if(len(f[0])==1):
        print('The most informative feature is (\'{0}\') ({1}) when the model is trained with a single feature.'\
              .format(sff.featureToString(f[0])[0], f[1]))
    else:
        print('The most informative features are {0} ({2})when the model is trained with {1} features.'\
              .format(sff.featureToString(f[0]), len(f[0]), f[1]))

e.printFeatureEvalPlot()
e.printFeatureEvalMatrix()

#print out the shapes of result data
#print(np.array(e.getHighestScores()).shape)
print(np.array(e.getResults()).shape)
#print(np.array(e.getMeanScores()).shape)
#print(e.getHighestMeanScores())

In [ ]: