Example classification analysis using ShinyLearner

By Erica Suh and Stephen Piccolo

This notebook illustrates how to perform a benchmark comparison of classification algorithms using ShinyLearner. We assume the reader has a moderate level of understanding of shell and Python scripting. We also assume that the user's operating system is UNIX-based.

Install Python modules


In [ ]:
%%bash

# This step may or may not be necessary on your system:
pip3 install --upgrade pip

# You only need to install these modules once
pip3 install pmlb pandas numpy

Preparing the data

First, let's generate a "null" dataset that contains no signal to ensure that ShinyLearner doesn't find a signal when there is nothing to be found.


In [5]:
import numpy as np
import os
import pandas
import shutil

def one_hot_encode(file_path, column_names):
    data = pandas.read_csv(file_path, index_col=0, sep="\t")
    
    if column_names == None:
        column_names = [x for x in list(data) if not x in ["Class"]]

    data = pandas.get_dummies(data, drop_first=True, columns=column_names)
    data.to_csv(file_path, sep="\t", index=True)
    
directory = "Datasets"
if os.path.exists(directory):
    shutil.rmtree(directory)
os.makedirs(directory)

np.random.seed(0)

num_observations = 500
num_numeric_features = 20
num_discrete_features = 10

data_dict = {}

data_dict[""] = ["Instance{}".format(i+1) for i in range(num_observations)]
data_dict["Class"] = np.random.choice([0, 1], size=num_observations, p=[0.5, 0.5])

for i in range(num_numeric_features):
    data_dict["Numeric{}".format(i+1)] = np.random.normal(0, 1, num_observations)
for i in range(num_discrete_features):
    data_dict["Discrete{}".format(i+1)] = np.random.choice(["A", "B", "C"], size=num_observations, p=[0.4, 0.5, 0.1])

df = pandas.DataFrame(data=data_dict)
df.set_index("", inplace=True)

file_path = '{}/{}.tsv'.format(directory, "null")

df.to_csv(file_path, sep="\t", index=True)
one_hot_encode('{}/{}.tsv'.format(directory, "null"), [x for x in data_dict.keys() if x.startswith("Discrete")])

The Penn Machine Learning Benchmarks (PMLB) repository contains a large number of datasets that can be used to test machine-learning algorithms. We can access this repository using the Python module named pmlb. For demonstration purposes, we will fetch 10 biology-related datasets from PMLB. First, define a list that indicates the unique identifier for each of these datasets.


In [6]:
datasets = ['analcatdata_aids',
            'ann-thyroid',
            'breast-cancer',
            'dermatology',
            'diabetes',
            'hepatitis',
            'iris',
            'liver-disorder',
            'molecular-biology_promoters',
            'yeast']

ShinyLearner requires that input data files have exactly one feature named 'Class', which includes the class labels. So we must modify the PMLB data to meet this requirement. After modifying the data, we save each DataFrame to a a file with a supported extension. (See PMLB's GitHub repository for more information about how to use this module.)


In [7]:
from pmlb import fetch_data

for data in datasets:
    curr_data = fetch_data(data)
    curr_data = curr_data.rename(columns={'target': 'Class'})  # Rename 'target' to 'Class'
    
    if data == "molecular-biology_promoters":
        curr_data = curr_data.drop(columns=["instance"], axis=1)
    
    curr_data.to_csv('{}/{}.tsv'.format(directory, data), sep='\t', index=True)  # Save to a .tsv file

one_hot_encode('{}/{}.tsv'.format(directory, "analcatdata_aids"), ["Race"])
one_hot_encode('{}/{}.tsv'.format(directory, "breast-cancer"), ["menopause", "breast-quad"])
one_hot_encode('{}/{}.tsv'.format(directory, "molecular-biology_promoters"), None)

Performing a benchmark comparison of 10 classification algorithms

For this initial analysis, we will apply 10 different classification algorithms to each dataset. Initially, we will use Monte Carlo cross validation (with no hyperparameter optimization). To keep the execution time reasonable, we will do 5 iterations of Monte Carlo cross validation.

ShinyLearner is executed within a Docker container. The ShinyLearner web application enables us to more easily build commands for executing ShinyLearner within Docker at the command line. We used this tool to create a template command. Below we modify that template and execute ShinyLearner for each dataset. We also indicate that we want to one-hot encode (--ohe) and scale the data (--scale) and that we want to impute any missing values (--impute).

(This process takes awhile to execute. You won't see any output until the analysis has completed. To facilitate this long-running execution, you could run this notebook at the command line. Also, we could use the shinylearner_gpu Docker image to speed up the keras algorithm, but that requires nvidia-docker to be installed, so we are using the regular, non-GPU image.)


In [ ]:
%%bash

function runShinyLearner {
  dataset_file_path="$1"  
  dataset_file_name="$(basename $dataset_file_path)"
  dataset_name="${dataset_file_name/\.tsv/}"
  dataset_results_dir_path="$(pwd)/Results_Basic/$dataset_name"
  
  mkdir -p "$dataset_results_dir_path"

  docker run --rm \
    -v "$(pwd)/Datasets":/InputData \
    -v "$dataset_results_dir_path":/OutputData \
    --user $(id -u):$(id -g) \
    srp33/shinylearner:version513 \
    /UserScripts/classification_montecarlo \
      --data "$dataset_file_name" \
      --description "$dataset_name" \
      --iterations 5 \
      --classif-algo "/AlgorithmScripts/Classification/tsv/keras/dnn/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/xgboost/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/mlp/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/decision_tree/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/svm/default*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/HoeffdingTree/default*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/default*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/SimpleLogistic/default*" \
      --output-dir "/OutputData" \
      --ohe false \
      --scale robust \
      --impute true \
      --verbose false
}

rm -rf Results_Basic

for dataset_file_path in ./Datasets/*.tsv
do
  runShinyLearner "$dataset_file_path"
done

Repeating the benchmark comparison with hyperparameter optimization

ShinyLearner provides an option to optimize a classification algorithm's hyperparameters. To accomplish this, it uses nested cross validation. This process requires more computational time, but it often increases classification accuracy. In the code below, we execute the same 10 classification algorithms on the same 10 datasets. There are some differences in the code below compared to the code above:

  1. We store the output in Results_ParamsOptimized rather than Results_Basic.
  2. We use the nestedclassification_montecarlo user script rather than classification_montecarlo.
  3. The path specified for each classification algorithm ends with * rather than default*. This tells ShinyLearner to evaluate all hyperparameter combinations, not just default ones.
  4. We indicate that we want to use 5 "outer" iterations and 3 "inner" iterations (to optimize hyperparameters).

In [ ]:
%%bash

function runShinyLearner {
  dataset_file_path="$1"
  dataset_file_name="$(basename $dataset_file_path)"
  dataset_name="${dataset_file_name/\.tsv/}"
  dataset_results_dir_path="$(pwd)/Results_ParamsOptimized/$dataset_name"
  
  mkdir -p $dataset_results_dir_path

  docker run --rm \
    -v "$(pwd)/Datasets":/InputData \
    -v "$dataset_results_dir_path":/OutputData \
    --user $(id -u):$(id -g) \
    srp33/shinylearner:version513 \
    /UserScripts/nestedclassification_montecarlo \
      --data "$dataset_file_name" \
      --description "$dataset_name" \
      --outer-iterations 5 \
      --inner-iterations 3 \
      --classif-algo "/AlgorithmScripts/Classification/tsv/keras/dnn/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/xgboost/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/mlp/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/decision_tree/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/svm/*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/HoeffdingTree/*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/SimpleLogistic/*" \
      --ohe false \
      --scale robust \
      --impute true \
      --verbose false
}

rm -rf Results_ParamsOptimized

for dataset_file_path in ./Datasets/*.tsv
do
  runShinyLearner "$dataset_file_path"
done

Repeating the benchmark comparison with feature selection (along with classification)

In this example, we will try 5 feature-selection algorithms in combination with the same 10 classification algorithms that we used previously. Although we could optimize hyperparameters as well, we won't do that, to reduce computational complexity. We have changed the following from the previous example:

  • We store the results in the Results_FeatureSelection directory.
  • We use the nestedboth_montecarlo user script.
  • We use default hyperparameters.
  • We added --fs-algo and --num-features arguments.

In [ ]:
%%bash

function runShinyLearner {
  dataset_file_path="$1"
  dataset_file_name="$(basename $dataset_file_path)"
  dataset_name="${dataset_file_name/\.tsv/}"
  dataset_results_dir_path="$(pwd)/Results_FeatureSelection/$dataset_name"
  
  mkdir -p $dataset_results_dir_path

  docker run --rm \
    -v "$(pwd)/Datasets":/InputData \
    -v "$dataset_results_dir_path":/OutputData \
    --user $(id -u):$(id -g) \
    srp33/shinylearner:version513 \
    /UserScripts/nestedboth_montecarlo \
      --data "$dataset_file_name" \
      --description "$dataset_name" \
      --outer-iterations 5 \
      --inner-iterations 3 \
      --classif-algo "/AlgorithmScripts/Classification/tsv/keras/dnn/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/xgboost/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/mlr/mlp/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/decision_tree/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/default*" \
      --classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/svm/default*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/HoeffdingTree/default*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/default*" \
      --classif-algo "/AlgorithmScripts/Classification/arff/weka/SimpleLogistic/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/tsv/mlr/kruskal.test/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/tsv/mlr/randomForestSRC.rfsrc/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/tsv/sklearn/mutual_info/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/tsv/sklearn/random_forest_rfe/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/tsv/sklearn/svm_rfe/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/Correlation/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/GainRatio/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/OneR/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/ReliefF/default*" \
      --fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/SymmetricalUncertainty/default*" \
      --num-features "1,3,5,10,15,20,50,200" \
      --ohe false \
      --scale robust \
      --impute true \
      --verbose false
}

rm -rf Results_FeatureSelection

for dataset_file_path in ./Datasets/*.tsv
do
  runShinyLearner "$dataset_file_path"
done

Compress output files and clean up


In [ ]:
%%bash

# These files are relatively large and we won't use them to make graphs, so let's delete them.
rm -fv Results_ParamsOptimized/*/Nested_ElapsedTime.tsv
rm -fv Results_ParamsOptimized/*/Nested_Best.tsv
mv Results_ParamsOptimized/diabetes/Nested_Predictions.tsv Results_ParamsOptimized/diabetes/Nested_Predictions.tsv.tmp
rm -fv Results_ParamsOptimized/*/Nested_Predictions.tsv
mv Results_ParamsOptimized/diabetes/Nested_Predictions.tsv.tmp Results_ParamsOptimized/diabetes/Nested_Predictions.tsv
rm -fv Results_FeatureSelection/*/Nested_Predictions.tsv
rm -fv Results_FeatureSelection/*/Nested_*ElapsedTime.tsv
rm -fv Results_FeatureSelection/*/Nested_Best.tsv

rm -rfv Datasets

Analyzing and visualizing the results

Please see the document called Analyze_Results.Rmd, which contains R code for analyzing and visualizing the results.