By Erica Suh and Stephen Piccolo
This notebook illustrates how to perform a benchmark comparison of classification algorithms using ShinyLearner. We assume the reader has a moderate level of understanding of shell and Python scripting. We also assume that the user's operating system is UNIX-based.
In [ ]:
%%bash
# This step may or may not be necessary on your system:
pip3 install --upgrade pip
# You only need to install these modules once
pip3 install pmlb pandas numpy
First, let's generate a "null" dataset that contains no signal to ensure that ShinyLearner doesn't find a signal when there is nothing to be found.
In [5]:
import numpy as np
import os
import pandas
import shutil
def one_hot_encode(file_path, column_names):
data = pandas.read_csv(file_path, index_col=0, sep="\t")
if column_names == None:
column_names = [x for x in list(data) if not x in ["Class"]]
data = pandas.get_dummies(data, drop_first=True, columns=column_names)
data.to_csv(file_path, sep="\t", index=True)
directory = "Datasets"
if os.path.exists(directory):
shutil.rmtree(directory)
os.makedirs(directory)
np.random.seed(0)
num_observations = 500
num_numeric_features = 20
num_discrete_features = 10
data_dict = {}
data_dict[""] = ["Instance{}".format(i+1) for i in range(num_observations)]
data_dict["Class"] = np.random.choice([0, 1], size=num_observations, p=[0.5, 0.5])
for i in range(num_numeric_features):
data_dict["Numeric{}".format(i+1)] = np.random.normal(0, 1, num_observations)
for i in range(num_discrete_features):
data_dict["Discrete{}".format(i+1)] = np.random.choice(["A", "B", "C"], size=num_observations, p=[0.4, 0.5, 0.1])
df = pandas.DataFrame(data=data_dict)
df.set_index("", inplace=True)
file_path = '{}/{}.tsv'.format(directory, "null")
df.to_csv(file_path, sep="\t", index=True)
one_hot_encode('{}/{}.tsv'.format(directory, "null"), [x for x in data_dict.keys() if x.startswith("Discrete")])
The Penn Machine Learning Benchmarks (PMLB) repository contains a large number of datasets that can be used to test machine-learning algorithms. We can access this repository using the Python module named pmlb. For demonstration purposes, we will fetch 10 biology-related datasets from PMLB. First, define a list that indicates the unique identifier for each of these datasets.
In [6]:
datasets = ['analcatdata_aids',
'ann-thyroid',
'breast-cancer',
'dermatology',
'diabetes',
'hepatitis',
'iris',
'liver-disorder',
'molecular-biology_promoters',
'yeast']
ShinyLearner requires that input data files have exactly one feature named 'Class', which includes the class labels. So we must modify the PMLB data to meet this requirement. After modifying the data, we save each DataFrame to a a file with a supported extension. (See PMLB's GitHub repository for more information about how to use this module.)
In [7]:
from pmlb import fetch_data
for data in datasets:
curr_data = fetch_data(data)
curr_data = curr_data.rename(columns={'target': 'Class'}) # Rename 'target' to 'Class'
if data == "molecular-biology_promoters":
curr_data = curr_data.drop(columns=["instance"], axis=1)
curr_data.to_csv('{}/{}.tsv'.format(directory, data), sep='\t', index=True) # Save to a .tsv file
one_hot_encode('{}/{}.tsv'.format(directory, "analcatdata_aids"), ["Race"])
one_hot_encode('{}/{}.tsv'.format(directory, "breast-cancer"), ["menopause", "breast-quad"])
one_hot_encode('{}/{}.tsv'.format(directory, "molecular-biology_promoters"), None)
For this initial analysis, we will apply 10 different classification algorithms to each dataset. Initially, we will use Monte Carlo cross validation (with no hyperparameter optimization). To keep the execution time reasonable, we will do 5 iterations of Monte Carlo cross validation.
ShinyLearner is executed within a Docker container. The ShinyLearner web application enables us to more easily build commands for executing ShinyLearner within Docker at the command line. We used this tool to create a template command. Below we modify that template and execute ShinyLearner for each dataset. We also indicate that we want to one-hot encode (--ohe) and scale the data (--scale) and that we want to impute any missing values (--impute).
(This process takes awhile to execute. You won't see any output until the analysis has completed. To facilitate this long-running execution, you could run this notebook at the command line. Also, we could use the shinylearner_gpu Docker image to speed up the keras algorithm, but that requires nvidia-docker to be installed, so we are using the regular, non-GPU image.)
In [ ]:
%%bash
function runShinyLearner {
dataset_file_path="$1"
dataset_file_name="$(basename $dataset_file_path)"
dataset_name="${dataset_file_name/\.tsv/}"
dataset_results_dir_path="$(pwd)/Results_Basic/$dataset_name"
mkdir -p "$dataset_results_dir_path"
docker run --rm \
-v "$(pwd)/Datasets":/InputData \
-v "$dataset_results_dir_path":/OutputData \
--user $(id -u):$(id -g) \
srp33/shinylearner:version513 \
/UserScripts/classification_montecarlo \
--data "$dataset_file_name" \
--description "$dataset_name" \
--iterations 5 \
--classif-algo "/AlgorithmScripts/Classification/tsv/keras/dnn/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/mlr/xgboost/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/mlr/mlp/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/decision_tree/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/svm/default*" \
--classif-algo "/AlgorithmScripts/Classification/arff/weka/HoeffdingTree/default*" \
--classif-algo "/AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/default*" \
--classif-algo "/AlgorithmScripts/Classification/arff/weka/SimpleLogistic/default*" \
--output-dir "/OutputData" \
--ohe false \
--scale robust \
--impute true \
--verbose false
}
rm -rf Results_Basic
for dataset_file_path in ./Datasets/*.tsv
do
runShinyLearner "$dataset_file_path"
done
ShinyLearner provides an option to optimize a classification algorithm's hyperparameters. To accomplish this, it uses nested cross validation. This process requires more computational time, but it often increases classification accuracy. In the code below, we execute the same 10 classification algorithms on the same 10 datasets. There are some differences in the code below compared to the code above:
Results_ParamsOptimized rather than Results_Basic.nestedclassification_montecarlo user script rather than classification_montecarlo.* rather than default*. This tells ShinyLearner to evaluate all hyperparameter combinations, not just default ones.
In [ ]:
%%bash
function runShinyLearner {
dataset_file_path="$1"
dataset_file_name="$(basename $dataset_file_path)"
dataset_name="${dataset_file_name/\.tsv/}"
dataset_results_dir_path="$(pwd)/Results_ParamsOptimized/$dataset_name"
mkdir -p $dataset_results_dir_path
docker run --rm \
-v "$(pwd)/Datasets":/InputData \
-v "$dataset_results_dir_path":/OutputData \
--user $(id -u):$(id -g) \
srp33/shinylearner:version513 \
/UserScripts/nestedclassification_montecarlo \
--data "$dataset_file_name" \
--description "$dataset_name" \
--outer-iterations 5 \
--inner-iterations 3 \
--classif-algo "/AlgorithmScripts/Classification/tsv/keras/dnn/*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/mlr/xgboost/*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/mlr/mlp/*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/decision_tree/*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/svm/*" \
--classif-algo "/AlgorithmScripts/Classification/arff/weka/HoeffdingTree/*" \
--classif-algo "/AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/*" \
--classif-algo "/AlgorithmScripts/Classification/arff/weka/SimpleLogistic/*" \
--ohe false \
--scale robust \
--impute true \
--verbose false
}
rm -rf Results_ParamsOptimized
for dataset_file_path in ./Datasets/*.tsv
do
runShinyLearner "$dataset_file_path"
done
In this example, we will try 5 feature-selection algorithms in combination with the same 10 classification algorithms that we used previously. Although we could optimize hyperparameters as well, we won't do that, to reduce computational complexity. We have changed the following from the previous example:
Results_FeatureSelection directory.nestedboth_montecarlo user script.--fs-algo and --num-features arguments.
In [ ]:
%%bash
function runShinyLearner {
dataset_file_path="$1"
dataset_file_name="$(basename $dataset_file_path)"
dataset_name="${dataset_file_name/\.tsv/}"
dataset_results_dir_path="$(pwd)/Results_FeatureSelection/$dataset_name"
mkdir -p $dataset_results_dir_path
docker run --rm \
-v "$(pwd)/Datasets":/InputData \
-v "$dataset_results_dir_path":/OutputData \
--user $(id -u):$(id -g) \
srp33/shinylearner:version513 \
/UserScripts/nestedboth_montecarlo \
--data "$dataset_file_name" \
--description "$dataset_name" \
--outer-iterations 5 \
--inner-iterations 3 \
--classif-algo "/AlgorithmScripts/Classification/tsv/keras/dnn/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/mlr/xgboost/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/mlr/h2o.randomForest/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/mlr/mlp/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/decision_tree/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/logistic_regression/default*" \
--classif-algo "/AlgorithmScripts/Classification/tsv/sklearn/svm/default*" \
--classif-algo "/AlgorithmScripts/Classification/arff/weka/HoeffdingTree/default*" \
--classif-algo "/AlgorithmScripts/Classification/arff/weka/MultilayerPerceptron/default*" \
--classif-algo "/AlgorithmScripts/Classification/arff/weka/SimpleLogistic/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/tsv/mlr/kruskal.test/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/tsv/mlr/randomForestSRC.rfsrc/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/tsv/sklearn/mutual_info/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/tsv/sklearn/random_forest_rfe/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/tsv/sklearn/svm_rfe/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/Correlation/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/GainRatio/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/OneR/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/ReliefF/default*" \
--fs-algo "/AlgorithmScripts/FeatureSelection/arff/weka/SymmetricalUncertainty/default*" \
--num-features "1,3,5,10,15,20,50,200" \
--ohe false \
--scale robust \
--impute true \
--verbose false
}
rm -rf Results_FeatureSelection
for dataset_file_path in ./Datasets/*.tsv
do
runShinyLearner "$dataset_file_path"
done
In [ ]:
%%bash
# These files are relatively large and we won't use them to make graphs, so let's delete them.
rm -fv Results_ParamsOptimized/*/Nested_ElapsedTime.tsv
rm -fv Results_ParamsOptimized/*/Nested_Best.tsv
mv Results_ParamsOptimized/diabetes/Nested_Predictions.tsv Results_ParamsOptimized/diabetes/Nested_Predictions.tsv.tmp
rm -fv Results_ParamsOptimized/*/Nested_Predictions.tsv
mv Results_ParamsOptimized/diabetes/Nested_Predictions.tsv.tmp Results_ParamsOptimized/diabetes/Nested_Predictions.tsv
rm -fv Results_FeatureSelection/*/Nested_Predictions.tsv
rm -fv Results_FeatureSelection/*/Nested_*ElapsedTime.tsv
rm -fv Results_FeatureSelection/*/Nested_Best.tsv
rm -rfv Datasets
Please see the document called Analyze_Results.Rmd, which contains R code for analyzing and visualizing the results.