Benchmarking MLDB

This notebook contains the code to run "The Absolute Minimum Benchmark" for a machine learning tool.

First we load the Python MLDB helper library



In [1]:

    
from pymldb import Connection
mldb = Connection("http://localhost/")

Next we create the datasets directly from the remote files.



In [2]:

    
mldb.put('/v1/procedures/import_bench_train_1m', {
    "type": "import.text",
    "params": { 
        "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/train-1m.csv",
        "outputDataset":"bench_train_1m",
        "runOnCreation": True
    }
})

mldb.put('/v1/procedures/import_bench_test', {
    "type": "import.text",
    "params": { 
        "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/test.csv",
        "outputDataset":"bench_test",
        "runOnCreation": True
    }
})

print "Datasets loaded."









    



Datasets loaded.

Now we create the experimental setup.



In [3]:

    
mldb.put('/v1/procedures/benchmark', {
    "type": "classifier.experiment",
    "params": {
        "experimentName": "benchm_ml",
        "inputData": """
            select
                {* EXCLUDING(dep_delayed_15min)} as features,
                dep_delayed_15min = 'Y' as label
            from bench_train_1m
            """,
        "testingDataOverride":  """
            select
                {* EXCLUDING(dep_delayed_15min)} as features,
                dep_delayed_15min = 'Y' as label
            from bench_test
            """,
        "configuration": {
            "type": "bagging",
            "num_bags": 100,
            "validation_split": 0,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 20,
                "random_feature_propn": 0.3
            }
        },
        "modelFileUrlPattern": "file:///mldb_data/models/benchml_$runid.cls",       
        "mode": "boolean"
    }
})

print "Ready to go!"









    



Ready to go!

Finally, we run the experiment inside a timing block. On an otherwise-unloaded AWS EC2 r3.8xlarge instance (32 cores, 240GB of RAM) it takes around 20 seconds to reach an AUC of more than 0.74.



In [4]:

    
import time

start_time = time.time()

result = mldb.post('/v1/procedures/benchmark/runs')

run_time = time.time() - start_time
auc = result.json()["status"]["folds"][0]["resultsTest"]["auc"]

print "\n\nAUC = %0.10f, time = %0.4f\n\n" % (auc, run_time)









    




AUC = 0.7430276746, time = 21.4008



In [ ]: