This notebook contains the code to run "The Absolute Minimum Benchmark" for a machine learning tool.
First we load the Python MLDB helper library
In [1]:
from pymldb import Connection
mldb = Connection("http://localhost/")
Next we create the datasets directly from the remote files.
In [2]:
mldb.put('/v1/procedures/import_bench_train_1m', {
"type": "import.text",
"params": {
"dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/train-1m.csv",
"outputDataset":"bench_train_1m",
"runOnCreation": True
}
})
mldb.put('/v1/procedures/import_bench_test', {
"type": "import.text",
"params": {
"dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/test.csv",
"outputDataset":"bench_test",
"runOnCreation": True
}
})
print "Datasets loaded."
Now we create the experimental setup.
In [3]:
mldb.put('/v1/procedures/benchmark', {
"type": "classifier.experiment",
"params": {
"experimentName": "benchm_ml",
"inputData": """
select
{* EXCLUDING(dep_delayed_15min)} as features,
dep_delayed_15min = 'Y' as label
from bench_train_1m
""",
"testingDataOverride": """
select
{* EXCLUDING(dep_delayed_15min)} as features,
dep_delayed_15min = 'Y' as label
from bench_test
""",
"configuration": {
"type": "bagging",
"num_bags": 100,
"validation_split": 0,
"weak_learner": {
"type": "decision_tree",
"max_depth": 20,
"random_feature_propn": 0.3
}
},
"modelFileUrlPattern": "file:///mldb_data/models/benchml_$runid.cls",
"mode": "boolean"
}
})
print "Ready to go!"
Finally, we run the experiment inside a timing block. On an otherwise-unloaded AWS EC2 r3.8xlarge instance (32 cores, 240GB of RAM) it takes around 20 seconds to reach an AUC of more than 0.74.
In [4]:
import time
start_time = time.time()
result = mldb.post('/v1/procedures/benchmark/runs')
run_time = time.time() - start_time
auc = result.json()["status"]["folds"][0]["resultsTest"]["auc"]
print "\n\nAUC = %0.10f, time = %0.4f\n\n" % (auc, run_time)
In [ ]: