In [1]:
import h2o

In [2]:
h2o.init()


H2O cluster uptime: 2 minutes 47 seconds 451 milliseconds
H2O cluster version: 3.5.0.99999
H2O cluster name: ece
H2O cluster total nodes: 1
H2O cluster total memory: 10.67 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321

In [3]:
from h2o.utils.shared_utils import _locate # private function. used to find files within h2o git project directory.

#uploading data file to h2o
air = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTrain.csv.zip"))


Parse Progress: [##################################################] 100%
Imported /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTrain.csv.zip. Parsed 24,421 rows and 12 cols

In [4]:
# Constructing validation and train sets by sampling (20/80)
# creating a column as tall as air.nrow
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]

myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"

In [5]:
#gbm
gbm = h2o.gbm(x=air_train[myX], 
              y=air_train[myY], 
              validation_x=air_valid[myX],
              validation_y=air_valid[myY],
              distribution="bernoulli", 
              ntrees=100, 
              max_depth=3, 
              learn_rate=0.01)
gbm.show()


gbm Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Gradient Boosting Machine
Model Key:  GBM_model_python_1444621872790_19

Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
100.0 21708.0 3.0 3.0 3.0 8.0 8.0 8.0

ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.224850820408
R^2: 0.0924584914905
LogLoss: 0.641731899772
AUC: 0.702768783364
Gini: 0.405537566728

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.448864947415:
NO YES Error Rate
NO 2631.0 6236.0 0.7033 (6236.0/8867.0)
YES 1018.0 9704.0 0.0949 (1018.0/10722.0)
Total 3649.0 15940.0 0.3703 (7254.0/19589.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.448864947415 0.727927387293 329.0
max f2 0.384216073407 0.859214414183 396.0
max f0point5 0.540484202247 0.683872282105 211.0
max accuracy 0.519460013297 0.658583899127 238.0
max precision 0.676757678734 0.874429223744 7.0
max absolute_MCC 0.519460013297 0.305317688997 238.0
max min_per_class_accuracy 0.546452685802 0.645401977243 203.0
ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.228581497538
R^2: 0.0782379231571
LogLoss: 0.649363009278
AUC: 0.680577713137
Gini: 0.361155426274

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.446739697823:
NO YES Error Rate
NO 605.0 1594.0 0.7249 (1594.0/2199.0)
YES 254.0 2379.0 0.0965 (254.0/2633.0)
Total 859.0 3973.0 0.3825 (1848.0/4832.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.446739697823 0.72025431426 330.0
max f2 0.373911932185 0.857795172864 397.0
max f0point5 0.520295345537 0.666759175744 236.0
max accuracy 0.512345552615 0.642177152318 248.0
max precision 0.684564934321 0.869565217391 0.0
max absolute_MCC 0.512345552615 0.27122238954 248.0
max min_per_class_accuracy 0.547116521118 0.625902012913 197.0
Scoring History:
timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error validation_MSE validation_logloss validation_AUC validation_classification_error
2015-10-11 20:54:01 0.070 sec 1.0 0.247331320109 0.687795430241 0.660414156332 0.390627392925 0.247603524684 0.688341976058 0.647685556757 0.395074503311
2015-10-11 20:54:01 0.114 sec 2.0 0.246913667129 0.686952727594 0.660424401202 0.390627392925 0.247224374106 0.6875769826 0.647755678055 0.395074503311
2015-10-11 20:54:01 0.150 sec 3.0 0.246503630683 0.686125470932 0.660424401202 0.390627392925 0.246854759296 0.686831315909 0.647755678055 0.395074503311
2015-10-11 20:54:01 0.177 sec 4.0 0.246101212127 0.685313591467 0.663714440178 0.390627392925 0.246503381528 0.686122482818 0.646327604285 0.39673013245
2015-10-11 20:54:01 0.196 sec 5.0 0.245706838064 0.684518021244 0.663646675728 0.390627392925 0.24615014055 0.685409949548 0.646298070438 0.39673013245
--- --- --- --- --- --- --- --- --- --- --- ---
2015-10-11 20:54:04 3.683 sec 75.0 0.228228148277 0.648850974833 0.698597822319 0.369748328143 0.231195338447 0.654891371682 0.678054296337 0.384105960265
2015-10-11 20:54:05 3.775 sec 76.0 0.228053442305 0.648485174531 0.698807689635 0.369646230027 0.231064058149 0.654614394176 0.678156283101 0.385347682119
2015-10-11 20:54:05 3.857 sec 77.0 0.227887494345 0.648136594459 0.699004214428 0.371688192353 0.230933646803 0.654339144973 0.678235126383 0.385347682119
2015-10-11 20:54:05 3.947 sec 78.0 0.227764971727 0.647879404876 0.699149178285 0.369697279085 0.230831743431 0.654125223389 0.678410256915 0.385347682119
2015-10-11 20:54:05 4.254 sec 100.0 0.224850820408 0.641731899772 0.702768783364 0.370309867783 0.228581497538 0.649363009278 0.680577713137 0.382450331126
Variable Importances:
variable relative_importance scaled_importance percentage
Origin 17652.8085938 1.0 0.692557100813
Dest 4619.26074219 0.261672850394 0.18122339063
UniqueCarrier 1647.6003418 0.0933336093827 0.0646388539225
fDayofMonth 1342.42211914 0.0760458094819 0.0526660653438
fDayOfWeek 139.104721069 0.00788003338566 0.00545737307588
fMonth 88.1220855713 0.0049919583676 0.00345721621445
Distance 0.0 0.0 0.0

In [6]:
#glm
glm = h2o.glm(x=air_train[myX], 
              y=air_train[myY],
              validation_x=air_valid[myX],
              validation_y=air_valid[myY],
              family = "binomial", 
              solver="L_BFGS")
glm.pprint_coef()


glm Model Build Progress: [##################################################] 100%

Coefficients: glm coefficients

names coefficients standardized_coefficients
Intercept 0.056540315409 0.224670161231
Origin.ABE -0.00451467313266 -0.00451467313266
Origin.ABQ -0.0369454795796 -0.0369454795796
Origin.ACY -0.0143826087457 -0.0143826087457
Origin.ALB 0.00857751200054 0.00857751200054
--- --- ---
fDayOfWeek.f6 -0.0868429852704 -0.0868429852704
fDayOfWeek.f7 0.0201706138395 0.0201706138395
fMonth.f1 -0.100726106453 -0.100726106453
fMonth.f10 0.106283308704 0.106283308704
Distance 0.000222614075406 0.140355957215


In [7]:
#uploading test file to h2o
air_test = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTest.csv.zip"))


Parse Progress: [##################################################] 100%
Imported /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTest.csv.zip. Parsed 2,691 rows and 12 cols

In [8]:
# predicting & performance on test file
gbm_pred = gbm.predict(air_test)
print("GBM predictions: ")
gbm_pred.head()

gbm_perf = gbm.model_performance(air_test)
print("GBM performance: ")
gbm_perf.show()

glm_pred = glm.predict(air_test)
print("GLM predictions: ")
glm_pred.head()

glm_perf = glm.model_performance(air_test)
print("GLM performance: ")
glm_perf.show()


GBM predictions: 
GBM performance: 

ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.226566937921
R^2: 0.0853901612137
LogLoss: 0.64529552984
AUC: 0.691592366843
Gini: 0.383184733686

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.461947098049:
NO YES Error Rate
NO 399.0 818.0 0.6721 (818.0/1217.0)
YES 178.0 1296.0 0.1208 (178.0/1474.0)
Total 577.0 2114.0 0.3701 (996.0/2691.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.461947098049 0.722408026756 317.0
max f2 0.385877574696 0.85970464135 394.0
max f0point5 0.53566616157 0.685536224357 219.0
max accuracy 0.53566616157 0.658491267187 219.0
max precision 0.675162105576 0.848484848485 9.0
max absolute_MCC 0.53566616157 0.307175612177 219.0
max min_per_class_accuracy 0.54621274447 0.641112618725 203.0
GLM predictions: 
GLM performance: 

ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.232278313501
R^2: 0.0623343687585
LogLoss: 0.656952698744
Null degrees of freedom: 2690
Residual degrees of freedom: 2438
Null deviance: 3705.93804119
Residual deviance: 3535.71942464
AIC: 4041.71942464
AUC: 0.654934783021
Gini: 0.309869566041

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.463553658893:
NO YES Error Rate
NO 279.0 938.0 0.7707 (938.0/1217.0)
YES 105.0 1369.0 0.0712 (105.0/1474.0)
Total 384.0 2307.0 0.3876 (1043.0/2691.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.463553658893 0.724147051045 309.0
max f2 0.351797544742 0.858992302309 391.0
max f0point5 0.518666845805 0.65379623621 256.0
max accuracy 0.511685442003 0.629134150873 264.0
max precision 0.769210809139 1.0 0.0
max absolute_MCC 0.511685442003 0.244267852676 264.0
max min_per_class_accuracy 0.557529502351 0.602442333786 198.0

In [9]:
# Building confusion matrix for test set
gbm_CM = gbm_perf.confusion_matrix()
print(gbm_CM)
print

glm_CM = glm_perf.confusion_matrix()
print(glm_CM)


Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.461947098049:
NO YES Error Rate
NO 399.0 818.0 0.6721 (818.0/1217.0)
YES 178.0 1296.0 0.1208 (178.0/1474.0)
Total 577.0 2114.0 0.3701 (996.0/2691.0)


Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.463553658893:
NO YES Error Rate
NO 279.0 938.0 0.7707 (938.0/1217.0)
YES 105.0 1369.0 0.0712 (105.0/1474.0)
Total 384.0 2307.0 0.3876 (1043.0/2691.0)


In [10]:
# ROC for test set
print('GBM Precision: {0}'.format(gbm_perf.precision()))
print('GBM Accuracy: {0}'.format(gbm_perf.accuracy()))
print('GBM AUC: {0}'.format(gbm_perf.auc()))
print
print('GLM Precision: {0}'.format(glm_perf.precision()))
print('GLM Accuracy: {0}'.format(glm_perf.accuracy()))
print('GLM AUC: {0}'.format(glm_perf.auc()))


GBM Precision: [[0.675162105576199, 0.8484848484848485]]
GBM Accuracy: [[0.5356661615701861, 0.6584912671869194]]
GBM AUC: 0.691592366843

GLM Precision: [[0.7692108091388267, 1.0]]
GLM Accuracy: [[0.5116854420033787, 0.6291341508732813]]
GLM AUC: 0.654934783021