In [1]:
# This is a demo of H2O's GLM function
# It imports a data set, parses it, and prints a summary
# Then, it runs GLM with a binomial link function
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [2]:
h2o.init()


Warning: Version mismatch. H2O is version 3.5.0.99999, but the python package is version UNKNOWN.
H2O cluster uptime: 44 minutes 50 seconds 74 milliseconds
H2O cluster version: 3.5.0.99999
H2O cluster name: ludirehak
H2O cluster total nodes: 1
H2O cluster total memory: 3.56 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321

In [3]:
from h2o.h2o import _locate # private function. used to find files within h2o git project directory.

air = h2o.upload_file(path=_locate("smalldata/airlines/AirlinesTrain.csv.zip"))


Parse Progress: [##################################################] 100%
Uploaded pya01a74e5-0aa6-4ef0-ae1a-0d3fe860eee9 into cluster with 24,421 rows and 12 cols

In [4]:
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]

In [5]:
myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"

In [6]:
rf_no_bal = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=False)
rf_no_bal.train(x=myX, y=myY, training_frame=air_train, validation_frame=air_valid)
rf_no_bal.show()


drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2ORandomForestEstimator :  Distributed RF
Model Key:  DRF_model_python_1445557087082_2742

Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
10.0 287650.0 20.0 20.0 20.0 1664.0 2418.0 2103.5

ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.269503006052
R^2: -0.0873991649123
LogLoss: 2.43382549553
AUC: 0.646622642412
Gini: 0.293245284825

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.402941766395:
NO YES Error Rate
NO 1948.0 6780.0 0.7768 (6780.0/8728.0)
YES 936.0 9580.0 0.089 (936.0/10516.0)
Total 2884.0 16360.0 0.401 (7716.0/19244.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.4 0.7 299.0
max f2 0.0 0.9 399.0
max f0point5 0.6 0.7 190.0
max accuracy 0.6 0.6 193.0
max precision 0.9 0.7 30.0
max absolute_MCC 0.6 0.2 190.0
max min_per_class_accuracy 0.7 0.6 140.0
ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.245293478794
R^2: 0.00968032826017
LogLoss: 0.758757679035
AUC: 0.685987609758
Gini: 0.371975219515

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.42132409513:
NO YES Error Rate
NO 467.0 1781.0 0.7923 (1781.0/2248.0)
YES 160.0 2566.0 0.0587 (160.0/2726.0)
Total 627.0 4347.0 0.3902 (1941.0/4974.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.4 0.7 315.0
max f2 0.2 0.9 396.0
max f0point5 0.7 0.7 174.0
max accuracy 0.7 0.6 200.0
max precision 1.0 0.9 0.0
max absolute_MCC 0.7 0.3 174.0
max min_per_class_accuracy 0.7 0.6 165.0
Scoring History:
timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error validation_MSE validation_logloss validation_AUC validation_classification_error
2015-10-22 17:22:58 0.074 sec 1.0 0.3 8.4 0.6 0.4 0.3 8.1 0.6 0.5
2015-10-22 17:22:58 0.163 sec 2.0 0.3 7.4 0.6 0.4 0.3 4.0 0.6 0.4
2015-10-22 17:22:58 0.245 sec 3.0 0.3 6.5 0.6 0.4 0.3 2.6 0.6 0.4
2015-10-22 17:22:58 0.311 sec 4.0 0.3 5.6 0.6 0.5 0.3 1.9 0.7 0.4
2015-10-22 17:22:58 0.391 sec 5.0 0.3 4.8 0.6 0.4 0.3 1.4 0.7 0.4
2015-10-22 17:22:58 0.480 sec 6.0 0.3 4.0 0.6 0.4 0.3 1.1 0.7 0.4
2015-10-22 17:22:58 0.565 sec 7.0 0.3 3.6 0.6 0.4 0.2 1.0 0.7 0.4
2015-10-22 17:22:58 0.659 sec 8.0 0.3 3.1 0.6 0.4 0.2 0.9 0.7 0.4
2015-10-22 17:22:58 0.751 sec 9.0 0.3 2.7 0.6 0.4 0.2 0.8 0.7 0.4
2015-10-22 17:22:58 0.851 sec 10.0 0.3 2.4 0.6 0.4 0.2 0.8 0.7 0.4
Variable Importances:
variable relative_importance scaled_importance percentage
Origin 6152.2 1.0 0.3
fDayofMonth 5583.6 0.9 0.3
Dest 4203.4 0.7 0.2
UniqueCarrier 1609.3 0.3 0.1
fDayOfWeek 1556.2 0.3 0.1
Distance 1493.0 0.2 0.1
fMonth 131.7 0.0 0.0

In [7]:
rf_bal = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=True)
rf_bal.train(x=myX, y=myY, training_frame=air_train, validation_frame=air_valid)
rf_bal.show()


drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2ORandomForestEstimator :  Distributed RF
Model Key:  DRF_model_python_1445557087082_2744

Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
10.0 299144.0 20.0 20.0 20.0 1750.0 2460.0 2168.2

ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.268874582249
R^2: -0.0754992978501
LogLoss: 2.09200342169
AUC: 0.685292136376
Gini: 0.370584272753

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.538182890839:
NO YES Error Rate
NO 3925.0 6621.0 0.6278 (6621.0/10546.0)
YES 1574.0 8952.0 0.1495 (1574.0/10526.0)
Total 5499.0 15573.0 0.3889 (8195.0/21072.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.5 0.7 226.0
max f2 0.0 0.8 399.0
max f0point5 0.8 0.6 124.0
max accuracy 0.7 0.6 140.0
max precision 0.9 0.7 28.0
max absolute_MCC 0.7 0.3 151.0
max min_per_class_accuracy 0.7 0.6 140.0
ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.249809873778
R^2: -0.00855364526058
LogLoss: 0.770654128805
AUC: 0.682375448104
Gini: 0.364750896207

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.56328826827:
NO YES Error Rate
NO 822.0 1426.0 0.6343 (1426.0/2248.0)
YES 367.0 2359.0 0.1346 (367.0/2726.0)
Total 1189.0 3785.0 0.3605 (1793.0/4974.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.6 0.7 261.0
max f2 0.1 0.9 399.0
max f0point5 0.7 0.7 179.0
max accuracy 0.6 0.6 235.0
max precision 1.0 0.8 6.0
max absolute_MCC 0.7 0.3 194.0
max min_per_class_accuracy 0.7 0.6 167.0
Scoring History:
timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error validation_MSE validation_logloss validation_AUC validation_classification_error
2015-10-22 17:22:59 0.093 sec 1.0 0.3 7.3 0.6 0.4 0.3 7.9 0.6 0.5
2015-10-22 17:22:59 0.152 sec 2.0 0.3 6.8 0.6 0.4 0.3 3.7 0.6 0.4
2015-10-22 17:22:59 0.210 sec 3.0 0.3 5.9 0.6 0.4 0.3 2.2 0.6 0.4
2015-10-22 17:22:59 0.287 sec 4.0 0.3 5.2 0.6 0.4 0.3 1.6 0.7 0.4
2015-10-22 17:22:59 0.377 sec 5.0 0.3 4.3 0.7 0.4 0.3 1.3 0.7 0.4
2015-10-22 17:22:59 0.469 sec 6.0 0.3 3.7 0.7 0.4 0.3 1.0 0.7 0.4
2015-10-22 17:22:59 0.571 sec 7.0 0.3 3.2 0.7 0.4 0.3 0.9 0.7 0.4
2015-10-22 17:22:59 0.678 sec 8.0 0.3 2.8 0.7 0.4 0.3 0.9 0.7 0.4
2015-10-22 17:22:59 0.784 sec 9.0 0.3 2.4 0.7 0.4 0.2 0.8 0.7 0.4
2015-10-22 17:22:59 0.894 sec 10.0 0.3 2.1 0.7 0.4 0.2 0.8 0.7 0.4
Variable Importances:
variable relative_importance scaled_importance percentage
Origin 6811.1 1.0 0.3
fDayofMonth 6129.0 0.9 0.3
Dest 4860.0 0.7 0.2
UniqueCarrier 1824.5 0.3 0.1
fDayOfWeek 1634.1 0.2 0.1
Distance 1591.5 0.2 0.1
fMonth 129.6 0.0 0.0

In [8]:
air_test = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTest.csv.zip"))


Parse Progress: [##################################################] 100%
Imported /Users/ludirehak/h2o-3/smalldata/airlines/AirlinesTest.csv.zip. Parsed 2,691 rows and 12 cols

In [9]:
def model(model_object, test):
        #predicting on test file
        pred = model_object.predict(test)
        pred.head()
        #Building confusion matrix for test set
        perf = model_object.model_performance(test)
        perf.show()
        print(perf.confusion_matrix())
        print(perf.precision())
        print(perf.accuracy())
        print(perf.auc())

In [10]:
print("\n\nWITHOUT CLASS BALANCING\n")
model(rf_no_bal, air_test)



WITHOUT CLASS BALANCING

H2OFrame with 2691 rows and 3 columns: 
predict YES YES YES YES YES YES NO YES YES YES
NO 0.1 0.0 0.225 0.175 0.5 0.4 0.6 0.3 0.3 0.4
YES 0.9 1.0 0.775 0.825 0.5 0.6 0.4 0.7 0.7 0.6
ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.242134967995
R^2: 0.0225448334417
LogLoss: 0.818660036508
AUC: 0.705312795104
Gini: 0.410625590208

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.51742125228:
NO YES Error Rate
NO 377.0 840.0 0.6902 (840.0/1217.0)
YES 143.0 1331.0 0.097 (143.0/1474.0)
Total 520.0 2171.0 0.3653 (983.0/2691.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.5 0.7 276.0
max f2 0.2 0.9 381.0
max f0point5 0.7 0.7 174.0
max accuracy 0.7 0.7 186.0
max precision 1.0 0.9 7.0
max absolute_MCC 0.7 0.3 174.0
max min_per_class_accuracy 0.7 0.7 162.0
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.51742125228:
NO YES Error Rate
NO 377.0 840.0 0.6902 (840.0/1217.0)
YES 143.0 1331.0 0.097 (143.0/1474.0)
Total 520.0 2171.0 0.3653 (983.0/2691.0)
[[0.985450211376883, 0.8556701030927835]]
[[0.6939187561627477, 0.6651802303976218]]
0.705312795104

In [11]:
print("\n\nWITH CLASS BALANCING\n")
model(rf_bal, air_test)



WITH CLASS BALANCING

H2OFrame with 2691 rows and 3 columns: 
predict YES YES YES YES NO NO NO YES YES NO
NO 0.0 0.3 0.1 0.0 0.4 0.5 0.7 0.1 0.3 0.5
YES 1.0 0.7 0.9 1.0 0.6 0.5 0.3 0.9 0.7 0.5
ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.24831550935
R^2: -0.00240489657592
LogLoss: 0.758488823047
AUC: 0.693547371085
Gini: 0.38709474217

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.475092852495:
NO YES Error Rate
NO 269.0 948.0 0.779 (948.0/1217.0)
YES 85.0 1389.0 0.0577 (85.0/1474.0)
Total 354.0 2337.0 0.3839 (1033.0/2691.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.5 0.7 307.0
max f2 0.3 0.9 379.0
max f0point5 0.7 0.7 184.0
max accuracy 0.7 0.7 210.0
max precision 1.0 0.85 1.0
max absolute_MCC 0.7 0.3 210.0
max min_per_class_accuracy 0.7 0.6 164.0
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.475092852495:
NO YES Error Rate
NO 269.0 948.0 0.779 (948.0/1217.0)
YES 85.0 1389.0 0.0577 (85.0/1474.0)
Total 354.0 2337.0 0.3839 (1033.0/2691.0)
[[0.9962384300103982, 0.85]]
[[0.6673053431289202, 0.6540319583797845]]
0.693547371085