In [1]:
# This is a demo of H2O's GLM function
# It imports a data set, parses it, and prints a summary
# Then, it runs GLM with a binomial link function
import h2o

In [2]:
h2o.init()


H2O cluster uptime: 16 minutes 29 seconds 988 milliseconds
H2O cluster version: 3.5.0.99999
H2O cluster name: ece
H2O cluster total nodes: 1
H2O cluster total memory: 10.67 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321

In [3]:
from h2o.utils.shared_utils import _locate # private function. used to find files within h2o git project directory.

air = h2o.upload_file(path=_locate("smalldata/airlines/AirlinesTrain.csv.zip"))


Parse Progress: [##################################################] 100%
Uploaded py94b053cc-68c5-4746-aecc-2a9cea269d7f into cluster with 24,421 rows and 12 cols

In [4]:
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]

In [5]:
myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"

In [6]:
rf_no_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
                              validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=False)
rf_no_bal.show()


drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Distributed RF
Model Key:  DRF_model_python_1444621872790_39

Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
10.0 308107.0 20.0 20.0 20.0 1838.0 2497.0 2279.7

ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.265275494969
R^2: -0.0711389804007
LogLoss: 2.29024516759
AUC: 0.661677793181
Gini: 0.323355586361

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.44272284619:
NO YES Error Rate
NO 2295.0 6457.0 0.7378 (6457.0/8752.0)
YES 1071.0 9557.0 0.1008 (1071.0/10628.0)
Total 3366.0 16014.0 0.3884 (7528.0/19380.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.44272284619 0.717438630733 272.0
max f2 0.0 0.858592386738 399.0
max f0point5 0.698268730884 0.659881812213 158.0
max accuracy 0.650252095381 0.628947368421 181.0
max precision 0.960446300909 0.721248630887 21.0
max absolute_MCC 0.698268730884 0.246344466653 158.0
max min_per_class_accuracy 0.749250828256 0.621229433272 134.0
ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.250504652341
R^2: -0.00747370018869
LogLoss: 0.768558939899
AUC: 0.689886229554
Gini: 0.379772459109

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.486017723481:
NO YES Error Rate
NO 645.0 1596.0 0.7122 (1596.0/2241.0)
YES 231.0 2366.0 0.0889 (231.0/2597.0)
Total 876.0 3962.0 0.3776 (1827.0/4838.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.486017723481 0.721451440768 284.0
max f2 0.0664381176233 0.852929584866 398.0
max f0point5 0.690310050891 0.662157351371 187.0
max accuracy 0.690310050891 0.640553947912 187.0
max precision 0.996994908052 0.901639344262 1.0
max absolute_MCC 0.690310050891 0.272976256064 187.0
max min_per_class_accuracy 0.73961018417 0.63226800154 159.0
Scoring History:
timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error validation_MSE validation_logloss validation_AUC validation_classification_error
2015-10-11 21:07:43 0.095 sec 1.0 0.32235484164 7.662491719 0.608674793325 0.455644594405 0.335928752684 7.97500838576 0.594303281667 0.463207937164
2015-10-11 21:07:43 0.171 sec 2.0 0.315088056893 6.7142159325 0.611138413767 0.452977518449 0.286013782525 3.64715325721 0.635786804429 0.413187267466
2015-10-11 21:07:43 0.244 sec 3.0 0.302349898977 5.80199845388 0.621654880365 0.406267142074 0.272317328196 2.23120408384 0.651894790904 0.415047540306
2015-10-11 21:07:44 0.337 sec 4.0 0.298117091735 5.30789756972 0.624075773238 0.40820668693 0.263460295802 1.63304931546 0.666612627724 0.403679206284
2015-10-11 21:07:44 0.427 sec 5.0 0.288127760368 4.52047928228 0.634770920373 0.407384230288 0.259250391801 1.30688458768 0.671739797937 0.397064902852
2015-10-11 21:07:44 0.515 sec 6.0 0.282972510681 3.99632739574 0.639444594454 0.405887502736 0.255525177893 1.07447770858 0.679644002786 0.396031417941
2015-10-11 21:07:44 0.609 sec 7.0 0.278459832224 3.43743009325 0.642955984647 0.408359100117 0.252968181208 0.926846750591 0.684167208345 0.391070690368
2015-10-11 21:07:44 0.710 sec 8.0 0.273494367166 3.01547825092 0.650657012799 0.395480522204 0.252163227511 0.886088862974 0.686932988446 0.391897478297
2015-10-11 21:07:44 0.822 sec 9.0 0.267895919558 2.59374086495 0.657471165348 0.393676669089 0.250662445622 0.807979165782 0.688616099619 0.377842083506
2015-10-11 21:07:44 0.937 sec 10.0 0.265275494969 2.29024516759 0.661677793181 0.388441692466 0.250504652341 0.768558939899 0.689886229554 0.377635386523
Variable Importances:
variable relative_importance scaled_importance percentage
fDayofMonth 6308.44042969 1.0 0.295695234969
Origin 6097.22363281 0.966518381329 0.285794879869
Dest 3880.18457031 0.61507826119 0.181875710967
fDayOfWeek 1679.60095215 0.266246621628 0.078727857342
Distance 1669.28149414 0.264610804009 0.0782441538666
UniqueCarrier 1566.25708008 0.248279602151 0.0734150952961
fMonth 133.276596069 0.0211267107227 0.00624706769126

In [7]:
rf_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
                               validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=True)
rf_bal.show()


drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Distributed RF
Model Key:  DRF_model_python_1444621872790_41

Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
10.0 309235.0 20.0 20.0 20.0 2000.0 2500.0 2283.0

ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.264451685762
R^2: -0.0578084464369
LogLoss: 1.97582874031
AUC: 0.691222009675
Gini: 0.38244401935

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.535749234613:
NO YES Error Rate
NO 4017.0 6608.0 0.6219 (6608.0/10625.0)
YES 1545.0 9107.0 0.145 (1545.0/10652.0)
Total 5562.0 15715.0 0.3832 (8153.0/21277.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.535749234613 0.690787727083 225.0
max f2 0.0 0.833685528684 399.0
max f0point5 0.739763138725 0.650608441158 126.0
max accuracy 0.739763138725 0.649527658974 126.0
max precision 0.932305521477 0.720121028744 32.0
max absolute_MCC 0.739763138725 0.299226719466 126.0
max min_per_class_accuracy 0.728312954319 0.647296282388 132.0
ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.253085984236
R^2: -0.0178552398974
LogLoss: 0.790489553072
AUC: 0.690520950872
Gini: 0.381041901745

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.521000889499:
NO YES Error Rate
NO 661.0 1580.0 0.705 (1580.0/2241.0)
YES 262.0 2335.0 0.1009 (262.0/2597.0)
Total 923.0 3915.0 0.3807 (1842.0/4838.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.521000889499 0.717137592138 270.0
max f2 0.0472040587339 0.852985613874 396.0
max f0point5 0.762349250056 0.663602173005 149.0
max accuracy 0.669848414479 0.641794129806 199.0
max precision 0.99052248973 0.853333333333 4.0
max absolute_MCC 0.805816136548 0.277919599732 121.0
max min_per_class_accuracy 0.748005549067 0.634578359646 156.0
Scoring History:
timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error validation_MSE validation_logloss validation_AUC validation_classification_error
2015-10-11 21:07:45 0.115 sec 1.0 0.318919473641 7.64901795535 0.631786919112 0.416984975809 0.331630644875 8.24803173698 0.603882607828 0.463207937164
2015-10-11 21:07:45 0.182 sec 2.0 0.307790515286 6.46362700555 0.64185871205 0.418909431313 0.284096965633 3.71616915378 0.645607544627 0.427656056222
2015-10-11 21:07:45 0.256 sec 3.0 0.295672506433 5.15391598199 0.65219504425 0.401374141162 0.268619916852 1.90373785978 0.659125098348 0.416907813146
2015-10-11 21:07:45 0.352 sec 4.0 0.288414207742 4.4630843256 0.661142174101 0.405508756373 0.261902251277 1.36153885604 0.670341142949 0.373914840843
2015-10-11 21:07:45 0.457 sec 5.0 0.286070143609 3.98934827706 0.662768323553 0.416429126717 0.25816659639 1.06915287788 0.677400226843 0.384663083919
2015-10-11 21:07:45 0.562 sec 6.0 0.281893895669 3.49067198853 0.665554796673 0.417763157895 0.256098307475 0.945158071812 0.681186561159 0.381149235221
2015-10-11 21:07:45 0.680 sec 7.0 0.275626604403 3.03955067987 0.67408386543 0.395511512679 0.253889238677 0.867898233797 0.685682618378 0.385283174866
2015-10-11 21:07:45 0.808 sec 8.0 0.269722216644 2.55500697462 0.681868908562 0.395127776451 0.25303344922 0.83064625676 0.688120986062 0.396651508888
2015-10-11 21:07:45 0.950 sec 9.0 0.267002344542 2.23913856475 0.686387318577 0.402808643435 0.253456533697 0.80493621649 0.689018341797 0.379082265399
2015-10-11 21:07:46 1.097 sec 10.0 0.264451685762 1.97582874031 0.691222009675 0.383183719509 0.253085984236 0.790489553072 0.690520950872 0.380735841257
Variable Importances:
variable relative_importance scaled_importance percentage
Origin 6781.87548828 1.0 0.286853852122
fDayofMonth 6666.95556641 0.983054846396 0.281993069535
Dest 4654.37060547 0.686295496506 0.196866506866
Distance 1848.23608398 0.272525806051 0.0781750772684
fDayOfWeek 1794.95092773 0.26466881187 0.0759212682214
UniqueCarrier 1773.37683105 0.26148767168 0.0750087459037
fMonth 122.501937866 0.0180631357916 0.0051814800832

In [8]:
air_test = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTest.csv.zip"))


Parse Progress: [##################################################] 100%
Imported /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTest.csv.zip. Parsed 2,691 rows and 12 cols

In [9]:
def model(model_object, test):
        #predicting on test file
        pred = model_object.predict(test)
        pred.head()
        #Building confusion matrix for test set
        perf = model_object.model_performance(test)
        perf.show()
        print(perf.confusion_matrix())
        print(perf.precision())
        print(perf.accuracy())
        print(perf.auc())

In [10]:
print("\n\nWITHOUT CLASS BALANCING\n")
model(rf_no_bal, air_test)



WITHOUT CLASS BALANCING


ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.24177915311
R^2: 0.0239811939184
LogLoss: 0.776986945965
AUC: 0.709914051168
Gini: 0.419828102336

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.579077571258:
NO YES Error Rate
NO 488.0 729.0 0.599 (729.0/1217.0)
YES 196.0 1278.0 0.133 (196.0/1474.0)
Total 684.0 2007.0 0.3437 (925.0/2691.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.579077571258 0.734271760988 238.0
max f2 0.230467463533 0.858751759737 370.0
max f0point5 0.765756022717 0.695414515639 140.0
max accuracy 0.73900813212 0.662950575994 157.0
max precision 0.988231959264 0.866666666667 6.0
max absolute_MCC 0.744853460556 0.323654975185 154.0
max min_per_class_accuracy 0.744853460556 0.661462612983 154.0
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.579077571258:
NO YES Error Rate
NO 488.0 729.0 0.599 (729.0/1217.0)
YES 196.0 1278.0 0.133 (196.0/1474.0)
Total 684.0 2007.0 0.3437 (925.0/2691.0)
[[0.988231959263794, 0.8666666666666667]]
[[0.7390081321199734, 0.6629505759940543]]
0.709914051168

In [11]:
print("\n\nWITH CLASS BALANCING\n")
model(rf_bal, air_test)



WITH CLASS BALANCING


ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.245387801445
R^2: 0.00941373185811
LogLoss: 0.760182429595
AUC: 0.696770034194
Gini: 0.393540068389

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.497376537988:
NO YES Error Rate
NO 321.0 896.0 0.7362 (896.0/1217.0)
YES 106.0 1368.0 0.0719 (106.0/1474.0)
Total 427.0 2264.0 0.3724 (1002.0/2691.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.497376537988 0.731942215088 278.0
max f2 0.18838964109 0.860617399439 381.0
max f0point5 0.647380822133 0.681608665591 209.0
max accuracy 0.647380822133 0.66220735786 209.0
max precision 0.997039367558 0.846153846154 1.0
max absolute_MCC 0.647380822133 0.311837477405 209.0
max min_per_class_accuracy 0.747237667953 0.635990139688 153.0
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.497376537988:
NO YES Error Rate
NO 321.0 896.0 0.7362 (896.0/1217.0)
YES 106.0 1368.0 0.0719 (106.0/1474.0)
Total 427.0 2264.0 0.3724 (1002.0/2691.0)
[[0.9970393675582909, 0.8461538461538461]]
[[0.6473808221327922, 0.6622073578595318]]
0.696770034194