In [1]:
# This is a demo of H2O's GLM function
# It imports a data set, parses it, and prints a summary
# Then, it runs GLM with a binomial link function
import h2o

In [2]:
h2o.init()


H2O cluster uptime: 5 minutes 14 seconds 128 milliseconds
H2O cluster version: 3.1.0.99999
H2O cluster name: ece
H2O cluster total nodes: 1
H2O cluster total memory: 4.44 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321

In [3]:
air = h2o.upload_file(path=h2o.locate("smalldata/airlines/AirlinesTrain.csv.zip"))


Parse Progress: [##################################################] 100%
Uploaded py6f514d4e-23da-4051-9994-ddb299009665 into cluster with 24421 rows and 12 cols

In [4]:
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]

In [5]:
myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"

In [6]:
rf_no_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
                              validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=False)
rf_no_bal.show()


drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Distributed RF
Model Key:  DRFModel__81f49ff2c23a04a37e910bcc58fb4215

Model Summary:

number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
10.0 149878.0 20.0 20.0 20.0 879.0 1140.0 1023.3

ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.228111120262
R^2: 0.080357095838
LogLoss: 0.841536443761
AUC: 0.686450906072
Gini: 0.372901812144

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.286623619698:

NO YES Error Rate
NO 1569.0 7236.0 0.8218 (7236.0/8805.0)
YES 651.0 9877.0 0.0618 (651.0/10528.0)
Total 2220.0 17113.0 0.8836 (0.8836/19333.0)
Maximum Metrics:

metric threshold value idx
f1 0.286623619698 0.714663000615 325.0
f2 8.00260088661e-05 0.856701114818 399.0
f0point5 0.618131936925 0.67159241288 173.0
accuracy 0.510547936844 0.641028293591 224.0
precision 0.932283611338 0.819430814524 23.0
absolute_MCC 0.620255349514 0.282106612071 172.0
min_per_class_accuracy 0.574981335833 0.636493161094 194.0
tns 1.0 8708.0 0.0
fns 1.0 10192.0 0.0
fps 8.00260088661e-05 8805.0 399.0
tps 8.00260088661e-05 10528.0 399.0
tnr 1.0 0.988983532084 0.0
fnr 1.0 0.968085106383 0.0
fpr 8.00260088661e-05 1.0 399.0
tpr 8.00260088661e-05 1.0 399.0
ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.215203503359
R^2: 0.128107873875
LogLoss: 0.626040757587
AUC: 0.710222541095
Gini: 0.42044508219

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.351204755018:

NO YES Error Rate
NO 557.0 1608.0 0.7427 (1608.0/2165.0)
YES 216.0 2504.0 0.0794 (216.0/2720.0)
Total 773.0 4112.0 0.8221 (0.8221/4885.0)
Maximum Metrics:

metric threshold value idx
f1 0.351204755018 0.733021077283 322.0
f2 0.1770355165 0.863265826009 385.0
f0point5 0.642643034239 0.696284032377 169.0
accuracy 0.515855323573 0.659160696008 238.0
precision 0.957237901508 0.95 10.0
absolute_MCC 0.642643034239 0.316924653321 169.0
min_per_class_accuracy 0.574910362384 0.655889145497 205.0
tns 0.998260494554 2164.0 0.0
fns 0.998260494554 2717.0 0.0
fps 0.104244194428 2165.0 399.0
tps 0.104244194428 2720.0 399.0
tnr 0.998260494554 0.999538106236 0.0
fnr 0.998260494554 0.998897058824 0.0
fpr 0.104244194428 1.0 399.0
tpr 0.104244194428 1.0 399.0
Scoring History:

timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error validation_MSE validation_logloss validation_AUC validation_classification_error
2015-05-22 13:26:25 0.334 sec 1.0 0.257064819084 1.92745589155 0.645868499693 0.428229328074 0.400629273877 2.39647969232 0.655043642168 0.411463664278
2015-05-22 13:26:25 0.440 sec 2.0 0.253175703449 1.99065654625 0.653143456319 0.420079146593 0.363333128699 1.40768739382 0.67876443418 0.386489252815
2015-05-22 13:26:25 0.552 sec 3.0 0.250745679307 1.78042186773 0.653824627008 0.416884366836 0.329983010029 1.05156077229 0.692085654123 0.38792221085
2015-05-22 13:26:25 0.636 sec 4.0 0.244652116515 1.55450013636 0.663639012346 0.417550274223 0.300502917089 0.87597034314 0.700898315446 0.373797338792
2015-05-22 13:26:26 0.727 sec 5.0 0.240277797376 1.37589402537 0.669836126819 0.403455748175 0.276923048874 0.785013264077 0.698065904768 0.37011258956
2015-05-22 13:26:26 0.823 sec 6.0 0.23645204618 1.17572966227 0.675103136318 0.404289161913 0.255862667171 0.726255343312 0.701924840375 0.378915046059
2015-05-22 13:26:26 0.923 sec 7.0 0.23381891121 1.05318282497 0.677629857572 0.407652843095 0.239395686254 0.684070706264 0.704263771906 0.362128966223
2015-05-22 13:26:26 1.023 sec 8.0 0.231583272153 0.975679020094 0.680780516942 0.391148954063 0.226867426377 0.653676735705 0.707648332428 0.367656090072
2015-05-22 13:26:26 1.129 sec 9.0 0.22954871378 0.892928560467 0.684517249362 0.407442102524 0.219018616188 0.635320181722 0.708369107458 0.381166837257
2015-05-22 13:26:26 1.242 sec 10.0 0.228111120262 0.841536443761 0.686450906072 0.407955309574 0.215203503359 0.626040757587 0.710222541095 0.373387922211
Variable Importances:

variable relative_importance scaled_importance percentage
Origin 5074.82666016 1.0 0.332755556342
fDayofMonth 3977.50952148 0.783772488766 0.260804650545
Dest 2911.67407227 0.573748477978 0.19091799399
UniqueCarrier 1222.79492188 0.240953042096 0.080178463575
fDayOfWeek 1001.97784424 0.197440801694 0.0656995238122
Distance 992.55279541 0.195583585781 0.0650815248979
fMonth 69.5790481567 0.0137106255674 0.00456228683847

In [7]:
rf_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
                               validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=True)
rf_bal.show()


drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Distributed RF
Model Key:  DRFModel__924d4015f4c523d250c26449b16945f4

Model Summary:

number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
10.0 161279.0 20.0 20.0 20.0 1027.0 1201.0 1095.7

ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.227050947395
R^2: 0.084631242813
LogLoss: 0.78142590824
AUC: 0.704638202454
Gini: 0.409276404907

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.401939962958:

NO YES Error Rate
NO 3873.0 6648.0 0.6319 (6648.0/10521.0)
YES 1460.0 9070.0 0.1387 (1460.0/10530.0)
Total 5333.0 15718.0 0.7706 (0.7706/21051.0)
Maximum Metrics:

metric threshold value idx
f1 0.401939962958 0.691100274307 264.0
f2 0.0 0.833452058698 399.0
f0point5 0.605764139792 0.653672952435 170.0
accuracy 0.586745397134 0.652368058525 180.0
precision 0.978060912989 0.836501901141 5.0
absolute_MCC 0.586745397134 0.304762140786 180.0
min_per_class_accuracy 0.582231845092 0.650603554795 182.0
tns 1.0 10452.0 0.0
fns 1.0 10199.0 0.0
fps 0.0 10521.0 399.0
tps 0.0 10530.0 399.0
tnr 1.0 0.993441688052 0.0
fnr 1.0 0.968566001899 0.0
fpr 0.0 1.0 399.0
tpr 0.0 1.0 399.0
ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.216562495458
R^2: 0.122601948124
LogLoss: 0.623880060288
AUC: 0.708433551827
Gini: 0.416867103654

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.405045186133:

NO YES Error Rate
NO 731.0 1434.0 0.6624 (1434.0/2165.0)
YES 295.0 2425.0 0.1085 (295.0/2720.0)
Total 1026.0 3859.0 0.7709 (0.7709/4885.0)
Maximum Metrics:

metric threshold value idx
f1 0.405045186133 0.737194102447 289.0
f2 0.135270716497 0.863416804373 386.0
f0point5 0.600421656558 0.690548294549 192.0
accuracy 0.505934494591 0.657932446264 243.0
precision 0.996011784004 1.0 0.0
absolute_MCC 0.620419824493 0.301529439421 180.0
min_per_class_accuracy 0.591067527012 0.648161764706 197.0
tns 0.996011784004 2165.0 0.0
fns 0.996011784004 2719.0 0.0
fps 0.0505929587793 2165.0 399.0
tps 0.0684239245551 2720.0 398.0
tnr 0.996011784004 1.0 0.0
fnr 0.996011784004 0.999632352941 0.0
fpr 0.0505929587793 1.0 399.0
tpr 0.0684239245551 1.0 398.0
Scoring History:

timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error validation_MSE validation_logloss validation_AUC validation_classification_error
2015-05-22 13:26:27 0.104 sec 1.0 0.255954985197 2.08023432225 0.662981914014 0.461290738117 0.405418554538 2.54355159204 0.645635358647 0.412282497441
2015-05-22 13:26:27 0.155 sec 2.0 0.25468472297 1.86352558991 0.661525255155 0.423780968913 0.370926635584 1.420635165 0.676132573699 0.414738996929
2015-05-22 13:26:27 0.233 sec 3.0 0.249598393724 1.57169101175 0.668207041164 0.432992295061 0.338066318176 1.03915856891 0.691084601277 0.367860798362
2015-05-22 13:26:27 0.328 sec 4.0 0.244778275953 1.39647444431 0.674642688382 0.430043687689 0.309120172949 0.890395753219 0.697113758321 0.380757420676
2015-05-22 13:26:27 0.424 sec 5.0 0.240958193613 1.29019435107 0.681163088793 0.407940914567 0.284701124531 0.812070685409 0.696648128651 0.371136131013
2015-05-22 13:26:27 0.525 sec 6.0 0.236190410816 1.10655931453 0.688713356476 0.403375314861 0.262138591924 0.736384904024 0.6989946169 0.367656090072
2015-05-22 13:26:27 0.632 sec 7.0 0.233016124122 0.970814521363 0.693609936492 0.394177426481 0.244223116357 0.689816810281 0.702110956392 0.368884339816
2015-05-22 13:26:27 0.745 sec 8.0 0.230513359481 0.901192589552 0.698361749513 0.385131547188 0.230494394698 0.657428244182 0.705417827062 0.362128966223
2015-05-22 13:26:27 0.866 sec 9.0 0.22875793945 0.84306688264 0.701440488461 0.382718409482 0.221628858232 0.636504280496 0.705721114658 0.372569089048
2015-05-22 13:26:27 0.989 sec 10.0 0.227050947395 0.78142590824 0.704638202454 0.385159849888 0.216562495458 0.623880060288 0.708433551827 0.353940634596
Variable Importances:

variable relative_importance scaled_importance percentage
Origin 5626.28417969 1.0 0.327272142551
fDayofMonth 4230.62353516 0.751939184023 0.246088747823
Dest 3676.58178711 0.653465354698 0.213861006715
UniqueCarrier 1316.92211914 0.234066050893 0.0766032979742
fDayOfWeek 1163.80688477 0.206851777763 0.0676968244989
Distance 1089.97692871 0.193729448051 0.063402251539
fMonth 87.2591629028 0.0155091993429 0.00507572889822

In [8]:
air_test = h2o.import_frame(path=h2o.locate("smalldata/airlines/AirlinesTest.csv.zip"))


Parse Progress: [##################################################] 100%
Imported  /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTest.csv.zip . Parsed 2,691 rows and 12 cols

In [9]:
def model(model_object, test):
        #predicting on test file
        pred = model_object.predict(test)
        pred.head()
        #Building confusion matrix for test set
        perf = model_object.model_performance(test)
        perf.show()
        print(perf.confusion_matrix())
        print(perf.precision())
        print(perf.accuracy())
        print(perf.auc())

In [10]:
print("\n\nWITHOUT CLASS BALANCING\n")
model(rf_no_bal, air_test)



WITHOUT CLASS BALANCING

First 10 rows and first 3 columns: 
Row ID predict NO YES
1 YES 0.2999211110174656 0.7000788889825345
2 YES 0.3735275126993656 0.6264724873006344
3 YES 0.22238414585590363 0.7776158541440964
4 YES 0.3962472975254059 0.6037527024745941
5 YES 0.6098413661122322 0.39015863388776784
6 YES 0.4950307622551918 0.5049692377448082
7 NO 0.6746981769800187 0.32530182301998134
8 YES 0.48598509430885317 0.5140149056911468
9 NO 0.6735334724187851 0.32646652758121486
10 NO 0.7184682190418243 0.28153178095817566
ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.209662417539
R^2: 0.153945177679
LogLoss: 0.618837191629
AUC: 0.731046158615
Gini: 0.462092317229

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.403470018009:

NO YES Error Rate
NO 408.0 809.0 0.6647 (809.0/1217.0)
YES 129.0 1345.0 0.0875 (129.0/1474.0)
Total 537.0 2154.0 0.7522 (0.7522/2691.0)
Maximum Metrics:

metric threshold value idx
f1 0.403470018009 0.741455347299 293.0
f2 0.131483560801 0.858774178513 397.0
f0point5 0.577017590124 0.706999149901 203.0
accuracy 0.545709063964 0.678558156819 219.0
precision 0.949203286087 0.970588235294 13.0
absolute_MCC 0.545709063964 0.348789541022 219.0
min_per_class_accuracy 0.579830584209 0.672998643148 201.0
tns 1.0 1216.0 0.0
fns 1.0 1474.0 0.0
fps 0.119230582317 1217.0 399.0
tps 0.131483560801 1474.0 397.0
tnr 1.0 0.999178307313 0.0
fnr 1.0 1.0 0.0
fpr 0.119230582317 1.0 399.0
tpr 0.131483560801 1.0 397.0
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.403470018009:

NO YES Error Rate
NO 408.0 809.0 0.6647 (809.0/1217.0)
YES 129.0 1345.0 0.0875 (129.0/1474.0)
Total 537.0 2154.0 0.7522 (0.7522/2691.0)
[[0.9492032860871406, 0.9705882352941176]]
[[0.5457090639642307, 0.6785581568190264]]
0.731046158615

In [11]:
print("\n\nWITH CLASS BALANCING\n")
model(rf_bal, air_test)



WITH CLASS BALANCING

First 10 rows and first 3 columns: 
Row ID predict NO YES
1 YES 0.25423263730284795 0.7457673626971522
2 YES 0.3061814057479045 0.6938185942520956
3 YES 0.29582113197078996 0.7041788680292099
4 YES 0.24460687396796132 0.7553931260320388
5 YES 0.5550336349109918 0.44496636508900816
6 YES 0.5633564660627113 0.4366435339372887
7 NO 0.6514019680463551 0.3485980319536449
8 YES 0.41344391039693884 0.5865560896030612
9 NO 0.7010735205237005 0.2989264794762995
10 NO 0.5986760058318191 0.40132399416818093
ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.215031073082
R^2: 0.13228093778
LogLoss: 0.623177900509
AUC: 0.716619152687
Gini: 0.433238305373

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.431652753871:

NO YES Error Rate
NO 456.0 761.0 0.6253 (761.0/1217.0)
YES 172.0 1302.0 0.1167 (172.0/1474.0)
Total 628.0 2063.0 0.742 (0.742/2691.0)
Maximum Metrics:

metric threshold value idx
f1 0.431652753871 0.736217133164 279.0
f2 0.168528514213 0.859950859951 380.0
f0point5 0.604570530568 0.697193500739 189.0
accuracy 0.566351616261 0.669267930137 209.0
precision 0.997537422127 1.0 0.0
absolute_MCC 0.566351616261 0.330678623344 209.0
min_per_class_accuracy 0.593757553889 0.661462612983 195.0
tns 0.997537422127 1217.0 0.0
fns 0.997537422127 1473.0 0.0
fps 0.0562118133373 1217.0 399.0
tps 0.115491940237 1474.0 394.0
tnr 0.997537422127 1.0 0.0
fnr 0.997537422127 0.999321573948 0.0
fpr 0.0562118133373 1.0 399.0
tpr 0.115491940237 1.0 394.0
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.431652753871:

NO YES Error Rate
NO 456.0 761.0 0.6253 (761.0/1217.0)
YES 172.0 1302.0 0.1167 (172.0/1474.0)
Total 628.0 2063.0 0.742 (0.742/2691.0)
[[0.997537422127463, 1.0]]
[[0.5663516162614709, 0.6692679301374953]]
0.716619152687