notebook.community

Edit and run



In [1]:

    
# This is a demo of H2O's GLM function
# It imports a data set, parses it, and prints a summary
# Then, it runs GLM with a binomial link function
import h2o



In [2]:

    
h2o.init()









    




H2O cluster uptime: 
16 minutes 29 seconds 988 milliseconds 
H2O cluster version: 
3.5.0.99999
H2O cluster name: 
ece
H2O cluster total nodes: 
1
H2O cluster total memory: 
10.67 GB
H2O cluster total cores: 
8
H2O cluster allowed cores: 
8
H2O cluster healthy: 
True
H2O Connection ip: 
127.0.0.1
H2O Connection port: 
54321



In [3]:

    
from h2o.utils.shared_utils import _locate # private function. used to find files within h2o git project directory.

air = h2o.upload_file(path=_locate("smalldata/airlines/AirlinesTrain.csv.zip"))









    



Parse Progress: [##################################################] 100%
Uploaded py94b053cc-68c5-4746-aecc-2a9cea269d7f into cluster with 24,421 rows and 12 cols



In [4]:

    
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]



In [5]:

    
myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"



In [6]:

    
rf_no_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
                              validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=False)
rf_no_bal.show()









    



drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Distributed RF
Model Key:  DRF_model_python_1444621872790_39

Model Summary:






    





number_of_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

10.0
308107.0
20.0
20.0
20.0
1838.0
2497.0
2279.7






    




ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.265275494969
R^2: -0.0711389804007
LogLoss: 2.29024516759
AUC: 0.661677793181
Gini: 0.323355586361

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.44272284619:






    





NO
YES
Error
Rate
NO
2295.0
6457.0
0.7378
 (6457.0/8752.0)
YES
1071.0
9557.0
0.1008
 (1071.0/10628.0)
Total
3366.0
16014.0
0.3884
 (7528.0/19380.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.44272284619
0.717438630733
272.0
max f2
0.0
0.858592386738
399.0
max f0point5
0.698268730884
0.659881812213
158.0
max accuracy
0.650252095381
0.628947368421
181.0
max precision
0.960446300909
0.721248630887
21.0
max absolute_MCC
0.698268730884
0.246344466653
158.0
max min_per_class_accuracy
0.749250828256
0.621229433272
134.0






    



ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.250504652341
R^2: -0.00747370018869
LogLoss: 0.768558939899
AUC: 0.689886229554
Gini: 0.379772459109

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.486017723481:






    





NO
YES
Error
Rate
NO
645.0
1596.0
0.7122
 (1596.0/2241.0)
YES
231.0
2366.0
0.0889
 (231.0/2597.0)
Total
876.0
3962.0
0.3776
 (1827.0/4838.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.486017723481
0.721451440768
284.0
max f2
0.0664381176233
0.852929584866
398.0
max f0point5
0.690310050891
0.662157351371
187.0
max accuracy
0.690310050891
0.640553947912
187.0
max precision
0.996994908052
0.901639344262
1.0
max absolute_MCC
0.690310050891
0.272976256064
187.0
max min_per_class_accuracy
0.73961018417
0.63226800154
159.0






    



Scoring History:






    





timestamp
duration
number_of_trees
training_MSE
training_logloss
training_AUC
training_classification_error
validation_MSE
validation_logloss
validation_AUC
validation_classification_error

2015-10-11 21:07:43
 0.095 sec
1.0
0.32235484164
7.662491719
0.608674793325
0.455644594405
0.335928752684
7.97500838576
0.594303281667
0.463207937164

2015-10-11 21:07:43
 0.171 sec
2.0
0.315088056893
6.7142159325
0.611138413767
0.452977518449
0.286013782525
3.64715325721
0.635786804429
0.413187267466

2015-10-11 21:07:43
 0.244 sec
3.0
0.302349898977
5.80199845388
0.621654880365
0.406267142074
0.272317328196
2.23120408384
0.651894790904
0.415047540306

2015-10-11 21:07:44
 0.337 sec
4.0
0.298117091735
5.30789756972
0.624075773238
0.40820668693
0.263460295802
1.63304931546
0.666612627724
0.403679206284

2015-10-11 21:07:44
 0.427 sec
5.0
0.288127760368
4.52047928228
0.634770920373
0.407384230288
0.259250391801
1.30688458768
0.671739797937
0.397064902852

2015-10-11 21:07:44
 0.515 sec
6.0
0.282972510681
3.99632739574
0.639444594454
0.405887502736
0.255525177893
1.07447770858
0.679644002786
0.396031417941

2015-10-11 21:07:44
 0.609 sec
7.0
0.278459832224
3.43743009325
0.642955984647
0.408359100117
0.252968181208
0.926846750591
0.684167208345
0.391070690368

2015-10-11 21:07:44
 0.710 sec
8.0
0.273494367166
3.01547825092
0.650657012799
0.395480522204
0.252163227511
0.886088862974
0.686932988446
0.391897478297

2015-10-11 21:07:44
 0.822 sec
9.0
0.267895919558
2.59374086495
0.657471165348
0.393676669089
0.250662445622
0.807979165782
0.688616099619
0.377842083506

2015-10-11 21:07:44
 0.937 sec
10.0
0.265275494969
2.29024516759
0.661677793181
0.388441692466
0.250504652341
0.768558939899
0.689886229554
0.377635386523






    



Variable Importances:






    




variable
relative_importance
scaled_importance
percentage
fDayofMonth
6308.44042969
1.0
0.295695234969
Origin
6097.22363281
0.966518381329
0.285794879869
Dest
3880.18457031
0.61507826119
0.181875710967
fDayOfWeek
1679.60095215
0.266246621628
0.078727857342
Distance
1669.28149414
0.264610804009
0.0782441538666
UniqueCarrier
1566.25708008
0.248279602151
0.0734150952961
fMonth
133.276596069
0.0211267107227
0.00624706769126



In [7]:

    
rf_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
                               validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=True)
rf_bal.show()









    



drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Distributed RF
Model Key:  DRF_model_python_1444621872790_41

Model Summary:






    





number_of_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

10.0
309235.0
20.0
20.0
20.0
2000.0
2500.0
2283.0






    




ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.264451685762
R^2: -0.0578084464369
LogLoss: 1.97582874031
AUC: 0.691222009675
Gini: 0.38244401935

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.535749234613:






    





NO
YES
Error
Rate
NO
4017.0
6608.0
0.6219
 (6608.0/10625.0)
YES
1545.0
9107.0
0.145
 (1545.0/10652.0)
Total
5562.0
15715.0
0.3832
 (8153.0/21277.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.535749234613
0.690787727083
225.0
max f2
0.0
0.833685528684
399.0
max f0point5
0.739763138725
0.650608441158
126.0
max accuracy
0.739763138725
0.649527658974
126.0
max precision
0.932305521477
0.720121028744
32.0
max absolute_MCC
0.739763138725
0.299226719466
126.0
max min_per_class_accuracy
0.728312954319
0.647296282388
132.0






    



ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.253085984236
R^2: -0.0178552398974
LogLoss: 0.790489553072
AUC: 0.690520950872
Gini: 0.381041901745

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.521000889499:






    





NO
YES
Error
Rate
NO
661.0
1580.0
0.705
 (1580.0/2241.0)
YES
262.0
2335.0
0.1009
 (262.0/2597.0)
Total
923.0
3915.0
0.3807
 (1842.0/4838.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.521000889499
0.717137592138
270.0
max f2
0.0472040587339
0.852985613874
396.0
max f0point5
0.762349250056
0.663602173005
149.0
max accuracy
0.669848414479
0.641794129806
199.0
max precision
0.99052248973
0.853333333333
4.0
max absolute_MCC
0.805816136548
0.277919599732
121.0
max min_per_class_accuracy
0.748005549067
0.634578359646
156.0






    



Scoring History:






    





timestamp
duration
number_of_trees
training_MSE
training_logloss
training_AUC
training_classification_error
validation_MSE
validation_logloss
validation_AUC
validation_classification_error

2015-10-11 21:07:45
 0.115 sec
1.0
0.318919473641
7.64901795535
0.631786919112
0.416984975809
0.331630644875
8.24803173698
0.603882607828
0.463207937164

2015-10-11 21:07:45
 0.182 sec
2.0
0.307790515286
6.46362700555
0.64185871205
0.418909431313
0.284096965633
3.71616915378
0.645607544627
0.427656056222

2015-10-11 21:07:45
 0.256 sec
3.0
0.295672506433
5.15391598199
0.65219504425
0.401374141162
0.268619916852
1.90373785978
0.659125098348
0.416907813146

2015-10-11 21:07:45
 0.352 sec
4.0
0.288414207742
4.4630843256
0.661142174101
0.405508756373
0.261902251277
1.36153885604
0.670341142949
0.373914840843

2015-10-11 21:07:45
 0.457 sec
5.0
0.286070143609
3.98934827706
0.662768323553
0.416429126717
0.25816659639
1.06915287788
0.677400226843
0.384663083919

2015-10-11 21:07:45
 0.562 sec
6.0
0.281893895669
3.49067198853
0.665554796673
0.417763157895
0.256098307475
0.945158071812
0.681186561159
0.381149235221

2015-10-11 21:07:45
 0.680 sec
7.0
0.275626604403
3.03955067987
0.67408386543
0.395511512679
0.253889238677
0.867898233797
0.685682618378
0.385283174866

2015-10-11 21:07:45
 0.808 sec
8.0
0.269722216644
2.55500697462
0.681868908562
0.395127776451
0.25303344922
0.83064625676
0.688120986062
0.396651508888

2015-10-11 21:07:45
 0.950 sec
9.0
0.267002344542
2.23913856475
0.686387318577
0.402808643435
0.253456533697
0.80493621649
0.689018341797
0.379082265399

2015-10-11 21:07:46
 1.097 sec
10.0
0.264451685762
1.97582874031
0.691222009675
0.383183719509
0.253085984236
0.790489553072
0.690520950872
0.380735841257






    



Variable Importances:






    




variable
relative_importance
scaled_importance
percentage
Origin
6781.87548828
1.0
0.286853852122
fDayofMonth
6666.95556641
0.983054846396
0.281993069535
Dest
4654.37060547
0.686295496506
0.196866506866
Distance
1848.23608398
0.272525806051
0.0781750772684
fDayOfWeek
1794.95092773
0.26466881187
0.0759212682214
UniqueCarrier
1773.37683105
0.26148767168
0.0750087459037
fMonth
122.501937866
0.0180631357916
0.0051814800832



In [8]:

    
air_test = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTest.csv.zip"))









    



Parse Progress: [##################################################] 100%
Imported /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTest.csv.zip. Parsed 2,691 rows and 12 cols



In [9]:

    
def model(model_object, test):
        #predicting on test file
        pred = model_object.predict(test)
        pred.head()
        #Building confusion matrix for test set
        perf = model_object.model_performance(test)
        perf.show()
        print(perf.confusion_matrix())
        print(perf.precision())
        print(perf.accuracy())
        print(perf.auc())



In [10]:

    
print("\n\nWITHOUT CLASS BALANCING\n")
model(rf_no_bal, air_test)









    




WITHOUT CLASS BALANCING


ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.24177915311
R^2: 0.0239811939184
LogLoss: 0.776986945965
AUC: 0.709914051168
Gini: 0.419828102336

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.579077571258:






    





NO
YES
Error
Rate
NO
488.0
729.0
0.599
 (729.0/1217.0)
YES
196.0
1278.0
0.133
 (196.0/1474.0)
Total
684.0
2007.0
0.3437
 (925.0/2691.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.579077571258
0.734271760988
238.0
max f2
0.230467463533
0.858751759737
370.0
max f0point5
0.765756022717
0.695414515639
140.0
max accuracy
0.73900813212
0.662950575994
157.0
max precision
0.988231959264
0.866666666667
6.0
max absolute_MCC
0.744853460556
0.323654975185
154.0
max min_per_class_accuracy
0.744853460556
0.661462612983
154.0






    



Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.579077571258:






    





NO
YES
Error
Rate
NO
488.0
729.0
0.599
 (729.0/1217.0)
YES
196.0
1278.0
0.133
 (196.0/1474.0)
Total
684.0
2007.0
0.3437
 (925.0/2691.0)






    



[[0.988231959263794, 0.8666666666666667]]
[[0.7390081321199734, 0.6629505759940543]]
0.709914051168



In [11]:

    
print("\n\nWITH CLASS BALANCING\n")
model(rf_bal, air_test)









    




WITH CLASS BALANCING


ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.245387801445
R^2: 0.00941373185811
LogLoss: 0.760182429595
AUC: 0.696770034194
Gini: 0.393540068389

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.497376537988:






    





NO
YES
Error
Rate
NO
321.0
896.0
0.7362
 (896.0/1217.0)
YES
106.0
1368.0
0.0719
 (106.0/1474.0)
Total
427.0
2264.0
0.3724
 (1002.0/2691.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.497376537988
0.731942215088
278.0
max f2
0.18838964109
0.860617399439
381.0
max f0point5
0.647380822133
0.681608665591
209.0
max accuracy
0.647380822133
0.66220735786
209.0
max precision
0.997039367558
0.846153846154
1.0
max absolute_MCC
0.647380822133
0.311837477405
209.0
max min_per_class_accuracy
0.747237667953
0.635990139688
153.0






    



Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.497376537988:






    





NO
YES
Error
Rate
NO
321.0
896.0
0.7362
 (896.0/1217.0)
YES
106.0
1368.0
0.0719
 (106.0/1474.0)
Total
427.0
2264.0
0.3724
 (1002.0/2691.0)






    



[[0.9970393675582909, 0.8461538461538461]]
[[0.6473808221327922, 0.6622073578595318]]
0.696770034194

H2O cluster uptime:	16 minutes 29 seconds 988 milliseconds
H2O cluster version:	3.5.0.99999
H2O cluster name:	ece
H2O cluster total nodes:	1
H2O cluster total memory:	10.67 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	10.0	308107.0	20.0	20.0	20.0	1838.0	2497.0	2279.7

	NO	YES	Error	Rate
NO	2295.0	6457.0	0.7378	(6457.0/8752.0)
YES	1071.0	9557.0	0.1008	(1071.0/10628.0)
Total	3366.0	16014.0	0.3884	(7528.0/19380.0)

metric	threshold	value	idx
max f1	0.44272284619	0.717438630733	272.0
max f2	0.0	0.858592386738	399.0
max f0point5	0.698268730884	0.659881812213	158.0
max accuracy	0.650252095381	0.628947368421	181.0
max precision	0.960446300909	0.721248630887	21.0
max absolute_MCC	0.698268730884	0.246344466653	158.0
max min_per_class_accuracy	0.749250828256	0.621229433272	134.0

	NO	YES	Error	Rate
NO	645.0	1596.0	0.7122	(1596.0/2241.0)
YES	231.0	2366.0	0.0889	(231.0/2597.0)
Total	876.0	3962.0	0.3776	(1827.0/4838.0)

timestamp	duration	number_of_trees	training_MSE	training_logloss	training_AUC	training_classification_error	validation_MSE	validation_logloss	validation_AUC	validation_classification_error
2015-10-11 21:07:43	0.095 sec	1.0	0.32235484164	7.662491719	0.608674793325	0.455644594405	0.335928752684	7.97500838576	0.594303281667	0.463207937164
2015-10-11 21:07:43	0.171 sec	2.0	0.315088056893	6.7142159325	0.611138413767	0.452977518449	0.286013782525	3.64715325721	0.635786804429	0.413187267466
2015-10-11 21:07:43	0.244 sec	3.0	0.302349898977	5.80199845388	0.621654880365	0.406267142074	0.272317328196	2.23120408384	0.651894790904	0.415047540306
2015-10-11 21:07:44	0.337 sec	4.0	0.298117091735	5.30789756972	0.624075773238	0.40820668693	0.263460295802	1.63304931546	0.666612627724	0.403679206284
2015-10-11 21:07:44	0.427 sec	5.0	0.288127760368	4.52047928228	0.634770920373	0.407384230288	0.259250391801	1.30688458768	0.671739797937	0.397064902852
2015-10-11 21:07:44	0.515 sec	6.0	0.282972510681	3.99632739574	0.639444594454	0.405887502736	0.255525177893	1.07447770858	0.679644002786	0.396031417941
2015-10-11 21:07:44	0.609 sec	7.0	0.278459832224	3.43743009325	0.642955984647	0.408359100117	0.252968181208	0.926846750591	0.684167208345	0.391070690368
2015-10-11 21:07:44	0.710 sec	8.0	0.273494367166	3.01547825092	0.650657012799	0.395480522204	0.252163227511	0.886088862974	0.686932988446	0.391897478297
2015-10-11 21:07:44	0.822 sec	9.0	0.267895919558	2.59374086495	0.657471165348	0.393676669089	0.250662445622	0.807979165782	0.688616099619	0.377842083506
2015-10-11 21:07:44	0.937 sec	10.0	0.265275494969	2.29024516759	0.661677793181	0.388441692466	0.250504652341	0.768558939899	0.689886229554	0.377635386523

variable	relative_importance	scaled_importance	percentage
fDayofMonth	6308.44042969	1.0	0.295695234969
Origin	6097.22363281	0.966518381329	0.285794879869
Dest	3880.18457031	0.61507826119	0.181875710967
fDayOfWeek	1679.60095215	0.266246621628	0.078727857342
Distance	1669.28149414	0.264610804009	0.0782441538666
UniqueCarrier	1566.25708008	0.248279602151	0.0734150952961
fMonth	133.276596069	0.0211267107227	0.00624706769126

	NO	YES	Error	Rate
NO	4017.0	6608.0	0.6219	(6608.0/10625.0)
YES	1545.0	9107.0	0.145	(1545.0/10652.0)
Total	5562.0	15715.0	0.3832	(8153.0/21277.0)

	NO	YES	Error	Rate
NO	661.0	1580.0	0.705	(1580.0/2241.0)
YES	262.0	2335.0	0.1009	(262.0/2597.0)
Total	923.0	3915.0	0.3807	(1842.0/4838.0)

	NO	YES	Error	Rate
NO	488.0	729.0	0.599	(729.0/1217.0)
YES	196.0	1278.0	0.133	(196.0/1474.0)
Total	684.0	2007.0	0.3437	(925.0/2691.0)

	NO	YES	Error	Rate
NO	321.0	896.0	0.7362	(896.0/1217.0)
YES	106.0	1368.0	0.0719	(106.0/1474.0)
Total	427.0	2264.0	0.3724	(1002.0/2691.0)