notebook.community

Edit and run



In [1]:

    
import h2o



In [2]:

    
h2o.init()









    




H2O cluster uptime: 
1 minutes 50 seconds 618 milliseconds 
H2O cluster version: 
3.1.0.99999
H2O cluster name: 
ece
H2O cluster total nodes: 
1
H2O cluster total memory: 
4.44 GB
H2O cluster total cores: 
8
H2O cluster allowed cores: 
8
H2O cluster healthy: 
True
H2O Connection ip: 
127.0.0.1
H2O Connection port: 
54321



In [3]:

    
#uploading data file to h2o
air = h2o.import_frame(path=h2o.locate("smalldata/airlines/AirlinesTrain.csv.zip"))









    



Parse Progress: [##################################################] 100%
Imported  /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTrain.csv.zip . Parsed 24,421 rows and 12 cols



In [4]:

    
# Constructing validation and train sets by sampling (20/80)
# creating a column as tall as air.nrow()
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]

myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"



In [5]:

    
#gbm
gbm = h2o.gbm(x=air_train[myX], 
              y=air_train[myY], 
              validation_x=air_valid[myX],
              validation_y=air_valid[myY],
              distribution="bernoulli", 
              ntrees=100, 
              max_depth=3, 
              learn_rate=0.01)
gbm.show()









    



gbm Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Gradient Boosting Machine
Model Key:  GBMModel__83569002bd127b1b24610fe4ac52444c

Model Summary:







    





number_of_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

100.0
21889.0
3.0
3.0
3.0
8.0
8.0
8.0






    




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.224935884507
R^2: 0.0917523735414
LogLoss: 0.641870843139
AUC: 0.700860264576
Gini: 0.401720529152

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.45100329685:







    





NO
YES
Error
Rate
NO
2703.0
6143.0
0.6944
 (6143.0/8846.0)
YES
1067.0
9680.0
0.0993
 (1067.0/10747.0)
Total
3770.0
15823.0
0.7937
 (0.7937/19593.0)






    



Maximum Metrics:







    




metric
threshold
value
idx
f1
0.45100329685
0.728641324802
331.0
f2
0.376803747622
0.859382506486
396.0
f0point5
0.538983613241
0.683115048095
218.0
accuracy
0.521859623661
0.654366355331
240.0
precision
0.681933134563
0.901162790698
8.0
absolute_MCC
0.538983613241
0.299292001087
218.0
min_per_class_accuracy
0.54865448394
0.644551967991
204.0
tns
0.690888629343
8833.0
0.0
fns
0.690888629343
10648.0
0.0
fps
0.371575110378
8846.0
399.0
tps
0.371575110378
10747.0
399.0
tnr
0.690888629343
0.998530409225
0.0
fnr
0.690888629343
0.990788126919
0.0
fpr
0.371575110378
1.0
399.0
tpr
0.371575110378
1.0
399.0






    



ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.2275183899
R^2: 0.0842002842717
LogLoss: 0.647224224791
AUC: 0.68803214641
Gini: 0.37606429282

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.429662357774:







    





NO
YES
Error
Rate
NO
435.0
1785.0
0.8041
 (1785.0/2220.0)
YES
137.0
2471.0
0.0525
 (137.0/2608.0)
Total
572.0
4256.0
0.8566
 (0.8566/4828.0)






    



Maximum Metrics:







    




metric
threshold
value
idx
f1
0.429662357774
0.719988344988
356.0
f2
0.376803773922
0.854684009986
396.0
f0point5
0.539014213255
0.674244668246
217.0
accuracy
0.526662150196
0.65057995029
232.0
precision
0.67636982654
0.835443037975
18.0
absolute_MCC
0.539014213255
0.292962334179
217.0
min_per_class_accuracy
0.548567487854
0.631901840491
202.0
tns
0.690888600455
2213.0
0.0
fns
0.690888600455
2589.0
0.0
fps
0.371575143654
2220.0
399.0
tps
0.371575143654
2608.0
399.0
tnr
0.690888600455
0.996846846847
0.0
fnr
0.690888600455
0.992714723926
0.0
fpr
0.371575143654
1.0
399.0
tpr
0.371575143654
1.0
399.0






    



Scoring History:







    





timestamp
duration
number_of_trees
training_MSE
training_logloss
training_AUC
training_classification_error
validation_MSE
validation_logloss
validation_AUC
validation_classification_error

2015-05-22 13:19:39
 0.073 sec
1.0
0.247227169696
0.687586187163
0.662122035392
0.385698974123
0.248060763596
0.689258980589
0.650669457801
0.386909693455

2015-05-22 13:19:39
 0.111 sec
2.0
0.246816106849
0.686756385519
0.66222330505
0.385698974123
0.247675119161
0.688480563888
0.650790619991
0.386909693455

2015-05-22 13:19:39
 0.142 sec
3.0
0.246413615521
0.685943950168
0.66257594751
0.385698974123
0.247291514047
0.687706372792
0.65123718427
0.386909693455

2015-05-22 13:19:39
 0.158 sec
4.0
0.246019285467
0.685148026987
0.662749723193
0.386464553667
0.246920994436
0.686958679422
0.651638409882
0.387116818558

2015-05-22 13:19:39
 0.178 sec
5.0
0.245631966235
0.684366278301
0.662702378116
0.386464553667
0.246552236947
0.68621460685
0.651539096612
0.387116818558
---
---
---
---
---
---
---
---
---
---
---
---

2015-05-22 13:19:42
 3.694 sec
80.0
0.227535224089
0.647373511398
0.697257952158
0.371101924157
0.229781026205
0.651996607902
0.68473434132
0.381731565866

2015-05-22 13:19:43
 3.777 sec
81.0
0.227384102324
0.647055090045
0.697614870507
0.371101924157
0.229651399848
0.651724668276
0.685020968054
0.381731565866

2015-05-22 13:19:43
 3.861 sec
82.0
0.227239988942
0.646750400551
0.697702497293
0.370846730975
0.229522678951
0.651453477146
0.685221321782
0.381731565866

2015-05-22 13:19:43
 3.947 sec
83.0
0.227073325763
0.646400341159
0.697905978041
0.370846730975
0.229386828575
0.651167496306
0.685343433925
0.381731565866

2015-05-22 13:19:43
 4.183 sec
100.0
0.224935884507
0.641870843139
0.700860264576
0.367988567345
0.2275183899
0.647224224791
0.68803214641
0.398094449047






    



Variable Importances:







    




variable
relative_importance
scaled_importance
percentage
Origin
17213.3203125
1.0
0.685965839068
Dest
4465.96972656
0.259448476266
0.177972791717
UniqueCarrier
1887.43884277
0.109649899526
0.075216085332
fDayofMonth
1266.3125
0.0735658476698
0.0504636584235
fMonth
203.423248291
0.0118177809161
0.008106594002
fDayOfWeek
57.0886230469
0.00331653754246
0.00227503145812
Distance
0.0
0.0
0.0



In [6]:

    
#glm
glm = h2o.glm(x=air_train[myX], 
              y=air_train[myY],
              validation_x=air_valid[myX],
              validation_y=air_valid[myY],
              family = "binomial", 
              solver="L_BFGS")
glm.pprint_coef()









    



glm Model Build Progress: [##################################################] 100%

Coefficients:







    




names
coefficients
standardized_coefficients
Intercept
0.0373707847069
0.195063579531
Origin.ABE
-0.0401578633536
-0.0401578633536
Origin.ABQ
-0.0938267138619
-0.0938267138619
Origin.ACY
-0.135339354063
-0.135339354063
Origin.ALB
0.0711798459683
0.0711798459683
---
---
---
fDayOfWeek.f6
-0.156236716144
-0.156236716144
fDayOfWeek.f7
0.0472831537707
0.0472831537707
fMonth.f1
-0.221575958907
-0.221575958907
fMonth.f10
0.208857303935
0.208857303935
Distance
0.00020866333889
0.131663819411



In [7]:

    
#uploading test file to h2o
air_test = h2o.import_frame(path=h2o.locate("smalldata/airlines/AirlinesTest.csv.zip"))









    



Parse Progress: [##################################################] 100%
Imported  /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTest.csv.zip . Parsed 2,691 rows and 12 cols



In [8]:

    
# predicting & performance on test file
gbm_pred = gbm.predict(air_test)
print "GBM predictions: "
gbm_pred.head()

gbm_perf = gbm.model_performance(air_test)
print "GBM performance: "
gbm_perf.show()

glm_pred = glm.predict(air_test)
print "GLM predictions: "
glm_pred.head()

glm_perf = glm.model_performance(air_test)
print "GLM performance: "
glm_perf.show()









    



GBM predictions: 
First 10 rows and first 3 columns: 






    




Row ID
predict
NO
YES
1
YES
0.47525141674393157
0.5247485832560684
2
YES
0.48024938136117845
0.5197506186388215
3
YES
0.48024938136117845
0.5197506186388215
4
YES
0.402168737810524
0.597831262189476
5
YES
0.5136446294303063
0.48635537056969363
6
YES
0.5136446294303063
0.48635537056969363
7
YES
0.5478525167901855
0.45214748320981446
8
YES
0.5580925509767907
0.4419074490232094
9
YES
0.5580925509767907
0.4419074490232094
10
YES
0.5580925509767907
0.4419074490232094






    



GBM performance: 

ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.226368400011
R^2: 0.0865312024067
LogLoss: 0.644861693711
AUC: 0.692409878597
Gini: 0.384819757194

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.441901440341:







    





NO
YES
Error
Rate
NO
293.0
924.0
0.7592
 (924.0/1217.0)
YES
112.0
1362.0
0.076
 (112.0/1474.0)
Total
405.0
2286.0
0.8352
 (0.8352/2691.0)






    



Maximum Metrics:







    




metric
threshold
value
idx
f1
0.441901440341
0.724468085106
339.0
f2
0.383773415183
0.859786810355
391.0
f0point5
0.543468874795
0.6851506265
213.0
accuracy
0.522296596314
0.657748049052
242.0
precision
0.678096525394
0.847222222222
14.0
absolute_MCC
0.543468874795
0.30464489118
213.0
min_per_class_accuracy
0.549319523433
0.642469470828
206.0
tns
0.690888600455
1213.0
0.0
fns
0.690888600455
1461.0
0.0
fps
0.371575143654
1217.0
399.0
tps
0.371575143654
1474.0
399.0
tnr
0.690888600455
0.996713229252
0.0
fnr
0.690888600455
0.99118046133
0.0
fpr
0.371575143654
1.0
399.0
tpr
0.371575143654
1.0
399.0






    



GLM predictions: 
First 10 rows and first 3 columns: 






    




Row ID
predict
p0
p1
1
YES
0.33138044246038023
0.6686195575396198
2
YES
0.3914744148501228
0.6085255851498772
3
YES
0.36039204225753896
0.639607957742461
4
YES
0.4304740051645429
0.5695259948354571
5
YES
0.5256165167500713
0.4743834832499287
6
YES
0.5562418812088273
0.44375811879117266
7
YES
0.48440139277691874
0.5155986072230813
8
YES
0.44487802611756044
0.5551219738824396
9
YES
0.5819723452658147
0.41802765473418535
10
YES
0.5685108555327485
0.4314891444672515






    



GLM performance: 

ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.220260505275
R^2: 0.111178508566
LogLoss: 0.630774448994
Null degrees of freedom: 2690
Residual degrees of freedom: 2438
Null deviance: 3705.94255374
Residual deviance: 3394.82808448
AIC: 3900.82808448
AUC: 0.69739355066
Gini: 0.39478710132

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.443988379059:







    





NO
YES
Error
Rate
NO
391.0
826.0
0.6787
 (826.0/1217.0)
YES
161.0
1313.0
0.1092
 (161.0/1474.0)
Total
552.0
2139.0
0.7879
 (0.7879/2691.0)






    



Maximum Metrics:







    




metric
threshold
value
idx
f1
0.443988379059
0.726819817326
284.0
f2
0.247001441468
0.860535860536
382.0
f0point5
0.569158065903
0.685638454733
183.0
accuracy
0.540614921318
0.655890003716
211.0
precision
0.887237238744
1.0
0.0
absolute_MCC
0.569158065903
0.303360041004
183.0
min_per_class_accuracy
0.563183947037
0.644504748982
189.0
tns
0.887237238744
1217.0
0.0
fns
0.887237238744
1472.0
0.0
fps
0.186084076673
1217.0
399.0
tps
0.215917647428
1474.0
393.0
tnr
0.887237238744
1.0
0.0
fnr
0.887237238744
0.998643147897
0.0
fpr
0.186084076673
1.0
399.0
tpr
0.215917647428
1.0
393.0



In [9]:

    
# Building confusion matrix for test set
gbm_CM = gbm_perf.confusion_matrix()
print(gbm_CM)
print

glm_CM = glm_perf.confusion_matrix()
print(glm_CM)









    



Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.441901440341:







    





NO
YES
Error
Rate
NO
293.0
924.0
0.7592
 (924.0/1217.0)
YES
112.0
1362.0
0.076
 (112.0/1474.0)
Total
405.0
2286.0
0.8352
 (0.8352/2691.0)






    





Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.443988379059:







    





NO
YES
Error
Rate
NO
391.0
826.0
0.6787
 (826.0/1217.0)
YES
161.0
1313.0
0.1092
 (161.0/1474.0)
Total
552.0
2139.0
0.7879
 (0.7879/2691.0)



In [10]:

    
# ROC for test set
print('GBM Precision: {0}'.format(gbm_perf.precision()))
print('GBM Accuracy: {0}'.format(gbm_perf.accuracy()))
print('GBM AUC: {0}'.format(gbm_perf.auc()))
print
print('GLM Precision: {0}'.format(glm_perf.precision()))
print('GLM Accuracy: {0}'.format(glm_perf.accuracy()))
print('GLM AUC: {0}'.format(glm_perf.auc()))









    



GBM Precision: [[0.6780965253938488, 0.8472222222222222]]
GBM Accuracy: [[0.5222965963143628, 0.6577480490523968]]
GBM AUC: 0.692409878597

GLM Precision: [[0.8872372387438643, 1.0]]
GLM Accuracy: [[0.5406149213176982, 0.6558900037160906]]
GLM AUC: 0.69739355066

H2O cluster uptime:	1 minutes 50 seconds 618 milliseconds
H2O cluster version:	3.1.0.99999
H2O cluster name:	ece
H2O cluster total nodes:	1
H2O cluster total memory:	4.44 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	100.0	21889.0	3.0	3.0	3.0	8.0	8.0	8.0

	NO	YES	Error	Rate
NO	2703.0	6143.0	0.6944	(6143.0/8846.0)
YES	1067.0	9680.0	0.0993	(1067.0/10747.0)
Total	3770.0	15823.0	0.7937	(0.7937/19593.0)

metric	threshold	value	idx
f1	0.45100329685	0.728641324802	331.0
f2	0.376803747622	0.859382506486	396.0
f0point5	0.538983613241	0.683115048095	218.0
accuracy	0.521859623661	0.654366355331	240.0
precision	0.681933134563	0.901162790698	8.0
absolute_MCC	0.538983613241	0.299292001087	218.0
min_per_class_accuracy	0.54865448394	0.644551967991	204.0
tns	0.690888629343	8833.0	0.0
fns	0.690888629343	10648.0	0.0
fps	0.371575110378	8846.0	399.0
tps	0.371575110378	10747.0	399.0
tnr	0.690888629343	0.998530409225	0.0
fnr	0.690888629343	0.990788126919	0.0
fpr	0.371575110378	1.0	399.0
tpr	0.371575110378	1.0	399.0

	NO	YES	Error	Rate
NO	435.0	1785.0	0.8041	(1785.0/2220.0)
YES	137.0	2471.0	0.0525	(137.0/2608.0)
Total	572.0	4256.0	0.8566	(0.8566/4828.0)

	timestamp	duration	number_of_trees	training_MSE	training_logloss	training_AUC	training_classification_error	validation_MSE	validation_logloss	validation_AUC	validation_classification_error
	2015-05-22 13:19:39	0.073 sec	1.0	0.247227169696	0.687586187163	0.662122035392	0.385698974123	0.248060763596	0.689258980589	0.650669457801	0.386909693455
	2015-05-22 13:19:39	0.111 sec	2.0	0.246816106849	0.686756385519	0.66222330505	0.385698974123	0.247675119161	0.688480563888	0.650790619991	0.386909693455
	2015-05-22 13:19:39	0.142 sec	3.0	0.246413615521	0.685943950168	0.66257594751	0.385698974123	0.247291514047	0.687706372792	0.65123718427	0.386909693455
	2015-05-22 13:19:39	0.158 sec	4.0	0.246019285467	0.685148026987	0.662749723193	0.386464553667	0.246920994436	0.686958679422	0.651638409882	0.387116818558
	2015-05-22 13:19:39	0.178 sec	5.0	0.245631966235	0.684366278301	0.662702378116	0.386464553667	0.246552236947	0.68621460685	0.651539096612	0.387116818558
---	---	---	---	---	---	---	---	---	---	---	---
	2015-05-22 13:19:42	3.694 sec	80.0	0.227535224089	0.647373511398	0.697257952158	0.371101924157	0.229781026205	0.651996607902	0.68473434132	0.381731565866
	2015-05-22 13:19:43	3.777 sec	81.0	0.227384102324	0.647055090045	0.697614870507	0.371101924157	0.229651399848	0.651724668276	0.685020968054	0.381731565866
	2015-05-22 13:19:43	3.861 sec	82.0	0.227239988942	0.646750400551	0.697702497293	0.370846730975	0.229522678951	0.651453477146	0.685221321782	0.381731565866
	2015-05-22 13:19:43	3.947 sec	83.0	0.227073325763	0.646400341159	0.697905978041	0.370846730975	0.229386828575	0.651167496306	0.685343433925	0.381731565866
	2015-05-22 13:19:43	4.183 sec	100.0	0.224935884507	0.641870843139	0.700860264576	0.367988567345	0.2275183899	0.647224224791	0.68803214641	0.398094449047

variable	relative_importance	scaled_importance	percentage
Origin	17213.3203125	1.0	0.685965839068
Dest	4465.96972656	0.259448476266	0.177972791717
UniqueCarrier	1887.43884277	0.109649899526	0.075216085332
fDayofMonth	1266.3125	0.0735658476698	0.0504636584235
fMonth	203.423248291	0.0118177809161	0.008106594002
fDayOfWeek	57.0886230469	0.00331653754246	0.00227503145812
Distance	0.0	0.0	0.0

names	coefficients	standardized_coefficients
Intercept	0.0373707847069	0.195063579531
Origin.ABE	-0.0401578633536	-0.0401578633536
Origin.ABQ	-0.0938267138619	-0.0938267138619
Origin.ACY	-0.135339354063	-0.135339354063
Origin.ALB	0.0711798459683	0.0711798459683
---	---	---
fDayOfWeek.f6	-0.156236716144	-0.156236716144
fDayOfWeek.f7	0.0472831537707	0.0472831537707
fMonth.f1	-0.221575958907	-0.221575958907
fMonth.f10	0.208857303935	0.208857303935
Distance	0.00020866333889	0.131663819411

Row ID	predict	NO	YES
1	YES	0.47525141674393157	0.5247485832560684
2	YES	0.48024938136117845	0.5197506186388215
3	YES	0.48024938136117845	0.5197506186388215
4	YES	0.402168737810524	0.597831262189476
5	YES	0.5136446294303063	0.48635537056969363
6	YES	0.5136446294303063	0.48635537056969363
7	YES	0.5478525167901855	0.45214748320981446
8	YES	0.5580925509767907	0.4419074490232094
9	YES	0.5580925509767907	0.4419074490232094
10	YES	0.5580925509767907	0.4419074490232094

	NO	YES	Error	Rate
NO	293.0	924.0	0.7592	(924.0/1217.0)
YES	112.0	1362.0	0.076	(112.0/1474.0)
Total	405.0	2286.0	0.8352	(0.8352/2691.0)

Row ID	predict	p0	p1
1	YES	0.33138044246038023	0.6686195575396198
2	YES	0.3914744148501228	0.6085255851498772
3	YES	0.36039204225753896	0.639607957742461
4	YES	0.4304740051645429	0.5695259948354571
5	YES	0.5256165167500713	0.4743834832499287
6	YES	0.5562418812088273	0.44375811879117266
7	YES	0.48440139277691874	0.5155986072230813
8	YES	0.44487802611756044	0.5551219738824396
9	YES	0.5819723452658147	0.41802765473418535
10	YES	0.5685108555327485	0.4314891444672515

	NO	YES	Error	Rate
NO	391.0	826.0	0.6787	(826.0/1217.0)
YES	161.0	1313.0	0.1092	(161.0/1474.0)
Total	552.0	2139.0	0.7879	(0.7879/2691.0)