notebook.community

Edit and run



In [1]:

    
import h2o



In [2]:

    
h2o.init()









    




H2O cluster uptime: 
2 minutes 47 seconds 451 milliseconds 
H2O cluster version: 
3.5.0.99999
H2O cluster name: 
ece
H2O cluster total nodes: 
1
H2O cluster total memory: 
10.67 GB
H2O cluster total cores: 
8
H2O cluster allowed cores: 
8
H2O cluster healthy: 
True
H2O Connection ip: 
127.0.0.1
H2O Connection port: 
54321



In [3]:

    
from h2o.utils.shared_utils import _locate # private function. used to find files within h2o git project directory.

#uploading data file to h2o
air = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTrain.csv.zip"))









    



Parse Progress: [##################################################] 100%
Imported /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTrain.csv.zip. Parsed 24,421 rows and 12 cols



In [4]:

    
# Constructing validation and train sets by sampling (20/80)
# creating a column as tall as air.nrow
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]

myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"



In [5]:

    
#gbm
gbm = h2o.gbm(x=air_train[myX], 
              y=air_train[myY], 
              validation_x=air_valid[myX],
              validation_y=air_valid[myY],
              distribution="bernoulli", 
              ntrees=100, 
              max_depth=3, 
              learn_rate=0.01)
gbm.show()









    



gbm Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Gradient Boosting Machine
Model Key:  GBM_model_python_1444621872790_19

Model Summary:






    





number_of_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

100.0
21708.0
3.0
3.0
3.0
8.0
8.0
8.0






    




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.224850820408
R^2: 0.0924584914905
LogLoss: 0.641731899772
AUC: 0.702768783364
Gini: 0.405537566728

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.448864947415:






    





NO
YES
Error
Rate
NO
2631.0
6236.0
0.7033
 (6236.0/8867.0)
YES
1018.0
9704.0
0.0949
 (1018.0/10722.0)
Total
3649.0
15940.0
0.3703
 (7254.0/19589.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.448864947415
0.727927387293
329.0
max f2
0.384216073407
0.859214414183
396.0
max f0point5
0.540484202247
0.683872282105
211.0
max accuracy
0.519460013297
0.658583899127
238.0
max precision
0.676757678734
0.874429223744
7.0
max absolute_MCC
0.519460013297
0.305317688997
238.0
max min_per_class_accuracy
0.546452685802
0.645401977243
203.0






    



ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.228581497538
R^2: 0.0782379231571
LogLoss: 0.649363009278
AUC: 0.680577713137
Gini: 0.361155426274

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.446739697823:






    





NO
YES
Error
Rate
NO
605.0
1594.0
0.7249
 (1594.0/2199.0)
YES
254.0
2379.0
0.0965
 (254.0/2633.0)
Total
859.0
3973.0
0.3825
 (1848.0/4832.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.446739697823
0.72025431426
330.0
max f2
0.373911932185
0.857795172864
397.0
max f0point5
0.520295345537
0.666759175744
236.0
max accuracy
0.512345552615
0.642177152318
248.0
max precision
0.684564934321
0.869565217391
0.0
max absolute_MCC
0.512345552615
0.27122238954
248.0
max min_per_class_accuracy
0.547116521118
0.625902012913
197.0






    



Scoring History:






    





timestamp
duration
number_of_trees
training_MSE
training_logloss
training_AUC
training_classification_error
validation_MSE
validation_logloss
validation_AUC
validation_classification_error

2015-10-11 20:54:01
 0.070 sec
1.0
0.247331320109
0.687795430241
0.660414156332
0.390627392925
0.247603524684
0.688341976058
0.647685556757
0.395074503311

2015-10-11 20:54:01
 0.114 sec
2.0
0.246913667129
0.686952727594
0.660424401202
0.390627392925
0.247224374106
0.6875769826
0.647755678055
0.395074503311

2015-10-11 20:54:01
 0.150 sec
3.0
0.246503630683
0.686125470932
0.660424401202
0.390627392925
0.246854759296
0.686831315909
0.647755678055
0.395074503311

2015-10-11 20:54:01
 0.177 sec
4.0
0.246101212127
0.685313591467
0.663714440178
0.390627392925
0.246503381528
0.686122482818
0.646327604285
0.39673013245

2015-10-11 20:54:01
 0.196 sec
5.0
0.245706838064
0.684518021244
0.663646675728
0.390627392925
0.24615014055
0.685409949548
0.646298070438
0.39673013245
---
---
---
---
---
---
---
---
---
---
---
---

2015-10-11 20:54:04
 3.683 sec
75.0
0.228228148277
0.648850974833
0.698597822319
0.369748328143
0.231195338447
0.654891371682
0.678054296337
0.384105960265

2015-10-11 20:54:05
 3.775 sec
76.0
0.228053442305
0.648485174531
0.698807689635
0.369646230027
0.231064058149
0.654614394176
0.678156283101
0.385347682119

2015-10-11 20:54:05
 3.857 sec
77.0
0.227887494345
0.648136594459
0.699004214428
0.371688192353
0.230933646803
0.654339144973
0.678235126383
0.385347682119

2015-10-11 20:54:05
 3.947 sec
78.0
0.227764971727
0.647879404876
0.699149178285
0.369697279085
0.230831743431
0.654125223389
0.678410256915
0.385347682119

2015-10-11 20:54:05
 4.254 sec
100.0
0.224850820408
0.641731899772
0.702768783364
0.370309867783
0.228581497538
0.649363009278
0.680577713137
0.382450331126






    



Variable Importances:






    




variable
relative_importance
scaled_importance
percentage
Origin
17652.8085938
1.0
0.692557100813
Dest
4619.26074219
0.261672850394
0.18122339063
UniqueCarrier
1647.6003418
0.0933336093827
0.0646388539225
fDayofMonth
1342.42211914
0.0760458094819
0.0526660653438
fDayOfWeek
139.104721069
0.00788003338566
0.00545737307588
fMonth
88.1220855713
0.0049919583676
0.00345721621445
Distance
0.0
0.0
0.0



In [6]:

    
#glm
glm = h2o.glm(x=air_train[myX], 
              y=air_train[myY],
              validation_x=air_valid[myX],
              validation_y=air_valid[myY],
              family = "binomial", 
              solver="L_BFGS")
glm.pprint_coef()









    



glm Model Build Progress: [##################################################] 100%

Coefficients: glm coefficients







    




names
coefficients
standardized_coefficients
Intercept
0.056540315409
0.224670161231
Origin.ABE
-0.00451467313266
-0.00451467313266
Origin.ABQ
-0.0369454795796
-0.0369454795796
Origin.ACY
-0.0143826087457
-0.0143826087457
Origin.ALB
0.00857751200054
0.00857751200054
---
---
---
fDayOfWeek.f6
-0.0868429852704
-0.0868429852704
fDayOfWeek.f7
0.0201706138395
0.0201706138395
fMonth.f1
-0.100726106453
-0.100726106453
fMonth.f10
0.106283308704
0.106283308704
Distance
0.000222614075406
0.140355957215



In [7]:

    
#uploading test file to h2o
air_test = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTest.csv.zip"))









    



Parse Progress: [##################################################] 100%
Imported /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTest.csv.zip. Parsed 2,691 rows and 12 cols



In [8]:

    
# predicting & performance on test file
gbm_pred = gbm.predict(air_test)
print("GBM predictions: ")
gbm_pred.head()

gbm_perf = gbm.model_performance(air_test)
print("GBM performance: ")
gbm_perf.show()

glm_pred = glm.predict(air_test)
print("GLM predictions: ")
glm_pred.head()

glm_perf = glm.model_performance(air_test)
print("GLM performance: ")
glm_perf.show()









    



GBM predictions: 
GBM performance: 

ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.226566937921
R^2: 0.0853901612137
LogLoss: 0.64529552984
AUC: 0.691592366843
Gini: 0.383184733686

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.461947098049:






    





NO
YES
Error
Rate
NO
399.0
818.0
0.6721
 (818.0/1217.0)
YES
178.0
1296.0
0.1208
 (178.0/1474.0)
Total
577.0
2114.0
0.3701
 (996.0/2691.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.461947098049
0.722408026756
317.0
max f2
0.385877574696
0.85970464135
394.0
max f0point5
0.53566616157
0.685536224357
219.0
max accuracy
0.53566616157
0.658491267187
219.0
max precision
0.675162105576
0.848484848485
9.0
max absolute_MCC
0.53566616157
0.307175612177
219.0
max min_per_class_accuracy
0.54621274447
0.641112618725
203.0






    



GLM predictions: 
GLM performance: 

ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.232278313501
R^2: 0.0623343687585
LogLoss: 0.656952698744
Null degrees of freedom: 2690
Residual degrees of freedom: 2438
Null deviance: 3705.93804119
Residual deviance: 3535.71942464
AIC: 4041.71942464
AUC: 0.654934783021
Gini: 0.309869566041

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.463553658893:






    





NO
YES
Error
Rate
NO
279.0
938.0
0.7707
 (938.0/1217.0)
YES
105.0
1369.0
0.0712
 (105.0/1474.0)
Total
384.0
2307.0
0.3876
 (1043.0/2691.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.463553658893
0.724147051045
309.0
max f2
0.351797544742
0.858992302309
391.0
max f0point5
0.518666845805
0.65379623621
256.0
max accuracy
0.511685442003
0.629134150873
264.0
max precision
0.769210809139
1.0
0.0
max absolute_MCC
0.511685442003
0.244267852676
264.0
max min_per_class_accuracy
0.557529502351
0.602442333786
198.0



In [9]:

    
# Building confusion matrix for test set
gbm_CM = gbm_perf.confusion_matrix()
print(gbm_CM)
print

glm_CM = glm_perf.confusion_matrix()
print(glm_CM)









    



Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.461947098049:






    





NO
YES
Error
Rate
NO
399.0
818.0
0.6721
 (818.0/1217.0)
YES
178.0
1296.0
0.1208
 (178.0/1474.0)
Total
577.0
2114.0
0.3701
 (996.0/2691.0)






    





Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.463553658893:






    





NO
YES
Error
Rate
NO
279.0
938.0
0.7707
 (938.0/1217.0)
YES
105.0
1369.0
0.0712
 (105.0/1474.0)
Total
384.0
2307.0
0.3876
 (1043.0/2691.0)



In [10]:

    
# ROC for test set
print('GBM Precision: {0}'.format(gbm_perf.precision()))
print('GBM Accuracy: {0}'.format(gbm_perf.accuracy()))
print('GBM AUC: {0}'.format(gbm_perf.auc()))
print
print('GLM Precision: {0}'.format(glm_perf.precision()))
print('GLM Accuracy: {0}'.format(glm_perf.accuracy()))
print('GLM AUC: {0}'.format(glm_perf.auc()))









    



GBM Precision: [[0.675162105576199, 0.8484848484848485]]
GBM Accuracy: [[0.5356661615701861, 0.6584912671869194]]
GBM AUC: 0.691592366843

GLM Precision: [[0.7692108091388267, 1.0]]
GLM Accuracy: [[0.5116854420033787, 0.6291341508732813]]
GLM AUC: 0.654934783021

H2O cluster uptime:	2 minutes 47 seconds 451 milliseconds
H2O cluster version:	3.5.0.99999
H2O cluster name:	ece
H2O cluster total nodes:	1
H2O cluster total memory:	10.67 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	100.0	21708.0	3.0	3.0	3.0	8.0	8.0	8.0

	NO	YES	Error	Rate
NO	2631.0	6236.0	0.7033	(6236.0/8867.0)
YES	1018.0	9704.0	0.0949	(1018.0/10722.0)
Total	3649.0	15940.0	0.3703	(7254.0/19589.0)

metric	threshold	value	idx
max f1	0.448864947415	0.727927387293	329.0
max f2	0.384216073407	0.859214414183	396.0
max f0point5	0.540484202247	0.683872282105	211.0
max accuracy	0.519460013297	0.658583899127	238.0
max precision	0.676757678734	0.874429223744	7.0
max absolute_MCC	0.519460013297	0.305317688997	238.0
max min_per_class_accuracy	0.546452685802	0.645401977243	203.0

	NO	YES	Error	Rate
NO	605.0	1594.0	0.7249	(1594.0/2199.0)
YES	254.0	2379.0	0.0965	(254.0/2633.0)
Total	859.0	3973.0	0.3825	(1848.0/4832.0)

	timestamp	duration	number_of_trees	training_MSE	training_logloss	training_AUC	training_classification_error	validation_MSE	validation_logloss	validation_AUC	validation_classification_error
	2015-10-11 20:54:01	0.070 sec	1.0	0.247331320109	0.687795430241	0.660414156332	0.390627392925	0.247603524684	0.688341976058	0.647685556757	0.395074503311
	2015-10-11 20:54:01	0.114 sec	2.0	0.246913667129	0.686952727594	0.660424401202	0.390627392925	0.247224374106	0.6875769826	0.647755678055	0.395074503311
	2015-10-11 20:54:01	0.150 sec	3.0	0.246503630683	0.686125470932	0.660424401202	0.390627392925	0.246854759296	0.686831315909	0.647755678055	0.395074503311
	2015-10-11 20:54:01	0.177 sec	4.0	0.246101212127	0.685313591467	0.663714440178	0.390627392925	0.246503381528	0.686122482818	0.646327604285	0.39673013245
	2015-10-11 20:54:01	0.196 sec	5.0	0.245706838064	0.684518021244	0.663646675728	0.390627392925	0.24615014055	0.685409949548	0.646298070438	0.39673013245
---	---	---	---	---	---	---	---	---	---	---	---
	2015-10-11 20:54:04	3.683 sec	75.0	0.228228148277	0.648850974833	0.698597822319	0.369748328143	0.231195338447	0.654891371682	0.678054296337	0.384105960265
	2015-10-11 20:54:05	3.775 sec	76.0	0.228053442305	0.648485174531	0.698807689635	0.369646230027	0.231064058149	0.654614394176	0.678156283101	0.385347682119
	2015-10-11 20:54:05	3.857 sec	77.0	0.227887494345	0.648136594459	0.699004214428	0.371688192353	0.230933646803	0.654339144973	0.678235126383	0.385347682119
	2015-10-11 20:54:05	3.947 sec	78.0	0.227764971727	0.647879404876	0.699149178285	0.369697279085	0.230831743431	0.654125223389	0.678410256915	0.385347682119
	2015-10-11 20:54:05	4.254 sec	100.0	0.224850820408	0.641731899772	0.702768783364	0.370309867783	0.228581497538	0.649363009278	0.680577713137	0.382450331126

variable	relative_importance	scaled_importance	percentage
Origin	17652.8085938	1.0	0.692557100813
Dest	4619.26074219	0.261672850394	0.18122339063
UniqueCarrier	1647.6003418	0.0933336093827	0.0646388539225
fDayofMonth	1342.42211914	0.0760458094819	0.0526660653438
fDayOfWeek	139.104721069	0.00788003338566	0.00545737307588
fMonth	88.1220855713	0.0049919583676	0.00345721621445
Distance	0.0	0.0	0.0

names	coefficients	standardized_coefficients
Intercept	0.056540315409	0.224670161231
Origin.ABE	-0.00451467313266	-0.00451467313266
Origin.ABQ	-0.0369454795796	-0.0369454795796
Origin.ACY	-0.0143826087457	-0.0143826087457
Origin.ALB	0.00857751200054	0.00857751200054
---	---	---
fDayOfWeek.f6	-0.0868429852704	-0.0868429852704
fDayOfWeek.f7	0.0201706138395	0.0201706138395
fMonth.f1	-0.100726106453	-0.100726106453
fMonth.f10	0.106283308704	0.106283308704
Distance	0.000222614075406	0.140355957215

	NO	YES	Error	Rate
NO	399.0	818.0	0.6721	(818.0/1217.0)
YES	178.0	1296.0	0.1208	(178.0/1474.0)
Total	577.0	2114.0	0.3701	(996.0/2691.0)

	NO	YES	Error	Rate
NO	279.0	938.0	0.7707	(938.0/1217.0)
YES	105.0	1369.0	0.0712	(105.0/1474.0)
Total	384.0	2307.0	0.3876	(1043.0/2691.0)