notebook.community

Edit and run



In [1]:

    
# This is a demo of H2O's GLM function
# It imports a data set, parses it, and prints a summary
# Then, it runs GLM with a binomial link function
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator



In [2]:

    
h2o.init()









    



Warning: Version mismatch. H2O is version 3.5.0.99999, but the python package is version UNKNOWN.






    




H2O cluster uptime: 
44 minutes 50 seconds 74 milliseconds 
H2O cluster version: 
3.5.0.99999
H2O cluster name: 
ludirehak
H2O cluster total nodes: 
1
H2O cluster total memory: 
3.56 GB
H2O cluster total cores: 
8
H2O cluster allowed cores: 
8
H2O cluster healthy: 
True
H2O Connection ip: 
127.0.0.1
H2O Connection port: 
54321



In [3]:

    
from h2o.h2o import _locate # private function. used to find files within h2o git project directory.

air = h2o.upload_file(path=_locate("smalldata/airlines/AirlinesTrain.csv.zip"))









    



Parse Progress: [##################################################] 100%
Uploaded pya01a74e5-0aa6-4ef0-ae1a-0d3fe860eee9 into cluster with 24,421 rows and 12 cols



In [4]:

    
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]



In [5]:

    
myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"



In [6]:

    
rf_no_bal = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=False)
rf_no_bal.train(x=myX, y=myY, training_frame=air_train, validation_frame=air_valid)
rf_no_bal.show()









    



drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2ORandomForestEstimator :  Distributed RF
Model Key:  DRF_model_python_1445557087082_2742

Model Summary:






    





number_of_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

10.0
287650.0
20.0
20.0
20.0
1664.0
2418.0
2103.5






    




ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.269503006052
R^2: -0.0873991649123
LogLoss: 2.43382549553
AUC: 0.646622642412
Gini: 0.293245284825

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.402941766395:






    





NO
YES
Error
Rate
NO
1948.0
6780.0
0.7768
 (6780.0/8728.0)
YES
936.0
9580.0
0.089
 (936.0/10516.0)
Total
2884.0
16360.0
0.401
 (7716.0/19244.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.4
0.7
299.0
max f2
0.0
0.9
399.0
max f0point5
0.6
0.7
190.0
max accuracy
0.6
0.6
193.0
max precision
0.9
0.7
30.0
max absolute_MCC
0.6
0.2
190.0
max min_per_class_accuracy
0.7
0.6
140.0






    



ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.245293478794
R^2: 0.00968032826017
LogLoss: 0.758757679035
AUC: 0.685987609758
Gini: 0.371975219515

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.42132409513:






    





NO
YES
Error
Rate
NO
467.0
1781.0
0.7923
 (1781.0/2248.0)
YES
160.0
2566.0
0.0587
 (160.0/2726.0)
Total
627.0
4347.0
0.3902
 (1941.0/4974.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.4
0.7
315.0
max f2
0.2
0.9
396.0
max f0point5
0.7
0.7
174.0
max accuracy
0.7
0.6
200.0
max precision
1.0
0.9
0.0
max absolute_MCC
0.7
0.3
174.0
max min_per_class_accuracy
0.7
0.6
165.0






    



Scoring History:






    





timestamp
duration
number_of_trees
training_MSE
training_logloss
training_AUC
training_classification_error
validation_MSE
validation_logloss
validation_AUC
validation_classification_error

2015-10-22 17:22:58
 0.074 sec
1.0
0.3
8.4
0.6
0.4
0.3
8.1
0.6
0.5

2015-10-22 17:22:58
 0.163 sec
2.0
0.3
7.4
0.6
0.4
0.3
4.0
0.6
0.4

2015-10-22 17:22:58
 0.245 sec
3.0
0.3
6.5
0.6
0.4
0.3
2.6
0.6
0.4

2015-10-22 17:22:58
 0.311 sec
4.0
0.3
5.6
0.6
0.5
0.3
1.9
0.7
0.4

2015-10-22 17:22:58
 0.391 sec
5.0
0.3
4.8
0.6
0.4
0.3
1.4
0.7
0.4

2015-10-22 17:22:58
 0.480 sec
6.0
0.3
4.0
0.6
0.4
0.3
1.1
0.7
0.4

2015-10-22 17:22:58
 0.565 sec
7.0
0.3
3.6
0.6
0.4
0.2
1.0
0.7
0.4

2015-10-22 17:22:58
 0.659 sec
8.0
0.3
3.1
0.6
0.4
0.2
0.9
0.7
0.4

2015-10-22 17:22:58
 0.751 sec
9.0
0.3
2.7
0.6
0.4
0.2
0.8
0.7
0.4

2015-10-22 17:22:58
 0.851 sec
10.0
0.3
2.4
0.6
0.4
0.2
0.8
0.7
0.4






    



Variable Importances:






    




variable
relative_importance
scaled_importance
percentage
Origin
6152.2
1.0
0.3
fDayofMonth
5583.6
0.9
0.3
Dest
4203.4
0.7
0.2
UniqueCarrier
1609.3
0.3
0.1
fDayOfWeek
1556.2
0.3
0.1
Distance
1493.0
0.2
0.1
fMonth
131.7
0.0
0.0



In [7]:

    
rf_bal = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=True)
rf_bal.train(x=myX, y=myY, training_frame=air_train, validation_frame=air_valid)
rf_bal.show()









    



drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2ORandomForestEstimator :  Distributed RF
Model Key:  DRF_model_python_1445557087082_2744

Model Summary:






    





number_of_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

10.0
299144.0
20.0
20.0
20.0
1750.0
2460.0
2168.2






    




ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.268874582249
R^2: -0.0754992978501
LogLoss: 2.09200342169
AUC: 0.685292136376
Gini: 0.370584272753

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.538182890839:






    





NO
YES
Error
Rate
NO
3925.0
6621.0
0.6278
 (6621.0/10546.0)
YES
1574.0
8952.0
0.1495
 (1574.0/10526.0)
Total
5499.0
15573.0
0.3889
 (8195.0/21072.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.5
0.7
226.0
max f2
0.0
0.8
399.0
max f0point5
0.8
0.6
124.0
max accuracy
0.7
0.6
140.0
max precision
0.9
0.7
28.0
max absolute_MCC
0.7
0.3
151.0
max min_per_class_accuracy
0.7
0.6
140.0






    



ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.249809873778
R^2: -0.00855364526058
LogLoss: 0.770654128805
AUC: 0.682375448104
Gini: 0.364750896207

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.56328826827:






    





NO
YES
Error
Rate
NO
822.0
1426.0
0.6343
 (1426.0/2248.0)
YES
367.0
2359.0
0.1346
 (367.0/2726.0)
Total
1189.0
3785.0
0.3605
 (1793.0/4974.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.6
0.7
261.0
max f2
0.1
0.9
399.0
max f0point5
0.7
0.7
179.0
max accuracy
0.6
0.6
235.0
max precision
1.0
0.8
6.0
max absolute_MCC
0.7
0.3
194.0
max min_per_class_accuracy
0.7
0.6
167.0






    



Scoring History:






    





timestamp
duration
number_of_trees
training_MSE
training_logloss
training_AUC
training_classification_error
validation_MSE
validation_logloss
validation_AUC
validation_classification_error

2015-10-22 17:22:59
 0.093 sec
1.0
0.3
7.3
0.6
0.4
0.3
7.9
0.6
0.5

2015-10-22 17:22:59
 0.152 sec
2.0
0.3
6.8
0.6
0.4
0.3
3.7
0.6
0.4

2015-10-22 17:22:59
 0.210 sec
3.0
0.3
5.9
0.6
0.4
0.3
2.2
0.6
0.4

2015-10-22 17:22:59
 0.287 sec
4.0
0.3
5.2
0.6
0.4
0.3
1.6
0.7
0.4

2015-10-22 17:22:59
 0.377 sec
5.0
0.3
4.3
0.7
0.4
0.3
1.3
0.7
0.4

2015-10-22 17:22:59
 0.469 sec
6.0
0.3
3.7
0.7
0.4
0.3
1.0
0.7
0.4

2015-10-22 17:22:59
 0.571 sec
7.0
0.3
3.2
0.7
0.4
0.3
0.9
0.7
0.4

2015-10-22 17:22:59
 0.678 sec
8.0
0.3
2.8
0.7
0.4
0.3
0.9
0.7
0.4

2015-10-22 17:22:59
 0.784 sec
9.0
0.3
2.4
0.7
0.4
0.2
0.8
0.7
0.4

2015-10-22 17:22:59
 0.894 sec
10.0
0.3
2.1
0.7
0.4
0.2
0.8
0.7
0.4






    



Variable Importances:






    




variable
relative_importance
scaled_importance
percentage
Origin
6811.1
1.0
0.3
fDayofMonth
6129.0
0.9
0.3
Dest
4860.0
0.7
0.2
UniqueCarrier
1824.5
0.3
0.1
fDayOfWeek
1634.1
0.2
0.1
Distance
1591.5
0.2
0.1
fMonth
129.6
0.0
0.0



In [8]:

    
air_test = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTest.csv.zip"))









    



Parse Progress: [##################################################] 100%
Imported /Users/ludirehak/h2o-3/smalldata/airlines/AirlinesTest.csv.zip. Parsed 2,691 rows and 12 cols



In [9]:

    
def model(model_object, test):
        #predicting on test file
        pred = model_object.predict(test)
        pred.head()
        #Building confusion matrix for test set
        perf = model_object.model_performance(test)
        perf.show()
        print(perf.confusion_matrix())
        print(perf.precision())
        print(perf.accuracy())
        print(perf.auc())



In [10]:

    
print("\n\nWITHOUT CLASS BALANCING\n")
model(rf_no_bal, air_test)









    




WITHOUT CLASS BALANCING

H2OFrame with 2691 rows and 3 columns: 






    




predict
YES
YES
YES
YES
YES
YES
NO
YES
YES
YES
NO
0.1
0.0
0.225
0.175
0.5
0.4
0.6
0.3
0.3
0.4
YES
0.9
1.0
0.775
0.825
0.5
0.6
0.4
0.7
0.7
0.6






    



ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.242134967995
R^2: 0.0225448334417
LogLoss: 0.818660036508
AUC: 0.705312795104
Gini: 0.410625590208

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.51742125228:






    





NO
YES
Error
Rate
NO
377.0
840.0
0.6902
 (840.0/1217.0)
YES
143.0
1331.0
0.097
 (143.0/1474.0)
Total
520.0
2171.0
0.3653
 (983.0/2691.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.5
0.7
276.0
max f2
0.2
0.9
381.0
max f0point5
0.7
0.7
174.0
max accuracy
0.7
0.7
186.0
max precision
1.0
0.9
7.0
max absolute_MCC
0.7
0.3
174.0
max min_per_class_accuracy
0.7
0.7
162.0






    



Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.51742125228:






    





NO
YES
Error
Rate
NO
377.0
840.0
0.6902
 (840.0/1217.0)
YES
143.0
1331.0
0.097
 (143.0/1474.0)
Total
520.0
2171.0
0.3653
 (983.0/2691.0)






    



[[0.985450211376883, 0.8556701030927835]]
[[0.6939187561627477, 0.6651802303976218]]
0.705312795104



In [11]:

    
print("\n\nWITH CLASS BALANCING\n")
model(rf_bal, air_test)









    




WITH CLASS BALANCING

H2OFrame with 2691 rows and 3 columns: 






    




predict
YES
YES
YES
YES
NO
NO
NO
YES
YES
NO
NO
0.0
0.3
0.1
0.0
0.4
0.5
0.7
0.1
0.3
0.5
YES
1.0
0.7
0.9
1.0
0.6
0.5
0.3
0.9
0.7
0.5






    



ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.24831550935
R^2: -0.00240489657592
LogLoss: 0.758488823047
AUC: 0.693547371085
Gini: 0.38709474217

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.475092852495:






    





NO
YES
Error
Rate
NO
269.0
948.0
0.779
 (948.0/1217.0)
YES
85.0
1389.0
0.0577
 (85.0/1474.0)
Total
354.0
2337.0
0.3839
 (1033.0/2691.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.5
0.7
307.0
max f2
0.3
0.9
379.0
max f0point5
0.7
0.7
184.0
max accuracy
0.7
0.7
210.0
max precision
1.0
0.85
1.0
max absolute_MCC
0.7
0.3
210.0
max min_per_class_accuracy
0.7
0.6
164.0






    



Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.475092852495:






    





NO
YES
Error
Rate
NO
269.0
948.0
0.779
 (948.0/1217.0)
YES
85.0
1389.0
0.0577
 (85.0/1474.0)
Total
354.0
2337.0
0.3839
 (1033.0/2691.0)






    



[[0.9962384300103982, 0.85]]
[[0.6673053431289202, 0.6540319583797845]]
0.693547371085

H2O cluster uptime:	44 minutes 50 seconds 74 milliseconds
H2O cluster version:	3.5.0.99999
H2O cluster name:	ludirehak
H2O cluster total nodes:	1
H2O cluster total memory:	3.56 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	10.0	287650.0	20.0	20.0	20.0	1664.0	2418.0	2103.5

	NO	YES	Error	Rate
NO	1948.0	6780.0	0.7768	(6780.0/8728.0)
YES	936.0	9580.0	0.089	(936.0/10516.0)
Total	2884.0	16360.0	0.401	(7716.0/19244.0)

metric	threshold	value	idx
max f1	0.4	0.7	299.0
max f2	0.0	0.9	399.0
max f0point5	0.6	0.7	190.0
max accuracy	0.6	0.6	193.0
max precision	0.9	0.7	30.0
max absolute_MCC	0.6	0.2	190.0
max min_per_class_accuracy	0.7	0.6	140.0

timestamp	duration	number_of_trees	training_MSE	training_logloss	training_AUC	training_classification_error	validation_MSE	validation_logloss	validation_AUC	validation_classification_error
2015-10-22 17:22:58	0.074 sec	1.0	0.3	8.4	0.6	0.4	0.3	8.1	0.6	0.5
2015-10-22 17:22:58	0.163 sec	2.0	0.3	7.4	0.6	0.4	0.3	4.0	0.6	0.4
2015-10-22 17:22:58	0.245 sec	3.0	0.3	6.5	0.6	0.4	0.3	2.6	0.6	0.4
2015-10-22 17:22:58	0.311 sec	4.0	0.3	5.6	0.6	0.5	0.3	1.9	0.7	0.4
2015-10-22 17:22:58	0.391 sec	5.0	0.3	4.8	0.6	0.4	0.3	1.4	0.7	0.4
2015-10-22 17:22:58	0.480 sec	6.0	0.3	4.0	0.6	0.4	0.3	1.1	0.7	0.4
2015-10-22 17:22:58	0.565 sec	7.0	0.3	3.6	0.6	0.4	0.2	1.0	0.7	0.4
2015-10-22 17:22:58	0.659 sec	8.0	0.3	3.1	0.6	0.4	0.2	0.9	0.7	0.4
2015-10-22 17:22:58	0.751 sec	9.0	0.3	2.7	0.6	0.4	0.2	0.8	0.7	0.4
2015-10-22 17:22:58	0.851 sec	10.0	0.3	2.4	0.6	0.4	0.2	0.8	0.7	0.4

variable	relative_importance	scaled_importance	percentage
Origin	6152.2	1.0	0.3
fDayofMonth	5583.6	0.9	0.3
Dest	4203.4	0.7	0.2
UniqueCarrier	1609.3	0.3	0.1
fDayOfWeek	1556.2	0.3	0.1
Distance	1493.0	0.2	0.1
fMonth	131.7	0.0	0.0

predict	YES	YES	YES	YES	YES	YES	NO	YES	YES	YES
NO	0.1	0.0	0.225	0.175	0.5	0.4	0.6	0.3	0.3	0.4
YES	0.9	1.0	0.775	0.825	0.5	0.6	0.4	0.7	0.7	0.6