In [1]:
#----------------------------------------------------------------------
# Purpose:  Condition an Airline dataset by filtering out NAs where the
#           departure delay in the input dataset is unknown.
#
#           Then treat anything longer than minutesOfDelayWeTolerate
#           as delayed.
#----------------------------------------------------------------------

In [2]:
import h2o

In [3]:
h2o.init()


H2O cluster uptime: 8 seconds 719 milliseconds
H2O cluster version: 3.1.0.99999
H2O cluster name: spencer
H2O cluster total nodes: 1
H2O cluster total memory: 14.22 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321

In [4]:
air = h2o.import_frame(h2o.locate("smalldata/airlines/allyears2k_headers.zip"))


Parse Progress: [##################################################] 100%
Imported /Users/spencer/0xdata/h2o-dev/smalldata/airlines/allyears2k_headers.zip. Parsed 43,978 rows and 31 cols

In [5]:
numRows, numCols = air.dim()
print "Original dataset rows: {0}, columns: {1}".format(numRows, numCols)

x_cols = ["Month", "DayofMonth", "DayOfWeek", "CRSDepTime", "CRSArrTime", "UniqueCarrier", "CRSElapsedTime", "Origin", "Dest", "Distance"]
y_col = "SynthDepDelayed"

noDepDelayedNAs = air[air["DepDelay"].isna() == 0]
rows, cols = noDepDelayedNAs.dim()
print "New dataset rows: {0}, columns: {1}".format(rows, cols)


Original dataset rows: 43978, columns: 31
New dataset rows: 42892, columns: 31

In [6]:
minutesOfDelayWeTolerate = 15
noDepDelayedNAs.cbind(noDepDelayedNAs["DepDelay"] > minutesOfDelayWeTolerate)
noDepDelayedNAs[numCols] = noDepDelayedNAs[numCols-1].asfactor()
noDepDelayedNAs.setName(numCols,y_col)


Out[6]:
<h2o.frame.H2OFrame instance at 0x110ab5dd0>

In [7]:
gbm = h2o.gbm(x=noDepDelayedNAs[x_cols], y=noDepDelayedNAs[y_col], distribution="bernoulli")
gbm.show()


gbm Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Gradient Boosting Machine
Model Key:  GBMModel__a483db33cfbb1f796edd4eebd222436a

Model Summary:

number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
50.0 34327.0 5.0 5.0 5.0 18.0 32.0 28.62

ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.191672317646
R^2: 0.232789480024
LogLoss: 0.565709063956
AUC: 0.785424985184
Gini: 0.570849970367

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.403241555699:

NO YES Error Rate
NO 10782.0 10105.0 0.4838 (10105.0/20887.0)
YES 3166.0 18839.0 0.1439 (3166.0/22005.0)
Total 13948.0 28944.0 0.6277 (0.6277/42892.0)
Maximum Metrics:

metric threshold value idx
max f1 0.403241555699 0.739523837563 262.0
max f2 0.23618642176 0.847667866646 347.0
max f0point5 0.555177550947 0.727363079981 182.0
max accuracy 0.500367989264 0.711531287886 213.0
max precision 0.956628787713 1.0 0.0
max absolute_MCC 0.500367989264 0.422517440941 213.0
max min_per_class_accuracy 0.508279003471 0.710106764973 208.0
Scoring History:

timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error
2015-06-27 16:04:33 0.530 sec 1.0 0.244362718818 0.681856365041 0.692503203228 0.413433740558
2015-06-27 16:04:33 0.638 sec 2.0 0.239916651394 0.672915590718 0.700446640048 0.410845845379
2015-06-27 16:04:33 0.830 sec 3.0 0.235500532554 0.663968419806 0.712157594375 0.391051944419
2015-06-27 16:04:33 1.110 sec 4.0 0.231804609045 0.656396477835 0.717626212056 0.387158444465
2015-06-27 16:04:33 1.171 sec 5.0 0.228800636054 0.6502322442 0.72532125588 0.377040007461
--- --- --- --- --- --- --- ---
2015-06-27 16:04:36 3.707 sec 46.0 0.192702949749 0.568262274753 0.783297917543 0.305744661009
2015-06-27 16:04:36 3.764 sec 47.0 0.192356272166 0.567420552198 0.783956222296 0.304695514315
2015-06-27 16:04:36 3.827 sec 48.0 0.192133014116 0.566843152503 0.784388534154 0.302387391588
2015-06-27 16:04:36 3.889 sec 49.0 0.191914492469 0.566305526738 0.78487237245 0.310151077124
2015-06-27 16:04:36 3.946 sec 50.0 0.191672317646 0.565709063956 0.785424985184 0.309405017253
Variable Importances:

variable relative_importance scaled_importance percentage
Origin 6861.91552734 1.0 0.410239582441
Dest 4551.00048828 0.663225956389 0.272081539413
DayofMonth 2025.62207031 0.295197756696 0.121101804445
UniqueCarrier 1279.63720703 0.186483963834 0.076503103455
CRSArrTime 714.227416992 0.104085719818 0.0427000822361
CRSDepTime 647.433837891 0.0943517645052 0.0387068284732
DayOfWeek 408.238586426 0.0594933856004 0.0244065416667
CRSElapsedTime 134.11907959 0.0195454285404 0.00801830844303
Month 73.2622070312 0.010676640763 0.00437998064847
Distance 31.148765564 0.00453936884531 0.00186222877964