Machine Learning with H2O - Tutorial 3a: Regression Models (Basics)


Objective:

  • This tutorial explains how to build regression models with four different H2O algorithms.

Wine Quality Dataset:


Algorithms:

  1. GLM
  2. DRF
  3. GBM
  4. DNN

Full Technical Reference:



In [1]:
# Start and connect to a local H2O cluster
suppressPackageStartupMessages(library(h2o))
h2o.init(nthreads = -1)


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/Rtmp96TV0q/h2o_joe_started_from_r.out
    /tmp/Rtmp96TV0q/h2o_joe_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 795 milliseconds 
    H2O cluster version:        3.10.3.5 
    H2O cluster version age:    10 days  
    H2O cluster name:           H2O_started_from_R_joe_hwk127 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   5.21 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.3.2 (2016-10-31) 



In [2]:
# Import wine quality data from a local CSV file
wine = h2o.importFile("winequality-white.csv")
head(wine, 5)


  |======================================================================| 100%
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
7.0 0.27 0.36 20.7 0.045 45 170 1.00103.00 0.45 8.8 6
6.3 0.30 0.34 1.6 0.049 14 132 0.99403.30 0.49 9.5 6
8.1 0.28 0.40 6.9 0.050 30 97 0.99513.26 0.44 10.1 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.40 9.9 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.40 9.9 6

In [3]:
# Define features (or predictors)
features = colnames(wine)  # we want to use all the information
features = setdiff(features, 'quality')    # we need to exclude the target 'quality'
features


  1. 'fixed acidity'
  2. 'volatile acidity'
  3. 'citric acid'
  4. 'residual sugar'
  5. 'chlorides'
  6. 'free sulfur dioxide'
  7. 'total sulfur dioxide'
  8. 'density'
  9. 'pH'
  10. 'sulphates'
  11. 'alcohol'

In [4]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = h2o.splitFrame(wine, ratios = 0.8, seed = 1234)

wine_train = wine_split[[1]] # using 80% for training
wine_test = wine_split[[2]]  # using the rest 20% for out-of-bag evaluation

In [5]:
dim(wine_train)


  1. 3932
  2. 12

In [6]:
dim(wine_test)


  1. 966
  2. 12


Generalized Linear Model


In [7]:
# Build a Generalized Linear Model (GLM) with default settings
glm_default = h2o.glm(x = features,
                      y = 'quality',
                      training_frame = wine_train,
                      family = 'gaussian', 
                      model_id = 'glm_default')


  |======================================================================| 100%

In [8]:
# Check the model performance on training dataset
glm_default


Model Details:
==============

H2ORegressionModel: glm
Model ID:  glm_default 
GLM Model: summary
    family     link                                regularization
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 7.744E-4 )
  number_of_predictors_total number_of_active_predictors number_of_iterations
1                         11                          11                    0
   training_frame
1 RTMP_sid_a9d0_2

Coefficients: glm coefficients
                  names coefficients standardized_coefficients
1             Intercept   136.516733                  5.878688
2         fixed acidity     0.040540                  0.034256
3      volatile acidity    -1.957825                 -0.198150
4           citric acid    -0.064298                 -0.007777
5        residual sugar     0.078084                  0.397523
6             chlorides    -0.723135                 -0.015638
7   free sulfur dioxide     0.002588                  0.044374
8  total sulfur dioxide    -0.000352                 -0.015076
9               density  -136.026688                 -0.409518
10                   pH     0.584229                  0.088671
11            sulphates     0.654807                  0.074764
12              alcohol     0.206873                  0.254962

H2ORegressionMetrics: glm
** Reported on training data. **

MSE:  0.5663261
RMSE:  0.7525464
MAE:  0.5855739
RMSLE:  0.111358
Mean Residual Deviance :  0.5663261
R^2 :  0.2851691
Null Deviance :3115.134
Null D.o.F. :3931
Residual Deviance :2226.794
Residual D.o.F. :3920
AIC :8948.855




In [9]:
# Check the model performance on test dataset
h2o.performance(glm_default, wine_test)


H2ORegressionMetrics: glm

MSE:  0.5546398
RMSE:  0.7447414
MAE:  0.5795791
RMSLE:  0.1107966
Mean Residual Deviance :  0.5546398
R^2 :  0.2618493
Null Deviance :725.8587
Null D.o.F. :965
Residual Deviance :535.782
Residual D.o.F. :954
AIC :2197.994


Distributed Random Forest


In [10]:
# Build a Distributed Random Forest (DRF) model with default settings
drf_default = h2o.randomForest(x = features,
                               y = 'quality',
                               training_frame = wine_train,
                               seed = 1234,
                               model_id = 'drf_default')


  |======================================================================| 100%

In [11]:
# Check the DRF model summary
drf_default


Model Details:
==============

H2ORegressionModel: drf
Model ID:  drf_default 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                       50              609178        20
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        20   20.00000        913       1012   964.12000


H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.3934958
RMSE:  0.6272924
MAE:  0.4495529
RMSLE:  0.09432345
Mean Residual Deviance :  0.3934958




In [12]:
# Check the model performance on test dataset
h2o.performance(drf_default, wine_test)


H2ORegressionMetrics: drf

MSE:  0.3711312
RMSE:  0.6092054
MAE:  0.4351009
RMSLE:  0.09161313
Mean Residual Deviance :  0.3711312


Gradient Boosting Machines


In [13]:
# Build a Gradient Boosting Machines (GBM) model with default settings
gbm_default = h2o.gbm(x = features,
                      y = 'quality',
                      training_frame = wine_train,
                      seed = 1234,
                      model_id = 'gbm_default')


  |======================================================================| 100%

In [14]:
# Check the GBM model summary
gbm_default


Model Details:
==============

H2ORegressionModel: gbm
Model ID:  gbm_default 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                       50               17580         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          9         30    22.80000


H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.335015
RMSE:  0.5788048
MAE:  0.4542062
RMSLE:  0.0856436
Mean Residual Deviance :  0.335015




In [15]:
# Check the model performance on test dataset
h2o.performance(gbm_default, wine_test)


H2ORegressionMetrics: gbm

MSE:  0.4551121
RMSE:  0.67462
MAE:  0.5219768
RMSLE:  0.1001376
Mean Residual Deviance :  0.4551121


H2O Deep Learning


In [16]:
# Build a Deep Learning (Deep Neural Networks, DNN) model with default settings
dnn_default = h2o.deeplearning(x = features,
                               y = 'quality',
                               training_frame = wine_train,
                               model_id = 'dnn_default')


  |======================================================================| 100%

In [17]:
# Check the DNN model summary
dnn_default


Model Details:
==============

H2ORegressionModel: deeplearning
Model ID:  dnn_default 
Status of Neuron Layers: predicting quality, regression, gaussian distribution, Quadratic loss, 42,801 weights/biases, 511.8 KB, 39,320 training samples, mini-batch size 1
  layer units      type dropout       l1       l2 mean_rate rate_rms momentum
1     1    11     Input  0.00 %                                              
2     2   200 Rectifier  0.00 % 0.000000 0.000000  0.006359 0.002181 0.000000
3     3   200 Rectifier  0.00 % 0.000000 0.000000  0.071259 0.086189 0.000000
4     4     1    Linear         0.000000 0.000000  0.000775 0.000349 0.000000
  mean_weight weight_rms mean_bias bias_rms
1                                          
2   -0.002792   0.117704  0.393787 0.052961
3   -0.019490   0.074954  0.952865 0.024638
4   -0.008303   0.064995  0.092555 0.000000


H2ORegressionMetrics: deeplearning
** Reported on training data. **
** Metrics reported on full training frame **

MSE:  0.4670376
RMSE:  0.6834015
MAE:  0.5313248
RMSLE:  0.1005671
Mean Residual Deviance :  0.4670376




In [18]:
# Check the model performance on test dataset
h2o.performance(dnn_default, wine_test)


H2ORegressionMetrics: deeplearning

MSE:  0.5025884
RMSE:  0.7089347
MAE:  0.5419651
RMSLE:  0.1047979
Mean Residual Deviance :  0.5025884


Making Predictions


In [19]:
# Use GLM model to make predictions
yhat_test_glm = h2o.predict(glm_default, wine_test)
head(yhat_test_glm)


  |======================================================================| 100%
predict
5.761094
5.767213
5.643247
5.857642
5.779668
5.518598

In [20]:
# Use DRF model to make predictions
yhat_test_drf = h2o.predict(drf_default, wine_test)
head(yhat_test_drf)


  |======================================================================| 100%
predict
5.824067
5.662857
5.380000
6.540000
5.880000
5.344501

In [21]:
# Use GBM model to make predictions
yhat_test_gbm = h2o.predict(gbm_default, wine_test)
head(yhat_test_gbm)


  |======================================================================| 100%
predict
5.846412
6.027371
5.289532
6.272658
5.630780
5.374139

In [22]:
# Use DNN model to make predictions
yhat_test_dnn = h2o.predict(dnn_default, wine_test)
head(yhat_test_dnn)


  |======================================================================| 100%
predict
5.585814
5.595579
5.220713
6.678417
5.635660
5.157082