Machine Learning with H2O - Tutorial 3a: Regression Models (Basics)


Objective:

  • This tutorial explains how to build regression models with four different H2O algorithms.

Wine Quality Dataset:


Algorithms:

  1. GLM
  2. DRF
  3. GBM
  4. DNN

Full Technical Reference:



In [1]:
# Start and connect to a local H2O cluster
import h2o
h2o.init(nthreads = -1)


Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /home/joe/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp653fsgix
  JVM stdout: /tmp/tmp653fsgix/h2o_joe_started_from_python.out
  JVM stderr: /tmp/tmp653fsgix/h2o_joe_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: 01 secs
H2O cluster version: 3.10.5.2
H2O cluster version age: 10 days
H2O cluster name: H2O_from_python_joe_4o70lg
H2O cluster total nodes: 1
H2O cluster free memory: 5.210 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: None
H2O internal security: False
Python version: 3.6.1 final



In [2]:
# Import wine quality data from a local CSV file
wine = h2o.import_file("winequality-white.csv")
wine.head(5)


Parse progress: |█████████████████████████████████████████████████████████| 100%
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6
6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6
8.1 0.28 0.4 6.9 0.05 30 97 0.99513.26 0.44 10.1 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.4 9.9 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.4 9.9 6
Out[2]:


In [3]:
# Define features (or predictors)
features = list(wine.columns) # we want to use all the information
features.remove('quality')    # we need to exclude the target 'quality' (otherwise there is nothing to predict)
features


Out[3]:
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

In [4]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = wine.split_frame(ratios = [0.8], seed = 1234)

wine_train = wine_split[0] # using 80% for training
wine_test = wine_split[1]  # using the rest 20% for out-of-bag evaluation

In [5]:
wine_train.shape


Out[5]:
(3932, 12)

In [6]:
wine_test.shape


Out[6]:
(966, 12)


Generalized Linear Model


In [7]:
# Build a Generalized Linear Model (GLM) with default settings

# Import the function for GLM
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# Set up GLM for regression
glm_default = H2OGeneralizedLinearEstimator(family = 'gaussian', model_id = 'glm_default')

# Use .train() to build the model
glm_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)


glm Model Build progress: |███████████████████████████████████████████████| 100%

In [8]:
# Check the model performance on training dataset
glm_default


Model Details
=============
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_default


ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.5663260600439028
RMSE: 0.7525463839816805
MAE: 0.5855739117180464
RMSLE: 0.11135798916908822
R^2: 0.28516909778801736
Mean Residual Deviance: 0.5663260600439028
Null degrees of freedom: 3931
Residual degrees of freedom: 3920
Null deviance: 3115.1340284842345
Residual deviance: 2226.794068092626
AIC: 8948.855269434132
Scoring History: 
timestamp duration iteration negative_log_likelihood objective
2017-06-29 23:24:36 0.000 sec 0 3115.1340285 0.7922518
Out[8]:


In [9]:
# Check the model performance on test dataset
glm_default.model_performance(wine_test)


ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 0.5546397919709444
RMSE: 0.7447414262486977
MAE: 0.5795791157106437
RMSLE: 0.11079661971451717
R^2: 0.26184927981796147
Mean Residual Deviance: 0.5546397919709444
Null degrees of freedom: 965
Residual degrees of freedom: 954
Null deviance: 725.858730540242
Residual deviance: 535.7820390439323
AIC: 2197.9936843132646
Out[9]:


Distributed Random Forest


In [10]:
# Build a Distributed Random Forest (DRF) model with default settings

# Import the function for DRF
from h2o.estimators.random_forest import H2ORandomForestEstimator

# Set up DRF for regression
# Add a seed for reproducibility
drf_default = H2ORandomForestEstimator(model_id = 'drf_default', seed = 1234)

# Use .train() to build the model
drf_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)


drf Model Build progress: |███████████████████████████████████████████████| 100%

In [11]:
# Check the DRF model summary
drf_default


Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  drf_default


ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.3934957618265588
RMSE: 0.6272924053633671
MAE: 0.449552888371567
RMSLE: 0.09432344617917691
Mean Residual Deviance: 0.3934957618265588
Scoring History: 
timestamp duration number_of_trees training_rmse training_mae training_deviance
2017-06-29 23:24:36 0.013 sec 0.0 nan nan nan
2017-06-29 23:24:36 0.197 sec 1.0 0.8897127 0.5615172 0.7915887
2017-06-29 23:24:37 0.254 sec 2.0 0.8676163 0.5541223 0.7527581
2017-06-29 23:24:37 0.297 sec 3.0 0.8480593 0.5495825 0.7192046
2017-06-29 23:24:37 0.353 sec 4.0 0.8330926 0.5457844 0.6940433
--- --- --- --- --- --- ---
2017-06-29 23:24:38 1.350 sec 46.0 0.6296173 0.4502195 0.3964179
2017-06-29 23:24:38 1.368 sec 47.0 0.6287729 0.4501413 0.3953554
2017-06-29 23:24:38 1.383 sec 48.0 0.6285979 0.4499853 0.3951354
2017-06-29 23:24:38 1.397 sec 49.0 0.6282803 0.4500892 0.3947361
2017-06-29 23:24:38 1.412 sec 50.0 0.6272924 0.4495529 0.3934958
See the whole table with table.as_data_frame()
Variable Importances: 
variable relative_importance scaled_importance percentage
alcohol 21593.0800781 1.0 0.2072105
density 11848.2773438 0.5487071 0.1136979
volatile acidity 11009.4531250 0.5098602 0.1056484
free sulfur dioxide 9610.8203125 0.4450880 0.0922269
total sulfur dioxide 8106.3881836 0.3754160 0.0777901
chlorides 7689.6831055 0.3561179 0.0737914
pH 7537.3427734 0.3490629 0.0723295
fixed acidity 6907.3925781 0.3198892 0.0662844
citric acid 6864.6484375 0.3179096 0.0658742
sulphates 6596.7392578 0.3055025 0.0633033
residual sugar 6444.6044922 0.2984569 0.0618434
Out[11]:


In [12]:
# Check the model performance on test dataset
drf_default.model_performance(wine_test)


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 0.3711311779836362
RMSE: 0.6092053660167778
MAE: 0.43510094840087254
RMSLE: 0.09161313127783179
Mean Residual Deviance: 0.3711311779836362
Out[12]:


Gradient Boosting Machines


In [13]:
# Build a Gradient Boosting Machines (GBM) model with default settings

# Import the function for GBM
from h2o.estimators.gbm import H2OGradientBoostingEstimator

# Set up GBM for regression
# Add a seed for reproducibility
gbm_default = H2OGradientBoostingEstimator(model_id = 'gbm_default', seed = 1234)

# Use .train() to build the model
gbm_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)


gbm Model Build progress: |███████████████████████████████████████████████| 100%

In [14]:
# Check the GBM model summary
gbm_default


Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_default


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.33501502713240405
RMSE: 0.5788048264591477
MAE: 0.4542062463255889
RMSLE: 0.08564359662538763
Mean Residual Deviance: 0.33501502713240405
Scoring History: 
timestamp duration number_of_trees training_rmse training_mae training_deviance
2017-06-29 23:24:38 0.005 sec 0.0 0.8900853 0.6768335 0.7922518
2017-06-29 23:24:38 0.064 sec 1.0 0.8576751 0.6503625 0.7356067
2017-06-29 23:24:38 0.079 sec 2.0 0.8295027 0.6291754 0.6880747
2017-06-29 23:24:38 0.091 sec 3.0 0.8058486 0.6149556 0.6493920
2017-06-29 23:24:38 0.103 sec 4.0 0.7850495 0.6037286 0.6163028
--- --- --- --- --- --- ---
2017-06-29 23:24:39 0.516 sec 46.0 0.5851965 0.4599527 0.3424549
2017-06-29 23:24:39 0.531 sec 47.0 0.5843875 0.4590260 0.3415087
2017-06-29 23:24:39 0.540 sec 48.0 0.5824507 0.4573919 0.3392489
2017-06-29 23:24:39 0.553 sec 49.0 0.5811694 0.4563671 0.3377579
2017-06-29 23:24:39 0.562 sec 50.0 0.5788048 0.4542062 0.3350150
See the whole table with table.as_data_frame()
Variable Importances: 
variable relative_importance scaled_importance percentage
alcohol 3482.7805176 1.0 0.3679474
volatile acidity 1540.7117920 0.4423798 0.1627725
free sulfur dioxide 1112.0415039 0.3192970 0.1174845
residual sugar 471.1806030 0.1352886 0.0497791
pH 464.7217407 0.1334341 0.0490967
total sulfur dioxide 442.2509766 0.1269822 0.0467227
fixed acidity 441.9790955 0.1269041 0.0466940
chlorides 416.6728516 0.1196380 0.0440205
density 377.4411011 0.1083735 0.0398757
citric acid 372.0686646 0.1068309 0.0393082
sulphates 343.5827942 0.0986519 0.0362987
Out[14]:


In [15]:
# Check the model performance on test dataset
gbm_default.model_performance(wine_test)


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.45511211588709155
RMSE: 0.6746199788674299
MAE: 0.5219768028633305
RMSLE: 0.10013755931021842
Mean Residual Deviance: 0.45511211588709155
Out[15]:


H2O Deep Learning


In [16]:
# Build a Deep Learning (Deep Neural Networks, DNN) model with default settings

# Import the function for DNN
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

# Set up DNN for regression
dnn_default = H2ODeepLearningEstimator(model_id = 'dnn_default')

# (not run) Change 'reproducible' to True if you want to reproduce the results
# The model will be built using a single thread (could be very slow)
# dnn_default = H2ODeepLearningEstimator(model_id = 'dnn_default', reproducible = True)

# Use .train() to build the model
dnn_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)


deeplearning Model Build progress: |██████████████████████████████████████| 100%

In [17]:
# Check the DNN model summary
dnn_default


Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  dnn_default


ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.5097094126644901
RMSE: 0.7139393620360837
MAE: 0.567596403224839
RMSLE: 0.10660752490224334
Mean Residual Deviance: 0.5097094126644901
Scoring History: 
timestamp duration training_speed epochs iterations samples training_rmse training_deviance training_mae
2017-06-29 23:24:39 0.000 sec None 0.0 0 0.0 nan nan nan
2017-06-29 23:24:40 1.741 sec 5859 obs/sec 1.0 1 3932.0 0.7740244 0.5991138 0.6107331
2017-06-29 23:24:42 3.580 sec 15964 obs/sec 10.0 10 39320.0 0.7139394 0.5097094 0.5675964
Variable Importances: 
variable relative_importance scaled_importance percentage
volatile acidity 1.0 1.0 0.1003347
residual sugar 0.9891271 0.9891271 0.0992437
free sulfur dioxide 0.9823452 0.9823452 0.0985633
citric acid 0.9510297 0.9510297 0.0954212
alcohol 0.9213510 0.9213510 0.0924434
total sulfur dioxide 0.9175226 0.9175226 0.0920593
chlorides 0.9009342 0.9009342 0.0903949
density 0.8977060 0.8977060 0.0900710
pH 0.8423673 0.8423673 0.0845186
fixed acidity 0.8117161 0.8117161 0.0814433
sulphates 0.7525467 0.7525467 0.0755065
Out[17]:


In [18]:
# Check the model performance on test dataset
dnn_default.model_performance(wine_test)


ModelMetricsRegression: deeplearning
** Reported on test data. **

MSE: 0.5578467028625466
RMSE: 0.7468913594777667
MAE: 0.5960455253073035
RMSLE: 0.11143449624119137
Mean Residual Deviance: 0.5578467028625466
Out[18]:


Making Predictions


In [19]:
# Use GLM model to make predictions
yhat_test_glm = glm_default.predict(wine_test)
yhat_test_glm.head(5)


glm prediction progress: |████████████████████████████████████████████████| 100%
predict
5.76109
5.76721
5.64325
5.85764
5.77967
Out[19]:


In [20]:
# Use DRF model to make predictions
yhat_test_drf = drf_default.predict(wine_test)
yhat_test_drf.head(5)


drf prediction progress: |████████████████████████████████████████████████| 100%
predict
5.82407
5.66286
5.38
6.54
5.88
Out[20]:


In [21]:
# Use GBM model to make predictions
yhat_test_gbm = gbm_default.predict(wine_test)
yhat_test_gbm.head(5)


gbm prediction progress: |████████████████████████████████████████████████| 100%
predict
5.84641
6.02737
5.28953
6.27266
5.63078
Out[21]:


In [22]:
# Use DNN model to make predictions
yhat_test_dnn = dnn_default.predict(wine_test)
yhat_test_dnn.head(5)


deeplearning prediction progress: |███████████████████████████████████████| 100%
predict
5.96835
6.07601
5.86781
6.72332
5.99686
Out[22]: