Machine Learning with H2O - Tutorial 3a: Regression Models (Basics)

Objective:

This tutorial explains how to build regression models with four different H2O algorithms.

Wine Quality Dataset:

Algorithms:

Full Technical Reference:

http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html



In [1]:

    
# Start and connect to a local H2O cluster
import h2o
h2o.init(nthreads = -1)









    



Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /home/joe/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp653fsgix
  JVM stdout: /tmp/tmp653fsgix/h2o_joe_started_from_python.out
  JVM stderr: /tmp/tmp653fsgix/h2o_joe_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.






    




H2O cluster uptime:
01 secs
H2O cluster version:
3.10.5.2
H2O cluster version age:
10 days 
H2O cluster name:
H2O_from_python_joe_4o70lg
H2O cluster total nodes:
1
H2O cluster free memory:
5.210 Gb
H2O cluster total cores:
8
H2O cluster allowed cores:
8
H2O cluster status:
accepting new members, healthy
H2O connection url:
http://127.0.0.1:54321
H2O connection proxy:
None
H2O internal security:
False
Python version:
3.6.1 final



In [2]:

    
# Import wine quality data from a local CSV file
wine = h2o.import_file("winequality-white.csv")
wine.head(5)









    



Parse progress: |█████████████████████████████████████████████████████████| 100%






    






  fixed acidity   volatile acidity   citric acid   residual sugar   chlorides   free sulfur dioxide   total sulfur dioxide   density   pH   sulphates   alcohol   quality


            7                0.27          0.36             20.7       0.045                    45                    170    1.001 3          0.45       8.8         6
            6.3               0.3          0.34              1.6       0.049                    14                    132    0.994 3.3        0.49       9.5         6
            8.1               0.28          0.4              6.9       0.05                    30                     97    0.9951 3.26        0.44      10.1         6
            7.2               0.23          0.32              8.5       0.058                    47                    186    0.9956 3.19        0.4       9.9         6
            7.2               0.23          0.32              8.5       0.058                    47                    186    0.9956 3.19        0.4       9.9         6








    Out[2]:



In [3]:

    
# Define features (or predictors)
features = list(wine.columns) # we want to use all the information
features.remove('quality')    # we need to exclude the target 'quality' (otherwise there is nothing to predict)
features









    Out[3]:





['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']



In [4]:

    
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = wine.split_frame(ratios = [0.8], seed = 1234)

wine_train = wine_split[0] # using 80% for training
wine_test = wine_split[1]  # using the rest 20% for out-of-bag evaluation



In [5]:

    
wine_train.shape









    Out[5]:





(3932, 12)



In [6]:

    
wine_test.shape









    Out[6]:





(966, 12)

Generalized Linear Model



In [7]:

    
# Build a Generalized Linear Model (GLM) with default settings

# Import the function for GLM
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# Set up GLM for regression
glm_default = H2OGeneralizedLinearEstimator(family = 'gaussian', model_id = 'glm_default')

# Use .train() to build the model
glm_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)









    



glm Model Build progress: |███████████████████████████████████████████████| 100%



In [8]:

    
# Check the model performance on training dataset
glm_default









    



Model Details
=============
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_default


ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 0.5663260600439028
RMSE: 0.7525463839816805
MAE: 0.5855739117180464
RMSLE: 0.11135798916908822
R^2: 0.28516909778801736
Mean Residual Deviance: 0.5663260600439028
Null degrees of freedom: 3931
Residual degrees of freedom: 3920
Null deviance: 3115.1340284842345
Residual deviance: 2226.794068092626
AIC: 8948.855269434132
Scoring History: 






    





timestamp
duration
iteration
negative_log_likelihood
objective

2017-06-29 23:24:36
 0.000 sec
0
3115.1340285
0.7922518






    Out[8]:



In [9]:

    
# Check the model performance on test dataset
glm_default.model_performance(wine_test)









    



ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 0.5546397919709444
RMSE: 0.7447414262486977
MAE: 0.5795791157106437
RMSLE: 0.11079661971451717
R^2: 0.26184927981796147
Mean Residual Deviance: 0.5546397919709444
Null degrees of freedom: 965
Residual degrees of freedom: 954
Null deviance: 725.858730540242
Residual deviance: 535.7820390439323
AIC: 2197.9936843132646






    Out[9]:

Distributed Random Forest



In [10]:

    
# Build a Distributed Random Forest (DRF) model with default settings

# Import the function for DRF
from h2o.estimators.random_forest import H2ORandomForestEstimator

# Set up DRF for regression
# Add a seed for reproducibility
drf_default = H2ORandomForestEstimator(model_id = 'drf_default', seed = 1234)

# Use .train() to build the model
drf_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)









    



drf Model Build progress: |███████████████████████████████████████████████| 100%



In [11]:

    
# Check the DRF model summary
drf_default









    



Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  drf_default


ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.3934957618265588
RMSE: 0.6272924053633671
MAE: 0.449552888371567
RMSLE: 0.09432344617917691
Mean Residual Deviance: 0.3934957618265588
Scoring History: 






    





timestamp
duration
number_of_trees
training_rmse
training_mae
training_deviance

2017-06-29 23:24:36
 0.013 sec
0.0
nan
nan
nan

2017-06-29 23:24:36
 0.197 sec
1.0
0.8897127
0.5615172
0.7915887

2017-06-29 23:24:37
 0.254 sec
2.0
0.8676163
0.5541223
0.7527581

2017-06-29 23:24:37
 0.297 sec
3.0
0.8480593
0.5495825
0.7192046

2017-06-29 23:24:37
 0.353 sec
4.0
0.8330926
0.5457844
0.6940433
---
---
---
---
---
---
---

2017-06-29 23:24:38
 1.350 sec
46.0
0.6296173
0.4502195
0.3964179

2017-06-29 23:24:38
 1.368 sec
47.0
0.6287729
0.4501413
0.3953554

2017-06-29 23:24:38
 1.383 sec
48.0
0.6285979
0.4499853
0.3951354

2017-06-29 23:24:38
 1.397 sec
49.0
0.6282803
0.4500892
0.3947361

2017-06-29 23:24:38
 1.412 sec
50.0
0.6272924
0.4495529
0.3934958






    



See the whole table with table.as_data_frame()
Variable Importances: 






    




variable
relative_importance
scaled_importance
percentage
alcohol
21593.0800781
1.0
0.2072105
density
11848.2773438
0.5487071
0.1136979
volatile acidity
11009.4531250
0.5098602
0.1056484
free sulfur dioxide
9610.8203125
0.4450880
0.0922269
total sulfur dioxide
8106.3881836
0.3754160
0.0777901
chlorides
7689.6831055
0.3561179
0.0737914
pH
7537.3427734
0.3490629
0.0723295
fixed acidity
6907.3925781
0.3198892
0.0662844
citric acid
6864.6484375
0.3179096
0.0658742
sulphates
6596.7392578
0.3055025
0.0633033
residual sugar
6444.6044922
0.2984569
0.0618434






    Out[11]:



In [12]:

    
# Check the model performance on test dataset
drf_default.model_performance(wine_test)









    



ModelMetricsRegression: drf
** Reported on test data. **

MSE: 0.3711311779836362
RMSE: 0.6092053660167778
MAE: 0.43510094840087254
RMSLE: 0.09161313127783179
Mean Residual Deviance: 0.3711311779836362






    Out[12]:

Gradient Boosting Machines



In [13]:

    
# Build a Gradient Boosting Machines (GBM) model with default settings

# Import the function for GBM
from h2o.estimators.gbm import H2OGradientBoostingEstimator

# Set up GBM for regression
# Add a seed for reproducibility
gbm_default = H2OGradientBoostingEstimator(model_id = 'gbm_default', seed = 1234)

# Use .train() to build the model
gbm_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)









    



gbm Model Build progress: |███████████████████████████████████████████████| 100%



In [14]:

    
# Check the GBM model summary
gbm_default









    



Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_default


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.33501502713240405
RMSE: 0.5788048264591477
MAE: 0.4542062463255889
RMSLE: 0.08564359662538763
Mean Residual Deviance: 0.33501502713240405
Scoring History: 






    





timestamp
duration
number_of_trees
training_rmse
training_mae
training_deviance

2017-06-29 23:24:38
 0.005 sec
0.0
0.8900853
0.6768335
0.7922518

2017-06-29 23:24:38
 0.064 sec
1.0
0.8576751
0.6503625
0.7356067

2017-06-29 23:24:38
 0.079 sec
2.0
0.8295027
0.6291754
0.6880747

2017-06-29 23:24:38
 0.091 sec
3.0
0.8058486
0.6149556
0.6493920

2017-06-29 23:24:38
 0.103 sec
4.0
0.7850495
0.6037286
0.6163028
---
---
---
---
---
---
---

2017-06-29 23:24:39
 0.516 sec
46.0
0.5851965
0.4599527
0.3424549

2017-06-29 23:24:39
 0.531 sec
47.0
0.5843875
0.4590260
0.3415087

2017-06-29 23:24:39
 0.540 sec
48.0
0.5824507
0.4573919
0.3392489

2017-06-29 23:24:39
 0.553 sec
49.0
0.5811694
0.4563671
0.3377579

2017-06-29 23:24:39
 0.562 sec
50.0
0.5788048
0.4542062
0.3350150






    



See the whole table with table.as_data_frame()
Variable Importances: 






    




variable
relative_importance
scaled_importance
percentage
alcohol
3482.7805176
1.0
0.3679474
volatile acidity
1540.7117920
0.4423798
0.1627725
free sulfur dioxide
1112.0415039
0.3192970
0.1174845
residual sugar
471.1806030
0.1352886
0.0497791
pH
464.7217407
0.1334341
0.0490967
total sulfur dioxide
442.2509766
0.1269822
0.0467227
fixed acidity
441.9790955
0.1269041
0.0466940
chlorides
416.6728516
0.1196380
0.0440205
density
377.4411011
0.1083735
0.0398757
citric acid
372.0686646
0.1068309
0.0393082
sulphates
343.5827942
0.0986519
0.0362987






    Out[14]:



In [15]:

    
# Check the model performance on test dataset
gbm_default.model_performance(wine_test)









    



ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.45511211588709155
RMSE: 0.6746199788674299
MAE: 0.5219768028633305
RMSLE: 0.10013755931021842
Mean Residual Deviance: 0.45511211588709155






    Out[15]:

H2O Deep Learning



In [16]:

    
# Build a Deep Learning (Deep Neural Networks, DNN) model with default settings

# Import the function for DNN
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

# Set up DNN for regression
dnn_default = H2ODeepLearningEstimator(model_id = 'dnn_default')

# (not run) Change 'reproducible' to True if you want to reproduce the results
# The model will be built using a single thread (could be very slow)
# dnn_default = H2ODeepLearningEstimator(model_id = 'dnn_default', reproducible = True)

# Use .train() to build the model
dnn_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)









    



deeplearning Model Build progress: |██████████████████████████████████████| 100%



In [17]:

    
# Check the DNN model summary
dnn_default









    



Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  dnn_default


ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.5097094126644901
RMSE: 0.7139393620360837
MAE: 0.567596403224839
RMSLE: 0.10660752490224334
Mean Residual Deviance: 0.5097094126644901
Scoring History: 






    





timestamp
duration
training_speed
epochs
iterations
samples
training_rmse
training_deviance
training_mae

2017-06-29 23:24:39
 0.000 sec
None
0.0
0
0.0
nan
nan
nan

2017-06-29 23:24:40
 1.741 sec
5859 obs/sec
1.0
1
3932.0
0.7740244
0.5991138
0.6107331

2017-06-29 23:24:42
 3.580 sec
15964 obs/sec
10.0
10
39320.0
0.7139394
0.5097094
0.5675964






    



Variable Importances: 






    




variable
relative_importance
scaled_importance
percentage
volatile acidity
1.0
1.0
0.1003347
residual sugar
0.9891271
0.9891271
0.0992437
free sulfur dioxide
0.9823452
0.9823452
0.0985633
citric acid
0.9510297
0.9510297
0.0954212
alcohol
0.9213510
0.9213510
0.0924434
total sulfur dioxide
0.9175226
0.9175226
0.0920593
chlorides
0.9009342
0.9009342
0.0903949
density
0.8977060
0.8977060
0.0900710
pH
0.8423673
0.8423673
0.0845186
fixed acidity
0.8117161
0.8117161
0.0814433
sulphates
0.7525467
0.7525467
0.0755065






    Out[17]:



In [18]:

    
# Check the model performance on test dataset
dnn_default.model_performance(wine_test)









    



ModelMetricsRegression: deeplearning
** Reported on test data. **

MSE: 0.5578467028625466
RMSE: 0.7468913594777667
MAE: 0.5960455253073035
RMSLE: 0.11143449624119137
Mean Residual Deviance: 0.5578467028625466






    Out[18]:

Making Predictions



In [19]:

    
# Use GLM model to make predictions
yhat_test_glm = glm_default.predict(wine_test)
yhat_test_glm.head(5)









    



glm prediction progress: |████████████████████████████████████████████████| 100%






    






  predict


  5.76109
  5.76721
  5.64325
  5.85764
  5.77967








    Out[19]:



In [20]:

    
# Use DRF model to make predictions
yhat_test_drf = drf_default.predict(wine_test)
yhat_test_drf.head(5)









    



drf prediction progress: |████████████████████████████████████████████████| 100%






    






  predict


  5.82407
  5.66286
  5.38   
  6.54   
  5.88   








    Out[20]:



In [21]:

    
# Use GBM model to make predictions
yhat_test_gbm = gbm_default.predict(wine_test)
yhat_test_gbm.head(5)









    



gbm prediction progress: |████████████████████████████████████████████████| 100%






    






  predict


  5.84641
  6.02737
  5.28953
  6.27266
  5.63078








    Out[21]:



In [22]:

    
# Use DNN model to make predictions
yhat_test_dnn = dnn_default.predict(wine_test)
yhat_test_dnn.head(5)









    



deeplearning prediction progress: |███████████████████████████████████████| 100%






    






  predict


  5.96835
  6.07601
  5.86781
  6.72332
  5.99686








    Out[22]:

	timestamp	duration	iteration	negative_log_likelihood	objective
	2017-06-29 23:24:36	0.000 sec	0	3115.1340285	0.7922518

H2O cluster uptime:	01 secs
H2O cluster version:	3.10.5.2
H2O cluster version age:	10 days
H2O cluster name:	H2O_from_python_joe_4o70lg
H2O cluster total nodes:	1
H2O cluster free memory:	5.210 Gb
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://127.0.0.1:54321
H2O connection proxy:	None
H2O internal security:	False
Python version:	3.6.1 final

fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
7	0.27	0.36	20.7	0.045	45	170	1.001	3	0.45	8.8	6
6.3	0.3	0.34	1.6	0.049	14	132	0.994	3.3	0.49	9.5	6
8.1	0.28	0.4	6.9	0.05	30	97	0.9951	3.26	0.44	10.1	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.4	9.9	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.4	9.9	6

	timestamp	duration	number_of_trees	training_rmse	training_mae	training_deviance
	2017-06-29 23:24:36	0.013 sec	0.0	nan	nan	nan
	2017-06-29 23:24:36	0.197 sec	1.0	0.8897127	0.5615172	0.7915887
	2017-06-29 23:24:37	0.254 sec	2.0	0.8676163	0.5541223	0.7527581
	2017-06-29 23:24:37	0.297 sec	3.0	0.8480593	0.5495825	0.7192046
	2017-06-29 23:24:37	0.353 sec	4.0	0.8330926	0.5457844	0.6940433
---	---	---	---	---	---	---
	2017-06-29 23:24:38	1.350 sec	46.0	0.6296173	0.4502195	0.3964179
	2017-06-29 23:24:38	1.368 sec	47.0	0.6287729	0.4501413	0.3953554
	2017-06-29 23:24:38	1.383 sec	48.0	0.6285979	0.4499853	0.3951354
	2017-06-29 23:24:38	1.397 sec	49.0	0.6282803	0.4500892	0.3947361
	2017-06-29 23:24:38	1.412 sec	50.0	0.6272924	0.4495529	0.3934958

variable	relative_importance	scaled_importance	percentage
alcohol	21593.0800781	1.0	0.2072105
density	11848.2773438	0.5487071	0.1136979
volatile acidity	11009.4531250	0.5098602	0.1056484
free sulfur dioxide	9610.8203125	0.4450880	0.0922269
total sulfur dioxide	8106.3881836	0.3754160	0.0777901
chlorides	7689.6831055	0.3561179	0.0737914
pH	7537.3427734	0.3490629	0.0723295
fixed acidity	6907.3925781	0.3198892	0.0662844
citric acid	6864.6484375	0.3179096	0.0658742
sulphates	6596.7392578	0.3055025	0.0633033
residual sugar	6444.6044922	0.2984569	0.0618434

timestamp	duration	training_speed	epochs	iterations	samples	training_rmse	training_deviance	training_mae
2017-06-29 23:24:39	0.000 sec	None	0.0	0	0.0	nan	nan	nan
2017-06-29 23:24:40	1.741 sec	5859 obs/sec	1.0	1	3932.0	0.7740244	0.5991138	0.6107331
2017-06-29 23:24:42	3.580 sec	15964 obs/sec	10.0	10	39320.0	0.7139394	0.5097094	0.5675964