In this notebook, we introduce H2O Deep Learning via fully-connected artificial neural networks. We also show many useful features of H2O such as hyper-parameter optimization, Flow, and checkpointing. There are other notebooks that use more complex convolutional neural networks ranging from LeNet all the way to Inception Resnet V2.
The MNIST database is a well-known academic dataset used to benchmark
classification performance. The data consists of 60,000 training images and
10,000 test images. Each image is a standardized $28^2$ pixel greyscale image of
a single handwritten digit. A sample of the scanned handwritten digits is
shown
In [1]:
import h2o
h2o.init(nthreads=-1)
In [2]:
import os.path
PATH = os.path.expanduser("~/h2o-3/")
In [3]:
test_df = h2o.import_file(PATH + "bigdata/laptop/mnist/test.csv.gz")
In [4]:
train_df = h2o.import_file(PATH + "bigdata/laptop/mnist/train.csv.gz")
Specify the response and predictor columns
In [5]:
y = "C785"
x = train_df.names[0:784]
In [6]:
train_df[y] = train_df[y].asfactor()
test_df[y] = test_df[y].asfactor()
Train Deep Learning model and validate on test set
In [7]:
from h2o.estimators.deepwater import H2ODeepWaterEstimator
In [8]:
model = H2ODeepWaterEstimator(
distribution="multinomial",
activation="rectifier",
mini_batch_size=128,
hidden=[1024,1024],
hidden_dropout_ratios=[0.5,0.5], ## for better generalization
input_dropout_ratio=0.1,
sparse=True, ## can result in speedup for sparse data
epochs=10) ## need more epochs for a better model
In [9]:
model.train(
x=x,
y=y,
training_frame=train_df,
validation_frame=test_df
)
In [10]:
model.scoring_history()
Out[10]:
In [11]:
model.model_performance(train=True) # training metrics
Out[11]:
In [12]:
model.model_performance(valid=True) # validation metrics
Out[12]:
It is highly recommended to use Flow to visualize the model training process and to inspect the model before using it for further steps.
Advanced users can also specify a fold column that defines the holdout fold associated with each row. By default, the holdout fold assignment is random. H2O supports other schemes such as round-robin assignment using the modulo operator.
Perform 3-fold cross-validation on training_frame
In [13]:
model_crossvalidated = H2ODeepWaterEstimator(
distribution="multinomial",
activation="rectifier",
mini_batch_size=128,
hidden=[1024,1024],
hidden_dropout_ratios=[0.5,0.5],
input_dropout_ratio=0.1,
sparse=True,
epochs=10,
nfolds=3
)
In [14]:
model_crossvalidated.train(
x=x,
y=y,
training_frame=train_df
)
In [15]:
# View specified parameters of the Deep Learning model
model_crossvalidated.params;
In [16]:
# Examine the trained model
model_crossvalidated
Out[16]:
Note: The validation error is based on the
parameter score_validation_samples
, which can be used to sample the validation set (by default, the entire validation set is used).
In [17]:
## Validation error of the original model (using a train/valid split)
model.mean_per_class_error(valid=True)
Out[17]:
In [18]:
## Training error of the model trained on 100% of the data
model_crossvalidated.mean_per_class_error(train=True)
Out[18]:
In [19]:
## Estimated generalization error of the cross-validated model
model_crossvalidated.mean_per_class_error(xval=True)
Out[19]:
Clearly, the model parameters aren't tuned perfectly yet, as 4-5% test set error is rather large.
In [20]:
#ls ../../h2o-docs/src/booklets/v2_2015/source/images/
In [21]:
predictions = model_crossvalidated.predict(test_df)
In [22]:
predictions.describe()
Variable importance allows us to view the absolute and relative predictive strength of each feature in the prediction task. Each H2O algorithm class has its own methodology for computing variable importance.
You can enable the variable importance, by setting the variable_importances
parameter to True
.
H2O’s Deep Learning uses the Gedeon method Gedeon, 1997, which is disabled by default since it can be slow for large networks.
If variable importance is a top priority in your analysis, consider training a Distributed Random Forest (DRF) model and compare the generated variable importances.
In [23]:
# Train Deep Learning model and validate on test set and save the variable importances
from h2o.estimators.deeplearning import H2ODeepLearningEstimator ## H2ODeepWaterEstimator doesn't yet have variable importances
model_variable_importances = H2ODeepLearningEstimator(
distribution="multinomial",
activation="RectifierWithDropout", ## shortcut for hidden_dropout_ratios=[0.5,0.5,0.5]
hidden=[32,32,32], ## smaller number of neurons to be fast enough on the CPU
input_dropout_ratio=0.1,
sparse=True,
epochs=1, ## not interested in a good model here
variable_importances=True) ## this is not yet implemented for DeepWaterEstimator
In [24]:
model_variable_importances.train(
x=x,
y=y,
training_frame=train_df,
validation_frame=test_df)
In [25]:
# Retrieve the variable importance
import pandas as pd
pd.DataFrame(model_variable_importances.varimp())
Out[25]:
In [26]:
model_variable_importances.varimp_plot(num_of_features=20)
Grid search provides more subtle insights into the model tuning and selection process by inspecting and comparing our trained models after the grid search process is complete.
To learn when and how to select different parameter configurations in a grid search, refer to Parameters for parameter descriptions and configurable values.
There are different strategies to explore the hyperparameter combinatorial space:
In this example, two different network topologies and two different learning rates are specified. This grid search model trains all 4 different models (all possible combinations of these parameters); other parameter combinations can be specified for a larger space of models. Note that the models will most likely converge before the default value of epochs, since early stopping is enabled.
In [27]:
from h2o.grid.grid_search import H2OGridSearch
In [28]:
hyper_parameters = {
"hidden":[[200,200,200],[300,300]],
"learning_rate":[1e-3,5e-3],
}
model_grid = H2OGridSearch(H2ODeepWaterEstimator, hyper_params=hyper_parameters)
In [29]:
model_grid.train(
x=x,
y=y,
distribution="multinomial",
epochs=50, ## might stop earlier since we enable early stopping below
training_frame=train_df,
validation_frame=test_df,
score_interval=2, ## score no more than every 2 seconds
score_duty_cycle=0.5, ## score up to 50% of the time - to enable early stopping
score_training_samples=1000, ## use a subset of the training frame for faster scoring
score_validation_samples=1000, ## use a subset of the validation frame for faster scoring
stopping_rounds=3,
stopping_tolerance=0.05,
stopping_metric="misclassification",
sparse = True,
mini_batch_size=256
)
In [30]:
# print model grid search results
model_grid
Out[30]:
In [31]:
for gmodel in model_grid:
print gmodel.model_id + " mean per class error: " + str(gmodel.mean_per_class_error())
In [32]:
import pandas as pd
In [33]:
grid_results = pd.DataFrame([[m.model_id, m.mean_per_class_error(valid=True)] for m in model_grid])
grid_results
Out[33]:
In [34]:
hyper_parameters = {
"hidden":[[1000,1000],[2000]],
"learning_rate":[s*1e-3 for s in range(30,100)],
"momentum_start":[s*1e-3 for s in range(0,900)],
"momentum_stable":[s*1e-3 for s in range(900,1000)],
}
In [35]:
search_criteria = {"strategy":"RandomDiscrete", "max_models":10, "max_runtime_secs":100, "seed":123456}
model_grid_random_search = H2OGridSearch(H2ODeepWaterEstimator,
hyper_params=hyper_parameters,
search_criteria=search_criteria)
In [36]:
model_grid_random_search.train(
x=x, y=y,
distribution="multinomial",
epochs=50, ## might stop earlier since we enable early stopping below
training_frame=train_df,
validation_frame=test_df,
score_interval=2, ## score no more than every 2 seconds
score_duty_cycle=0.5, ## score up to 50% of the wall clock time - scoring is needed for early stopping
score_training_samples=1000, ## use a subset of the training frame for faster scoring
score_validation_samples=1000, ## use a subset of the validation frame for faster scoring
stopping_rounds=3,
stopping_tolerance=0.05,
stopping_metric="misclassification",
sparse = True,
mini_batch_size=256)
In [37]:
grid_results = pd.DataFrame([[m.model_id, m.mean_per_class_error(valid=True)] for m in model_grid_random_search])
In [38]:
grid_results
Out[38]:
H2O supporst model checkpoints. You can store the state
of training and resume it later.
Checkpointing can be used to reload existing models that were saved to
disk in a previous session.
To resume model training, use checkpoint model keys (model id) to incrementally train a specific model using more iterations, more data, different data, and so forth. To further train the initial model, use it (or its key) as a checkpoint argument for a new model.
To improve this initial model, start from the previous model and add iterations by building another model, specifying checkpoint=previous model id, and changing train samples per iteration, target ratio comm to comp, or other parameters. Many parameters can be changed between checkpoints, especially those that affect regularization or performance tuning.
You can use GridSearch with checkpoint restarts to scan a broader range of hyperparameter combinations.
In [39]:
# Re-start the training process on a saved DL model using the ‘checkpoint‘ argument
model_checkpoint = H2ODeepWaterEstimator(
checkpoint=model.model_id,
activation="rectifier",
distribution="multinomial",
mini_batch_size=128,
hidden=[1024,1024],
hidden_dropout_ratios=[0.5,0.5],
input_dropout_ratio=0.1,
sparse=True,
epochs=20) ## previous model had 10 epochs, so we need to only train for 10 more to get to 20 epochs
In [40]:
model_checkpoint.train(
x=x,
y=y,
training_frame=train_df,
validation_frame=test_df)
In [41]:
model_checkpoint.scoring_history()
Out[41]:
Specify a model and a file path. The default path is the current working directory.
In [42]:
model_path = h2o.save_model(
model = model,
#path = "/tmp/mymodel",
force = True)
print model_path
In [43]:
!ls -lah $model_path
After restarting H2O, you can load the saved model by specifying the host and model file path.
Note: The saved model must be the same version used to save the model.
In [44]:
# Load model from disk
saved_model = h2o.load_model(model_path)
You can also use the following commands to retrieve a model from its H2O key. This is useful if you have created an H2O model using the web interface and want to continue the modeling process in another language, for example R.
In [45]:
# Retrieve model by H2O key
model = h2o.get_model(model_id=model_checkpoint._id)
model
Out[45]: