Automatic model building and model selection with startml

In this notebook, we will train a number popular machine learning algorithms in a regression problem to estiamte house prices.

The data set was provided by Kaggle, and is one of the long standing data sets in the playground section.

startml currently supports hyperparameter tuning of 3 algorithms, random forest, gradient boosted machine, and deep learning. In this notebook, we will:

Automatically train a number of different machine learning algorithms through hyperparameter grid searches
Apply a performance threshold to filter the resulting models
Visualize the model performance and training information
Select one best model
Make Predictions with the selected model.

Load startml



In [1]:

    
#==================================================================
# load the startml library
#==================================================================
library(startml)









    



Loading required package: Metrics
Loading required package: ggplot2
Loading required package: reshape2
Loading required package: plyr
Loading required package: gridExtra
Loading required package: plotly

Attaching package: ‘plotly’

The following objects are masked from ‘package:plyr’:

    arrange, mutate, rename, summarise

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout

Loading required package: tsne
Loading required package: htmlwidgets
Loading required package: h2o

----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: ‘h2o’

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc


Attaching package: ‘startml’

The following object is masked from ‘package:graphics’:

    plot



In [2]:

    
#=============================================================================
# Launch an h2o instance 
# Load in the data downloaded from Kaggle Playground
#=============================================================================

# launch an h2o instance, max ram and number of threads can be set 
# here, I just use the defaul
h2o.init()









    



H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmpF407GW/h2o_andrewrs_started_from_r.out
    /tmp/RtmpF407GW/h2o_andrewrs_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 887 milliseconds 
    H2O cluster version:        3.16.0.1 
    H2O cluster version age:    2 months and 26 days  
    H2O cluster name:           H2O_started_from_R_andrewrs_hzh146 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.46 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.4.2 (2017-09-28)



In [3]:

    
#=============================================================================
#
# Load in the data downloaded from Kaggle Playground
#
#=============================================================================


# use the h2o import to creat an h2o object. This is important 
#   as there is a lot going on behind the scenes in h2o, and R is only 
#   the interface.
train <- h2o.importFile("../../../data/train.csv")
test <- h2o.importFile("../../../data/test.csv")

# print out the dimenations of train and test 
tn_dim <- dim(train)
tt_dim <- dim(test)

# build message with extras
message <- c('train: a 2D h2oFrame with shape: [', tn_dim, ']\n', 'test: a 2D h2oFrame with shape: [', tt_dim, ']')

# print to consol
cat(message)









    



  |======================================================================| 100%
  |======================================================================| 100%
train: a 2D h2oFrame with shape: [ 1460 81 ]
 test: a 2D h2oFrame with shape: [ 1459 80 ]

Data Exploration

H20 provides a few ways to visualize data through the R api which can be helpful. Since this data set has been "pre-cooked" you can just begin modeling if you want. Generally, this is not the case. Here, I used a few of the built in vizualization functions in the h2o library to look at the data. These functions are designed to work with pretty big data sets, which is why they use binning so much. We dont need it here, as there are only about 1500 rows, but I'll do it anyway to show the style of figure you can get out of really big data sets with these functions.



In [5]:

    
#=============================================================================
#
# Using h2o fucntion to visualize the data
#
#=============================================================================

# lets print out the column names of train, you can do it the same way as a dataframe
colnames(train)









    





	'Id'
	'MSSubClass'
	'MSZoning'
	'LotFrontage'
	'LotArea'
	'Street'
	'Alley'
	'LotShape'
	'LandContour'
	'Utilities'
	'LotConfig'
	'LandSlope'
	'Neighborhood'
	'Condition1'
	'Condition2'
	'BldgType'
	'HouseStyle'
	'OverallQual'
	'OverallCond'
	'YearBuilt'
	'YearRemodAdd'
	'RoofStyle'
	'RoofMatl'
	'Exterior1st'
	'Exterior2nd'
	'MasVnrType'
	'MasVnrArea'
	'ExterQual'
	'ExterCond'
	'Foundation'
	'BsmtQual'
	'BsmtCond'
	'BsmtExposure'
	'BsmtFinType1'
	'BsmtFinSF1'
	'BsmtFinType2'
	'BsmtFinSF2'
	'BsmtUnfSF'
	'TotalBsmtSF'
	'Heating'
	'HeatingQC'
	'CentralAir'
	'Electrical'
	'1stFlrSF'
	'2ndFlrSF'
	'LowQualFinSF'
	'GrLivArea'
	'BsmtFullBath'
	'BsmtHalfBath'
	'FullBath'
	'HalfBath'
	'BedroomAbvGr'
	'KitchenAbvGr'
	'KitchenQual'
	'TotRmsAbvGrd'
	'Functional'
	'Fireplaces'
	'FireplaceQu'
	'GarageType'
	'GarageYrBlt'
	'GarageFinish'
	'GarageCars'
	'GarageArea'
	'GarageQual'
	'GarageCond'
	'PavedDrive'
	'WoodDeckSF'
	'OpenPorchSF'
	'EnclosedPorch'
	'3SsnPorch'
	'ScreenPorch'
	'PoolArea'
	'PoolQC'
	'Fence'
	'MiscFeature'
	'MiscVal'
	'MoSold'
	'YrSold'
	'SaleType'
	'SaleCondition'
	'SalePrice'

Cut out a few columns

Now, I'll cut out a few columns that are difficult to split well into train, valid, test. Again, stratified splits or some other better way will be incorperated into a later version of startml. I know this from running this notebook a few times, in real applications this info may come from a little trial and error



In [9]:

    
#=============================================================================
#
# Cut out a few columns of the training data
#
#=============================================================================

# you can interact with h2oFrames mostly the same way you interact with dataframes


labeled_data <- train[,!names(train) %in% c('Exterior1st', 'Exterior2nd', 'KitchenQual','Functional',  'SaleType', 'MSZoning')]
newdata <- test[,!names(test) %in% c('Exterior1st', 'Exterior2nd', 'KitchenQual','Functional',  'SaleType', 'MSZoning')]


# check number of columns 
message <- c('Now we are down to ', ncol(labeled_data), ' columns', "in train and \n", ncol(newdata), "in test")
cat(message)
colnames(labeled_data)









    



Now we are down to  75  columns in train and 
 74 in test





    





	'Id'
	'MSSubClass'
	'LotFrontage'
	'LotArea'
	'Street'
	'Alley'
	'LotShape'
	'LandContour'
	'Utilities'
	'LotConfig'
	'LandSlope'
	'Neighborhood'
	'Condition1'
	'Condition2'
	'BldgType'
	'HouseStyle'
	'OverallQual'
	'OverallCond'
	'YearBuilt'
	'YearRemodAdd'
	'RoofStyle'
	'RoofMatl'
	'MasVnrType'
	'MasVnrArea'
	'ExterQual'
	'ExterCond'
	'Foundation'
	'BsmtQual'
	'BsmtCond'
	'BsmtExposure'
	'BsmtFinType1'
	'BsmtFinSF1'
	'BsmtFinType2'
	'BsmtFinSF2'
	'BsmtUnfSF'
	'TotalBsmtSF'
	'Heating'
	'HeatingQC'
	'CentralAir'
	'Electrical'
	'1stFlrSF'
	'2ndFlrSF'
	'LowQualFinSF'
	'GrLivArea'
	'BsmtFullBath'
	'BsmtHalfBath'
	'FullBath'
	'HalfBath'
	'BedroomAbvGr'
	'KitchenAbvGr'
	'TotRmsAbvGrd'
	'Fireplaces'
	'FireplaceQu'
	'GarageType'
	'GarageYrBlt'
	'GarageFinish'
	'GarageCars'
	'GarageArea'
	'GarageQual'
	'GarageCond'
	'PavedDrive'
	'WoodDeckSF'
	'OpenPorchSF'
	'EnclosedPorch'
	'3SsnPorch'
	'ScreenPorch'
	'PoolArea'
	'PoolQC'
	'Fence'
	'MiscFeature'
	'MiscVal'
	'MoSold'
	'YrSold'
	'SaleCondition'
	'SalePrice'

Look at the sale year and sale price

Doesn't seem to be a lot of correlation. Interestingly, the most expensive house was bought in 2007 before the housing crisis, but it looks like more houses over half a million were bought post 2008. In general its hard to see any correlations.



In [7]:

    
# since sale price is the target variable lets plot a couple of input variables with respect
#   to SalePrice 
plot(h2o.tabulate(train, "YrSold", "SalePrice"))

Look at the sale month and sale price

It looks like most houses are sold in warmer months, may june and july. Its possible that there is a postive correlation where houses sold later through the high volumen months sell for more, but its hard to say with confidence from this overview plot. Also, it looks like no houses over half a million were sold between august and december, while the month with the most highest price sales is july.



In [8]:

    
# Try the month sold
plot(h2o.tabulate(train, "MoSold", "SalePrice" ))

Overall condition and sale price

Looks like most houses are in condition 5 / 10. I guess thats not surprising. Houses in this condition have a wide range of sale prices. I think you really can see a positive collelation overall, with an increasing condition rising with sale price, although there is lots of variance.



In [9]:

    
# overall condition is probably important
plot(h2o.tabulate(train, "OverallCond", "SalePrice"))

Garage car capacity and sale price

This one is interesting. There seems to be a strong postive correlation between garage car capacity and sale price, compared to the other relationships we've seem so far. If you read the meta-data about the training set, then you know there houses are located in and around Ames, Iowa USA. Iowa is a very rural state, and ames is not a big city. Public transportation is likely not a realistic option for most people. Those looking to buy a new house in this area put a lot of value on the ability to store more vechicles in a covered garage.



In [10]:

    
# Now the garage size 
plot(h2o.tabulate(train, "GarageCars", "SalePrice" ))

Run startml

A few things happen when you run startml:

The labeled data is split into training, validation, and testing sets.
the algorithms you selected are trained one type at a time
- A grid search provided by h2o is used to train many models
- These models are saved into the output object, which is an mlblob class.
- Additionally, each model is saved onto the disk, incase of run failures on long training sessions.
- Each model is evaluated separately on the test set
Predictions of each model on train, validation, test, and newdata are saved to the output mlblob object

The inputs are as follows:

labeled_data -> this is the collection of labeled data that will be used train, validate, and test the models. It is an h2oFrame object, which mentally is similar to an R dataframe, but it exists on the h2o cluster inililized earlier. This allows for scalling to multi-node clusters and big data.
newdata -> newdata is a h2oFrame object containing input variables but no labels. If this argument is left empty, no predictions will be made, but the models will still be trained and tested. If it is present, all the models trained will be used to make predictions, or those which satisfy the threhold arguments
label_id -> text object. name of the id column in training and newdata that you want to leave out of the analysis. If you have a integer varialbe that idenfies observations and you leave it in, you can create false correlations and worthless models. This does not delete the column, just never trains any models on it. It will still be there in the output to identify predictions.
y -> A text object defining the name of the target column, the one you are training to predict.
x -> A string of text objects defining the input variables by labeled data column name which will become inputs in all the models. If not used, then all columns will be used for training except the label id column, if specifed.
algorithms -> string of text objects defining which algorithms to train. Can be any of or any combination of "deeplearning", "randomForest", and "gbm". Naming conventions adopted from h2o for consistency.
split_seed the random seed which is used to split the data into train, validation, and test sets. This is incluced in an effort to make the startml process more repeatable. In general, h2o training and model building on the h2o cluster is not perfectly repeatable. This is still under construction, as ideally smarter data splits would be available. For now, consider this limiation in how data is split when building your labeled data set.
runtime_secs -> the total number of seconds each grid search will use when training algorithms. For example, if you were training just deep learning algorithms, and you set runtime_secs to 60, then startml would perform 1 grid search lasting 60 secs. If however, you wanted to train deep learning and random forest models, then startml would perfom 2 grid searches each lasting 60 secs totaling 2 minutes.
eval_metric -> the evaluation metric used for training and testing models. Right now, startml only supports regression and binary classification, and can use "RMSE", "RMSLE", "cross_entropy" "MSE", "MAE", "AUC", and a few other metrics.
other arguments -> there are a few others, which I have left out here for brevity. More in the official documentation.



In [5]:

    
#===============================================================
# Run startml 
#
#===============================================================

# here we only train 3 types of algorithms for 60 grid searches each. 
# This will total 3 minutes of grid search
output <- startml(labeled_data = labeled_data,
                  newdata = newdata,
                  label_id = 'Id',
                  y = 'SalePrice',
                  algorithms =c("deeplearning", "randomForest", "gbm"),
                  split_seed = 1234,
                  runtime_secs = 20, # run it a very short time for this example
                  eval_metric = "RMSLE"
)









    



Training Deep Learning Models
  |======================================================================| 100%
Deep Learning Models Saved To:
 /home/andrewrs/Desktop/github/startml/examples/notebooks/selecting_best_model/dl_models 

Training Random Forest Models
  |======================================================================| 100%
Random Forest Models Saved To:
 /home/andrewrs/Desktop/github/startml/examples/notebooks/selecting_best_model/rf_models 

Training Gradient Boosting Models
  |======================================================================| 100%
gbm Models Saved To:
 /home/andrewrs/Desktop/github/startml/examples/notebooks/selecting_best_model/gbm_models 


Saving Train Predictions with Selected Models
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%

Saving Valid Predictions with Selected Models
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%

Saving Test Predictions with Selected Models
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%

Predicting on New Data with Selected Models
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%

Plot the Training Results

startml has a built in overview plot to see the results of training the models and the test performance. Just use plot on the mlblob output from the startml function. Its worth noting that this plot function doesn't work yet with very big data sets, and thus is limited by your workstation capabilities. Although the startml function itself is capable of running on big data sets thanks to the h2o backend, the plot function here still needs a lot of ram. I am working on this in future versions so you can still use the plot function on very big (not in ram) data sets.



In [12]:

    
#===============================================================
# Plot the results from training.  
#
#===============================================================

plot(output)









    



No id variables; using all as measure variables

Trimming

There's alot going on even after only 1 minutes of model building. To fix this, we will trim poorly performing models with the trim function. If you don't know where to start, take a look at the median performance line and decide from there. For this example, I'm going to pick only models that did a lot better than the median. So, I'll trim down to models with RMSLE 0.3 or better on the test data.

Also, I don't really care about very similar models either, so I set a correlation threshld of 0.95. This will pick only one model when two models test predictions have pearson correlation of greater than 0.95.



In [13]:

    
#==============================================================================================
# Triming models based on eval metric and correlation. 
#
#===============================================================================================

# trim a few models
trim_out <- trim(output,
                  eval_metric = 'RMSLE',
                  eval_threshold = 0.3, 
                  correlation_threshold = 0.95)









    



Choosing Models on Test based on Performance and Correlation Thresholds
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%

Plotting the trimmed result

So, we'll plot the trimmed output:



In [14]:

    
#==============================================================================================
# plotting the trimmed result
#
#===============================================================================================

plot(trim_out)









    



No id variables; using all as measure variables

Iterative Trimming

Based off the top right subplot, we can see one of the random forest models (rf_model_1) that made it through our threshold controls is a stupid model. Even though it scored much better than many others, it simply found a sub-optima where producing one of two values based on the inputs got a RMSE of slightly below 60000. This was hard to see before with all the clutter, but now that it is obvious, we will send this object back through the timmer. This time, we wont use the correlation threshold.



In [15]:

    
#==============================================================================================
# Triming models again.  
#
#===============================================================================================

# trim a few models
trim_out <- trim(trim_out,
                  eval_metric = 'RMSLE',
                  eval_threshold = 0.15)









    



Choosing Models on Test based on Performance and Correlation Thresholds
  |======================================================================| 100%
  |======================================================================| 100%

Plotting the re trimmed resut

So, we'll plot one more time



In [16]:

    
#==============================================================================================
# plotting the trimmed result
#
#===============================================================================================

plot(trim_out)









    



No id variables; using all as measure variables

Evaluating the Remaining Models

From about 40 or so models, we are down to two. the random forest model has performed much better than the gbm model, but it can be worth taking taking a look at both. From the top right plot, you can see the random forest model seems to have learned something about the structure of the data and is providing reasonable estiamtes that begin to approximate the actual shape of the test data. The gbm model however, does not seem to have yet learned enough to make reasonable predictions. Its predictions capture some of the processes dictating housing prices, but arnet good enough to really use.

Selecting One model and Looking Further

So, I've have automatically tuned over 40 models including deep learning, random forest, and gradient boosted machine algorithms. The majority of these models were not very good. Through iterative trimming, we selected random forest model as the best. We can now get the new data predictions from this model, or grab the model out of the mlblob object for further analysis. At this point, I can just treat the model like you trained it using the h2o interface only.

H2O offers a lot of options for looking into models, they are pretty complex objects all by themselves. What the model object looks like depends on what algorithm it is, and what kind of problem it was trained on. Since the model I'm looking into is a random forest, the object includes a few extras like variable importances out of the box.



In [7]:

    
#=====================================================================================
#
# Save the best model predictions and go further with the seected model
#
# 
#=====================================================================================

# The rf_model_0 is still on the h2o cluster before and after this command. 
#  here, we just grab it out into a new object to work with it easily. 
#  You get "rf_model_0" right off the summary graph of the mblob object
best_model <- h2o.getModel('rf_model_0')

# take a look at the model summary to see final parameters and more info on performance
summary(best_model)









    



Model Details:
==============

H2ORegressionModel: drf
Model Key:  rf_model_0 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1            2752                     2752             1192804         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000         18         32    29.33794

H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  1043960598
RMSE:  32310.38
MAE:  19609.82
RMSLE:  0.1697414
Mean Residual Deviance :  1043960598


H2ORegressionMetrics: drf
** Reported on validation data. **

MSE:  1062094542
RMSE:  32589.79
MAE:  18890.82
RMSLE:  0.1722092
Mean Residual Deviance :  1062094542




Scoring History: 
            timestamp   duration number_of_trees training_rmse training_mae
1 2018-02-17 12:48:22  0.010 sec               0                           
2 2018-02-17 12:48:22  0.406 sec               1   50843.22802  30630.88108
3 2018-02-17 12:48:22  0.444 sec               2   43119.32380  28353.56260
4 2018-02-17 12:48:22  0.471 sec               3   43110.70402  27686.29591
5 2018-02-17 12:48:22  0.502 sec               4   41119.87864  25607.68864
  training_deviance validation_rmse validation_mae validation_deviance
1                                                                     
2  2585033835.27141     47629.08440    31682.04133    2268529680.34942
3  1859276084.71643     36878.14785    24909.36784    1359997788.62757
4  1858532800.93803     37811.15997    24140.10251    1429683818.56863
5  1690844419.45406     37008.03534    22597.95339    1369594679.74758

---
              timestamp   duration number_of_trees training_rmse training_mae
249 2018-02-17 12:48:26  4.298 sec             248   32533.81114  19736.18711
250 2018-02-17 12:48:26  4.310 sec             249   32521.81637  19728.16124
251 2018-02-17 12:48:30  8.314 sec             870   32149.60781  19538.36151
252 2018-02-17 12:48:34 12.318 sec            1497   32283.86004  19590.89100
253 2018-02-17 12:48:38 16.322 sec            2168   32313.51300  19611.37882
254 2018-02-17 12:48:42 20.006 sec            2752   32310.37911  19609.82008
    training_deviance validation_rmse validation_mae validation_deviance
249  1058448867.51974     32123.52036    18591.11787    1031920560.64052
250  1057668539.70299     32167.04649    18598.29686    1034718880.10854
251  1033597282.38241     32241.74366    18808.74273    1039530033.99342
252  1042247619.37699     32331.46904    18817.95573    1045323890.06984
253  1044163122.63590     32454.44075    18825.30412    1053290724.28115
254  1043960598.19106     32589.79199    18890.82168    1062094541.91597

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
      variable     relative_importance scaled_importance percentage
1  OverallQual 3944432734830592.000000          1.000000   0.301807
2    ExterQual 1925278683103232.000000          0.488100   0.147312
3   GarageCars 1383996437037056.000000          0.350873   0.105896
4    GrLivArea 1194779236892672.000000          0.302903   0.091418
5 Neighborhood 1149724895739904.000000          0.291480   0.087971

---
      variable  relative_importance scaled_importance percentage
67       Alley 1305937444864.000000          0.000331   0.000100
68  Electrical  654044037120.000000          0.000166   0.000050
69   3SsnPorch  445280780288.000000          0.000113   0.000034
70     MiscVal  212522647552.000000          0.000054   0.000016
71 MiscFeature  149918269440.000000          0.000038   0.000011
72      Street   19327156224.000000          0.000005   0.000001

Visualizing the Training Process in more depth

Although the plot function in startml shows the validation training for as many models as fit on the figure, it is a good idea to to take a closer look at the training process for models which will be potentially selected. For the random forest model, I'm taking a look at the training and validation loss scores over number of weak learners, or number of trees, built.

Looks like I could have used a harder early stop. Interestingly, the MAE is actually lower for the validation set, which is not ideal. Usually I'm looking for training to be a little bit better than validation, indicating a small amount of goodness of fit loss between training and validation. It doesn't nessesarily mean the model is bad, but generally something a bit suspect in my oppinion. I this case, I bet its from having such a small data set. With small data sets the scores can be better for the validation sometimes just becase the small set may not reprsent all the troubles encountered in the larger training set. Looking back at the traing, validation density plots in the bottom left of the plot output, you can see the validation distribtion is more likely to have lower-mid range prices in a random sample than the training data. In the upper right plot, the observed and predicted, I can see that the selected random forest model might be generally better at making predictions in the lower-mid range.

When comparing two different fit metrics its also a good idea to remember how the are related to eachother. For example, for the same data sets, mae is less than or equal to rmse. Also, rmse has greater penalty for larger errors than mae, so food for thought while considering the training history. In this example, small data and not having enough examples across the sale prices in each data set split is probably the cause. In the future, I'm planning on making better data separations built-in to startml, but for now I'll just move on with this model for the example.



In [8]:

    
#========================================================================
# 
# Plot the train and validation history for the selected model 
#
#========================================================================

# make a dataframe out of the training history frame from the h2o model object 
# H2O only saves RMSE and MAE for training history in this model

hist <- best_model@model$scoring_history

# make a plot of rmse for validation and training, and add in a color chaning with the variance
#training_deviance
#validation_deviance

hist_melt_1 = melt(hist[,c(3, 4, 7)], id = 1)
hist_melt_2 = melt(hist[,c(3, 5, 8)], id = 1)

p1 = ggplot(hist_melt_1) + 
    geom_line(aes(x = number_of_trees, y = value, color = variable)) +
    ggtitle('Training and Validation RMSE')

p2 = ggplot(hist_melt_2) + 
    geom_line(aes(x = number_of_trees, y = value, color = variable)) + 
    ggtitle('Training and Validation MAE')

# plot top - bottom
grid.arrange(p1, p2, ncol = 1, nrow = 2)









    



Warning message:
“Removed 2 rows containing missing values (geom_path).”Warning message:
“Removed 2 rows containing missing values (geom_path).”

Variable importances

H20 includes a function to look at variable importances in different algorithms. Here I can see importance information for these variables in a nice looking table. Looking at the far right column, the "OverallQual" (overall quality) of the houses was a pretty important variable when predicting the price according to the random forest model, while "YrSold" (year the house was sold) was not very important to the price. Garage car capacity was also important in determining sale price. Its good news that these findings echo some of what I thought while looking at the training data before the modeling process.



In [55]:

    
#========================================================================
# 
# Explore variable importances
#
#========================================================================

# take a look at the variable importances. 
h2o.varimp(best_model)









    





variable relative_importance scaled_importance percentage

	OverallQual 4.031006e+15 1.000000000 0.302908534 
	ExterQual   1.949982e+15 0.483745793 0.146530729 
	GarageCars  1.409448e+15 0.349651747 0.105912498 
	GrLivArea   1.212128e+15 0.300701068 0.091084919 
	Neighborhood 1.171789e+15 0.290693886 0.088053659 
	TotalBsmtSF 4.584283e+14 0.113725552 0.034448440 
	YearBuilt   4.247206e+14 0.105363426 0.031915481 
	BsmtQual    3.554684e+14 0.088183564 0.026711554 
	1stFlrSF    3.174286e+14 0.078746740 0.023853060 
	GarageArea  3.111750e+14 0.077195381 0.023383140 
	2ndFlrSF    2.426821e+14 0.060203864 0.018236264 
	BsmtFinSF1  2.129467e+14 0.052827189 0.016001806 
	FullBath    1.226352e+14 0.030422975 0.009215379 
	FireplaceQu 9.988631e+13 0.024779502 0.007505923 
	LotArea     9.311584e+13 0.023099903 0.006997158 
	MasVnrArea  8.814504e+13 0.021866761 0.006623628 
	GarageType  6.800622e+13 0.016870783 0.005110304 
	GarageYrBlt 6.703033e+13 0.016628686 0.005036971 
	Fireplaces  6.286122e+13 0.015594427 0.004723685 
	YearRemodAdd 6.000114e+13 0.014884906 0.004508765 
	TotRmsAbvGrd 5.916164e+13 0.014676645 0.004445681 
	GarageFinish 5.176931e+13 0.012842779 0.003890187 
	LotFrontage 4.566513e+13 0.011328471 0.003431490 
	OpenPorchSF 2.455335e+13 0.006091122 0.001845053 
	BsmtUnfSF   2.022497e+13 0.005017350 0.001519798 
	BsmtFinType1 1.991753e+13 0.004941082 0.001496696 
	WoodDeckSF  1.970804e+13 0.004889112 0.001480954 
	PoolArea    1.757225e+13 0.004359272 0.001320461 
	BsmtExposure 1.504853e+13 0.003733196 0.001130817 
	LotShape    1.470552e+13 0.003648101 0.001105041 
	⋮ ⋮ ⋮ ⋮
	Fence        9.516974e+12 2.360943e-03 7.151498e-04 
	CentralAir   9.086405e+12 2.254129e-03 6.827948e-04 
	BsmtHalfBath 8.570124e+12 2.126051e-03 6.439990e-04 
	HeatingQC    7.354213e+12 1.824412e-03 5.526299e-04 
	HalfBath     6.614227e+12 1.640838e-03 4.970238e-04 
	ScreenPorch  6.048580e+12 1.500514e-03 4.545185e-04 
	GarageQual   5.250202e+12 1.302455e-03 3.945246e-04 
	MasVnrType   5.214827e+12 1.293679e-03 3.918664e-04 
	RoofStyle    4.914854e+12 1.219263e-03 3.693250e-04 
	YrSold       4.349216e+12 1.078941e-03 3.268204e-04 
	Foundation   3.785838e+12 9.391796e-04 2.844855e-04 
	BldgType     3.627283e+12 8.998457e-04 2.725709e-04 
	LandSlope    3.598006e+12 8.925827e-04 2.703709e-04 
	BsmtFinSF2   3.548034e+12 8.801859e-04 2.666158e-04 
	GarageCond   3.545248e+12 8.794946e-04 2.664064e-04 
	ExterCond    3.389977e+12 8.409755e-04 2.547386e-04 
	BsmtFinType2 3.151819e+12 7.818940e-04 2.368424e-04 
	Condition2   2.629298e+12 6.522686e-04 1.975777e-04 
	BsmtCond     2.380687e+12 5.905939e-04 1.788959e-04 
	KitchenAbvGr 1.795289e+12 4.453700e-04 1.349064e-04 
	LowQualFinSF 1.748464e+12 4.337538e-04 1.313877e-04 
	Heating      1.613034e+12 4.001566e-04 1.212109e-04 
	EnclosedPorch 1.404318e+12 3.483792e-04 1.055270e-04 
	PavedDrive   1.347438e+12 3.342684e-04 1.012527e-04 
	Alley        1.305937e+12 3.239731e-04 9.813422e-05 
	Electrical   6.632001e+11 1.645247e-04 4.983595e-05 
	3SsnPorch    4.452808e+11 1.104639e-04 3.346047e-05 
	MiscVal      2.201360e+11 5.461069e-05 1.654204e-05 
	MiscFeature  1.575311e+11 3.907984e-05 1.183762e-05 
	Street       1.932716e+10 4.794624e-06 1.452333e-06

Making predictions with the best model

I fed in newdata when launching startml in this notebook, so, the predictions for that set are already in the mlblob object "output".

Theres a function in startml to make an h2oFrame or regular R dataframe with the predictions I want. For this small data set, we can load the final predictions right into the R workspace. For bigger data, you should not do this, and leave the object as an h2oFrame.



In [26]:

    
#==============================================================================
#
# Get predictions from best model in a dataframe
#
#
#==============================================================================

# making predictions with the best model into H2O dataframe.
#newdata_predictions <- get_prediction('rf_model_0', output)

# make a regular data frame of the predctions with the ids from the original file if small enought 
#predict_out <- data.frame(Id = as.data.frame(newdata$Id), Predictions = as.data.frame(newdata_predictions[[1]]$predict))

# print out to csv
#write.csv(predict_out, 'rf_model_o_newdata_predictions.csv', quote = FALSE, row.names = FALSE)

# view it 
predict_out









    





Id predict

	1461    131327.2
	1462    144076.1
	1463    181239.6
	1464    182871.2
	1465    213837.3
	1466    184089.9
	1467    167665.3
	1468    176853.0
	1469    176485.7
	1470    131498.0
	1471    193716.3
	1472    111079.5
	1473    112783.5
	1474    153719.9
	1475    137299.0
	1476    339612.3
	1477    267857.9
	1478    308458.6
	1479    296120.4
	1480    449315.6
	1481    299727.9
	1482    220953.2
	1483    181766.8
	1484    181035.8
	1485    179739.8
	1486    200060.8
	1487    317400.6
	1488    257703.8
	1489    198064.0
	1490    219028.7
	⋮ ⋮
	2890     100593.40
	2891     138060.30
	2892      92150.97
	2893     115702.58
	2894      90800.69
	2895     290654.42
	2896     284238.31
	2897     179383.08
	2898     154339.63
	2899     233416.16
	2900     149815.00
	2901     195767.08
	2902     190942.59
	2903     324979.63
	2904     321381.53
	2905     135126.94
	2906     194736.70
	2907     123223.85
	2908     128243.08
	2909     153906.52
	2910     100588.82
	2911     106538.94
	2912     146689.39
	2913     107419.49
	2914     100690.96
	2915     100924.70
	2916     107163.37
	2917     152888.85
	2918     125448.06
	2919     210628.11



In [8]:

    
# experiments with dask and datashader 

h2o.exportFile(train, 'vis_data/test.csv', force = TRUE, parts = -1)









    



  |======================================================================| 100%



In [ ]:

variable	relative_importance	scaled_importance	percentage
OverallQual	4.031006e+15	1.000000000	0.302908534
ExterQual	1.949982e+15	0.483745793	0.146530729
GarageCars	1.409448e+15	0.349651747	0.105912498
GrLivArea	1.212128e+15	0.300701068	0.091084919
Neighborhood	1.171789e+15	0.290693886	0.088053659
TotalBsmtSF	4.584283e+14	0.113725552	0.034448440
YearBuilt	4.247206e+14	0.105363426	0.031915481
BsmtQual	3.554684e+14	0.088183564	0.026711554
1stFlrSF	3.174286e+14	0.078746740	0.023853060
GarageArea	3.111750e+14	0.077195381	0.023383140
2ndFlrSF	2.426821e+14	0.060203864	0.018236264
BsmtFinSF1	2.129467e+14	0.052827189	0.016001806
FullBath	1.226352e+14	0.030422975	0.009215379
FireplaceQu	9.988631e+13	0.024779502	0.007505923
LotArea	9.311584e+13	0.023099903	0.006997158
MasVnrArea	8.814504e+13	0.021866761	0.006623628
GarageType	6.800622e+13	0.016870783	0.005110304
GarageYrBlt	6.703033e+13	0.016628686	0.005036971
Fireplaces	6.286122e+13	0.015594427	0.004723685
YearRemodAdd	6.000114e+13	0.014884906	0.004508765
TotRmsAbvGrd	5.916164e+13	0.014676645	0.004445681
GarageFinish	5.176931e+13	0.012842779	0.003890187
LotFrontage	4.566513e+13	0.011328471	0.003431490
OpenPorchSF	2.455335e+13	0.006091122	0.001845053
BsmtUnfSF	2.022497e+13	0.005017350	0.001519798
BsmtFinType1	1.991753e+13	0.004941082	0.001496696
WoodDeckSF	1.970804e+13	0.004889112	0.001480954
PoolArea	1.757225e+13	0.004359272	0.001320461
BsmtExposure	1.504853e+13	0.003733196	0.001130817
LotShape	1.470552e+13	0.003648101	0.001105041
⋮	⋮	⋮	⋮
Fence	9.516974e+12	2.360943e-03	7.151498e-04
CentralAir	9.086405e+12	2.254129e-03	6.827948e-04
BsmtHalfBath	8.570124e+12	2.126051e-03	6.439990e-04
HeatingQC	7.354213e+12	1.824412e-03	5.526299e-04
HalfBath	6.614227e+12	1.640838e-03	4.970238e-04
ScreenPorch	6.048580e+12	1.500514e-03	4.545185e-04
GarageQual	5.250202e+12	1.302455e-03	3.945246e-04
MasVnrType	5.214827e+12	1.293679e-03	3.918664e-04
RoofStyle	4.914854e+12	1.219263e-03	3.693250e-04
YrSold	4.349216e+12	1.078941e-03	3.268204e-04
Foundation	3.785838e+12	9.391796e-04	2.844855e-04
BldgType	3.627283e+12	8.998457e-04	2.725709e-04
LandSlope	3.598006e+12	8.925827e-04	2.703709e-04
BsmtFinSF2	3.548034e+12	8.801859e-04	2.666158e-04
GarageCond	3.545248e+12	8.794946e-04	2.664064e-04
ExterCond	3.389977e+12	8.409755e-04	2.547386e-04
BsmtFinType2	3.151819e+12	7.818940e-04	2.368424e-04
Condition2	2.629298e+12	6.522686e-04	1.975777e-04
BsmtCond	2.380687e+12	5.905939e-04	1.788959e-04
KitchenAbvGr	1.795289e+12	4.453700e-04	1.349064e-04
LowQualFinSF	1.748464e+12	4.337538e-04	1.313877e-04
Heating	1.613034e+12	4.001566e-04	1.212109e-04
EnclosedPorch	1.404318e+12	3.483792e-04	1.055270e-04
PavedDrive	1.347438e+12	3.342684e-04	1.012527e-04
Alley	1.305937e+12	3.239731e-04	9.813422e-05
Electrical	6.632001e+11	1.645247e-04	4.983595e-05
3SsnPorch	4.452808e+11	1.104639e-04	3.346047e-05
MiscVal	2.201360e+11	5.461069e-05	1.654204e-05
MiscFeature	1.575311e+11	3.907984e-05	1.183762e-05
Street	1.932716e+10	4.794624e-06	1.452333e-06

Id	predict
1461	131327.2
1462	144076.1
1463	181239.6
1464	182871.2
1465	213837.3
1466	184089.9
1467	167665.3
1468	176853.0
1469	176485.7
1470	131498.0
1471	193716.3
1472	111079.5
1473	112783.5
1474	153719.9
1475	137299.0
1476	339612.3
1477	267857.9
1478	308458.6
1479	296120.4
1480	449315.6
1481	299727.9
1482	220953.2
1483	181766.8
1484	181035.8
1485	179739.8
1486	200060.8
1487	317400.6
1488	257703.8
1489	198064.0
1490	219028.7
⋮	⋮
2890	100593.40
2891	138060.30
2892	92150.97
2893	115702.58
2894	90800.69
2895	290654.42
2896	284238.31
2897	179383.08
2898	154339.63
2899	233416.16
2900	149815.00
2901	195767.08
2902	190942.59
2903	324979.63
2904	321381.53
2905	135126.94
2906	194736.70
2907	123223.85
2908	128243.08
2909	153906.52
2910	100588.82
2911	106538.94
2912	146689.39
2913	107419.49
2914	100690.96
2915	100924.70
2916	107163.37
2917	152888.85
2918	125448.06
2919	210628.11