In this notebook, we will train a number popular machine learning algorithms in a regression problem to estiamte house prices.
The data set was provided by Kaggle, and is one of the long standing data sets in the playground section.
startml currently supports hyperparameter tuning of 3 algorithms, random forest, gradient boosted machine, and deep learning. In this notebook, we will:
Automatically train a number of different machine learning algorithms through hyperparameter grid searches
Apply a performance threshold to filter the resulting models
Visualize the model performance and training information
In [1]:
#==================================================================
# load the startml library
#==================================================================
library(startml)
In [2]:
#=============================================================================
# Launch an h2o instance
# Load in the data downloaded from Kaggle Playground
#=============================================================================
# launch an h2o instance, max ram and number of threads can be set
# here, I just use the defaul
h2o.init()
In [3]:
#=============================================================================
#
# Load in the data downloaded from Kaggle Playground
#
#=============================================================================
# use the h2o import to creat an h2o object. This is important
# as there is a lot going on behind the scenes in h2o, and R is only
# the interface.
train <- h2o.importFile("../../../data/train.csv")
test <- h2o.importFile("../../../data/test.csv")
# print out the dimenations of train and test
tn_dim <- dim(train)
tt_dim <- dim(test)
# build message with extras
message <- c('train: a 2D h2oFrame with shape: [', tn_dim, ']\n', 'test: a 2D h2oFrame with shape: [', tt_dim, ']')
# print to consol
cat(message)
H20 provides a few ways to visualize data through the R api which can be helpful. Since this data set has been "pre-cooked" you can just begin modeling if you want. Generally, this is not the case. Here, I used a few of the built in vizualization functions in the h2o library to look at the data. These functions are designed to work with pretty big data sets, which is why they use binning so much. We dont need it here, as there are only about 1500 rows, but I'll do it anyway to show the style of figure you can get out of really big data sets with these functions.
In [5]:
#=============================================================================
#
# Using h2o fucntion to visualize the data
#
#=============================================================================
# lets print out the column names of train, you can do it the same way as a dataframe
colnames(train)
Now, I'll cut out a few columns that are difficult to split well into train, valid, test. Again, stratified splits or some other better way will be incorperated into a later version of startml. I know this from running this notebook a few times, in real applications this info may come from a little trial and error
In [9]:
#=============================================================================
#
# Cut out a few columns of the training data
#
#=============================================================================
# you can interact with h2oFrames mostly the same way you interact with dataframes
labeled_data <- train[,!names(train) %in% c('Exterior1st', 'Exterior2nd', 'KitchenQual','Functional', 'SaleType', 'MSZoning')]
newdata <- test[,!names(test) %in% c('Exterior1st', 'Exterior2nd', 'KitchenQual','Functional', 'SaleType', 'MSZoning')]
# check number of columns
message <- c('Now we are down to ', ncol(labeled_data), ' columns', "in train and \n", ncol(newdata), "in test")
cat(message)
colnames(labeled_data)
In [7]:
# since sale price is the target variable lets plot a couple of input variables with respect
# to SalePrice
plot(h2o.tabulate(train, "YrSold", "SalePrice"))
It looks like most houses are sold in warmer months, may june and july. Its possible that there is a postive correlation where houses sold later through the high volumen months sell for more, but its hard to say with confidence from this overview plot. Also, it looks like no houses over half a million were sold between august and december, while the month with the most highest price sales is july.
In [8]:
# Try the month sold
plot(h2o.tabulate(train, "MoSold", "SalePrice" ))
Looks like most houses are in condition 5 / 10. I guess thats not surprising. Houses in this condition have a wide range of sale prices. I think you really can see a positive collelation overall, with an increasing condition rising with sale price, although there is lots of variance.
In [9]:
# overall condition is probably important
plot(h2o.tabulate(train, "OverallCond", "SalePrice"))
This one is interesting. There seems to be a strong postive correlation between garage car capacity and sale price, compared to the other relationships we've seem so far. If you read the meta-data about the training set, then you know there houses are located in and around Ames, Iowa USA. Iowa is a very rural state, and ames is not a big city. Public transportation is likely not a realistic option for most people. Those looking to buy a new house in this area put a lot of value on the ability to store more vechicles in a covered garage.
In [10]:
# Now the garage size
plot(h2o.tabulate(train, "GarageCars", "SalePrice" ))
A few things happen when you run startml:
The inputs are as follows:
In [5]:
#===============================================================
# Run startml
#
#===============================================================
# here we only train 3 types of algorithms for 60 grid searches each.
# This will total 3 minutes of grid search
output <- startml(labeled_data = labeled_data,
newdata = newdata,
label_id = 'Id',
y = 'SalePrice',
algorithms =c("deeplearning", "randomForest", "gbm"),
split_seed = 1234,
runtime_secs = 20, # run it a very short time for this example
eval_metric = "RMSLE"
)
startml has a built in overview plot to see the results of training the models and the test performance. Just use plot on the mlblob output from the startml function. Its worth noting that this plot function doesn't work yet with very big data sets, and thus is limited by your workstation capabilities. Although the startml function itself is capable of running on big data sets thanks to the h2o backend, the plot function here still needs a lot of ram. I am working on this in future versions so you can still use the plot function on very big (not in ram) data sets.
In [12]:
#===============================================================
# Plot the results from training.
#
#===============================================================
plot(output)
There's alot going on even after only 1 minutes of model building. To fix this, we will trim poorly performing models with the trim function. If you don't know where to start, take a look at the median performance line and decide from there. For this example, I'm going to pick only models that did a lot better than the median. So, I'll trim down to models with RMSLE 0.3 or better on the test data.
Also, I don't really care about very similar models either, so I set a correlation threshld of 0.95. This will pick only one model when two models test predictions have pearson correlation of greater than 0.95.
In [13]:
#==============================================================================================
# Triming models based on eval metric and correlation.
#
#===============================================================================================
# trim a few models
trim_out <- trim(output,
eval_metric = 'RMSLE',
eval_threshold = 0.3,
correlation_threshold = 0.95)
In [14]:
#==============================================================================================
# plotting the trimmed result
#
#===============================================================================================
plot(trim_out)
Based off the top right subplot, we can see one of the random forest models (rf_model_1) that made it through our threshold controls is a stupid model. Even though it scored much better than many others, it simply found a sub-optima where producing one of two values based on the inputs got a RMSE of slightly below 60000. This was hard to see before with all the clutter, but now that it is obvious, we will send this object back through the timmer. This time, we wont use the correlation threshold.
In [15]:
#==============================================================================================
# Triming models again.
#
#===============================================================================================
# trim a few models
trim_out <- trim(trim_out,
eval_metric = 'RMSLE',
eval_threshold = 0.15)
In [16]:
#==============================================================================================
# plotting the trimmed result
#
#===============================================================================================
plot(trim_out)
From about 40 or so models, we are down to two. the random forest model has performed much better than the gbm model, but it can be worth taking taking a look at both. From the top right plot, you can see the random forest model seems to have learned something about the structure of the data and is providing reasonable estiamtes that begin to approximate the actual shape of the test data. The gbm model however, does not seem to have yet learned enough to make reasonable predictions. Its predictions capture some of the processes dictating housing prices, but arnet good enough to really use.
So, I've have automatically tuned over 40 models including deep learning, random forest, and gradient boosted machine algorithms. The majority of these models were not very good. Through iterative trimming, we selected random forest model as the best. We can now get the new data predictions from this model, or grab the model out of the mlblob object for further analysis. At this point, I can just treat the model like you trained it using the h2o interface only.
H2O offers a lot of options for looking into models, they are pretty complex objects all by themselves. What the model object looks like depends on what algorithm it is, and what kind of problem it was trained on. Since the model I'm looking into is a random forest, the object includes a few extras like variable importances out of the box.
In [7]:
#=====================================================================================
#
# Save the best model predictions and go further with the seected model
#
#
#=====================================================================================
# The rf_model_0 is still on the h2o cluster before and after this command.
# here, we just grab it out into a new object to work with it easily.
# You get "rf_model_0" right off the summary graph of the mblob object
best_model <- h2o.getModel('rf_model_0')
# take a look at the model summary to see final parameters and more info on performance
summary(best_model)
Although the plot function in startml shows the validation training for as many models as fit on the figure, it is a good idea to to take a closer look at the training process for models which will be potentially selected. For the random forest model, I'm taking a look at the training and validation loss scores over number of weak learners, or number of trees, built.
Looks like I could have used a harder early stop. Interestingly, the MAE is actually lower for the validation set, which is not ideal. Usually I'm looking for training to be a little bit better than validation, indicating a small amount of goodness of fit loss between training and validation. It doesn't nessesarily mean the model is bad, but generally something a bit suspect in my oppinion. I this case, I bet its from having such a small data set. With small data sets the scores can be better for the validation sometimes just becase the small set may not reprsent all the troubles encountered in the larger training set. Looking back at the traing, validation density plots in the bottom left of the plot output, you can see the validation distribtion is more likely to have lower-mid range prices in a random sample than the training data. In the upper right plot, the observed and predicted, I can see that the selected random forest model might be generally better at making predictions in the lower-mid range.
When comparing two different fit metrics its also a good idea to remember how the are related to eachother. For example, for the same data sets, mae is less than or equal to rmse. Also, rmse has greater penalty for larger errors than mae, so food for thought while considering the training history. In this example, small data and not having enough examples across the sale prices in each data set split is probably the cause. In the future, I'm planning on making better data separations built-in to startml, but for now I'll just move on with this model for the example.
In [8]:
#========================================================================
#
# Plot the train and validation history for the selected model
#
#========================================================================
# make a dataframe out of the training history frame from the h2o model object
# H2O only saves RMSE and MAE for training history in this model
hist <- best_model@model$scoring_history
# make a plot of rmse for validation and training, and add in a color chaning with the variance
#training_deviance
#validation_deviance
hist_melt_1 = melt(hist[,c(3, 4, 7)], id = 1)
hist_melt_2 = melt(hist[,c(3, 5, 8)], id = 1)
p1 = ggplot(hist_melt_1) +
geom_line(aes(x = number_of_trees, y = value, color = variable)) +
ggtitle('Training and Validation RMSE')
p2 = ggplot(hist_melt_2) +
geom_line(aes(x = number_of_trees, y = value, color = variable)) +
ggtitle('Training and Validation MAE')
# plot top - bottom
grid.arrange(p1, p2, ncol = 1, nrow = 2)
H20 includes a function to look at variable importances in different algorithms. Here I can see importance information for these variables in a nice looking table. Looking at the far right column, the "OverallQual" (overall quality) of the houses was a pretty important variable when predicting the price according to the random forest model, while "YrSold" (year the house was sold) was not very important to the price. Garage car capacity was also important in determining sale price. Its good news that these findings echo some of what I thought while looking at the training data before the modeling process.
In [55]:
#========================================================================
#
# Explore variable importances
#
#========================================================================
# take a look at the variable importances.
h2o.varimp(best_model)
I fed in newdata when launching startml in this notebook, so, the predictions for that set are already in the mlblob object "output".
Theres a function in startml to make an h2oFrame or regular R dataframe with the predictions I want. For this small data set, we can load the final predictions right into the R workspace. For bigger data, you should not do this, and leave the object as an h2oFrame.
In [26]:
#==============================================================================
#
# Get predictions from best model in a dataframe
#
#
#==============================================================================
# making predictions with the best model into H2O dataframe.
#newdata_predictions <- get_prediction('rf_model_0', output)
# make a regular data frame of the predctions with the ids from the original file if small enought
#predict_out <- data.frame(Id = as.data.frame(newdata$Id), Predictions = as.data.frame(newdata_predictions[[1]]$predict))
# print out to csv
#write.csv(predict_out, 'rf_model_o_newdata_predictions.csv', quote = FALSE, row.names = FALSE)
# view it
predict_out
In [8]:
# experiments with dask and datashader
h2o.exportFile(train, 'vis_data/test.csv', force = TRUE, parts = -1)
In [ ]: