First let us discuss what exactly are AutoEncoders ?



In [ ]:

    
# Autoendoer using H2o

#CSCI6360 H2O WORKSHOP                                         

from IPython.display import Image,display
from IPython.core.display import HTML 
import matplotlib.pyplot as plot
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.grid.grid_search import H2OGridSearch



img = Image(url="images/autoencoder_structure.png")

img_1 = Image(url="images/autoencoder_equation.png")


display(img)

display(img_1)

This is a workshop on H2O,a library that is used extensively in the production environment, widely used in healthcare and finance

Here is its OFFICIAL WEBSITE

https://www.h2o.ai

First, lets create a seperate environment for h2o in Anaconda and performs a switch to that environment.

conda create --name h2o-py python=3.5 h2o h2o-py

As,I am currently in Mac , I would like to use a UI as I am a bit confortable with it , reducing complexity is nice !!!

What is h2o?

A library which is used for building machine learning models at ease on huge dataset.It supports mxnet, tensorflow and caffe. It is not an alternative for any of those, its just exends the backent (h2o.ai!!)

Other examples that work like h2o is keras.

We are now going to import h2o inside python

Adventages of having h2o :

1.Notable adventages variation in Stochastic Gradient descent implementation H2O SGD algorithm is executed in parallel across all cores. The training set is also distributed across all nodes. At the final an average is taken of all the values.

For more detials on Page 16.

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf

Lets start h2o programming



In [ ]:

    
import h2o

h2o.init() #initialize h2o cluster

#Once h2o is initialized it actually automatically sets up the spark cluster if spark is configured as a backend, 
#applies same for mxnet and tensorflow

Once h2o is initialized it actually automatically sets up the spark cluster, if spark is configured as a backend, applies same for mxnet, tensorflow and theano backends. We will discuss shortly how to use spark.

Now lets see our cluster status info.



In [ ]:

    
h2o.cluster().show_status()



In [ ]:

    
h2o.ls()
#this line of code displays the list of content present in an h2o cluster.Since there are no files in the cluster, 
#the key argument is empty



In [ ]:

    
#Now lets import a file to H2o cluster

h2o.import_file("LICENSE")



In [ ]:

    
h2o.ls() #as you can see there are files here #but we have found duplicates as we have 
#called h2o.import_file two times



In [ ]:

    
h2o.remove("LICENSE1.hex") #REMOVE THE LICENSE FILE



In [ ]:

    
h2o.ls()



In [ ]:

    
help(h2o.import_file)

Lets load the minst training dataset



In [ ]:

    
train = h2o.import_file("data/ecg_discord_train.csv")

Lets load the heart disease dataset



In [ ]:

    
test = h2o.import_file("data/ecg_discord_test.csv")



In [ ]:

    
model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
                                  hidden=[32,32,32],
                                  autoencoder=True,input_dropout_ratio=0.2,sparse=True,l1=1e-5,epochs=10)



In [ ]:

    
model.train(x=train.names,training_frame=train,validation_frame=test)



In [ ]:

    
model.anomaly(test)



In [ ]:

    
model_path = h2o.save_model(model = model,force = True)



In [ ]:

    
print(model_path)



In [ ]:

    
saved_model = h2o.load_model(model_path)



In [ ]:

    
print(saved_model)



In [ ]:

    
hyper_parameters = {'input_dropout_ratio':[0.1,0.2,0.5,0.7]}
                                            
h2o_gridSearch = H2OGridSearch(H2ODeepLearningEstimator(activation="RectifierWithDropout",
                                  hidden=[32,32,32],
                                  autoencoder=True,sparse=True,l1=1e-5,epochs=10),hyper_parameters)

h2o_gridSearch.train(x=train.names,training_frame=train,validation_frame=test)



In [ ]:

    
print(h2o_gridSearch.get_grid(sort_by="mse"))

Now tell me who is going to win ? one with greater epochs or with lower dropout or something else ?



In [ ]:

    
hyper_parameters = {'input_dropout_ratio':[0.1,0.2,0.5,0.7],'epochs':[10,20,30,40]}

h2o_gridSearch = H2OGridSearch(H2ODeepLearningEstimator(activation="RectifierWithDropout",
                                  hidden=[32,32,32],
                                  autoencoder=True,sparse=True,l1=1e-5,epochs=10),hyper_parameters)

h2o_gridSearch.train(x=train.names,training_frame=train,validation_frame=test)



In [ ]:

    
#Lets do cross validation

model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
                                  hidden=[32,32,32],
                                  autoencoder=True,input_dropout_ratio=0.2,sparse=True,l1=1e-5,epochs=10,nfolds=10)
model.train(x=train.names,training_frame=train,validation_frame=test)

print(h2o.auc(model, xval = TRUE))