Model Parameter Search - Automated and Distributed

Back when I was creating a movie recommender the most time consuming part was finding good parameters to use in training my recommender; the right training parameters can make all the difference between a lame model and a great model. Finding good training parameters is a very common problem in machine learning.

Fortunately GraphLab Create makes it easy to tune training parameters. By just calling model_parameter_search we can create a job to automatically search for parameters. With just one more line of code we can make it a distributed search, training and evaluating models in parallel.

Setup

The first step is to import graphlab and read in our data:



In [1]:

    
import graphlab as gl

data_url = 'https://static.turi.com/datasets/movie_ratings/sample.small'
movie_data = gl.SFrame.read_csv(data_url, delimiter='\t')









    



[INFO] Start server at: ipc:///tmp/graphlab_server-18448 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1425080074.log
[INFO] GraphLab Server Version: 1.5.0






    




PROGRESS: Downloading https://static.turi.com/datasets/movie_ratings/sample.small to /var/tmp/graphlab-toby/18448/000000.small






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/movie_ratings/sample.small






    




PROGRESS: Parsing completed. Parsed 100 lines in 2.06759 secs.






    



------------------------------------------------------





    




PROGRESS: Read 1549015 lines. Lines per second: 2.15258e+06






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/movie_ratings/sample.small






    




PROGRESS: Parsing completed. Parsed 4000000 lines in 1.24665 secs.






    



Inferred types from first line of file as 
column_type_hints=[str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Each row in our data represents a movie rating from a user. There are only three columns: user, movie and rating.



In [2]:

    
movie_data









    Out[2]:





    
        user
        movie
        rating
    
    
        Jacob Smith
        Flirting with Disaster
        4
    
    
        Jacob Smith
        Indecent Proposal
        3
    
    
        Jacob Smith
        Runaway Bride
        2
    
    
        Jacob Smith
        My Best Friend's Wedding
        3
    
    
        Jacob Smith
        Swiss Family Robinson
        1
    
    
        Jacob Smith
        The Mexican
        2
    
    
        Jacob Smith
        Maid in Manhattan
        4
    
    
        Jacob Smith
        A Charlie Brown
Thanksgiving / The ...
        3
    
    
        Jacob Smith
        Brazil
        1
    
    
        Jacob Smith
        Forrest Gump
        3
    
    
        ...
        ...
        ...
    

[4000000 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

So What Exactly is Model Parameter Search?

To quickly create a recommender we can simply call the create method of factorization_recommender. We only need to pass it our data and tell it what columns represent: user id, item id and prediction target. That looks like:



In [ ]:

    
model = gl.factorization_recommender.create(movie_data, user_id='user',
                                            item_id='movie', target='rating')

If we did it this way, the default values would be used for all of our training parameters. All of the models in Graphlab Create come with good default values. However no single value will ever be optimal for all data. With just a little work we can find better parameter values and create a more effective model.

In order to be able to tell the best parameter values, we have to be able to measure a model's performance. It's important that you don't use the same data to both train the model and evaluate its effectiveness. So we'll create a random split of our data, using 80% for training the models and the other 20% for evaluating the models.



In [3]:

    
train_set, validation_set = movie_data.random_split(0.8)

Once we have a model we want to evaluate, We can then evaluate this model with our test_set, by:



In [ ]:

    
evaluation = model.evaluate(validation_set)

In a nutshell, model parameter search trains several different model, each with different values for training parameters, then evaluates each of the models.

Doing a Model Parameter Search

There are a lot of different parameters we could tweak when creating a factorization_recommender. Probably the most important is the number of latent factors. With one call to model_parameter_search we can easily search over several different values for the number of latent factors.

The first parameter to model_parameter_search is training set and the validation set. The second parameter is the function that creates the model, in our case that's "gl.factorization_recommender.create". In additon, we need to specify the parameters that will be used to create the models. There are two types of parameters: fixed parameters and free parameters. Fixed parameter are the parameters that are the same for all of the models that get created, for us that's: user_id, item_id, and target. Free parameters are the parameters you want to search over, so that's num_factors.

Putting it all together we get:



In [4]:

    
job = gl.model_parameter_search.create(
    (train_set, validation_set), 
    gl.factorization_recommender.create, 
    model_parameters = {'user_id': 'user', 'item_id': 'movie', 'target': 'rating', 'num_factors': [4, 5, 6, 7]} 
)









    



[INFO] Validating job.
[INFO] Creating a LocalAsync environment called 'async'.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Feb-27-2015-15-35-51' ready for execution
[INFO] Job: 'Model-Parameter-Search-Feb-27-2015-15-35-51' scheduled.

By default, the job will run asynchronously in a background process. We can check weather the job has completed by calling job.get_status()



In [5]:

    
job.get_status()









    Out[5]:





'{'Canceled': 0, 'Completed': 0, 'Failed': 0, 'Pending': 0, 'Running': 1}'

It will take a few minutes to train and evaluate four models ......



In [6]:

    
job.get_status()









    Out[6]:





'{'Canceled': 0, 'Completed': 1, 'Failed': 0, 'Pending': 0, 'Running': 0}'

Getting the Best Model

Once the job is completed, we can get the results by calling job.get_results(). The results contain two things: all of the models that were created and summary information about each model. The summary information includes the RMSE on the validation set. With a little work we can determine the best RMSE score and get the corresponding model:



In [7]:

    
search_summary= job.get_results()
best_RMSE = search_summary['validation_rmse'].min()
best_model_id = search_summary[search_summary['validation_rmse'] == best_RMSE]['model_id'][0]

best_model_id will be the best of the four models we searched over.

The more parameters combinations we try the more likely we are to find an even better model. We might want to try a larger range for the number of latent factors. There are other parameter we can tweak too. For example, regularization is another important parameters to tune.

As we increase the number of parameters and range of values we want to tweak, the number of combinations gets large quickly. Doing the entire search just on your computer could take a long time.

Making it Distributed

With only a couple more lines of code we can make our search distributed, training and evaluating models in parrallel. GraphLab Create makes it easy to use either Amazon Web Services or a Hadoop cluster. All we need to do is create a deployment environment and pass that to model_parameter_search.

To use a Hadoop cluster, create an environment object like this:



In [ ]:

    
hadoop_cluster = gl.deploy.hadoop_cluster.create(name = '<name of hadoop cluster>',
                                       turi_dist_path = '<distributed path>')
                                      hadoop_conf_dir = '<path to hadoop config dir>')

To use an EC2 environment with three hosts, create an environment like this:



In [ ]:

    
ec2_config = gl.deploy.Ec2Config(aws_access_key_id = '<my access key>',
                                 aws_secret_access_key = '<my secret key>')
my_env = gl.deploy.ec2_cluster.create('<name for my environment>',
                                       s3_path = 's3://<my bucket name>',
                                       num_hosts = 3, 
                                       ec2_config = ec2_config)

Searching over several values for num_factors and regularization, and using our distributed environment, the model_parameter_search call will look like:



In [ ]:

    
job = gl.model_parameter_search.create(
    (train_set, validation_set), 
    gl.factorization_recommender.create, 
    environment = my_env, 
    model_parameters = {'user_id': 'user', 'item_id': 'movie', 'target': 'rating', 'num_factors': [4, 5, 6, 7]} 
)

Once the job has completed we can get the best model in exactly the same way we did before.

user	movie	rating
Jacob Smith	Flirting with Disaster	4
Jacob Smith	Indecent Proposal	3
Jacob Smith	Runaway Bride	2
Jacob Smith	My Best Friend's Wedding	3
Jacob Smith	Swiss Family Robinson	1
Jacob Smith	The Mexican	2
Jacob Smith	Maid in Manhattan	4
Jacob Smith	A Charlie Brown Thanksgiving / The ...	3
Jacob Smith	Brazil	1
Jacob Smith	Forrest Gump	3
...	...	...