Back when I was creating a movie recommender the most time consuming part was finding good parameters to use in training my recommender; the right training parameters can make all the difference between a lame model and a great model. Finding good training parameters is a very common problem in machine learning.
Fortunately GraphLab Create makes it easy to tune training parameters. By just calling model_parameter_search we can create a job to automatically search for parameters. With just one more line of code we can make it a distributed search, training and evaluating models in parallel.
The first step is to import graphlab and read in our data:
In [1]:
import graphlab as gl
data_url = 'https://static.turi.com/datasets/movie_ratings/sample.small'
movie_data = gl.SFrame.read_csv(data_url, delimiter='\t')
Each row in our data represents a movie rating from a user. There are only three columns: user, movie and rating.
In [2]:
movie_data
Out[2]:
To quickly create a recommender we can simply call the create method of factorization_recommender. We only need to pass it our data and tell it what columns represent: user id, item id and prediction target. That looks like:
In [ ]:
model = gl.factorization_recommender.create(movie_data, user_id='user',
item_id='movie', target='rating')
If we did it this way, the default values would be used for all of our training parameters. All of the models in Graphlab Create come with good default values. However no single value will ever be optimal for all data. With just a little work we can find better parameter values and create a more effective model.
In order to be able to tell the best parameter values, we have to be able to measure a model's performance. It's important that you don't use the same data to both train the model and evaluate its effectiveness. So we'll create a random split of our data, using 80% for training the models and the other 20% for evaluating the models.
In [3]:
train_set, validation_set = movie_data.random_split(0.8)
Once we have a model we want to evaluate, We can then evaluate this model with our test_set, by:
In [ ]:
evaluation = model.evaluate(validation_set)
In a nutshell, model parameter search trains several different model, each with different values for training parameters, then evaluates each of the models.
There are a lot of different parameters we could tweak when creating a factorization_recommender. Probably the most important is the number of latent factors. With one call to model_parameter_search we can easily search over several different values for the number of latent factors.
The first parameter to model_parameter_search is training set and the validation set. The second parameter is the function that creates the model, in our case that's "gl.factorization_recommender.create". In additon, we need to specify the parameters that will be used to create the models. There are two types of parameters: fixed parameters and free parameters. Fixed parameter are the parameters that are the same for all of the models that get created, for us that's: user_id, item_id, and target. Free parameters are the parameters you want to search over, so that's num_factors.
Putting it all together we get:
In [4]:
job = gl.model_parameter_search.create(
(train_set, validation_set),
gl.factorization_recommender.create,
model_parameters = {'user_id': 'user', 'item_id': 'movie', 'target': 'rating', 'num_factors': [4, 5, 6, 7]}
)
By default, the job will run asynchronously in a background process. We can check weather the job has completed by calling job.get_status()
In [5]:
job.get_status()
Out[5]:
It will take a few minutes to train and evaluate four models ......
In [6]:
job.get_status()
Out[6]:
Once the job is completed, we can get the results by calling job.get_results(). The results contain two things: all of the models that were created and summary information about each model. The summary information includes the RMSE on the validation set. With a little work we can determine the best RMSE score and get the corresponding model:
In [7]:
search_summary= job.get_results()
best_RMSE = search_summary['validation_rmse'].min()
best_model_id = search_summary[search_summary['validation_rmse'] == best_RMSE]['model_id'][0]
best_model_id will be the best of the four models we searched over.
The more parameters combinations we try the more likely we are to find an even better model. We might want to try a larger range for the number of latent factors. There are other parameter we can tweak too. For example, regularization is another important parameters to tune.
As we increase the number of parameters and range of values we want to tweak, the number of combinations gets large quickly. Doing the entire search just on your computer could take a long time.
With only a couple more lines of code we can make our search distributed, training and evaluating models in parrallel. GraphLab Create makes it easy to use either Amazon Web Services or a Hadoop cluster. All we need to do is create a deployment environment and pass that to model_parameter_search.
To use a Hadoop cluster, create an environment object like this:
In [ ]:
hadoop_cluster = gl.deploy.hadoop_cluster.create(name = '<name of hadoop cluster>',
turi_dist_path = '<distributed path>')
hadoop_conf_dir = '<path to hadoop config dir>')
To use an EC2 environment with three hosts, create an environment like this:
In [ ]:
ec2_config = gl.deploy.Ec2Config(aws_access_key_id = '<my access key>',
aws_secret_access_key = '<my secret key>')
my_env = gl.deploy.ec2_cluster.create('<name for my environment>',
s3_path = 's3://<my bucket name>',
num_hosts = 3,
ec2_config = ec2_config)
Searching over several values for num_factors and regularization, and using our distributed environment, the model_parameter_search call will look like:
In [ ]:
job = gl.model_parameter_search.create(
(train_set, validation_set),
gl.factorization_recommender.create,
environment = my_env,
model_parameters = {'user_id': 'user', 'item_id': 'movie', 'target': 'rating', 'num_factors': [4, 5, 6, 7]}
)
Once the job has completed we can get the best model in exactly the same way we did before.