Building a Recommender with Implicit Data - Million Song

In this notebook we will import GraphLab Create and use it to

download data from Amazon S3 containing information about songs that users are listening to
train two models that can be used for recommending new songs to users
compare the performance of the two models

Note: This notebook uses GraphLab Create 1.0.



In [1]:

    
import graphlab as gl
# set canvas to show sframes and sgraphs in ipython notebook
gl.canvas.set_target('ipynb')
import matplotlib.pyplot as plt
%matplotlib inline









    



[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-43425 - Server binary: /Users/zach/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1439506159.log
[INFO] GraphLab Server Version: 1.5.2

After importing GraphLab Create, we can download data directly from S3. We have placed a preprocessed version of the Million Song Dataset on S3. This data set was used for a Kaggle challenge and includes data from The Echo Nest, SecondHandSongs, musiXmatch, and Last.fm. This file includes data for a subset of 10000 songs.



In [2]:

    
train_file = 'https://static.turi.com/datasets/millionsong/10000.txt'

# The below will download a 118 MB file.
sf = gl.SFrame.read_csv(train_file, header=False, delimiter='\t', verbose=False)
sf.rename({'X1':'user_id', 'X2':'song_id', 'X3':'listen_count'}).show()









    




PROGRESS: Read 844838 lines. Lines per second: 512345






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/millionsong/10000.txt






    




PROGRESS: Parsing completed. Parsed 2000000 lines in 2.37174 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In order to evaluate the performance of our model, we randomly split the observations in our data set into two partitions: we will use train_set when creating our model and test_set for evaluating its performance.



In [3]:

    
(train_set, test_set) = sf.random_split(0.8, seed=1)

One typically wants to initially create a simple recommendation system that can be used as a baseline and to verify that the rest of the pipeline works as expected. The recommender package has several models available for this purpose. For example, we can create a model that predicts songs based on their overall popularity across all users.



In [4]:

    
popularity_model = gl.popularity_recommender.create(train_set, 'user_id', 'song_id')









    




PROGRESS: Recsys training: model = popularity






    




PROGRESS: Warning: Column 'listen_count' ignored.






    




PROGRESS:     To use this column as the target, set target = "listen_count" and use a method that allows the use of a target.






    




PROGRESS: Preparing data set.






    




PROGRESS:     Data has 1599753 observations with 76085 users and 10000 items.






    




PROGRESS:     Data prepared in: 2.53525s






    




PROGRESS: 1599753 observations to process; with 10000 unique items.

Collaborative filtering methods make predictions for a given user based on the patterns of other users' activities. One common technique is to compare items based on their Jaccard similarity. This measurement is a ratio: the number of items they have in common, over the total number of distinct items in both sets. We could also have used another slightly more complicated similarity measurement, called Cosine Similarity. In the following code block, we compute all the item-item similarities and create an object that can be used for recommendations.



In [5]:

    
item_sim_model = gl.item_similarity_recommender.create(train_set, 'user_id', 'song_id')









    




PROGRESS: Recsys training: model = item_similarity






    




PROGRESS: Warning: Column 'listen_count' ignored.






    




PROGRESS:     To use this column as the target, set target = "listen_count" and use a method that allows the use of a target.






    




PROGRESS: Preparing data set.






    




PROGRESS:     Data has 1599753 observations with 76085 users and 10000 items.






    




PROGRESS:     Data prepared in: 2.15243s






    




PROGRESS: Computing item similarity statistics:






    




PROGRESS: Computing most similar items for 10000 items:






    




PROGRESS: +-----------------+-----------------+






    




PROGRESS: | Number of items | Elapsed Time    |






    




PROGRESS: +-----------------+-----------------+






    




PROGRESS: | 1000            | 2.39126         |






    




PROGRESS: | 2000            | 2.49888         |






    




PROGRESS: | 3000            | 2.60216         |






    




PROGRESS: | 4000            | 2.70135         |






    




PROGRESS: | 5000            | 2.79927         |






    




PROGRESS: | 6000            | 2.89675         |






    




PROGRESS: | 7000            | 2.99166         |






    




PROGRESS: | 8000            | 3.09126         |






    




PROGRESS: | 9000            | 3.20585         |






    




PROGRESS: | 10000           | 3.41084         |






    




PROGRESS: +-----------------+-----------------+






    




PROGRESS: Finished training in 3.83026s

It's straightforward to use GraphLab to compare models on a small subset of users in the test_set. The precision-recall plot that is computed shows the benefits of using the similarity-based model instead of the baseline popularity_model: better curves tend toward the upper-right hand corner of the plot.

The following command finds the top-ranked items for all users in the first 500 rows of test_set. The observations in train_set are not included in the predicted items.



In [6]:

    
result = gl.recommender.util.compare_models(test_set, [popularity_model, item_sim_model],
                                            user_sample=.1, skip_set=train_set)









    



compare_models: using 6871 users to estimate model performance
PROGRESS: Evaluate model M0





    




PROGRESS: recommendations finished on 1000/6871 queries. users per second: 2611.92






    




PROGRESS: recommendations finished on 2000/6871 queries. users per second: 2672.28






    




PROGRESS: recommendations finished on 3000/6871 queries. users per second: 2725.87






    




PROGRESS: recommendations finished on 4000/6871 queries. users per second: 2792.1






    




PROGRESS: recommendations finished on 5000/6871 queries. users per second: 2784.28






    




PROGRESS: recommendations finished on 6000/6871 queries. users per second: 2745.44






    




Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   2    | 0.0336195604715 | 0.0137218945029 |
|   4    | 0.0298719254839 | 0.0253942926557 |
|   6    | 0.0265851647019 | 0.0335540040839 |
|   8    | 0.0242504730025 |  0.039799426244 |
|   10   | 0.0221510697133 | 0.0451777641091 |
|   12   | 0.0204725173434 | 0.0494218080674 |
|   14   | 0.0190344512132 |  0.052807397536 |
|   16   | 0.0180195750255 | 0.0564872089885 |
|   18   | 0.0171170297062 | 0.0600883057356 |
|   20   | 0.0164677630621 | 0.0635676545243 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]






    



[WARNING] Model trained without a target. Skipping RMSE computation.






    




PROGRESS: recommendations finished on 1000/6871 queries. users per second: 784.808






    




PROGRESS: recommendations finished on 2000/6871 queries. users per second: 791.153






    




PROGRESS: recommendations finished on 3000/6871 queries. users per second: 803.888






    




PROGRESS: recommendations finished on 4000/6871 queries. users per second: 787.441






    




PROGRESS: recommendations finished on 5000/6871 queries. users per second: 776.934






    




PROGRESS: recommendations finished on 6000/6871 queries. users per second: 772.965






    



PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   2    |  0.181924028526 | 0.0872725885118 |
|   4    |  0.144265754621 |  0.128513006725 |
|   6    |  0.122956386746 |  0.158391259054 |
|   8    |  0.107171445204 |  0.180931388766 |
|   10   |  0.096128656673 |  0.199777422339 |
|   12   | 0.0882088972978 |  0.217193022964 |
|   14   | 0.0804623989022 |  0.228707844047 |
|   16   |  0.075089142774 |  0.241476274474 |
|   18   |  0.070416727308 |  0.252395161545 |
|   20   | 0.0663513316839 |  0.26287160669  |
+--------+-----------------+-----------------+
[10 rows x 3 columns]






    



[WARNING] Model trained without a target. Skipping RMSE computation.






    









    



/Users/zach/anaconda/lib/python2.7/site-packages/matplotlib/figure.py:387: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "

Now let's ask the item similarity model for song recommendations on several users. We first create a list of users and create a subset of observations, users_ratings, that pertain to these users.



In [7]:

    
K = 10
users = gl.SArray(sf['user_id'].unique().head(100))

Next we use the recommend() function to query the model we created for recommendations. The returned object has four columns: user_id, song_id, the score that the algorithm gave this user for this song, and the song's rank (an integer from 0 to K-1). To see this we can grab the top few rows of recs:



In [8]:

    
recs = item_sim_model.recommend(users=users, k=K)
recs.head()









    Out[8]:





    
        user_id
        song_id
        score
        rank
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SOLIXJW12A58A79D02
        0.0241561375236
        1
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SONYKOW12AB01849C9
        0.0241061375695
        2
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SOFGTOC12A8C13B2A8
        0.0238511726039
        3
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SOLFXKT12AB017E3E0
        0.0232836258676
        4
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SOXKFRI12A8C137A5F
        0.0220054967621
        5
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SOAXGDH12A8C13F8A1
        0.0218985339593
        6
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SOTWNDJ12A8C143984
        0.0216148245634
        7
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SODWUBY12A6D4F8E8A
        0.0207981940013
        8
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SOYQQAC12A6D4FD59E
        0.020768326762
        9
    
    
        c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
        SOUSMXX12AB0185C24
        0.0206590470683
        10
    

[10 rows x 4 columns]

To learn what songs these ids pertain to, we can merge in metadata about each song.



In [9]:

    
# Get the meta data of the songs

# The below will download a 75 MB file.
songs = gl.SFrame.read_csv('https://static.turi.com/datasets/millionsong/song_data.csv', verbose=False)
songs = songs[['song_id', 'title', 'artist_name']]
results = recs.join(songs, on='song_id', how='inner')

# Populate observed user-song data with song info
userset = frozenset(users)
ix = sf['user_id'].apply(lambda x: x in userset, int)  
user_data = sf[ix]
user_data = user_data.join(songs, on='song_id')[['user_id', 'title', 'artist_name']]









    




PROGRESS: Read 637410 lines. Lines per second: 336663






    




PROGRESS: Finished parsing file https://static.turi.com/datasets/millionsong/song_data.csv






    




PROGRESS: Parsing completed. Parsed 1000000 lines in 2.39243 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[str,str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------



In [10]:

    
# Print out some recommendations 
for i in range(5):
    user = list(users)[i]
    print "User: " + str(i + 1)
    user_obs = user_data[user_data['user_id'] == user].head(K)
    del user_obs['user_id']
    user_recs = results[results['user_id'] == str(user)][['title', 'artist_name']]

    print "We were told that the user liked these songs: "
    print user_obs.head(K)

    print "We recommend these other songs:"
    print user_recs.head(K)

    print ""









    



User: 1
We were told that the user liked these songs: 
+---------------------------+-------------------------+
|           title           |       artist_name       |
+---------------------------+-------------------------+
|      No Me Ensenaste      |          Thalia         |
|   Love Is A Losing Game   |      Amy Winehouse      |
|    You Know I'm No Good   |      Amy Winehouse      |
|         Las flores        |       Café Tacvba       |
|        No One Knows       | Queens Of The Stone Age |
|     Burden In My Hand     |       Soundgarden       |
|    Rain On Your Parade    |          Duffy          |
|   Declaration of Purpose  |   Brilliant Red Lights  |
| Uninvited (Album Version) |    Alanis Morissette    |
|       Peligroso Pop       |     Plastilina Mosh     |
+---------------------------+-------------------------+
[10 rows x 2 columns]

We recommend these other songs:
+--------------------------------+---------------------------+
|             title              |        artist_name        |
+--------------------------------+---------------------------+
|        Some Unholy War         |       Amy Winehouse       |
|     Love Is A Losing Game      |       Amy Winehouse       |
| Dog Days Are Over (Radio Edit) |   Florence + The Machine  |
|         Wake Up Alone          |       Amy Winehouse       |
|          Just Friends          |       Amy Winehouse       |
|              OMG               | Usher featuring will.i.am |
|           Fireflies            |     Charttraxx Karaoke    |
|            Secrets             |        OneRepublic        |
|            Marry Me            |           Train           |
|           Esa noche            |        Café Tacvba        |
+--------------------------------+---------------------------+
[10 rows x 2 columns]


User: 2
We were told that the user liked these songs: 
+--------------------------+--------------------+
|          title           |    artist_name     |
+--------------------------+--------------------+
|     With Pen In Hand     |   Johnny Darrell   |
|         Patience         |   Guns N' Roses    |
|      Lemme Get That      |      Rihanna       |
|   Move It Up (Muévela)   |    Jody Bernal     |
|    Brave The Elements    |      Colossal      |
|    Question Existing     |      Rihanna       |
| Un-thinkable (I'm Ready) |    Alicia Keys     |
|         Timeless         | Ron Van Den Beuken |
|     Bringing Us Down     |    Les Savy Fav    |
|        Wind Farm         |       Lange        |
+--------------------------+--------------------+
[10 rows x 2 columns]

We recommend these other songs:
+------------------------+------------------------------+
|         title          |         artist_name          |
+------------------------+------------------------------+
|         Thump          |       Simon Patterson        |
|         Flash          |         Johan Gielen         |
|      Solid State       | Menno De Jong feat. Relocate |
|     Keep Our Ring      |          Sunlounger          |
|         Apple          |       Sander Van Doorn       |
| Love All The Pain Away |         Ronski Speed         |
| Love All The Pain Away |         Ronski Speed         |
|         Kiksu          |        Kyau & Albert         |
|  More Than Everything  |         Gareth Emery         |
|       First Time       |         Offer Nissim         |
+------------------------+------------------------------+
[10 rows x 2 columns]


User: 3
We were told that the user liked these songs: 
+-------------------------------+-------------------------------+
|             title             |          artist_name          |
+-------------------------------+-------------------------------+
|           Rock Star           |            N.E.R.D.           |
|              Lump             | The Presidents of the Unit... |
|           Man To Man          |           Gary Allan          |
|         Personal Jesus        |         Marilyn Manson        |
|           Too Close           |              Next             |
| Been Caught Stealing ( LP ... |        Jane's Addiction       |
|             Pepper            |        Butthole Surfers       |
|              Undo             |             Björk             |
|          Crying Shame         |          Jack Johnson         |
|         Make You Smile        |              +44              |
+-------------------------------+-------------------------------+
[10 rows x 2 columns]

We recommend these other songs:
+-------------------------------+---------------------------+
|             title             |        artist_name        |
+-------------------------------+---------------------------+
| The Only Exception (Album ... |          Paramore         |
|          Bulletproof          |          La Roux          |
|         The Scientist         |          Coldplay         |
| Bleed It Out [Live At Milt... |        Linkin Park        |
|          Use Somebody         |       Kings Of Leon       |
|          Use Somebody         |       Kings Of Leon       |
|              OMG              | Usher featuring will.i.am |
|        Hey_ Soul Sister       |           Train           |
|           Fireflies           |     Charttraxx Karaoke    |
|            Secrets            |        OneRepublic        |
+-------------------------------+---------------------------+
[10 rows x 2 columns]


User: 4
We were told that the user liked these songs: 
+-----------------------------+----------------+
|            title            |  artist_name   |
+-----------------------------+----------------+
|          Blind Date         | Bouncing Souls |
| Knocking On Forbidden Doors |     Enigma     |
|       Absence of Fear       |  War Of Ages   |
|    Victoria (LP Version)    |    Old 97's    |
+-----------------------------+----------------+
[4 rows x 2 columns]

We recommend these other songs:
+-------------------------------+------------------------+
|             title             |      artist_name       |
+-------------------------------+------------------------+
| Val's Blues (Digitally Rem... |      Louis Smith       |
|        The Big Gundown        |      The Prodigy       |
|              Rain             |       Subhumans        |
| Ain't No Rest For The Wick... |   Cage The Elephant    |
|        Sei Lá Mangueira       |    Elizeth Cardoso     |
|     West One (Shine On Me)    |        The Ruts        |
|  A Beggar On A Beach Of Gold  | Mike And The Mechanics |
|        Who Can Compare        |     Foolish Things     |
|           In One Ear          |   Cage The Elephant    |
|     Back Against The Wall     |   Cage The Elephant    |
+-------------------------------+------------------------+
[10 rows x 2 columns]


User: 5
We were told that the user liked these songs: 
+-------------------------------+----------------------------+
|             title             |        artist_name         |
+-------------------------------+----------------------------+
|    Hey Daddy (Daddy's Home)   |           Usher            |
|        Southside Remix        | Lloyd / Ashanti / Scarface |
| So Confused (feat. Butta C... |        Pretty Ricky        |
|    Hey Daddy (Daddy's Home)   |           Usher            |
| Up And Down (explicit albu... |        Pretty Ricky        |
|           Gunn Clapp          |           O.G.C.           |
| Love Like Honey (amended a... |        Pretty Ricky        |
+-------------------------------+----------------------------+
[7 rows x 2 columns]

We recommend these other songs:
+-------------------------------+-------------------------------+
|             title             |          artist_name          |
+-------------------------------+-------------------------------+
|          Fallin' Out          |          Keyshia Cole         |
| We're Not Making Love No More |            Dru Hill           |
|       There Goes My Baby      |             Usher             |
|           StreetLove          |             Lloyd             |
|              Love             |             Musiq             |
|          Put It Down          |           The-Dream           |
| Up And Down (explicit albu... |          Pretty Ricky         |
|           Womanopoly          |             Musiq             |
|            Soulstar           | Musiq / DJ Aktive / Carol ... |
| Grind With Me (Explicit Ve... |          Pretty Ricky         |
+-------------------------------+-------------------------------+
[10 rows x 2 columns]

(Looking for more details about the modules and functions? Check out the API docs.)

user_id	song_id	score	rank
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SOLIXJW12A58A79D02	0.0241561375236	1
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SONYKOW12AB01849C9	0.0241061375695	2
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SOFGTOC12A8C13B2A8	0.0238511726039	3
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SOLFXKT12AB017E3E0	0.0232836258676	4
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SOXKFRI12A8C137A5F	0.0220054967621	5
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SOAXGDH12A8C13F8A1	0.0218985339593	6
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SOTWNDJ12A8C143984	0.0216148245634	7
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SODWUBY12A6D4F8E8A	0.0207981940013	8
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SOYQQAC12A6D4FD59E	0.020768326762	9
c66c10a9567f0d82ff31441a9 fd5063e5cd9dfe8 ...	SOUSMXX12AB0185C24	0.0206590470683	10