Building a Recommender


In [70]:
# A dictionary of movie critics and their ratings of a small
# set of movies
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
      'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
      'The Night Listener': 3.0},
     'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
      'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
      'You, Me and Dupree': 3.5},
     'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
      'Superman Returns': 3.5, 'The Night Listener': 4.0},
     'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
      'The Night Listener': 4.5, 'Superman Returns': 4.0,
      'You, Me and Dupree': 2.5},
     'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
      'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
      'You, Me and Dupree': 2.0},
     'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
      'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
     'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

In [39]:
critics['Lisa Rose']['Lady in the Water']


Out[39]:
2.5

In [40]:
critics['Toby']['Snakes on a Plane']=4.5

In [41]:
critics['Toby']


Out[41]:
{'Snakes on a Plane': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 1.0}

Finding similar users


In [48]:
import numpy as np

np.sqrt(np.power(5-4, 2) + np.power(4-1, 2))


Out[48]:
3.1622776601683795

This formula calculates the distance, which will be smaller for people who are more similar. However, you need a function that gives higher values for people who are similar. This can be done by adding 1 to the function (so you don’t get a division-by- zero error) and inverting it:


In [49]:
1.0 /(1 + np.sqrt(np.power(5-4, 2) + np.power(4-1, 2)) )


Out[49]:
0.2402530733520421

In [53]:
# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
    # Get the list of shared_items
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item]=1
    # if they have no ratings in common, return 0
    if len(si)==0: return 0
    # Add up the squares of all the differences
    sum_of_squares=np.sum([np.power(prefs[person1][item]-prefs[person2][item],2)
                      for item in prefs[person1] if item in prefs[person2]])
    return 1/(1+sum_of_squares)

In [54]:
sim_distance(critics, 'Lisa Rose','Gene Seymour')


Out[54]:
0.14814814814814814

In [59]:
# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs,p1,p2):
    # Get the list of mutually rated items
    si={}
    for item in prefs[p1]:
        if item in prefs[p2]: si[item]=1
    # Find the number of elements
    n=len(si)
    # if they are no ratings in common, return 0
    if n==0: return 0
    # Add up all the preferences
    sum1=np.sum([prefs[p1][it] for it in si])
    sum2=np.sum([prefs[p2][it] for it in si])
    # Sum up the squares
    sum1Sq=np.sum([np.power(prefs[p1][it],2) for it in si])
    sum2Sq=np.sum([np.power(prefs[p2][it],2) for it in si])
    # Sum up the products
    pSum=np.sum([prefs[p1][it]*prefs[p2][it] for it in si])
    # Calculate Pearson score
    num=pSum-(sum1*sum2/n)
    den=np.sqrt((sum1Sq-np.power(sum1,2)/n)*(sum2Sq-np.power(sum2,2)/n))
    if den==0: return 0
    return num/den

In [60]:
sim_pearson(critics, 'Lisa Rose','Gene Seymour')


Out[60]:
0.39605901719066977

In [75]:
# Returns the best matches for person from the prefs dictionary.
# Number of results and similarity function are optional params.
def topMatches(prefs,person,n=5,similarity=sim_pearson):
    scores=[(similarity(prefs,person,other),other)
        for other in prefs if other!=person]
    # Sort the list so the highest scores appear at the top 
    scores.sort( )
    scores.reverse( )
    return scores[0:n]

In [77]:
topMatches(critics,'Toby',n=3)


Out[77]:
[(0.99124070716192991, 'Lisa Rose'),
 (0.92447345164190486, 'Mick LaSalle'),
 (0.89340514744156474, 'Claudia Puig')]

Recommending Items


In [66]:
# Gets recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs,person,similarity=sim_pearson):
    totals={}
    simSums={}
    for other in prefs:
        # don't compare me to myself
        if other==person: continue
        sim=similarity(prefs,person,other)

        # ignore scores of zero or lower
        if sim<=0: continue
        for item in prefs[other]:   
            # only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item]==0:
                # Similarity * Score
                totals.setdefault(item,0)
                totals[item]+=prefs[other][item]*sim
                # Sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+=sim

    # Create the normalized list
    rankings=[(total/simSums[item],item) for item,total in totals.items()]

    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings

In [78]:
#Now you can find out what movies I should watch next:
getRecommendations(critics,'Toby')


Out[78]:
[(3.3477895267131013, 'The Night Listener'),
 (2.8325499182641614, 'Lady in the Water'),
 (2.5309807037655645, 'Just My Luck')]

In [79]:
# You’ll find that the results are only affected very slightly by the choice of similarity metric.
getRecommendations(critics,'Toby',similarity=sim_distance)


Out[79]:
[(3.5002478401415877, 'The Night Listener'),
 (2.7561242939959363, 'Lady in the Water'),
 (2.4619884860743739, 'Just My Luck')]

Matching Products

Now you know how to find similar people and recommend products for a given per- son, but what if you want to see which products are similar to each other? This is actually the same method we used ear- lier to determine similarity between people—

将item-user字典的键值翻转


In [85]:
# you just need to swap the people and the items. 
def transformPrefs(prefs):
    result={}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item,{})
            # Flip item and person
            result[item][person]=prefs[person][item]
    return result

movies = transformPrefs(critics)

计算item的相似性


In [83]:
topMatches(movies,'Superman Returns')


Out[83]:
[(0.65795169495976946, 'You, Me and Dupree'),
 (0.48795003647426888, 'Lady in the Water'),
 (0.11180339887498941, 'Snakes on a Plane'),
 (-0.17984719479905439, 'The Night Listener'),
 (-0.42289003161103106, 'Just My Luck')]

给item推荐user


In [81]:
getRecommendations(movies,'Just My Luck')


Out[81]:
[(4.0, 'Michael Phillips'), (3.0, 'Jack Matthews')]

In [84]:
getRecommendations(movies, 'You, Me and Dupree')


Out[84]:
[(3.1637361366111816, 'Michael Phillips')]

Item-Based Filtering


In [86]:
def calculateSimilarItems(prefs,n=10):
    # Create a dictionary of items showing which other items they
    # are most similar to.
    result={}
    # Invert the preference matrix to be item-centric
    itemPrefs=transformPrefs(prefs)
    c=0
    for item in itemPrefs:
        # Status updates for large datasets
        c+=1
        if c%100==0: print "%d / %d" % (c,len(itemPrefs))
        # Find the most similar items to this one
        scores=topMatches(itemPrefs,item,n=n,similarity=sim_distance)
        result[item]=scores
    return result

itemsim=calculateSimilarItems(critics) 
itemsim


Out[86]:
{'Just My Luck': [(0.22222222222222221, 'Lady in the Water'),
  (0.18181818181818182, 'You, Me and Dupree'),
  (0.15384615384615385, 'The Night Listener'),
  (0.10526315789473684, 'Snakes on a Plane'),
  (0.064516129032258063, 'Superman Returns')],
 'Lady in the Water': [(0.40000000000000002, 'You, Me and Dupree'),
  (0.2857142857142857, 'The Night Listener'),
  (0.22222222222222221, 'Snakes on a Plane'),
  (0.22222222222222221, 'Just My Luck'),
  (0.090909090909090912, 'Superman Returns')],
 'Snakes on a Plane': [(0.22222222222222221, 'Lady in the Water'),
  (0.18181818181818182, 'The Night Listener'),
  (0.16666666666666666, 'Superman Returns'),
  (0.10526315789473684, 'Just My Luck'),
  (0.05128205128205128, 'You, Me and Dupree')],
 'Superman Returns': [(0.16666666666666666, 'Snakes on a Plane'),
  (0.10256410256410256, 'The Night Listener'),
  (0.090909090909090912, 'Lady in the Water'),
  (0.064516129032258063, 'Just My Luck'),
  (0.053333333333333337, 'You, Me and Dupree')],
 'The Night Listener': [(0.2857142857142857, 'Lady in the Water'),
  (0.18181818181818182, 'Snakes on a Plane'),
  (0.15384615384615385, 'Just My Luck'),
  (0.14814814814814814, 'You, Me and Dupree'),
  (0.10256410256410256, 'Superman Returns')],
 'You, Me and Dupree': [(0.40000000000000002, 'Lady in the Water'),
  (0.18181818181818182, 'Just My Luck'),
  (0.14814814814814814, 'The Night Listener'),
  (0.053333333333333337, 'Superman Returns'),
  (0.05128205128205128, 'Snakes on a Plane')]}

In [88]:
def getRecommendedItems(prefs,itemMatch,user):
    userRatings=prefs[user]
    scores={}
    totalSim={}
    # Loop over items rated by this user
    for (item,rating) in userRatings.items( ):
        # Loop over items similar to this one
        for (similarity,item2) in itemMatch[item]:
            # Ignore if this user has already rated this item
            if item2 in userRatings: continue
            # Weighted sum of rating times similarity
            scores.setdefault(item2,0)
            scores[item2]+=similarity*rating
            # Sum of all the similarities
            totalSim.setdefault(item2,0)
            totalSim[item2]+=similarity
    # Divide each total score by total weighting to get an average
    rankings=[(score/totalSim[item],item) for item,score in scores.items( )]
    # Return the rankings from highest to lowest
    rankings.sort( )
    rankings.reverse( )
    return rankings

getRecommendedItems(critics,itemsim,'Toby')


Out[88]:
[(3.182634730538922, 'The Night Listener'),
 (2.5983318700614575, 'Just My Luck'),
 (2.4730878186968837, 'Lady in the Water')]

MovieLens Recommender

1::1193::5::978300760

1::661::3::978302109

1::914::3::978301968


In [91]:
def loadMovieLens(path='/Users/chengjun/bigdata/ml-1m/'):
    # Get movie titles
    movies={}
    for line in open(path+'movies.dat'):
        (id,title)=line.split('::')[0:2]
        movies[id]=title
  
    # Load data
    prefs={}
    for line in open(path+'/ratings.dat'):
        (user,movieid,rating,ts)=line.split('::')
        prefs.setdefault(user,{})
        prefs[user][movies[movieid]]=float(rating)
    return prefs

In [92]:
prefs=loadMovieLens()
prefs['87']


Out[92]:
{'Alice in Wonderland (1951)': 1.0,
 'Army of Darkness (1993)': 3.0,
 'Bad Boys (1995)': 5.0,
 'Benji (1974)': 1.0,
 'Brady Bunch Movie, The (1995)': 1.0,
 'Braveheart (1995)': 5.0,
 'Buffalo 66 (1998)': 1.0,
 'Chambermaid on the Titanic, The (1998)': 1.0,
 'Cowboy Way, The (1994)': 1.0,
 'Cyrano de Bergerac (1990)': 4.0,
 'Dear Diary (Caro Diario) (1994)': 1.0,
 'Die Hard (1988)': 3.0,
 'Diebinnen (1995)': 1.0,
 'Dr. No (1962)': 1.0,
 'Escape from the Planet of the Apes (1971)': 1.0,
 'Fast, Cheap & Out of Control (1997)': 1.0,
 'Faster Pussycat! Kill! Kill! (1965)': 1.0,
 'From Russia with Love (1963)': 1.0,
 'Fugitive, The (1993)': 5.0,
 'Get Shorty (1995)': 1.0,
 'Gladiator (2000)': 5.0,
 'Goldfinger (1964)': 5.0,
 'Good, The Bad and The Ugly, The (1966)': 4.0,
 'Hunt for Red October, The (1990)': 5.0,
 'Hurricane, The (1999)': 5.0,
 'Indiana Jones and the Last Crusade (1989)': 4.0,
 'Jaws (1975)': 5.0,
 'Jurassic Park (1993)': 5.0,
 'King Kong (1933)': 1.0,
 'King of New York (1990)': 1.0,
 'Last of the Mohicans, The (1992)': 1.0,
 'Lethal Weapon (1987)': 5.0,
 'Longest Day, The (1962)': 1.0,
 'Man with the Golden Gun, The (1974)': 5.0,
 'Mask of Zorro, The (1998)': 5.0,
 'Matrix, The (1999)': 5.0,
 "On Her Majesty's Secret Service (1969)": 1.0,
 'Out of Sight (1998)': 1.0,
 'Palookaville (1996)': 1.0,
 'Planet of the Apes (1968)': 1.0,
 'Pope of Greenwich Village, The (1984)': 1.0,
 'Princess Bride, The (1987)': 3.0,
 'Raiders of the Lost Ark (1981)': 4.0,
 'Rock, The (1996)': 5.0,
 'Rocky (1976)': 5.0,
 'Saving Private Ryan (1998)': 4.0,
 'Shanghai Noon (2000)': 1.0,
 'Speed (1994)': 1.0,
 'Star Wars: Episode IV - A New Hope (1977)': 5.0,
 'Star Wars: Episode V - The Empire Strikes Back (1980)': 5.0,
 'Taking of Pelham One Two Three, The (1974)': 1.0,
 'Terminator 2: Judgment Day (1991)': 5.0,
 'Terminator, The (1984)': 4.0,
 'Thelma & Louise (1991)': 1.0,
 'True Romance (1993)': 1.0,
 'U-571 (2000)': 5.0,
 'Untouchables, The (1987)': 5.0,
 'Westworld (1973)': 1.0,
 'X-Men (2000)': 4.0}

user-based filtering


In [93]:
getRecommendations(prefs,'87')[0:30]


Out[93]:
[(5.0, 'Time of the Gypsies (Dom za vesanje) (1989)'),
 (5.0, 'Tigrero: A Film That Was Never Made (1994)'),
 (5.0, 'Schlafes Bruder (Brother of Sleep) (1995)'),
 (5.0, 'Return with Honor (1998)'),
 (5.0, 'Lured (1947)'),
 (5.0, 'Identification of a Woman (Identificazione di una donna) (1982)'),
 (5.0, 'I Am Cuba (Soy Cuba/Ya Kuba) (1964)'),
 (5.0, 'Hour of the Pig, The (1993)'),
 (5.0, 'Gay Deceivers, The (1969)'),
 (5.0, 'Gate of Heavenly Peace, The (1995)'),
 (5.0, 'Foreign Student (1994)'),
 (5.0, 'Dingo (1992)'),
 (5.0, 'Dangerous Game (1993)'),
 (5.0, 'Callej\xf3n de los milagros, El (1995)'),
 (5.0, 'Bittersweet Motel (2000)'),
 (4.8204601017229889, 'Apple, The (Sib) (1998)'),
 (4.7389561849363862, 'Lamerica (1994)'),
 (4.6818165414673958, 'Bells, The (1926)'),
 (4.6649580725222339, 'Hurricane Streets (1998)'),
 (4.6507418408045593, 'Sanjuro (1962)'),
 (4.6499741726003458, 'On the Ropes (1999)'),
 (4.6368254087395071, 'Shawshank Redemption, The (1994)'),
 (4.627888709544556, 'For All Mankind (1989)'),
 (4.5820483492805089, 'Midaq Alley (Callej\xf3n de los milagros, El) (1995)'),
 (4.5797786468711532, "Schindler's List (1993)"),
 (4.5751999410373871,
  'Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)'),
 (4.5749049884034561, 'Godfather, The (1972)'),
 (4.5746840191882345, "Ed's Next Move (1996)"),
 (4.5585190371478284, 'Hanging Garden, The (1997)'),
 (4.5277600427755909, 'Close Shave, A (1995)')]

Item-based filtering


In [94]:
itemsim=calculateSimilarItems(prefs,n=50)


100 / 3706
200 / 3706
300 / 3706
400 / 3706
500 / 3706
600 / 3706
700 / 3706
800 / 3706
900 / 3706
1000 / 3706
1100 / 3706
1200 / 3706
1300 / 3706
1400 / 3706
1500 / 3706
1600 / 3706
1700 / 3706
1800 / 3706
1900 / 3706
2000 / 3706
2100 / 3706
2200 / 3706
2300 / 3706
2400 / 3706
2500 / 3706
2600 / 3706
2700 / 3706
2800 / 3706
2900 / 3706
3000 / 3706
3100 / 3706
3200 / 3706
3300 / 3706
3400 / 3706
3500 / 3706
3600 / 3706
3700 / 3706

In [95]:
getRecommendedItems(prefs,itemsim,'87')[0:30]


Out[95]:
[(5.0, 'Uninvited Guest, An (2000)'),
 (5.0, 'Two Much (1996)'),
 (5.0, 'Two Family House (2000)'),
 (5.0, 'Trial by Jury (1994)'),
 (5.0, 'Tom & Viv (1994)'),
 (5.0, 'This Is My Father (1998)'),
 (5.0, 'Something to Sing About (1937)'),
 (5.0, 'Slappy and the Stinkers (1998)'),
 (5.0, 'Running Free (2000)'),
 (5.0, 'Roula (1995)'),
 (5.0, 'Prom Night IV: Deliver Us From Evil (1992)'),
 (5.0, 'Project Moon Base (1953)'),
 (5.0, 'Price Above Rubies, A (1998)'),
 (5.0, 'Open Season (1996)'),
 (5.0, 'Only Angels Have Wings (1939)'),
 (5.0, 'Onegin (1999)'),
 (5.0, 'Once Upon a Time... When We Were Colored (1995)'),
 (5.0, 'Office Killer (1997)'),
 (5.0, 'N\xe9nette et Boni (1996)'),
 (5.0, 'No Looking Back (1998)'),
 (5.0, 'Never Met Picasso (1996)'),
 (5.0, 'Music From Another Room (1998)'),
 (5.0, "Mummy's Tomb, The (1942)"),
 (5.0, 'Modern Affair, A (1995)'),
 (5.0, 'Machine, The (1994)'),
 (5.0, 'Lured (1947)'),
 (5.0, 'Low Life, The (1994)'),
 (5.0, 'Lodger, The (1926)'),
 (5.0, 'Loaded (1994)'),
 (5.0, 'Line King: Al Hirschfeld, The (1996)')]

Buiding Recommendation System with GraphLab

In this notebook we will import GraphLab Create and use it to

  • train two models that can be used for recommending new songs to users
  • compare the performance of the two models

Note: This notebook uses GraphLab Create 1.0.


In [2]:
# set product key using GraphLab Create API
#import graphlab
#graphlab.product_key.set_product_key('4972-65DF-8E02-816C-AB15-021C-EC1B-0367')

In [3]:
import graphlab as gl
# set canvas to show sframes and sgraphs in ipython notebook
gl.canvas.set_target('ipynb')
import matplotlib.pyplot as plt
%matplotlib inline

In [28]:
sf = graphlab.SFrame({'user_id': ["0", "0", "0", "1", "1", "2", "2", "2"],
                       'item_id': ["a", "b", "c", "a", "b", "b", "c", "d"],
                       'rating': [1, 3, 2, 5, 4, 1, 4, 3]})
sf


Out[28]:
item_id rating user_id
a 1 0
b 3 0
c 2 0
a 5 1
b 4 1
b 1 2
c 4 2
d 3 2
[8 rows x 3 columns]

In [29]:
m = graphlab.recommender.create(sf, target='rating')
recs = m.recommend()
print recs


PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 8 observations with 3 users and 4 items.
PROGRESS:     Data prepared in: 0.00353s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 32       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-09    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
PROGRESS: | ranking_regularization         | Rank-based Regularization Weight                 | 0.25     |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 25       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 8 / 8 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 25                | Not Viable                               |
PROGRESS: | 1       | 6.25              | Not Viable                               |
PROGRESS: | 2       | 1.5625            | Not Viable                               |
PROGRESS: | 3       | 0.390625          | 2.87179                                  |
PROGRESS: | 4       | 0.195312          | 2.74949                                  |
PROGRESS: | 5       | 0.0976562         | 2.82075                                  |
PROGRESS: | 6       | 0.0488281         | 3.00581                                  |
PROGRESS: | 7       | 0.0244141         | 3.28448                                  |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.195312          | 2.74949                                  |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Initial | 378us        | 3.89999           | 1.3637                |             |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | 1       | 1.316ms      | 4.29936           | 1.70755               | 0.195312    |
PROGRESS: | 2       | 2.258ms      | 3.07576           | 1.35708               | 0.116134    |
PROGRESS: | 3       | 3.1ms        | 2.68032           | 1.2128                | 0.0856819   |
PROGRESS: | 4       | 3.765ms      | 2.48103           | 1.17181               | 0.0580668   |
PROGRESS: | 5       | 4.4ms        | 2.41861           | 1.14629               | 0.0491185   |
PROGRESS: | 6       | 5.106ms      | 2.38115           | 1.14416               | 0.042841    |
PROGRESS: | 11      | 7.858ms      | 2.27575           | 1.07819               | 0.0271912   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training RMSE.
PROGRESS:        Final objective value: 2.79518
PROGRESS:        Final training RMSE: 1.04252
+---------+---------+---------------+------+
| user_id | item_id |     score     | rank |
+---------+---------+---------------+------+
|    0    |    d    | 1.26970750093 |  1   |
|    1    |    c    | 3.99517074227 |  1   |
|    1    |    d    | 3.12188112736 |  2   |
|    2    |    a    | 2.48946130276 |  1   |
+---------+---------+---------------+------+
[4 rows x 4 columns]


In [26]:
m['coefficients']


Out[26]:
{'intercept': 2.875, 'item_id': Columns:
 	item_id	str
 	linear_terms	float
 	factors	array
 
 Rows: 4
 
 Data:
 +---------+------------------+-------------------------------+
 | item_id |   linear_terms   |            factors            |
 +---------+------------------+-------------------------------+
 |    a    | -0.118845671415  | [0.00146939896513, -0.0018... |
 |    b    | -0.0642884969711 | [0.000750669511035, -0.003... |
 |    c    |  0.296330481768  | [-0.00124295474961, 0.0029... |
 |    d    | -0.571658432484  | [-0.000990785774775, 0.002... |
 +---------+------------------+-------------------------------+
 [4 rows x 3 columns], 'user_id': Columns:
 	user_id	str
 	linear_terms	float
 	factors	array
 
 Rows: 3
 
 Data:
 +---------+-----------------+-------------------------------+
 | user_id |   linear_terms  |            factors            |
 +---------+-----------------+-------------------------------+
 |    0    |  -1.00742137432 | [-0.000118086099974, -0.00... |
 |    1    |  0.854533851147 | [0.00172331754584, -0.0029... |
 |    2    | -0.305394947529 | [-0.00148682587314, 0.0042... |
 +---------+-----------------+-------------------------------+
 [3 rows x 3 columns]}

After importing GraphLab Create, we can download data directly from S3. We have placed a preprocessed version of the Million Song Dataset on S3. This data set was used for a Kaggle challenge and includes data from The Echo Nest, SecondHandSongs, musiXmatch, and Last.fm. This file includes data for a subset of 10000 songs.

The CourseTalk dataset: loading and first look

Loading of the CourseTalk database.


In [7]:
#train_file = 'http://s3.amazonaws.com/dato-datasets/millionsong/10000.txt'
train_file = '/Users/chengjun/GitHub/cjc2016/data/ratings.dat'
sf = gl.SFrame.read_csv(train_file, header=False, delimiter='|', verbose=False)
sf.rename({'X1':'user_id', 'X2':'course_id', 'X3':'rating'}).show()


------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/chengjun/GitHub/cjc2016/data/ratings.dat
PROGRESS: Parsing completed. Parsed 2773 lines in 0.0105 secs.

In order to evaluate the performance of our model, we randomly split the observations in our data set into two partitions: we will use train_set when creating our model and test_set for evaluating its performance.


In [8]:
(train_set, test_set) = sf.random_split(0.8, seed=1)

Popularity model

Create a model that makes recommendations using item popularity. When no target column is provided, the popularity is determined by the number of observations involving each item. When a target is provided, popularity is computed using the item’s mean target value. When the target column contains ratings, for example, the model computes the mean rating for each item and uses this to rank items for recommendations.

One typically wants to initially create a simple recommendation system that can be used as a baseline and to verify that the rest of the pipeline works as expected. The recommender package has several models available for this purpose. For example, we can create a model that predicts songs based on their overall popularity across all users.


In [10]:
popularity_model = gl.popularity_recommender.create(train_set, 'user_id', 'course_id', target = 'rating')


PROGRESS: Recsys training: model = popularity
PROGRESS: Preparing data set.
PROGRESS:     Data has 2202 observations with 1651 users and 201 items.
PROGRESS:     Data prepared in: 0.007957s
PROGRESS: 2202 observations to process; with 201 unique items.

Item similarity Model

  • Collaborative filtering methods make predictions for a given user based on the patterns of other users' activities. One common technique is to compare items based on their Jaccard similarity.This measurement is a ratio: the number of items they have in common, over the total number of distinct items in both sets.
  • We could also have used another slightly more complicated similarity measurement, called Cosine Similarity.

If your data is implicit, i.e., you only observe interactions between users and items, without a rating, then use ItemSimilarityModel with Jaccard similarity.

If your data is explicit, i.e., the observations include an actual rating given by the user, then you have a wide array of options. ItemSimilarityModel with cosine or Pearson similarity can incorporate ratings. In addition, MatrixFactorizationModel, FactorizationModel, as well as LinearRegressionModel all support rating prediction.

Now data contains three columns: ‘user_id’, ‘item_id’, and ‘rating’.

itemsim_cosine_model = graphlab.recommender.create(data, target=’rating’, method=’item_similarity’, similarity_type=’cosine’)

factorization_machine_model = graphlab.recommender.create(data, target=’rating’, method=’factorization_model’)

In the following code block, we compute all the item-item similarities and create an object that can be used for recommendations.


In [30]:
item_sim_model = gl.item_similarity_recommender.create(train_set, 'user_id', 'course_id', target = 'rating', 
                                                       similarity_type='cosine')


PROGRESS: Recsys training: model = item_similarity
PROGRESS: Preparing data set.
PROGRESS:     Data has 2202 observations with 1651 users and 201 items.
PROGRESS:     Data prepared in: 0.008781s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 201 items:
PROGRESS: Finished training in 0.003179s
PROGRESS: Finished prediction in 0.003227s

Factorization Recommender Model

Create a FactorizationRecommender that learns latent factors for each user and item and uses them to make rating predictions. This includes both standard matrix factorization as well as factorization machines models (in the situation where side data is available for users and/or items). link


In [35]:
factorization_machine_model = gl.recommender.factorization_recommender.create(train_set, 'user_id', 'course_id',
                                                                              target='rating')


PROGRESS: Recsys training: model = factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 2202 observations with 1651 users and 201 items.
PROGRESS:     Data prepared in: 0.007704s
PROGRESS: Training factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 8        |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-08    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 50       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 2202 / 2202 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 25                | Not Viable                               |
PROGRESS: | 1       | 6.25              | Not Viable                               |
PROGRESS: | 2       | 1.5625            | Not Viable                               |
PROGRESS: | 3       | 0.390625          | 0.133755                                 |
PROGRESS: | 4       | 0.195312          | 0.171583                                 |
PROGRESS: | 5       | 0.0976562         | 0.236008                                 |
PROGRESS: | 6       | 0.0488281         | 0.338778                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.390625          | 0.133755                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Initial | 324us        | 0.891401          | 0.94414               |             |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | 1       | 38.389ms     | 0.878127          | 0.937082              | 0.390625    |
PROGRESS: | 2       | 69.95ms      | 0.502405          | 0.708804              | 0.232267    |
PROGRESS: | 3       | 101.597ms    | 0.302088          | 0.549625              | 0.171364    |
PROGRESS: | 4       | 133.423ms    | 0.219703          | 0.468724              | 0.138107    |
PROGRESS: | 5       | 164.767ms    | 0.162832          | 0.403524              | 0.116824    |
PROGRESS: | 6       | 196.399ms    | 0.119286          | 0.345378              | 0.101894    |
PROGRESS: | 11      | 355.55ms     | 0.0272915         | 0.165197              | 0.0646719   |
PROGRESS: | 50      | 1.59s        | 0.000759299       | 0.0275144             | 0.0207746   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training RMSE.
PROGRESS:        Final objective value: 0.000664005
PROGRESS:        Final training RMSE: 0.0257244

Model Evaluation

It's straightforward to use GraphLab to compare models on a small subset of users in the test_set. The precision-recall plot that is computed shows the benefits of using the similarity-based model instead of the baseline popularity_model: better curves tend toward the upper-right hand corner of the plot.

The following command finds the top-ranked items for all users in the first 500 rows of test_set. The observations in train_set are not included in the predicted items.


In [36]:
result = gl.recommender.util.compare_models(test_set, [popularity_model, item_sim_model, factorization_machine_model],
                                            user_sample=.1, skip_set=train_set)


compare_models: using 49 users to estimate model performance
PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+------------------+-----------------+
| cutoff |  mean_precision  |   mean_recall   |
+--------+------------------+-----------------+
|   2    |       0.0        |       0.0       |
|   4    |       0.0        |       0.0       |
|   6    |       0.0        |       0.0       |
|   8    |       0.0        |       0.0       |
|   10   |       0.0        |       0.0       |
|   12   | 0.00170068027211 | 0.0204081632653 |
|   14   | 0.00145772594752 | 0.0204081632653 |
|   16   | 0.00127551020408 | 0.0204081632653 |
|   18   | 0.00113378684807 | 0.0204081632653 |
|   20   | 0.00102040816327 | 0.0204081632653 |
+--------+------------------+-----------------+
[10 rows x 3 columns]


Overall RMSE:  1.07244677675

Per User RMSE (best)
+---------+-------+-----------------+
| user_id | count |       rmse      |
+---------+-------+-----------------+
|   1642  |   1   | 0.0263157894737 |
+---------+-------+-----------------+
[1 rows x 3 columns]


Per User RMSE (worst)
+---------+-------+---------------+
| user_id | count |      rmse     |
+---------+-------+---------------+
|   1615  |   1   | 4.16666666667 |
+---------+-------+---------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+-----------+-------+------+
| course_id | count | rmse |
+-----------+-------+------+
|    100    |   1   | 0.0  |
+-----------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+-----------+-------+---------------+
| course_id | count |      rmse     |
+-----------+-------+---------------+
|     36    |   1   | 4.16666666667 |
+-----------+-------+---------------+
[1 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+------------------+-----------------+
| cutoff |  mean_precision  |   mean_recall   |
+--------+------------------+-----------------+
|   2    |       0.0        |       0.0       |
|   4    | 0.0102040816327  | 0.0408163265306 |
|   6    | 0.0102040816327  | 0.0612244897959 |
|   8    | 0.00765306122449 | 0.0612244897959 |
|   10   | 0.0102040816327  |  0.102040816327 |
|   12   | 0.00850340136054 |  0.102040816327 |
|   14   | 0.00728862973761 |  0.102040816327 |
|   16   | 0.00637755102041 |  0.102040816327 |
|   18   | 0.00566893424036 |  0.102040816327 |
|   20   | 0.00612244897959 |  0.122448979592 |
+--------+------------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Finished prediction in 0.001036s

Overall RMSE:  1.19396827432

Per User RMSE (best)
+---------+-------+------+
| user_id | count | rmse |
+---------+-------+------+
|   1600  |   1   | 0.0  |
+---------+-------+------+
[1 rows x 3 columns]


Per User RMSE (worst)
+---------+-------+------+
| user_id | count | rmse |
+---------+-------+------+
|   1615  |   1   | 4.5  |
+---------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (best)
+-----------+-------+------+
| course_id | count | rmse |
+-----------+-------+------+
|    113    |   1   | 0.0  |
+-----------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+-----------+-------+------+
| course_id | count | rmse |
+-----------+-------+------+
|     36    |   1   | 4.5  |
+-----------+-------+------+
[1 rows x 3 columns]

PROGRESS: Evaluate model M2

Precision and recall summary statistics by cutoff
+--------+------------------+------------------+
| cutoff |  mean_precision  |   mean_recall    |
+--------+------------------+------------------+
|   2    |       0.0        |       0.0        |
|   4    |       0.0        |       0.0        |
|   6    | 0.00340136054422 | 0.00680272108844 |
|   8    | 0.00510204081633 | 0.0170068027211  |
|   10   | 0.00408163265306 | 0.0170068027211  |
|   12   | 0.00340136054422 | 0.0170068027211  |
|   14   | 0.00437317784257 | 0.0374149659864  |
|   16   | 0.00382653061224 | 0.0374149659864  |
|   18   | 0.00340136054422 | 0.0374149659864  |
|   20   | 0.0030612244898  | 0.0374149659864  |
+--------+------------------+------------------+
[10 rows x 3 columns]


Overall RMSE:  1.18039409899

Per User RMSE (best)
+---------+-------+-----------------+
| user_id | count |       rmse      |
+---------+-------+-----------------+
|   1642  |   1   | 0.0397204443792 |
+---------+-------+-----------------+
[1 rows x 3 columns]


Per User RMSE (worst)
+---------+-------+--------------+
| user_id | count |     rmse     |
+---------+-------+--------------+
|   1615  |   1   | 4.4589629598 |
+---------+-------+--------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+-----------+-------+-----------------+
| course_id | count |       rmse      |
+-----------+-------+-----------------+
|    137    |   1   | 0.0120499303006 |
+-----------+-------+-----------------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+-----------+-------+--------------+
| course_id | count |     rmse     |
+-----------+-------+--------------+
|     36    |   1   | 4.4589629598 |
+-----------+-------+--------------+
[1 rows x 3 columns]

Now let's ask the item similarity model for song recommendations on several users. We first create a list of users and create a subset of observations, users_ratings, that pertain to these users.


In [37]:
K = 10
users = gl.SArray(sf['user_id'].unique().head(100))

Next we use the recommend() function to query the model we created for recommendations. The returned object has four columns: user_id, song_id, the score that the algorithm gave this user for this song, and the song's rank (an integer from 0 to K-1). To see this we can grab the top few rows of recs:


In [15]:
recs = item_sim_model.recommend(users=users, k=K)
recs.head()


Out[15]:
user_id course_id score rank
232 15 5.0 1
232 14 5.0 2
232 13 5.0 3
232 12 5.0 4
232 11 5.0 5
232 10 5.0 6
232 9 5.0 7
232 8 5.0 8
232 7 5.0 9
232 4 5.0 10
[10 rows x 4 columns]

To learn what songs these ids pertain to, we can merge in metadata about each song.


In [22]:
# Get the meta data of the courses
courses = gl.SFrame.read_csv('/Users/chengjun/GitHub/cjc2016/data/cursos.dat', header=False, delimiter='|', verbose=False)
courses.rename({'X1':'course_id', 'X2':'title', 'X3':'avg_rating', 
              'X4':'workload', 'X5':'university', 'X6':'difficulty', 'X7':'provider'}).show()

courses = courses[['course_id', 'title', 'provider']]
results = recs.join(courses, on='course_id', how='inner')

# Populate observed user-course data with course info
userset = frozenset(users)
ix = sf['user_id'].apply(lambda x: x in userset, int)  
user_data = sf[ix]
user_data = user_data.join(courses, on='course_id')[['user_id', 'title', 'provider']]


------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str,float,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/chengjun/GitHub/cjc2016/data/cursos.dat
PROGRESS: Parsing completed. Parsed 5597 lines in 0.016009 secs.

In [23]:
# Print out some recommendations 
for i in range(5):
    user = list(users)[i]
    print "User: " + str(i + 1)
    user_obs = user_data[user_data['user_id'] == user].head(K)
    del user_obs['user_id']
    user_recs = results[results['user_id'] == str(user)][['title', 'provider']]

    print "We were told that the user liked these courses: "
    print user_obs.head(K)

    print "We recommend these other courses:"
    print user_recs.head(K)

    print ""


User: 1
We were told that the user liked these courses: 
+-------------------------------+----------+
|             title             | provider |
+-------------------------------+----------+
| An Introduction to Interac... | coursera |
+-------------------------------+----------+
[1 rows x 2 columns]

We recommend these other courses:
+-------+----------+
| title | provider |
+-------+----------+
+-------+----------+
[0 rows x 2 columns]


User: 2
We were told that the user liked these courses: 
+-------------------------------+----------+
|             title             | provider |
+-------------------------------+----------+
| An Introduction to Interac... | coursera |
+-------------------------------+----------+
[1 rows x 2 columns]

We recommend these other courses:
+-------+----------+
| title | provider |
+-------+----------+
+-------+----------+
[0 rows x 2 columns]


User: 3
We were told that the user liked these courses: 
+-------------------------------+----------+
|             title             | provider |
+-------------------------------+----------+
| An Introduction to Interac... | coursera |
+-------------------------------+----------+
[1 rows x 2 columns]

We recommend these other courses:
+-------+----------+
| title | provider |
+-------+----------+
+-------+----------+
[0 rows x 2 columns]


User: 4
We were told that the user liked these courses: 
+-------------------------------+----------+
|             title             | provider |
+-------------------------------+----------+
| A Beginner&#39;s Guide to ... | coursera |
|          Gamification         | coursera |
+-------------------------------+----------+
[2 rows x 2 columns]

We recommend these other courses:
+-------+----------+
| title | provider |
+-------+----------+
+-------+----------+
[0 rows x 2 columns]


User: 5
We were told that the user liked these courses: 
+-------------------------------+----------+
|             title             | provider |
+-------------------------------+----------+
| Web Intelligence and Big Data | coursera |
+-------------------------------+----------+
[1 rows x 2 columns]

We recommend these other courses:
+-------+----------+
| title | provider |
+-------+----------+
+-------+----------+
[0 rows x 2 columns]


(Looking for more details about the modules and functions? Check out the API docs.)