In [2]:
import graphlab as gl

In [3]:
song_data = gl.SFrame('song_data.gl/')


[INFO] This non-commercial license of GraphLab Create is assigned to iliassweb@gmail.comand will expire on September 22, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-6590 - Server binary: /home/zax/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1445494515.log
[INFO] GraphLab Server Version: 1.6.1

Explore data


In [5]:
song_data.head()


Out[5]:
user_id song_id listen_count title artist
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOAKIMP12A8C130995 1 The Cove Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBBMDR12A8C13253B 2 Entre Dos Aguas Paco De Lucia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBXHDL12A81C204C0 1 Stronger Kanye West
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBYHAJ12A6701BF1D 1 Constellations Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODACBL12A8C13C273 1 Learn To Fly Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODDNQT12A6D4F5F7E 5 Apuesta Por El Rock 'N'
Roll ...
Héroes del Silencio
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODXRTY12AB0180F3B 1 Paper Gangsta Lady GaGa
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFGUAY12AB017B0A8 1 Stacked Actors Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFRQTD12A81C233C0 1 Sehr kosmisch Harmonia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOHQWYZ12A6D4FA701 1 Heaven's gonna burn your
eyes ...
Thievery Corporation
feat. Emiliana Torrini ...
song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De
Lucia ...
Stronger - Kanye West
Constellations - Jack
Johnson ...
Learn To Fly - Foo
Fighters ...
Apuesta Por El Rock 'N'
Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo
Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your
eyes - Thievery ...
[10 rows x 6 columns]


In [4]:
gl.canvas.set_target('ipynb')

In [6]:
song_data['song'].show()



In [7]:
len(song_data)


Out[7]:
1116609

Count number of users


In [34]:
users = song_data['user_id'].unique()

In [9]:
len(users)


Out[9]:
66346

Built a recommender system

Split the data into train and test data


In [6]:
train_data, test_data = song_data.random_split(.8, seed=0)

Simple popularity-based recommender


In [11]:
popularity_model = gl.popularity_recommender.create(train_data,
                                                    user_id='user_id',
                                                   item_id='song'
                                                   )


PROGRESS: Recsys training: model = popularity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 1.34965s
PROGRESS: 893580 observations to process; with 9952 unique items.

Use the popularity model to make some predictions


In [14]:
popularity_model.recommend(users=[users[0]])


Out[14]:
user_id song score rank
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Sehr kosmisch - Harmonia 4754.0 1
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Undo - Björk 4227.0 2
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
You're The One - Dwight
Yoakam ...
3781.0 3
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Revelry - Kings Of Leon 3527.0 5
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Secrets - OneRepublic 3148.0 7
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Fireflies - Charttraxx
Karaoke ...
2532.0 8
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Tive Sim - Cartola 2521.0 9
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Drop The World - Lil
Wayne / Eminem ...
2053.0 10
[10 rows x 4 columns]

User 1 recommendation


In [16]:
popularity_model.recommend(users=[users[1]])


Out[16]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sehr kosmisch - Harmonia 4754.0 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Undo - Björk 4227.0 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
You're The One - Dwight
Yoakam ...
3781.0 3
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Revelry - Kings Of Leon 3527.0 5
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Secrets - OneRepublic 3148.0 7
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Hey_ Soul Sister - Train 2538.0 8
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Fireflies - Charttraxx
Karaoke ...
2532.0 9
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Tive Sim - Cartola 2521.0 10
[10 rows x 4 columns]

Reommender with personalization


In [18]:
personalized_model = gl.item_similarity_recommender.create(train_data,
                                                          user_id = 'user_id',
                                                           item_id = 'song'
                                                          )


PROGRESS: Recsys training: model = item_similarity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 1.39029s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 9952 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 1.85622         |
PROGRESS: | 2000            | 1.93614         |
PROGRESS: | 3000            | 2.0068          |
PROGRESS: | 4000            | 2.08326         |
PROGRESS: | 5000            | 2.16111         |
PROGRESS: | 6000            | 2.23223         |
PROGRESS: | 7000            | 2.30307         |
PROGRESS: | 8000            | 2.37842         |
PROGRESS: | 9000            | 2.46742         |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 2.76698s

Apply thepersonalized model to make recommendation


In [19]:
personalized_model.recommend(users= [users[0]])


Out[19]:
user_id song score rank
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Cuando Pase El Temblor -
Soda Stereo ...
0.0194504525792 1
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Fireflies - Charttraxx
Karaoke ...
0.0145048191769 2
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Love Is A Losing Game -
Amy Winehouse ...
0.0142992063828 3
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Marry Me - Train 0.0141649731998 4
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Secrets - OneRepublic 0.0136169436052 5
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Sehr kosmisch - Harmonia 0.0134355710515 6
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
No Dejes Que... -
Caifanes ...
0.0134191754754 7
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Y solo se me ocurre
amarte (Unplugged) - ...
0.0133210385369 8
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Te Hacen Falta Vitaminas
- Soda Stereo ...
0.0129302853556 9
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
OMG - Usher featuring
will.i.am ...
0.0128012717244 10
[10 rows x 4 columns]


In [20]:
personalized_model.recommend(users=[users[1]])


Out[20]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Riot In Cell Block Number
Nine - Dr Feelgood ...
0.0375 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sei Lá Mangueira -
Elizeth Cardoso ...
0.0331632653061 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
The Stallion - Ween 0.0322580645161 3
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Rain - Subhumans 0.0314716312057 4
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
West One (Shine On Me) -
The Ruts ...
0.0307080895662 5
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Back Against The Wall -
Cage The Elephant ...
0.0301204819277 6
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Life Less Frightening -
Rise Against ...
0.0284431137725 7
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
A Beggar On A Beach Of
Gold - Mike And The ...
0.0230024907156 8
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Audience Of One - Rise
Against ...
0.0193938442211 9
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Blame It On The Boogie -
The Jacksons ...
0.0189873417722 10
[10 rows x 4 columns]


In [21]:
personalized_model.get_similar_items(['The Stallion - Ween'])


PROGRESS: Getting similar items completed in 0.001846
Out[21]:
song similar score rank
The Stallion - Ween Blame It On The Boogie -
The Jacksons ...
0.179104477612 1
The Stallion - Ween Absence of Fear - War Of
Ages ...
0.129032258065 2
The Stallion - Ween Faint Resemblance - Rise
Against ...
0.121739130435 3
The Stallion - Ween Entertainment - Rise
Against ...
0.118055555556 4
The Stallion - Ween Halfway There - Rise
Against ...
0.115384615385 5
The Stallion - Ween To The Core - Rise
Against ...
0.115044247788 6
The Stallion - Ween Long Forgotten Sons -
Rise Against ...
0.112426035503 7
The Stallion - Ween Riot In Cell Block Number
Nine - Dr Feelgood ...
0.111764705882 8
The Stallion - Ween Great Awakening - Rise
Against ...
0.0887096774194 9
The Stallion - Ween Hairline Fracture - Rise
Against ...
0.0866141732283 10
[10 rows x 4 columns]


In [22]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])


PROGRESS: Getting similar items completed in 0.003067
Out[22]:
song similar score rank
Chan Chan (Live) - Buena
Vista Social Club ...
Murmullo - Buena Vista
Social Club ...
0.188118811881 1
Chan Chan (Live) - Buena
Vista Social Club ...
La Bayamesa - Buena Vista
Social Club ...
0.187192118227 2
Chan Chan (Live) - Buena
Vista Social Club ...
Amor de Loca Juventud -
Buena Vista Social Club ...
0.184834123223 3
Chan Chan (Live) - Buena
Vista Social Club ...
Diferente - Gotan Project 0.0214592274678 4
Chan Chan (Live) - Buena
Vista Social Club ...
Mistica - Orishas 0.0205761316872 5
Chan Chan (Live) - Buena
Vista Social Club ...
Hotel California - Gipsy
Kings ...
0.019305019305 6
Chan Chan (Live) - Buena
Vista Social Club ...
Nací Orishas - Orishas 0.0191570881226 7
Chan Chan (Live) - Buena
Vista Social Club ...
Le Moulin - Yann Tiersen 0.0187969924812 8
Chan Chan (Live) - Buena
Vista Social Club ...
Gitana - Willie Colon 0.0187969924812 9
Chan Chan (Live) - Buena
Vista Social Club ...
Criminal - Gotan Project 0.018779342723 10
[10 rows x 4 columns]

Quantitative comparison between the models

Precision-Recall


In [27]:
import matplotlib
%matplotlib inline
model_performance = gl.recommender.util.compare_models(test_data,
                                                      [popularity_model, personalized_model],
                                                      user_sample=0.05)


compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 10665.1
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 12174.8

Precision and recall summary statistics by cutoff
[WARNING] Model trained without a target. Skipping RMSE computation.
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0296827021494 | 0.00775176338124 |
|   2    | 0.0279767997271 | 0.0148200410227  |
|   3    | 0.0260434436484 | 0.0203007756488  |
|   4    | 0.0242238143978 | 0.0248677324419  |
|   5    | 0.0225861480723 | 0.0295442066936  |
|   6    | 0.0216080973502 | 0.0346141254023  |
|   7    | 0.0204220889994 | 0.0379537678668  |
|   8    | 0.0198311156602 | 0.0422943455953  |
|   9    | 0.0187649266462 | 0.0448458380393  |
|   10   |  0.017911975435 | 0.0478489032481  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 1549.2
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 1538.02

Precision and recall summary statistics by cutoff
[WARNING] Model trained without a target. Skipping RMSE computation.
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.195496417605 | 0.0650005691075 |
|   2    |  0.159501876493 | 0.0996978063636 |
|   3    |  0.141248720573 |  0.126881945626 |
|   4    |  0.12572500853  |  0.146405349361 |
|   5    |  0.114227226203 |  0.162845861349 |
|   6    |  0.104912998976 |  0.177262898309 |
|   7    | 0.0971876980065 |  0.189555715144 |
|   8    | 0.0907540088707 |  0.201852444534 |
|   9    | 0.0846127601501 |  0.21178175335  |
|   10   | 0.0805527123849 |  0.222249525371 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]


In [28]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots()

pr_curves_by_model = [res['precision_recall_overall'] for res in model_performance]

pr_curve = pr_curves_by_model[0].sort('recall')
ax.plot(list(pr_curve['recall']), list(pr_curve['precision']),
        'blue', label='M1')

pr_curve = pr_curves_by_model[1].sort('recall')
ax.plot(list(pr_curve['recall']), list(pr_curve['precision']),
        'green', label='M2')

ax.set_title('Precision-Recall Averaged Over Users')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.legend()

fig.show()


/home/zax/anaconda/lib/python2.7/site-packages/matplotlib/figure.py:387: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "

Counting unique users for some artists

Kanye Wesy


In [35]:
Kanye_West = song_data[song_data['artist']=='Kanye West']

In [38]:
Kanye_West['user_id'].unique().show()


Kanye_West unique users


In [41]:
Kanye_West_users=Kanye_West['user_id'].unique()

In [42]:
len(Kanye_West_users)


Out[42]:
2522

Foo Fighters unique users


In [45]:
Foo_Fighters_users = song_data[song_data['artist']=='Foo Fighters']['user_id'].unique()

In [47]:
len(Foo_Fighters_users)


Out[47]:
2055

Taylor Swift


In [53]:
Taylor_Swift_users = song_data[song_data['artist']=='Taylor Swift']['user_id'].unique()

In [55]:
len(Taylor_Swift_users)


Out[55]:
3246

Lady GaGa


In [56]:
Lady_GaGa_users = song_data[song_data['artist']=='Lady GaGa']['user_id'].unique()

In [58]:
len(Lady_GaGa_users)


Out[58]:
2928

Groupby-aggregate


In [8]:
groupby_artist = song_data.groupby(key_columns='artist', operations={'total_count': gl.aggregate.SUM('listen_count')})

Sorting groupby_artist


In [10]:
groupby_artist


Out[10]:
artist total_count
The Dells 274
Lil Jon / The East Side
Boyz ...
197
Tom Petty And The
Heartbreakers ...
2867
Blackstreet 747
Ratatat 3727
Shotta 82
Airscape 130
Mecano 172
Moimir Papalescu & The
Nihilists ...
177
Brad Paisley 2731
[3375 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [19]:
groupby_artist.sort('total_count')


Out[19]:
artist total_count
William Tabbert 14
Reel Feelings 24
Beyoncé feat. Bun B and
Slim Thug ...
26
Diplo 30
Boggle Karaoke 30
harvey summers 31
Nâdiya 36
Kanye West / Talib Kweli
/ Q-Tip / Common / ...
38
Aneta Langerova 38
Jody Bernal 38
[3375 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [20]:
groupby_artist.sort('total_count', ascending=False)


Out[20]:
artist total_count
Kings Of Leon 43218
Dwight Yoakam 40619
Björk 38889
Coldplay 35362
Florence + The Machine 33387
Justin Bieber 29715
Alliance Ethnik 26689
OneRepublic 25754
Train 25402
The Black Keys 22184
[3375 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Split the data into train and test data


In [22]:
train_data,test_data = song_data.random_split(.8, seed=0)

In [23]:
subset_test_data = test_data['user_id'].unique()[0:10000]

In [25]:
personalized_model = gl.item_similarity_recommender.create(train_data,
                                                   user_id = 'user_id',
                                                   item_id = 'song')


PROGRESS: Recsys training: model = item_similarity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 1.23434s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 9952 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 1.68233         |
PROGRESS: | 2000            | 1.7462          |
PROGRESS: | 3000            | 1.80827         |
PROGRESS: | 4000            | 1.87111         |
PROGRESS: | 5000            | 1.95594         |
PROGRESS: | 6000            | 2.02921         |
PROGRESS: | 7000            | 2.10021         |
PROGRESS: | 8000            | 2.17565         |
PROGRESS: | 9000            | 2.2453          |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 2.54794s

Recommend song to 10000 test users


In [27]:
# 1 recommendation for each of these users
recommneded_song = personalized_model.recommend(subset_test_data, k=1)


PROGRESS: recommendations finished on 1000/10000 queries. users per second: 1714.06
PROGRESS: recommendations finished on 2000/10000 queries. users per second: 1800.46
PROGRESS: recommendations finished on 3000/10000 queries. users per second: 1833.78
PROGRESS: recommendations finished on 4000/10000 queries. users per second: 1833.45
PROGRESS: recommendations finished on 5000/10000 queries. users per second: 1828.53
PROGRESS: recommendations finished on 6000/10000 queries. users per second: 1833.59
PROGRESS: recommendations finished on 7000/10000 queries. users per second: 1828.71
PROGRESS: recommendations finished on 8000/10000 queries. users per second: 1835.12
PROGRESS: recommendations finished on 9000/10000 queries. users per second: 1838.31
PROGRESS: recommendations finished on 10000/10000 queries. users per second: 1838.18

In [ ]:
recommneded_song = song_data.groupby(key_column = 'song')