In [1]:
import graphlab

Load music data


In [2]:
song_data = graphlab.SFrame('./song_data.gl/')


[INFO] This non-commercial license of GraphLab Create is assigned to eroicaleo@yahoo.com and will expire on September 28, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-1191 - Server binary: /Users/yang/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1446881725.log
[INFO] GraphLab Server Version: 1.6.1

Explore data


In [3]:
song_data.head()


Out[3]:
user_id song_id listen_count title artist
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOAKIMP12A8C130995 1 The Cove Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBBMDR12A8C13253B 2 Entre Dos Aguas Paco De Lucia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBXHDL12A81C204C0 1 Stronger Kanye West
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBYHAJ12A6701BF1D 1 Constellations Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODACBL12A8C13C273 1 Learn To Fly Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODDNQT12A6D4F5F7E 5 Apuesta Por El Rock 'N'
Roll ...
Héroes del Silencio
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODXRTY12AB0180F3B 1 Paper Gangsta Lady GaGa
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFGUAY12AB017B0A8 1 Stacked Actors Foo Fighters
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOFRQTD12A81C233C0 1 Sehr kosmisch Harmonia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOHQWYZ12A6D4FA701 1 Heaven's gonna burn your
eyes ...
Thievery Corporation
feat. Emiliana Torrini ...
song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De
Lucia ...
Stronger - Kanye West
Constellations - Jack
Johnson ...
Learn To Fly - Foo
Fighters ...
Apuesta Por El Rock 'N'
Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo
Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your
eyes - Thievery ...
[10 rows x 6 columns]


In [4]:
graphlab.canvas.set_target('ipynb')

In [5]:
song_data['song'].show()



In [6]:
len(song_data)


Out[6]:
1116609

Count number of users


In [7]:
users = song_data['user_id'].unique()

In [8]:
len(users)


Out[8]:
66346

Creat a song recommender


In [9]:
train_data, test_data = song_data.random_split(.8, seed=0)

Simple popularity-based recommender


In [10]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')


PROGRESS: Recsys training: model = popularity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 1.39828s
PROGRESS: 893580 observations to process; with 9952 unique items.

Use the popularity model to make some predictions


In [11]:
popularity_model.recommend(users=[users[0]])


Out[11]:
user_id song score rank
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Sehr kosmisch - Harmonia 4754.0 1
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Undo - Björk 4227.0 2
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
You're The One - Dwight
Yoakam ...
3781.0 3
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Revelry - Kings Of Leon 3527.0 5
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Secrets - OneRepublic 3148.0 7
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Fireflies - Charttraxx
Karaoke ...
2532.0 8
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Tive Sim - Cartola 2521.0 9
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Drop The World - Lil
Wayne / Eminem ...
2053.0 10
[10 rows x 4 columns]


In [12]:
popularity_model.recommend(users=[users[1]])


Out[12]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sehr kosmisch - Harmonia 4754.0 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Undo - Björk 4227.0 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
You're The One - Dwight
Yoakam ...
3781.0 3
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Revelry - Kings Of Leon 3527.0 5
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Secrets - OneRepublic 3148.0 7
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Hey_ Soul Sister - Train 2538.0 8
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Fireflies - Charttraxx
Karaoke ...
2532.0 9
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Tive Sim - Cartola 2521.0 10
[10 rows x 4 columns]

Build a song recommender with personalization


In [13]:
personalized_mode = graphlab.item_similarity_recommender.create(train_data,
                                                               user_id='user_id',
                                                               item_id='song')


PROGRESS: Recsys training: model = item_similarity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 1.32383s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 9952 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 1.78274         |
PROGRESS: | 2000            | 1.86017         |
PROGRESS: | 3000            | 1.93299         |
PROGRESS: | 4000            | 2.00467         |
PROGRESS: | 5000            | 2.07504         |
PROGRESS: | 6000            | 2.13892         |
PROGRESS: | 7000            | 2.20079         |
PROGRESS: | 8000            | 2.27583         |
PROGRESS: | 9000            | 2.35564         |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 2.84023s

Applying the personalized model to make song recommendation


In [14]:
personalized_mode.recommend(users=[users[0]])


Out[14]:
user_id song score rank
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Cuando Pase El Temblor -
Soda Stereo ...
0.0194504525792 1
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Fireflies - Charttraxx
Karaoke ...
0.0144917381235 2
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Love Is A Losing Game -
Amy Winehouse ...
0.0142865986808 3
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Marry Me - Train 0.0141539719954 4
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Secrets - OneRepublic 0.0136062112507 5
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
No Dejes Que... -
Caifanes ...
0.0134191754754 6
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Sehr kosmisch - Harmonia 0.0134166034035 7
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Y solo se me ocurre
amarte (Unplugged) - ...
0.0133210385369 8
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Te Hacen Falta Vitaminas
- Soda Stereo ...
0.0129302853556 9
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
OMG - Usher featuring
will.i.am ...
0.0127952460209 10
[10 rows x 4 columns]


In [15]:
personalized_mode.recommend(users=[users[1]])


Out[15]:
user_id song score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Riot In Cell Block Number
Nine - Dr Feelgood ...
0.0375 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Sei Lá Mangueira -
Elizeth Cardoso ...
0.0331632653061 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
The Stallion - Ween 0.0322580645161 3
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Rain - Subhumans 0.0314159292035 4
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
West One (Shine On Me) -
The Ruts ...
0.0306772028826 5
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Back Against The Wall -
Cage The Elephant ...
0.0301204819277 6
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Life Less Frightening -
Rise Against ...
0.0284431137725 7
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
A Beggar On A Beach Of
Gold - Mike And The ...
0.0230024907156 8
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Audience Of One - Rise
Against ...
0.0193938442211 9
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
Blame It On The Boogie -
The Jacksons ...
0.0189873417722 10
[10 rows x 4 columns]


In [16]:
personalized_mode.get_similar_items(['With Or Without You - U2'])


PROGRESS: Getting similar items completed in 0.034809
Out[16]:
song similar score rank
With Or Without You - U2 I Still Haven't Found
What I'm Looking For ...
0.0428571428571 1
With Or Without You - U2 Hold Me_ Thrill Me_ Kiss
Me_ Kill Me - U2 ...
0.033734939759 2
With Or Without You - U2 Window In The Skies - U2 0.0328358208955 3
With Or Without You - U2 Vertigo - U2 0.0300751879699 4
With Or Without You - U2 Sunday Bloody Sunday - U2 0.0271317829457 5
With Or Without You - U2 Bad - U2 0.0251798561151 6
With Or Without You - U2 A Day Without Me - U2 0.0237154150198 7
With Or Without You - U2 Another Time Another
Place - U2 ...
0.020325203252 8
With Or Without You - U2 Walk On - U2 0.020202020202 9
With Or Without You - U2 Get On Your Boots - U2 0.0196850393701 10
[10 rows x 4 columns]


In [17]:
personalized_mode.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])


PROGRESS: Getting similar items completed in 0.003535
Out[17]:
song similar score rank
Chan Chan (Live) - Buena
Vista Social Club ...
Murmullo - Buena Vista
Social Club ...
0.188118811881 1
Chan Chan (Live) - Buena
Vista Social Club ...
La Bayamesa - Buena Vista
Social Club ...
0.187192118227 2
Chan Chan (Live) - Buena
Vista Social Club ...
Amor de Loca Juventud -
Buena Vista Social Club ...
0.184834123223 3
Chan Chan (Live) - Buena
Vista Social Club ...
Diferente - Gotan Project 0.0214592274678 4
Chan Chan (Live) - Buena
Vista Social Club ...
Mistica - Orishas 0.0205761316872 5
Chan Chan (Live) - Buena
Vista Social Club ...
Hotel California - Gipsy
Kings ...
0.019305019305 6
Chan Chan (Live) - Buena
Vista Social Club ...
Nací Orishas - Orishas 0.0191570881226 7
Chan Chan (Live) - Buena
Vista Social Club ...
Le Moulin - Yann Tiersen 0.0187969924812 8
Chan Chan (Live) - Buena
Vista Social Club ...
Gitana - Willie Colon 0.0187969924812 9
Chan Chan (Live) - Buena
Vista Social Club ...
Criminal - Gotan Project 0.018779342723 10
[10 rows x 4 columns]

Quantitative comparison between the models


In [19]:
import matplotlib.pyplot as plt
%matplotlib inline
model_performance = graphlab.recommender.util.compare_models(test_data,
                                                            [popularity_model, personalized_mode],
                                                            user_sample=0.05)


compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 9016.64
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 11510

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    |  0.028659160696 | 0.00730925411069 |
|   2    | 0.0272944387581 | 0.0143253785526  |
|   3    | 0.0257022631639 | 0.0195734276338  |
|   4    | 0.0240532241556 | 0.0244085362304  |
|   5    | 0.0217673149096 | 0.0280024516974  |
|   6    | 0.0213806436938 | 0.0335147594359  |
|   7    | 0.0204220889994 | 0.0369194946984  |
|   8    | 0.0192340498124 | 0.0405064844932  |
|   9    | 0.0184995640472 | 0.0431062848028  |
|   10   | 0.0177072671443 | 0.0459812842489  |
+--------+-----------------+------------------+
[10 rows x 3 columns]
[WARNING] Model trained without a target. Skipping RMSE computation.
PROGRESS: Evaluate model M1
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 1100.66
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 1144.77

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.199931763903 | 0.0608084715429 |
|   2    |  0.163254861822 | 0.0948712760381 |
|   3    |  0.144887979074 |  0.122499880138 |
|   4    |  0.130160354828 |  0.145611991595 |
|   5    |  0.117570794951 |  0.162565310557 |
|   6    |  0.108097350165 |  0.178556940046 |
|   7    |  0.101184383682 |  0.195134835974 |
|   8    | 0.0942937563971 |  0.207001547866 |
|   9    | 0.0872663861405 |  0.214203758446 |
|   10   | 0.0831115660184 |  0.225417095888 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]
[WARNING] Model trained without a target. Skipping RMSE computation.

Problem 1


In [23]:
len(song_data[song_data['artist'] == 'Kanye West']['user_id'].unique())


Out[23]:
2522

In [24]:
len(song_data[song_data['artist'] == 'Foo Fighters']['user_id'].unique())


Out[24]:
2055

In [25]:
len(song_data[song_data['artist'] == 'Taylor Swift']['user_id'].unique())


Out[25]:
3246

In [26]:
len(song_data[song_data['artist'] == 'Lady GaGa']['user_id'].unique())


Out[26]:
2928

Problem 2


In [27]:
song_data.groupby(key_columns='artist', operations={'total_count': graphlab.aggregate.SUM('listen_count')}).sort('total_count', ascending=False)


Out[27]:
artist total_count
Kings Of Leon 43218
Dwight Yoakam 40619
Björk 38889
Coldplay 35362
Florence + The Machine 33387
Justin Bieber 29715
Alliance Ethnik 26689
OneRepublic 25754
Train 25402
The Black Keys 22184
[3375 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [28]:
song_data.groupby(key_columns='artist', operations={'total_count': graphlab.aggregate.SUM('listen_count')}).sort('total_count', ascending=True)


Out[28]:
artist total_count
William Tabbert 14
Reel Feelings 24
Beyoncé feat. Bun B and
Slim Thug ...
26
Diplo 30
Boggle Karaoke 30
harvey summers 31
Nâdiya 36
Kanye West / Talib Kweli
/ Q-Tip / Common / ...
38
Aneta Langerova 38
Jody Bernal 38
[3375 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Problem 3


In [29]:
subset_test_users = test_data['user_id'].unique()[0:10000]

In [31]:
recommendations = personalized_mode.recommend(subset_test_users,k=1)


PROGRESS: recommendations finished on 1000/10000 queries. users per second: 1213.21
PROGRESS: recommendations finished on 2000/10000 queries. users per second: 1209.07
PROGRESS: recommendations finished on 3000/10000 queries. users per second: 1197.18
PROGRESS: recommendations finished on 4000/10000 queries. users per second: 1216.29
PROGRESS: recommendations finished on 5000/10000 queries. users per second: 1234.38
PROGRESS: recommendations finished on 6000/10000 queries. users per second: 1254.33
PROGRESS: recommendations finished on 7000/10000 queries. users per second: 1263.86
PROGRESS: recommendations finished on 8000/10000 queries. users per second: 1278.05
PROGRESS: recommendations finished on 9000/10000 queries. users per second: 1269.33
PROGRESS: recommendations finished on 10000/10000 queries. users per second: 1243.02

In [32]:
recommendations.head()


Out[32]:
user_id song score rank
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Cuando Pase El Temblor -
Soda Stereo ...
0.0194504525792 1
c067c22072a17d33310d7223d
7b79f819e48cf42 ...
Grind With Me (Explicit
Version) - Pretty Ricky ...
0.0459424433009 1
f6c596a519698c97f1591ad89
f540d76f6a04f1a ...
Hey_ Soul Sister - Train 0.0249315949672 1
696787172dd3f5169dc94deef
97e427cee86147d ...
Senza Una Donna (Without
A Woman) - Zucchero / ...
0.0170265780731 1
3a7111f4cdf3c5a85fd4053e3
cc2333562e1e0cb ...
Heartbreak Warfare - John
Mayer ...
0.0320586842427 1
532e98155cbfd1e1a474a28ed
96e59e50f7c5baf ...
Jive Talkin' (Album
Version) - Bee Gees ...
0.0118288659232 1
ee43b175ed753b2e2bce806c9
03d4661ad351a91 ...
Ricordati Di Noi -
Valerio Scanu ...
0.0305171277997 1
e372c27f6cb071518ae500589
ae02c126954c148 ...
Fall Out - The Police 0.0819672131148 1
83b1428917b47a6b130ed471b
09033820be78a8c ...
Clocks - Coldplay 0.0440380823291 1
39487deef9345b1e22881245c
abf4e7c53b6cf6e ...
Black Mirror - Arcade
Fire ...
0.0417737699321 1
[10 rows x 4 columns]


In [34]:
recommendations.groupby(key_columns='song', operations={'count': graphlab.aggregate.COUNT()}).sort('count', ascending=False)


Out[34]:
song count
Undo - Björk 431
Secrets - OneRepublic 381
Revelry - Kings Of Leon 232
You're The One - Dwight
Yoakam ...
170
Fireflies - Charttraxx
Karaoke ...
122
Hey_ Soul Sister - Train 107
Horn Concerto No. 4 in E
flat K495: II. Romance ...
98
Sehr kosmisch - Harmonia 72
OMG - Usher featuring
will.i.am ...
58
Dog Days Are Over (Radio
Edit) - Florence + The ...
53
[3135 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.