Recommendation Engine

In this part we are going to build a simple recommender system using collaborative filtering.

1. Imports


In [ ]:
import numpy as np
import pandas as pd
import sklearn.metrics.pairwise

2. The data

We will use Germany's data of the Last.fm Dataset. To read and print the data we will use the Pandas library.


In [ ]:
data = pd.read_csv('data/lastfm-matrix-germany.csv')
data

As you can see the data contains for each user, which song he or she listened to on Last.FM. Note that the number of times a person listened to a specific band is not listed.

To make it easier on ourselves to we also make a copy of the data that does not contain the user column and store that in a numpy array.


In [ ]:
data_matrix = np.array(data.drop('user', 1))
data_matrix.shape

3. Band similarity

We want to figure out which band to recommend to which user. Since we know which user listened to which band we can look for bands or users that are similar. Humans can have vastly complex listening preferences and are very hard to group. Bands on the other hand are usually much easier to group. So it is best to look for similarties between bands rather than between users.

To determine if 2 bands are similar, you can use many different similarity metrics. Finding the best metric is a whole research topic on its own. In many cases though the cosine similarity is used. The implementation we will use here is the sklearn.metrics.pairwise.cosine_similarity.


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# similarity_matrix = sklearn.metrics.pairwise.cosine_similarity( ? )

assert similarity_matrix.shape == (285, 285)

To make a nice print of the data we will use the pandas library as follows.


In [ ]:
band_similarities = pd.DataFrame(similarity_matrix, index=data.columns[1:],columns=data.columns[1:])
band_similarities

As you can see above, bands are 100% similar to themselves and The White Stripes are nothing like Abba.

4. Picking the best matches

Even though many of the bands above have a similarity close to 0, there might be some bands that seem to be slightly similar because for some reason somebody with a very complex taste listened to them both. To remove this noise from the dataset we are going to select only the 10 best matches.

Let's first try this with the first band in the list.


In [ ]:
n_best = 10
##### Implement this part of the code #####
raise NotImplementedError()
# top_n = band_similarities.iloc[:,0].sort_values(ascending= ? )[:?]
print(top_n)

assert len(top_n) == 10

If we only want the names, we can get them through the .index.


In [ ]:
n_best = 10
##### Implement this part of the code #####
raise NotImplementedError()
# top_n = band_similarities.iloc[:,0].sort_values(ascending= ? ) ?
print(top_n)

assert len(top_n) == 10 and top_n.__class__ == pd.Index

Now let's do this for all bands.


In [ ]:
n_best = 10

# First create a place holder for all the most similar bands.
top_n_similar_bands = pd.DataFrame(index=band_similarities.columns,columns=range(1, n_best + 1))
 
# Now loop over all the bands and select the top bands
for i in range(0, len(band_similarities.columns)):
##### Implement this part of the code #####
raise NotImplementedError()
    # top_n_similar_bands.iloc[i,:] = band_similarities.iloc[:,i].sort_values(ascending= ? ) ?

top_n_similar_bands

5. Find which bands to advise.

Now that we know which bands are similar, we have to figure out which bands to advise to whom. To do this we need to determine how the listening history of a user matches that of bands he didn't listen to yet. For this we will use the following similarity score.


In [ ]:
# Function to compute the similarity scores
def similarity_score(listening_history, similarities):
    return sum(listening_history * similarities) / sum(similarities)

For each band we sum the similarities of bands the user also listened to. In the end we divide by the total sum of similarities to normalise the score.

So let's say a user listened to 1 of 3 bands that are similar, for example [0, 1, 0] and there respective similarity scores are [0.3, 0.2, 0.1] you get the following score:


In [ ]:
listening_history = np.array([0, 1, 0]) 
similarities = np.array([0.3, 0.2, 0.1])
similarity_score(listening_history, similarities)

Now let's compute the score for each band for user 1 (with index 0).


In [ ]:
user_index = 0

# a list of all the scores
scores = []

for band_index in range(len(band_similarities.columns)):
    user = data.index[user_index]
    band = band_similarities.columns[band_index]
    
    # For bands the user already listened to we set the score to 0
    if data_matrix[user_index][band_index] == 1:
        scores.append(0)
    else:
        # Most similar bands to this one
##### Implement this part of the code #####
raise NotImplementedError()
        # most_similar_band_names = band_similarities.loc[band].sort_values(ascending= ? ) ?
        # Get the similarity score of these bands
##### Implement this part of the code #####
raise NotImplementedError()
        # most_similar_band_scores = band_similarities.loc[band].sort_values(ascending= ? ) ?
        # Get the listening history for these bands
        user_listening_history = data.loc[user, most_similar_band_names]

        scores.append(similarity_score(user_listening_history, most_similar_band_scores))

Now let's make a nice print of the top 5 bands to advice to this user:


In [ ]:
print('For user with id', data.iloc[user_index, 0], 'we advice:')
pd.DataFrame(scores, index=band_similarities.columns).sort_values(0, ascending=False).iloc[:5]

Now try this also for other users.


In [ ]: