Extracting similar artists for the artists we have in our dataset.


In [1]:
import pandas as pd
import requests
from urllib.parse import quote
from artist_api import *

The artist dataset contains ids, Artist names, Artist url, and Artist pictureURL.


In [2]:
artists_df = pd.read_csv('artists.dat', sep='\t', header=0, index_col=0, skipinitialspace=True)
artists_df.head()


Out[2]:
name url pictureURL
id
1 MALICE MIZER http://www.last.fm/music/MALICE+MIZER http://userserve-ak.last.fm/serve/252/10808.jpg
2 Diary of Dreams http://www.last.fm/music/Diary+of+Dreams http://userserve-ak.last.fm/serve/252/3052066.jpg
3 Carpathian Forest http://www.last.fm/music/Carpathian+Forest http://userserve-ak.last.fm/serve/252/40222717...
4 Moi dix Mois http://www.last.fm/music/Moi+dix+Mois http://userserve-ak.last.fm/serve/252/54697835...
5 Bella Morte http://www.last.fm/music/Bella+Morte http://userserve-ak.last.fm/serve/252/14789013...

In this notebook we extract the unique mbid code identifiers of each artist using the api. This mbid code is used later on to extract the similar artists belonging to this artist.

To match the artist we are extracting from the api using the name, to the artist in our dataset, we search through the url of each artist with the same name and cross compare both urls, picking only the one that matches.

There are 3 cases that could happen in here:

  • There is an mbid code available.
  • There is no mbid code available.
  • We didn't find an artist match, denoted as "not found".

In [ ]:
artists_df['mbid'] = artists_df.apply(parse_artists, axis=1)

In [3]:
artists_df_mbid = pd.read_csv('artist_mbid_codes.csv', sep='\t', header=0, index_col=0, skipinitialspace=True)

In [4]:
artists_df_mbid.head()


Out[4]:
name url pictureURL mbid
id
1 MALICE MIZER http://www.last.fm/music/MALICE+MIZER http://userserve-ak.last.fm/serve/252/10808.jpg 3897cf7f-9aac-4eef-aacb-ca0accdee9a2
2 Diary of Dreams http://www.last.fm/music/Diary+of+Dreams http://userserve-ak.last.fm/serve/252/3052066.jpg 22fa6038-d14c-4aab-a057-d397132e9191
3 Carpathian Forest http://www.last.fm/music/Carpathian+Forest http://userserve-ak.last.fm/serve/252/40222717... 69fa5c49-12ec-4c86-a238-4e07cc6d1a7d
4 Moi dix Mois http://www.last.fm/music/Moi+dix+Mois http://userserve-ak.last.fm/serve/252/54697835... 935d146a-ab8a-4745-b4c4-6043d4cc15c4
5 Bella Morte http://www.last.fm/music/Bella+Morte http://userserve-ak.last.fm/serve/252/14789013... 686dffc5-e13c-4b14-b2b7-097cbef040f0

As mentioned above there are cases where there is no output mbid code returned from an artist, and so filtering the dataframe returning only the artists where the mbid code was fetched.


In [5]:
print("The number of artists without an mbid code :",\
len(artists_df_mbid[artists_df_mbid.mbid=='notfound']) + len(artists_df_mbid[artists_df_mbid.mbid.isnull()]))


The number of artists without an mbid code : 3295

In [6]:
print("The total number of artists we have :",len(artists_df))


The total number of artists we have : 17632

In [7]:
artists_df_mbid = artists_df_mbid[~((artists_df_mbid.mbid =='notfound'))]
artists_df_mbid = artists_df_mbid[~(artists_df_mbid.mbid.isnull())]

In [8]:
len(artists_df_mbid)


Out[8]:
14337

Now that we have the codes for the artists, we can extract all the similar artists to them. Keep in mind that this function returns a massive amount of information about those similar artists, one of which is of a particular interest to us, the "match", that returns a value between 0 and 1 describing how similar both artists are, 1 being the most similar.


In [ ]:
artists_df_mbid['similar_artists'] = artists_df_mbid.apply(similar_artists, axis=1)

In [ ]:
columns=['ArtistID','Artist','Similar_artists','Weight']
artist_artist = pd.DataFrame(columns=columns)

For each artist, 100 similar artists are returned. To decrease the complexity of the problem and reduce the computational time, we only extracted the top 10 similar artists and saved them into a dataframe that contains the artist, artist id, the top 10 similar artists, and their corresponding weights.


In [ ]:
for index, row in artists_df_mbid.iterrows():
    # main artist id
    artist_id = index
    # main artist name
    artist_name = row['name']
    # list of dictionaries of similar artists information
    similar_dict = row['similar_artists']
    # the number of similar artists needed
    range_used = 10
    try:
        # Incase list contained less than 10 artists, set the range to that number
        if (len(similar_dict)<10):
            range_used = len(similar_dict)
    except:
        continue
        
    for i in range(range_used):
        # Appending the information obtained to a new dataframe
        org_artist_name = artist_name
        similar_artist_name = similar_dict[i]['name']
        similar_artist_match = similar_dict[i]['match']
        artist_artist = artist_artist.append({'ArtistID':artist_id,'Artist': artist_name,'Similar_artists':similar_artist_name,'Weight':similar_artist_match}, ignore_index=True)

In [9]:
artist_artist_connections = pd.read_csv('artist_artist_connections.csv', sep='\t', header=0, index_col=0, skipinitialspace=True)

Sample output


In [10]:
artist_artist_connections.head()


Out[10]:
ArtistID Artist Similar_artists Weight
0 1 MALICE MIZER Moi dix Mois 1.000000
1 1 MALICE MIZER Közi 0.851889
2 1 MALICE MIZER LAREINE 0.807094
3 1 MALICE MIZER BAISER 0.595696
4 1 MALICE MIZER Kaya 0.585429

Some of the cells in this notebook are not ran, instead the data is being read from dumps previously generated from the functions in the cells. The previous is done since the information retrival process itself is computationally expensive and time demanding.