Extracting similar artists for the artists we have in our dataset.



In [1]:

    
import pandas as pd
import requests
from urllib.parse import quote
from artist_api import *

The artist dataset contains ids, Artist names, Artist url, and Artist pictureURL.



In [2]:

    
artists_df = pd.read_csv('artists.dat', sep='\t', header=0, index_col=0, skipinitialspace=True)
artists_df.head()









    Out[2]:







  
    
      
      name
      url
      pictureURL
    
    
      id
      
      
      
    
  
  
    
      1
      MALICE MIZER
      http://www.last.fm/music/MALICE+MIZER
      http://userserve-ak.last.fm/serve/252/10808.jpg
    
    
      2
      Diary of Dreams
      http://www.last.fm/music/Diary+of+Dreams
      http://userserve-ak.last.fm/serve/252/3052066.jpg
    
    
      3
      Carpathian Forest
      http://www.last.fm/music/Carpathian+Forest
      http://userserve-ak.last.fm/serve/252/40222717...
    
    
      4
      Moi dix Mois
      http://www.last.fm/music/Moi+dix+Mois
      http://userserve-ak.last.fm/serve/252/54697835...
    
    
      5
      Bella Morte
      http://www.last.fm/music/Bella+Morte
      http://userserve-ak.last.fm/serve/252/14789013...

In this notebook we extract the unique mbid code identifiers of each artist using the api. This mbid code is used later on to extract the similar artists belonging to this artist.

To match the artist we are extracting from the api using the name, to the artist in our dataset, we search through the url of each artist with the same name and cross compare both urls, picking only the one that matches.

There are 3 cases that could happen in here:

There is an mbid code available.
There is no mbid code available.
We didn't find an artist match, denoted as "not found".



In [ ]:

    
artists_df['mbid'] = artists_df.apply(parse_artists, axis=1)



In [3]:

    
artists_df_mbid = pd.read_csv('artist_mbid_codes.csv', sep='\t', header=0, index_col=0, skipinitialspace=True)



In [4]:

    
artists_df_mbid.head()









    Out[4]:







  
    
      
      name
      url
      pictureURL
      mbid
    
    
      id
      
      
      
      
    
  
  
    
      1
      MALICE MIZER
      http://www.last.fm/music/MALICE+MIZER
      http://userserve-ak.last.fm/serve/252/10808.jpg
      3897cf7f-9aac-4eef-aacb-ca0accdee9a2
    
    
      2
      Diary of Dreams
      http://www.last.fm/music/Diary+of+Dreams
      http://userserve-ak.last.fm/serve/252/3052066.jpg
      22fa6038-d14c-4aab-a057-d397132e9191
    
    
      3
      Carpathian Forest
      http://www.last.fm/music/Carpathian+Forest
      http://userserve-ak.last.fm/serve/252/40222717...
      69fa5c49-12ec-4c86-a238-4e07cc6d1a7d
    
    
      4
      Moi dix Mois
      http://www.last.fm/music/Moi+dix+Mois
      http://userserve-ak.last.fm/serve/252/54697835...
      935d146a-ab8a-4745-b4c4-6043d4cc15c4
    
    
      5
      Bella Morte
      http://www.last.fm/music/Bella+Morte
      http://userserve-ak.last.fm/serve/252/14789013...
      686dffc5-e13c-4b14-b2b7-097cbef040f0

As mentioned above there are cases where there is no output mbid code returned from an artist, and so filtering the dataframe returning only the artists where the mbid code was fetched.



In [5]:

    
print("The number of artists without an mbid code :",\
len(artists_df_mbid[artists_df_mbid.mbid=='notfound']) + len(artists_df_mbid[artists_df_mbid.mbid.isnull()]))









    



The number of artists without an mbid code : 3295



In [6]:

    
print("The total number of artists we have :",len(artists_df))









    



The total number of artists we have : 17632



In [7]:

    
artists_df_mbid = artists_df_mbid[~((artists_df_mbid.mbid =='notfound'))]
artists_df_mbid = artists_df_mbid[~(artists_df_mbid.mbid.isnull())]



In [8]:

    
len(artists_df_mbid)









    Out[8]:





14337

Now that we have the codes for the artists, we can extract all the similar artists to them. Keep in mind that this function returns a massive amount of information about those similar artists, one of which is of a particular interest to us, the "match", that returns a value between 0 and 1 describing how similar both artists are, 1 being the most similar.



In [ ]:

    
artists_df_mbid['similar_artists'] = artists_df_mbid.apply(similar_artists, axis=1)



In [ ]:

    
columns=['ArtistID','Artist','Similar_artists','Weight']
artist_artist = pd.DataFrame(columns=columns)

For each artist, 100 similar artists are returned. To decrease the complexity of the problem and reduce the computational time, we only extracted the top 10 similar artists and saved them into a dataframe that contains the artist, artist id, the top 10 similar artists, and their corresponding weights.



In [ ]:

    
for index, row in artists_df_mbid.iterrows():
    # main artist id
    artist_id = index
    # main artist name
    artist_name = row['name']
    # list of dictionaries of similar artists information
    similar_dict = row['similar_artists']
    # the number of similar artists needed
    range_used = 10
    try:
        # Incase list contained less than 10 artists, set the range to that number
        if (len(similar_dict)<10):
            range_used = len(similar_dict)
    except:
        continue
        
    for i in range(range_used):
        # Appending the information obtained to a new dataframe
        org_artist_name = artist_name
        similar_artist_name = similar_dict[i]['name']
        similar_artist_match = similar_dict[i]['match']
        artist_artist = artist_artist.append({'ArtistID':artist_id,'Artist': artist_name,'Similar_artists':similar_artist_name,'Weight':similar_artist_match}, ignore_index=True)



In [9]:

    
artist_artist_connections = pd.read_csv('artist_artist_connections.csv', sep='\t', header=0, index_col=0, skipinitialspace=True)

Sample output



In [10]:

    
artist_artist_connections.head()









    Out[10]:







  
    
      
      ArtistID
      Artist
      Similar_artists
      Weight
    
  
  
    
      0
      1
      MALICE MIZER
      Moi dix Mois
      1.000000
    
    
      1
      1
      MALICE MIZER
      Közi
      0.851889
    
    
      2
      1
      MALICE MIZER
      LAREINE
      0.807094
    
    
      3
      1
      MALICE MIZER
      BAISER
      0.595696
    
    
      4
      1
      MALICE MIZER
      Kaya
      0.585429

Some of the cells in this notebook are not ran, instead the data is being read from dumps previously generated from the functions in the cells. The previous is done since the information retrival process itself is computationally expensive and time demanding.

	name	url	pictureURL
id
1	MALICE MIZER	http://www.last.fm/music/MALICE+MIZER	http://userserve-ak.last.fm/serve/252/10808.jpg
2	Diary of Dreams	http://www.last.fm/music/Diary+of+Dreams	http://userserve-ak.last.fm/serve/252/3052066.jpg
3	Carpathian Forest	http://www.last.fm/music/Carpathian+Forest	http://userserve-ak.last.fm/serve/252/40222717...
4	Moi dix Mois	http://www.last.fm/music/Moi+dix+Mois	http://userserve-ak.last.fm/serve/252/54697835...
5	Bella Morte	http://www.last.fm/music/Bella+Morte	http://userserve-ak.last.fm/serve/252/14789013...

	ArtistID	Artist	Similar_artists	Weight
0	1	MALICE MIZER	Moi dix Mois	1.000000
1	1	MALICE MIZER	Közi	0.851889
2	1	MALICE MIZER	LAREINE	0.807094
3	1	MALICE MIZER	BAISER	0.595696
4	1	MALICE MIZER	Kaya	0.585429