In [1]:
import pandas as pd
import requests
from urllib.parse import quote
from artist_api import *
The artist dataset contains ids, Artist names, Artist url, and Artist pictureURL.
In [2]:
artists_df = pd.read_csv('artists.dat', sep='\t', header=0, index_col=0, skipinitialspace=True)
artists_df.head()
Out[2]:
In this notebook we extract the unique mbid code identifiers of each artist using the api. This mbid code is used later on to extract the similar artists belonging to this artist.
To match the artist we are extracting from the api using the name, to the artist in our dataset, we search through the url of each artist with the same name and cross compare both urls, picking only the one that matches.
There are 3 cases that could happen in here:
In [ ]:
artists_df['mbid'] = artists_df.apply(parse_artists, axis=1)
In [3]:
artists_df_mbid = pd.read_csv('artist_mbid_codes.csv', sep='\t', header=0, index_col=0, skipinitialspace=True)
In [4]:
artists_df_mbid.head()
Out[4]:
As mentioned above there are cases where there is no output mbid code returned from an artist, and so filtering the dataframe returning only the artists where the mbid code was fetched.
In [5]:
print("The number of artists without an mbid code :",\
len(artists_df_mbid[artists_df_mbid.mbid=='notfound']) + len(artists_df_mbid[artists_df_mbid.mbid.isnull()]))
In [6]:
print("The total number of artists we have :",len(artists_df))
In [7]:
artists_df_mbid = artists_df_mbid[~((artists_df_mbid.mbid =='notfound'))]
artists_df_mbid = artists_df_mbid[~(artists_df_mbid.mbid.isnull())]
In [8]:
len(artists_df_mbid)
Out[8]:
Now that we have the codes for the artists, we can extract all the similar artists to them. Keep in mind that this function returns a massive amount of information about those similar artists, one of which is of a particular interest to us, the "match", that returns a value between 0 and 1 describing how similar both artists are, 1 being the most similar.
In [ ]:
artists_df_mbid['similar_artists'] = artists_df_mbid.apply(similar_artists, axis=1)
In [ ]:
columns=['ArtistID','Artist','Similar_artists','Weight']
artist_artist = pd.DataFrame(columns=columns)
For each artist, 100 similar artists are returned. To decrease the complexity of the problem and reduce the computational time, we only extracted the top 10 similar artists and saved them into a dataframe that contains the artist, artist id, the top 10 similar artists, and their corresponding weights.
In [ ]:
for index, row in artists_df_mbid.iterrows():
# main artist id
artist_id = index
# main artist name
artist_name = row['name']
# list of dictionaries of similar artists information
similar_dict = row['similar_artists']
# the number of similar artists needed
range_used = 10
try:
# Incase list contained less than 10 artists, set the range to that number
if (len(similar_dict)<10):
range_used = len(similar_dict)
except:
continue
for i in range(range_used):
# Appending the information obtained to a new dataframe
org_artist_name = artist_name
similar_artist_name = similar_dict[i]['name']
similar_artist_match = similar_dict[i]['match']
artist_artist = artist_artist.append({'ArtistID':artist_id,'Artist': artist_name,'Similar_artists':similar_artist_name,'Weight':similar_artist_match}, ignore_index=True)
In [9]:
artist_artist_connections = pd.read_csv('artist_artist_connections.csv', sep='\t', header=0, index_col=0, skipinitialspace=True)
Sample output
In [10]:
artist_artist_connections.head()
Out[10]:
Some of the cells in this notebook are not ran, instead the data is being read from dumps previously generated from the functions in the cells. The previous is done since the information retrival process itself is computationally expensive and time demanding.