In this notebook we look at the problem of classifying songs to three genres (rap, rock and country) based on a simple binary bag of words representation. First we load the data and then we take a look at it. Using our implementation of discrete random variables we generate new random songs. Finally we show how classification can be performed using Bayes Rule. The data comes for the lyrics of songs from the Million Song DataSet and was created for an assignment in my course on MIR.
The data layout and the way the classifier is implemented is not general and not optimal but done for pedagogical purposes.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pickle
import numpy as np
# load some lyrics bag of words data, binarize, separate matrix rows by genre
data = np.load('data.npz')
a = data['arr_0']
a[a > 0] = 1
labels = np.load('labels.npz')
labels = labels['arr_0']
dictionary = pickle.load(open('dictionary.pck','rb'), encoding='latin1')
word_indices = [ 41, 1465, 169, 217, 1036, 188, 260, 454, 173, 728, 163,
151, 107, 142, 90, 141, 161, 131, 86, 73, 165, 133,
84, 244, 153, 126, 137, 119, 80, 224]
words = [dictionary[r] for r in word_indices]
# binary row vectors separate by genre
ra_rows = a[0:1000,:]
ro_rows = a[1000:2000,:]
co_rows = a[2000:3000,:]
print(ra_rows.shape, ro_rows.shape, co_rows.shape)
plt.imshow(a, aspect='auto', cmap='gray')
Out[2]:
Let's calculate the word probability vector for each genre and then look at the most probable words for each genre in our data as well as how particular songs are represented as bag of words
In [5]:
# calculate word counts for each genre
word_probs_ra = (ra_rows.sum(axis=0).astype(float) + 1.0) / 1001.
word_probs_ro = (ro_rows.sum(axis=0).astype(float) + 1.0) / 1001.
word_probs_co = (co_rows.sum(axis=0).astype(float) + 1.0) / 1001.
word_probs_ro
words
Out[5]:
In [6]:
#let's look at the bag of words for three particular songs
track_id = 500
print("RAP 342:",[words[i] for i,r in enumerate(ra_rows[track_id]) if r==1])
print("ROCK 342:",[words[i] for i,r in enumerate(ro_rows[track_id]) if r==1])
print("COUNTRY 342:",[words[i] for i,r in enumerate(co_rows[track_id]) if r==1])
In [812]:
# let's look at the k most probable words for each genre
k = 10
[[words[x] for x in np.argpartition(word_probs_ra, -k)[-k:]],
[words[x] for x in np.argpartition(word_probs_ro, -k)[-k:]],
[words[x] for x in np.argpartition(word_probs_co, -k)[-k:]]]
#word_probs_ra
Out[812]:
In [13]:
funny_list = ['a','b','c']
print(funny_list)
for (i,w) in enumerate(funny_list):
print(i,w)
Now let's generate some random songs represented as bag of words using the calculated word probabilities.
In [8]:
print('Random rap', [w for (i,w) in enumerate(words) if np.greater(word_probs_ra, np.random.rand(30))[i]])
print('Random rock', [w for (i,w) in enumerate(words) if np.greater(word_probs_ro, np.random.rand(30))[i]])
print('Random country', [w for (i,w) in enumerate(words) if np.greater(word_probs_co, np.random.rand(30))[i]])
Now let's look at classifying songs using a naive Bayes multinomial classifier. We simply calculate the likelihood for each genre independently by taking the products of the genre dependent word probabilities. The genere with the highest likelihood is selected as the predicted class.
In [15]:
# calcuate likelihood separately for each word
# using naive bayes assumption and multiply
# typically a sum of log-likelihoods is used
# rather than a multiplication.
def likelihood(test_song, word_probs_for_genre):
probability_product = 1.0
for (i,w) in enumerate(test_song):
if (w==1):
probability = word_probs_for_genre[i]
else:
probability = 1.0 - word_probs_for_genre[i]
probability_product *= probability
return probability_product
def predict(test_song):
scores = [likelihood(test_song, word_probs_ra),
likelihood(test_song, word_probs_ro),
likelihood(test_song, word_probs_co)]
labels = ['rap', 'rock', 'country']
return labels[np.argmax(scores)]
def predict_set(test_set, ground_truth_label):
score = 0
for r in test_set:
if predict(r) == ground_truth_label:
score += 1
# convert to percentage
return score / 10.0
# predict a random country track
track_id = np.random.randint(1000)
print("Random track id", track_id)
test_song = co_rows[track_id]
print(predict(test_song))
In [853]:
# Let's evaluate how well our classifier does on the training set
# A more proper evaluation would utilize cross-validation
print("Rap accuracy% = ", predict_set(ra_rows, 'rap'))
print("Rock accuracy% = ", predict_set(ro_rows, 'rock'))
print("Country accuracy% = ", predict_set(co_rows, 'country'))
In [16]:
test_song = co_rows[track_id]
In [17]:
print(test_song)
In [18]:
scores = [likelihood(test_song, word_probs_ra),
likelihood(test_song, word_probs_ro),
likelihood(test_song, word_probs_co)]
In [19]:
print(scores)
In [ ]: