In [1]:
%load_ext watermark

In [2]:
%watermark -a 'Sebastian Raschka' -d -v


Sebastian Raschka 07/12/2014 

CPython 3.4.2
IPython 2.3.0

[More information](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/ipython_magic/watermark.ipynb) about the `watermark` magic command extension.



Data Collection




Sections




Downloading the Dataset

A subset of 10,000 songs in HDF5 format was downloaded from the Million Song Dataset. A feature list of the file contents can be found here.

The following snippet flattens the directory tree that the Million Song subset comes in:


In [3]:
import os, sys

dir_tree = '/Users/sebastian/Desktop/MillionSongSubset/'

for dir_path, dir_names, file_names in os.walk(dir_tree):
    for file_name in file_names:
        try:
            os.rename(os.path.join(dir_path, file_name), os.path.join(dir_tree, file_name))
        except OSError:
            print ("Could not move %s " % os.join(dir_path, file_name))



Compiling a Title-Artist Table

Now, we create a a pandas DataFrame with the three feature columns file, artist, and title, where the artist and title are our input for the lyrics search, and the file name is merely serves for identification purposes.


In [16]:
import os
import pandas as pd

def make_artist_table(base):

# Get file names

    files = [os.path.join(base,fn) for fn in os.listdir(base) if fn.endswith('.h5')]
    data = {'file':[], 'artist':[], 'title':[]}

    # Add artist and title data to dictionary
    for f in files:
        store = pd.HDFStore(f)
        title = store.root.metadata.songs.cols.title[0]
        artist = store.root.metadata.songs.cols.artist_name[0]
        data['file'].append(os.path.basename(f))
        data['title'].append(title.decode("utf-8"))
        data['artist'].append(artist.decode("utf-8"))
        store.close()
    
    # Convert dictionary to pandas DataFrame
    df = pd.DataFrame.from_dict(data, orient='columns')
    df = df[['file', 'artist', 'title']]
    return df

In [17]:
base = '/Users/sebastian/Desktop/MillionSongSubset/'
df = make_artist_table(base)

df.tail()


Out[17]:
file artist title
9996 TRBIJMU12903CF892B.h5 Moonspell The Hanged Man
9997 TRBIJNF128F14815A7.h5 Danny Williams The Wonderful World Of The Young
9998 TRBIJNK128F93093EC.h5 Winston Reedy Sentimental Man
9999 TRBIJRN128F425F3DD.h5 Myrick "Freeze" Guillory Zydeco In D-Minor
10000 TRBIJYB128F14AE326.h5 Seventh Day Slumber Shattered Life



Downloading Lyrics

First, we add a new column for the lyrics to our DataFrame.


In [20]:
df['lyrics'] = pd.Series('', index=df.index)
df.tail()


Out[20]:
file artist title lyrics
9996 TRBIJMU12903CF892B.h5 Moonspell The Hanged Man
9997 TRBIJNF128F14815A7.h5 Danny Williams The Wonderful World Of The Young
9998 TRBIJNK128F93093EC.h5 Winston Reedy Sentimental Man
9999 TRBIJRN128F425F3DD.h5 Myrick "Freeze" Guillory Zydeco In D-Minor
10000 TRBIJYB128F14AE326.h5 Seventh Day Slumber Shattered Life

Then, we use the following code to download the song lyrics from LyricWikia based on the artist and title names in the pandas DataFrame.


In [24]:
# Sebastian Raschka, 2014
# 
# Script to download lyrics from http://lyrics.wikia.com/

import urllib
import lxml.html

class Song(object):
    def __init__(self, artist, title):
        self.artist = self.__format_str(artist)
        self.title = self.__format_str(title)
        self.url = None
        self.lyric = None
        
    def __format_str(self, s):
        # remove paranthesis and contents
        s = s.strip()
        try:
            # strip accent
            s = ''.join(c for c in unicodedata.normalize('NFD', s)
                         if unicodedata.category(c) != 'Mn')
        except:
            pass
        s = s.title()
        return s
        
    def __quote(self, s):
         return urllib.parse.quote(s.replace(' ', '_'))

    def __make_url(self):
        artist = self.__quote(self.artist)
        title = self.__quote(self.title)
        artist_title = '%s:%s' %(artist, title)
        url = 'http://lyrics.wikia.com/' + artist_title
        self.url = url
        
    def update(self, artist=None, title=None):
        if artist:
            self.artist = self.__format_str(artist)
        if title:
            self.title = self.__format_str(title)
        
    def lyricwikia(self):
        self.__make_url()
        try:
            doc = lxml.html.parse(self.url)
            lyricbox = doc.getroot().cssselect('.lyricbox')[0]
        except (IOError, IndexError) as e:
            self.lyric = ''
            return self.lyric
        lyrics = []

        for node in lyricbox:
            if node.tag == 'br':
                lyrics.append('\n')
            if node.tail is not None:
                lyrics.append(node.tail)
        self.lyric =  "".join(lyrics).strip()    
        return self.lyric

If this script doesn't work for you, you can find some alternatives to download lyrics from other websites in my datacollect repository.

Example:


In [25]:
song = Song(artist='John Mellencamp', title='Jack and Diane')
lyr = song.lyricwikia()
print(lyr)


A little ditty about Jack and Diane
Two American kids growin' up in the heartland
Jackie gonna be a football star
Diane's a debutante, backseat of Jackie's car

Suckin' on a chili dog outside the Tastee-Freez
Diane's sittin' on Jackie's lap
He's got his hands between her knees
Jackie say, "Hey Diane, let's run off behind the shady trees
Dribble off those Bobbie Brooks, let me do what I please."
And say uh

Oh yeah, life goes on
Long after the thrill of livin' is gone, they say uh
Oh yeah, life goes on
Long after the thrill of livin' is gone, they walk on

Jackie sits back, collects his thoughts for the moment
Scratches his head and does his best James Dean
"Well then there Diane, we oughta run off to the city."
Diane says, "Baby, you ain't missin' nothing."
And Jackie say uh

Oh yeah, life goes on
Long after the thrill of livin' is gone
Oh yeah, they say life goes on
Long after the thrill of livin' is gone

Gonna let it rock
Let it roll
Let the Bible Belt come and save my soul
Hold on to sixteen as long as you can
Changes come around real soon
Make us women and men

Oh yeah, life goes on
Long after the thrill of livin' is gone
Oh yeah, they say life goes on
Long after the thrill of livin' is gone

A little ditty about Jack and Diane
Two American kids doin' the best they can



Adding lyrics to the DataFrame


In [26]:
import pyprind

In [27]:
pbar = pyprind.ProgBar(df.shape[0])
for row_id in df.index:
    song = Song(artist=df.loc[row_id]['artist'], title=df.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df.loc[row_id,'lyrics'] = lyr
    pbar.update()


0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 5709.172 sec

In [28]:
print('downloaded Lyrics for %s songs' %sum(df.lyrics!=''))
df.head()


downloaded Lyrics for 3142 songs
Out[28]:
file artist title lyrics
0 subset_msd_summary_file.h5 Mastodon Deep Sea Creature Knowing right, learning wrong\nWhat you're fee...
1 TRAAAAW128F429D538.h5 Casual I Didn't Mean To Verse One:\n\nAlright I might\nHave had a litt...
2 TRAAABD128F429CF47.h5 The Box Tops Soul Deep Darling, I don't know much\nBut I know I love ...
3 TRAAADZ128F9348C2E.h5 Sonora Santanera Amor De Cabaret
4 TRAAAEF128F4273421.h5 Adam Ant Something Girls Adam Ant/Marco Pirroni\nEvery girl is a someth...

In [29]:
df.to_csv('/Users/sebastian/Desktop/df_lyr_backup.csv')

Remove Rows where Lyrics are not available

If lyrics were not available, this can be due to one of the following reasons

  • the URL was not parsed correctly
  • the song does not exist in the LyricWikia database
  • the song is an instrumental song

In [30]:
df = df[df.lyrics!='']



Language Filter

Now, we remove all lyrics that are not in English. Basically, we say that if the song contains more English than non-English words (> 50%), then it is an English song. We use this relatively high cutoff-ratio of 0.5, since a songtext likely contains also names and other special words that are not part of a common English dictionary.

Example:


In [32]:
import nltk

def eng_ratio(text):
    ''' Returns the ratio of non-English to English words from a text '''

    english_vocab = set(w.lower() for w in nltk.corpus.words.words()) 
    text_vocab = set(w.lower() for w in text.split() if w.lower().isalpha()) 
    unusual = text_vocab.difference(english_vocab)
    diff = len(unusual)/len(text_vocab)
    return diff
    
text = 'This is a test fahrrad'

print(eng_ratio(text))


0.2



Remove all non-English lyrics


In [33]:
before = df.shape[0]
for row_id in df.index:
    text = df.loc[row_id]['lyrics']
    diff = eng_ratio(text)
    if diff >= 0.5:
        df = df[df.index != row_id]
after = df.shape[0]
rem = before - after
print('%s have been removed.' %rem)
print('%s songs remain in the dataset.' %after)


372 have been removed.
2770 songs remain in the dataset.

In [34]:
df.to_csv('/Users/sebastian/Desktop/df_lyr_backup2.csv', index=False)



Create a filtered dataset

Now, we copy all songs for which the lyrics exist to a new directory.


In [35]:
import os
import shutil

new_dir = '/Users/sebastian/Desktop/h5_filtered/'
if not os.path.exists(new_dir):
    os.mkdir(new_dir)

h1 = '/Users/sebastian/Desktop/MillionSongSubset/'
filepaths1 = [os.path.join(h1, f) for f in os.listdir(h1) if f.endswith('.h5')]

filepaths = filepaths1

for f in filepaths:
    base = os.path.basename(f)
    if base in df.file.values:
        target = os.path.join(new_dir, base)
        shutil.copyfile(f, target)



Randomly partition the dataset into separate training and validation sets

In this step, the dataset is reduced to a "reasonable" amount for the manual labeling step: 1000 songs for the training dataset and 200 songs for the validation dataset.


In [38]:
import random

h2 = '/Users/sebastian/Desktop/h5_filtered/'
filepaths2 = [os.path.join(h2, f) for f in os.listdir(h2) if f.endswith('.h5')]
random.shuffle(filepaths2)

train_dir = '../../dataset/training/h5_train/'
valid_dir = '../../dataset/validation/h5_valid/'
aux_dir = '../../dataset/auxiliary/h5_aux/'

for d in (train_dir, valid_dir, aux_dir):
    if not os.path.exists(d):
        os.mkdir(d)

for f in filepaths2[:1000]:
    base = os.path.basename(f)
    target = os.path.join(train_dir, base)
    shutil.copyfile(f, target)
  
for i in filepaths2[1000:1200]:
    base = os.path.basename(f)
    target = os.path.join(valid_dir, base)
    shutil.copyfile(f, target)
    
for i in filepaths2[1200:]:
    base = os.path.basename(f)
    target = os.path.join(aux_dir, base)
    shutil.copyfile(f, target)



Make new CSV tables for the Training and Validation dataset


In [88]:
df_train = make_artist_table('../../dataset/training/h5_train')
df_train['lyrics'] = pd.Series('', index=df_train.index)

pbar = pp.ProgBar(df_train.shape[0])
for row_id in df_train.index:
    song = Song(artist=df_train.loc[row_id]['artist'], title=df_train.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df_train.loc[row_id]['lyrics'] = lyr
    pbar.update()


0%                          100%
[##############################]
Total time elapsed: 60.924 sec

In [89]:
df_valid = make_artist_table('../../dataset/validation/h5_validate')
df_valid['lyrics'] = pd.Series('', index=df_valid.index)

pbar = pp.ProgBar(df_valid.shape[0])
for row_id in df_valid.index:
    song = Song(artist=df_valid.loc[row_id]['artist'], title=df_valid.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df_valid.loc[row_id]['lyrics'] = lyr
    pbar.update()


0%                          100%
[##############################]
Total time elapsed: 5.997 sec

In [90]:
df_train.to_csv('../../dataset/training/train_lyrics_1000.csv')
df_valid.to_csv('../../dataset/validation/valid_lyrics_200.csv')



Adding year information


In [39]:
import pandas as pd

In [46]:
df = pd.read_csv('../../dataset/training/train_lyrics_1000.csv')
df.head()


Out[46]:
file artist title lyrics mood year
0 TRAAAAW128F429D538.h5 Casual I Didn't Mean To Verse One:\n\nAlright I might\nHave had a litt... sad 1994
1 TRAAAEF128F4273421.h5 Adam Ant Something Girls Adam Ant/Marco Pirroni\nEvery girl is a someth... happy 1982
2 TRAAAFD128F92F423A.h5 Gob Face the Ashes I've just erased it's been a while, I've got a... sad 2007
3 TRAABJV128F1460C49.h5 Lionel Richie Tonight Will Be Alright Little darling \nWhere you've been so long \nI... happy 1986
4 TRAABLR128F423B7E3.h5 Blue Rodeo Floating Lead Vocal by Greg\n\nWell, these late night c... sad 1987

In [48]:
import os

df['year'] = pd.Series('', index=df.index)

base = '../../dataset/training/h5_train/'
files = [os.path.join(base,fn) for fn in os.listdir(base) if fn.endswith('.h5')]
for row_id in df.index:
    filename = df.loc[row_id]['file']
    filepath = os.path.join(base,filename)
    store = pd.HDFStore(filepath)
    year = store.root.musicbrainz.songs.cols.year[0]
    df.loc[row_id]['year'] = year

In [49]:
df[['file', 'artist', 'title','lyrics','year']].tail()


Out[49]:
file artist title lyrics year
995 TRBIGRY128F42597B3.h5 Sade All About Our Love Its all about our love\nSo shall it be forever... 2000
996 TRBIIEU128F9307C88.h5 New Found Glory Don't Let Her Pull You Down It's time that I rain on your parade\nWatch as... 2009
997 TRBIIJY12903CE4755.h5 Mindy McCready Ten Thousand Angels Speakin of the devil\nLook who just walked in\... 1996
998 TRBIIOT128F423C594.h5 Joy Division Leaders Of Men Born from some mother's womb\nJust like any ot... 1978
999 TRBIJYB128F14AE326.h5 Seventh Day Slumber Shattered Life This wanting more from me is tearing me, it's ... 2005

In [22]:
df.to_csv('../../dataset/training/train_lyrics_1000.csv', index=False)

Missing year labels were manually added based on information from http://www.allmusic.com