In [1]:

    
%load_ext watermark



In [2]:

    
%watermark -a 'Sebastian Raschka' -d -v









    



Sebastian Raschka 07/12/2014 

CPython 3.4.2
IPython 2.3.0

[More information](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/ipython_magic/watermark.ipynb) about the `watermark` magic command extension.

Data Collection

Sections

Downloading the Dataset
Compiling a Title-Artist Table
Downloading Lyrics
- Adding lyrics to the DataFrame
- Remove Rows where Lyrics are not available
Language Filter
- Remove all non-English lyrics
Create a filtered dataset
Randomly partition the dataset into separate training and validation sets
Make new CSV tables for the Training and Validation dataset

Downloading the Dataset

[back to top]

A subset of 10,000 songs in HDF5 format was downloaded from the Million Song Dataset. A feature list of the file contents can be found here.

The following snippet flattens the directory tree that the Million Song subset comes in:



In [3]:

    
import os, sys

dir_tree = '/Users/sebastian/Desktop/MillionSongSubset/'

for dir_path, dir_names, file_names in os.walk(dir_tree):
    for file_name in file_names:
        try:
            os.rename(os.path.join(dir_path, file_name), os.path.join(dir_tree, file_name))
        except OSError:
            print ("Could not move %s " % os.join(dir_path, file_name))

Compiling a Title-Artist Table

[back to top]

Now, we create a a pandas DataFrame with the three feature columns file, artist, and title, where the artist and title are our input for the lyrics search, and the file name is merely serves for identification purposes.



In [16]:

    
import os
import pandas as pd

def make_artist_table(base):

# Get file names

    files = [os.path.join(base,fn) for fn in os.listdir(base) if fn.endswith('.h5')]
    data = {'file':[], 'artist':[], 'title':[]}

    # Add artist and title data to dictionary
    for f in files:
        store = pd.HDFStore(f)
        title = store.root.metadata.songs.cols.title[0]
        artist = store.root.metadata.songs.cols.artist_name[0]
        data['file'].append(os.path.basename(f))
        data['title'].append(title.decode("utf-8"))
        data['artist'].append(artist.decode("utf-8"))
        store.close()
    
    # Convert dictionary to pandas DataFrame
    df = pd.DataFrame.from_dict(data, orient='columns')
    df = df[['file', 'artist', 'title']]
    return df



In [17]:

    
base = '/Users/sebastian/Desktop/MillionSongSubset/'
df = make_artist_table(base)

df.tail()









    Out[17]:






  
    
      
      file
      artist
      title
    
  
  
    
      9996 
       TRBIJMU12903CF892B.h5
                      Moonspell
                         The Hanged Man
    
    
      9997 
       TRBIJNF128F14815A7.h5
                 Danny Williams
       The Wonderful World Of The Young
    
    
      9998 
       TRBIJNK128F93093EC.h5
                  Winston Reedy
                        Sentimental Man
    
    
      9999 
       TRBIJRN128F425F3DD.h5
       Myrick "Freeze" Guillory
                      Zydeco In D-Minor
    
    
      10000
       TRBIJYB128F14AE326.h5
            Seventh Day Slumber
                         Shattered Life

Downloading Lyrics

[back to top]

First, we add a new column for the lyrics to our DataFrame.



In [20]:

    
df['lyrics'] = pd.Series('', index=df.index)
df.tail()









    Out[20]:






  
    
      
      file
      artist
      title
      lyrics
    
  
  
    
      9996 
       TRBIJMU12903CF892B.h5
                      Moonspell
                         The Hanged Man
       
    
    
      9997 
       TRBIJNF128F14815A7.h5
                 Danny Williams
       The Wonderful World Of The Young
       
    
    
      9998 
       TRBIJNK128F93093EC.h5
                  Winston Reedy
                        Sentimental Man
       
    
    
      9999 
       TRBIJRN128F425F3DD.h5
       Myrick "Freeze" Guillory
                      Zydeco In D-Minor
       
    
    
      10000
       TRBIJYB128F14AE326.h5
            Seventh Day Slumber
                         Shattered Life

Then, we use the following code to download the song lyrics from LyricWikia based on the artist and title names in the pandas DataFrame.



In [24]:

    
# Sebastian Raschka, 2014
# 
# Script to download lyrics from http://lyrics.wikia.com/

import urllib
import lxml.html

class Song(object):
    def __init__(self, artist, title):
        self.artist = self.__format_str(artist)
        self.title = self.__format_str(title)
        self.url = None
        self.lyric = None
        
    def __format_str(self, s):
        # remove paranthesis and contents
        s = s.strip()
        try:
            # strip accent
            s = ''.join(c for c in unicodedata.normalize('NFD', s)
                         if unicodedata.category(c) != 'Mn')
        except:
            pass
        s = s.title()
        return s
        
    def __quote(self, s):
         return urllib.parse.quote(s.replace(' ', '_'))

    def __make_url(self):
        artist = self.__quote(self.artist)
        title = self.__quote(self.title)
        artist_title = '%s:%s' %(artist, title)
        url = 'http://lyrics.wikia.com/' + artist_title
        self.url = url
        
    def update(self, artist=None, title=None):
        if artist:
            self.artist = self.__format_str(artist)
        if title:
            self.title = self.__format_str(title)
        
    def lyricwikia(self):
        self.__make_url()
        try:
            doc = lxml.html.parse(self.url)
            lyricbox = doc.getroot().cssselect('.lyricbox')[0]
        except (IOError, IndexError) as e:
            self.lyric = ''
            return self.lyric
        lyrics = []

        for node in lyricbox:
            if node.tag == 'br':
                lyrics.append('\n')
            if node.tail is not None:
                lyrics.append(node.tail)
        self.lyric =  "".join(lyrics).strip()    
        return self.lyric

If this script doesn't work for you, you can find some alternatives to download lyrics from other websites in my datacollect repository.

Example:



In [25]:

    
song = Song(artist='John Mellencamp', title='Jack and Diane')
lyr = song.lyricwikia()
print(lyr)









    



A little ditty about Jack and Diane
Two American kids growin' up in the heartland
Jackie gonna be a football star
Diane's a debutante, backseat of Jackie's car

Suckin' on a chili dog outside the Tastee-Freez
Diane's sittin' on Jackie's lap
He's got his hands between her knees
Jackie say, "Hey Diane, let's run off behind the shady trees
Dribble off those Bobbie Brooks, let me do what I please."
And say uh

Oh yeah, life goes on
Long after the thrill of livin' is gone, they say uh
Oh yeah, life goes on
Long after the thrill of livin' is gone, they walk on

Jackie sits back, collects his thoughts for the moment
Scratches his head and does his best James Dean
"Well then there Diane, we oughta run off to the city."
Diane says, "Baby, you ain't missin' nothing."
And Jackie say uh

Oh yeah, life goes on
Long after the thrill of livin' is gone
Oh yeah, they say life goes on
Long after the thrill of livin' is gone

Gonna let it rock
Let it roll
Let the Bible Belt come and save my soul
Hold on to sixteen as long as you can
Changes come around real soon
Make us women and men

Oh yeah, life goes on
Long after the thrill of livin' is gone
Oh yeah, they say life goes on
Long after the thrill of livin' is gone

A little ditty about Jack and Diane
Two American kids doin' the best they can

Adding lyrics to the DataFrame

[back to top]



In [26]:

    
import pyprind



In [27]:

    
pbar = pyprind.ProgBar(df.shape[0])
for row_id in df.index:
    song = Song(artist=df.loc[row_id]['artist'], title=df.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df.loc[row_id,'lyrics'] = lyr
    pbar.update()









    



0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 5709.172 sec



In [28]:

    
print('downloaded Lyrics for %s songs' %sum(df.lyrics!=''))
df.head()









    



downloaded Lyrics for 3142 songs






    Out[28]:






  
    
      
      file
      artist
      title
      lyrics
    
  
  
    
      0
       subset_msd_summary_file.h5
               Mastodon
       Deep Sea Creature
       Knowing right, learning wrong\nWhat you're fee...
    
    
      1
            TRAAAAW128F429D538.h5
                 Casual
        I Didn't Mean To
       Verse One:\n\nAlright I might\nHave had a litt...
    
    
      2
            TRAAABD128F429CF47.h5
           The Box Tops
               Soul Deep
       Darling, I don't know much\nBut I know I love ...
    
    
      3
            TRAAADZ128F9348C2E.h5
       Sonora Santanera
         Amor De Cabaret
                                                        
    
    
      4
            TRAAAEF128F4273421.h5
               Adam Ant
         Something Girls
       Adam Ant/Marco Pirroni\nEvery girl is a someth...



In [29]:

    
df.to_csv('/Users/sebastian/Desktop/df_lyr_backup.csv')

Remove Rows where Lyrics are not available

[back to top]

If lyrics were not available, this can be due to one of the following reasons

the URL was not parsed correctly
the song does not exist in the LyricWikia database
the song is an instrumental song



In [30]:

    
df = df[df.lyrics!='']

Language Filter

[back to top]

Now, we remove all lyrics that are not in English. Basically, we say that if the song contains more English than non-English words (> 50%), then it is an English song. We use this relatively high cutoff-ratio of 0.5, since a songtext likely contains also names and other special words that are not part of a common English dictionary.

Example:



In [32]:

    
import nltk

def eng_ratio(text):
    ''' Returns the ratio of non-English to English words from a text '''

    english_vocab = set(w.lower() for w in nltk.corpus.words.words()) 
    text_vocab = set(w.lower() for w in text.split() if w.lower().isalpha()) 
    unusual = text_vocab.difference(english_vocab)
    diff = len(unusual)/len(text_vocab)
    return diff
    
text = 'This is a test fahrrad'

print(eng_ratio(text))

0.2

Remove all non-English lyrics

[back to top]



In [33]:

    
before = df.shape[0]
for row_id in df.index:
    text = df.loc[row_id]['lyrics']
    diff = eng_ratio(text)
    if diff >= 0.5:
        df = df[df.index != row_id]
after = df.shape[0]
rem = before - after
print('%s have been removed.' %rem)
print('%s songs remain in the dataset.' %after)









    



372 have been removed.
2770 songs remain in the dataset.



In [34]:

    
df.to_csv('/Users/sebastian/Desktop/df_lyr_backup2.csv', index=False)

Create a filtered dataset

[back to top]

Now, we copy all songs for which the lyrics exist to a new directory.



In [35]:

    
import os
import shutil

new_dir = '/Users/sebastian/Desktop/h5_filtered/'
if not os.path.exists(new_dir):
    os.mkdir(new_dir)

h1 = '/Users/sebastian/Desktop/MillionSongSubset/'
filepaths1 = [os.path.join(h1, f) for f in os.listdir(h1) if f.endswith('.h5')]

filepaths = filepaths1

for f in filepaths:
    base = os.path.basename(f)
    if base in df.file.values:
        target = os.path.join(new_dir, base)
        shutil.copyfile(f, target)

Randomly partition the dataset into separate training and validation sets

[back to top]

In this step, the dataset is reduced to a "reasonable" amount for the manual labeling step: 1000 songs for the training dataset and 200 songs for the validation dataset.



In [38]:

    
import random

h2 = '/Users/sebastian/Desktop/h5_filtered/'
filepaths2 = [os.path.join(h2, f) for f in os.listdir(h2) if f.endswith('.h5')]
random.shuffle(filepaths2)

train_dir = '../../dataset/training/h5_train/'
valid_dir = '../../dataset/validation/h5_valid/'
aux_dir = '../../dataset/auxiliary/h5_aux/'

for d in (train_dir, valid_dir, aux_dir):
    if not os.path.exists(d):
        os.mkdir(d)

for f in filepaths2[:1000]:
    base = os.path.basename(f)
    target = os.path.join(train_dir, base)
    shutil.copyfile(f, target)
  
for i in filepaths2[1000:1200]:
    base = os.path.basename(f)
    target = os.path.join(valid_dir, base)
    shutil.copyfile(f, target)
    
for i in filepaths2[1200:]:
    base = os.path.basename(f)
    target = os.path.join(aux_dir, base)
    shutil.copyfile(f, target)

Make new CSV tables for the Training and Validation dataset

[back to top]



In [88]:

    
df_train = make_artist_table('../../dataset/training/h5_train')
df_train['lyrics'] = pd.Series('', index=df_train.index)

pbar = pp.ProgBar(df_train.shape[0])
for row_id in df_train.index:
    song = Song(artist=df_train.loc[row_id]['artist'], title=df_train.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df_train.loc[row_id]['lyrics'] = lyr
    pbar.update()









    



0%                          100%
[##############################]
Total time elapsed: 60.924 sec



In [89]:

    
df_valid = make_artist_table('../../dataset/validation/h5_validate')
df_valid['lyrics'] = pd.Series('', index=df_valid.index)

pbar = pp.ProgBar(df_valid.shape[0])
for row_id in df_valid.index:
    song = Song(artist=df_valid.loc[row_id]['artist'], title=df_valid.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df_valid.loc[row_id]['lyrics'] = lyr
    pbar.update()









    



0%                          100%
[##############################]
Total time elapsed: 5.997 sec



In [90]:

    
df_train.to_csv('../../dataset/training/train_lyrics_1000.csv')
df_valid.to_csv('../../dataset/validation/valid_lyrics_200.csv')

Adding year information

[back to top]



In [39]:

    
import pandas as pd



In [46]:

    
df = pd.read_csv('../../dataset/training/train_lyrics_1000.csv')
df.head()









    Out[46]:






  
    
      
      file
      artist
      title
      lyrics
      mood
      year
    
  
  
    
      0
       TRAAAAW128F429D538.h5
              Casual
              I Didn't Mean To
       Verse One:\n\nAlright I might\nHave had a litt...
         sad
       1994
    
    
      1
       TRAAAEF128F4273421.h5
            Adam Ant
               Something Girls
       Adam Ant/Marco Pirroni\nEvery girl is a someth...
       happy
       1982
    
    
      2
       TRAAAFD128F92F423A.h5
                 Gob
                Face the Ashes
       I've just erased it's been a while, I've got a...
         sad
       2007
    
    
      3
       TRAABJV128F1460C49.h5
       Lionel Richie
       Tonight Will Be Alright
       Little darling \nWhere you've been so long \nI...
       happy
       1986
    
    
      4
       TRAABLR128F423B7E3.h5
          Blue Rodeo
                      Floating
       Lead Vocal by Greg\n\nWell, these late night c...
         sad
       1987



In [48]:

    
import os

df['year'] = pd.Series('', index=df.index)

base = '../../dataset/training/h5_train/'
files = [os.path.join(base,fn) for fn in os.listdir(base) if fn.endswith('.h5')]
for row_id in df.index:
    filename = df.loc[row_id]['file']
    filepath = os.path.join(base,filename)
    store = pd.HDFStore(filepath)
    year = store.root.musicbrainz.songs.cols.year[0]
    df.loc[row_id]['year'] = year



In [49]:

    
df[['file', 'artist', 'title','lyrics','year']].tail()









    Out[49]:






  
    
      
      file
      artist
      title
      lyrics
      year
    
  
  
    
      995
       TRBIGRY128F42597B3.h5
                      Sade
                All About Our Love
       Its all about our love\nSo shall it be forever...
       2000
    
    
      996
       TRBIIEU128F9307C88.h5
           New Found Glory
       Don't Let Her Pull You Down
       It's time that I rain on your parade\nWatch as...
       2009
    
    
      997
       TRBIIJY12903CE4755.h5
            Mindy McCready
               Ten Thousand Angels
       Speakin of the devil\nLook who just walked in\...
       1996
    
    
      998
       TRBIIOT128F423C594.h5
              Joy Division
                    Leaders Of Men
       Born from some mother's womb\nJust like any ot...
       1978
    
    
      999
       TRBIJYB128F14AE326.h5
       Seventh Day Slumber
                    Shattered Life
       This wanting more from me is tearing me, it's ...
       2005



In [22]:

    
df.to_csv('../../dataset/training/train_lyrics_1000.csv', index=False)

Missing year labels were manually added based on information from http://www.allmusic.com

	file	artist	title
9996	TRBIJMU12903CF892B.h5	Moonspell	The Hanged Man
9997	TRBIJNF128F14815A7.h5	Danny Williams	The Wonderful World Of The Young
9998	TRBIJNK128F93093EC.h5	Winston Reedy	Sentimental Man
9999	TRBIJRN128F425F3DD.h5	Myrick "Freeze" Guillory	Zydeco In D-Minor
10000	TRBIJYB128F14AE326.h5	Seventh Day Slumber	Shattered Life

	file	artist	title	lyrics
0	subset_msd_summary_file.h5	Mastodon	Deep Sea Creature	Knowing right, learning wrong\nWhat you're fee...
1	TRAAAAW128F429D538.h5	Casual	I Didn't Mean To	Verse One:\n\nAlright I might\nHave had a litt...
2	TRAAABD128F429CF47.h5	The Box Tops	Soul Deep	Darling, I don't know much\nBut I know I love ...
3	TRAAADZ128F9348C2E.h5	Sonora Santanera	Amor De Cabaret
4	TRAAAEF128F4273421.h5	Adam Ant	Something Girls	Adam Ant/Marco Pirroni\nEvery girl is a someth...

	file	artist	title	lyrics	year
995	TRBIGRY128F42597B3.h5	Sade	All About Our Love	Its all about our love\nSo shall it be forever...	2000
996	TRBIIEU128F9307C88.h5	New Found Glory	Don't Let Her Pull You Down	It's time that I rain on your parade\nWatch as...	2009
997	TRBIIJY12903CE4755.h5	Mindy McCready	Ten Thousand Angels	Speakin of the devil\nLook who just walked in\...	1996
998	TRBIIOT128F423C594.h5	Joy Division	Leaders Of Men	Born from some mother's womb\nJust like any ot...	1978
999	TRBIJYB128F14AE326.h5	Seventh Day Slumber	Shattered Life	This wanting more from me is tearing me, it's ...	2005