In [1]:
%load_ext watermark

In [2]:
%watermark -a 'Sebastian Raschka' -d -v


Sebastian Raschka 07/12/2014 

CPython 3.4.2
IPython 2.3.0



Redownloading the lyrics

The lyrics have been removed from the datasets in the public GitHub repository for copyright reasons. But you can follow the steps in this IPython notebook to re-download the lyrics.



Load the CSV file


In [3]:
import pandas as pd

df = pd.read_csv('../../dataset/training/train_lyrics_rem_1000.csv')
df.tail()


Out[3]:
file artist title lyrics mood year
995 TRBIGRY128F42597B3.h5 Sade All About Our Love NaN sad 2000
996 TRBIIEU128F9307C88.h5 New Found Glory Don't Let Her Pull You Down NaN happy 2009
997 TRBIIJY12903CE4755.h5 Mindy McCready Ten Thousand Angels NaN happy 1996
998 TRBIIOT128F423C594.h5 Joy Division Leaders Of Men NaN sad 1978
999 TRBIJYB128F14AE326.h5 Seventh Day Slumber Shattered Life NaN sad 2005



Script to download the lyrics


In [4]:
# Sebastian Raschka, 2014
# 
# Script to download lyrics from http://lyrics.wikia.com/

import urllib
import lxml.html

class Song(object):
    def __init__(self, artist, title):
        self.artist = self.__format_str(artist)
        self.title = self.__format_str(title)
        self.url = None
        self.lyric = None
        
    def __format_str(self, s):
        # remove paranthesis and contents
        s = s.strip()
        try:
            # strip accent
            s = ''.join(c for c in unicodedata.normalize('NFD', s)
                         if unicodedata.category(c) != 'Mn')
        except:
            pass
        s = s.title()
        return s
        
    def __quote(self, s):
         return urllib.parse.quote(s.replace(' ', '_'))

    def __make_url(self):
        artist = self.__quote(self.artist)
        title = self.__quote(self.title)
        artist_title = '%s:%s' %(artist, title)
        url = 'http://lyrics.wikia.com/' + artist_title
        self.url = url
        
    def update(self, artist=None, title=None):
        if artist:
            self.artist = self.__format_str(artist)
        if title:
            self.title = self.__format_str(title)
        
    def lyricwikia(self):
        self.__make_url()
        try:
            doc = lxml.html.parse(self.url)
            lyricbox = doc.getroot().cssselect('.lyricbox')[0]
        except (IOError, IndexError) as e:
            self.lyric = ''
            return self.lyric
        lyrics = []

        for node in lyricbox:
            if node.tag == 'br':
                lyrics.append('\n')
            if node.tail is not None:
                lyrics.append(node.tail)
        self.lyric =  "".join(lyrics).strip()    
        return self.lyric
    
song = Song(artist='John Mellencamp', title='Jack and Diane')
lyr = song.lyricwikia()
print(lyr)


A little ditty about Jack and Diane
Two American kids growin' up in the heartland
Jackie gonna be a football star
Diane's a debutante, backseat of Jackie's car

Suckin' on a chili dog outside the Tastee-Freez
Diane's sittin' on Jackie's lap
He's got his hands between her knees
Jackie say, "Hey Diane, let's run off behind the shady trees
Dribble off those Bobbie Brooks, let me do what I please."
And say uh

Oh yeah, life goes on
Long after the thrill of livin' is gone, they say uh
Oh yeah, life goes on
Long after the thrill of livin' is gone, they walk on

Jackie sits back, collects his thoughts for the moment
Scratches his head and does his best James Dean
"Well then there Diane, we oughta run off to the city."
Diane says, "Baby, you ain't missin' nothing."
And Jackie say uh

Oh yeah, life goes on
Long after the thrill of livin' is gone
Oh yeah, they say life goes on
Long after the thrill of livin' is gone

Gonna let it rock
Let it roll
Let the Bible Belt come and save my soul
Hold on to sixteen as long as you can
Changes come around real soon
Make us women and men

Oh yeah, life goes on
Long after the thrill of livin' is gone
Oh yeah, they say life goes on
Long after the thrill of livin' is gone

A little ditty about Jack and Diane
Two American kids doin' the best they can



Download lyrics (training dataset)


In [6]:
import pyprind

pbar = pyprind.ProgBar(df.shape[0])
for row_id in df.index:
    song = Song(artist=df.loc[row_id]['artist'], title=df.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df.loc[row_id,'lyrics'] = lyr
    pbar.update()
    
df.tail()


0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 760.945 sec
Out[6]:
file artist title lyrics mood year
995 TRBIGRY128F42597B3.h5 Sade All About Our Love Its all about our love\nSo shall it be forever... sad 2000
996 TRBIIEU128F9307C88.h5 New Found Glory Don't Let Her Pull You Down It's time that I rain on your parade\nWatch as... happy 2009
997 TRBIIJY12903CE4755.h5 Mindy McCready Ten Thousand Angels Speakin of the devil\nLook who just walked in\... happy 1996
998 TRBIIOT128F423C594.h5 Joy Division Leaders Of Men Born from some mother's womb\nJust like any ot... sad 1978
999 TRBIJYB128F14AE326.h5 Seventh Day Slumber Shattered Life This wanting more from me is tearing me, it's ... sad 2005

In [ ]:
df.to_csv('../../dataset/training/train_lyrics_1000.csv', index=False)



Download lyrics (validation dataset)


In [7]:
df = pd.read_csv('../../dataset/validation/valid_lyrics_rem_200.csv')
pbar = pyprind.ProgBar(df.shape[0])
for row_id in df.index:
    song = Song(artist=df.loc[row_id]['artist'], title=df.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df.loc[row_id,'lyrics'] = lyr
    pbar.update()
    
df.tail()


0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 186.137 sec
Out[7]:
file artist title genre lyrics mood
195 TRAKQEA128F1495E21.h5 Prince Escape ( LP Version) Rock {B-side of Glam Slam}\nSnare drum pounds on th... happy
196 TRAKQLN128F932AC25.h5 Cavo Over Again (Album Version) Rock Well I will rise\nThe morning comes\nNothing e... sad
197 TRAKQXJ128F147A028.h5 AFI Summer Shudder Rock Listen when I say, when I say it's real\nReal ... happy
198 TRAKRQW128F427D6E3.h5 Vitamin C Girls Against Boys (LP Version) Pop Imagine a world where the girls, girls rule th... happy
199 TRAKSRQ128F4269AE8.h5 Richard Burton Camelot Jazz Each evening, from December to December\nBefor... happy

In [ ]:
df.to_csv('../../dataset/validation/valid_lyrics_200.csv', index=False)



Download lyrics (auxiliary dataset)


In [ ]:
df = pd.read_csv('../../dataset/auxiliary/aux_lyrics_rem.csv')
pbar = pyprind.ProgBar(df.shape[0])
for row_id in df.index:
    song = Song(artist=df.loc[row_id]['artist'], title=df.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df.loc[row_id,'lyrics'] = lyr
    pbar.update()
    
df.tail()

In [ ]:
df.to_csv('../../dataset/auxiliary/aux_lyrics.csv', index=False)