In [1]:

    
%load_ext watermark



In [2]:

    
%watermark -a 'Sebastian Raschka' -d -v









    



Sebastian Raschka 07/12/2014 

CPython 3.4.2
IPython 2.3.0

Redownloading the lyrics

The lyrics have been removed from the datasets in the public GitHub repository for copyright reasons. But you can follow the steps in this IPython notebook to re-download the lyrics.

Load the CSV file



In [3]:

    
import pandas as pd

df = pd.read_csv('../../dataset/training/train_lyrics_rem_1000.csv')
df.tail()









    Out[3]:






  
    
      
      file
      artist
      title
      lyrics
      mood
      year
    
  
  
    
      995
       TRBIGRY128F42597B3.h5
                      Sade
                All About Our Love
      NaN
         sad
       2000
    
    
      996
       TRBIIEU128F9307C88.h5
           New Found Glory
       Don't Let Her Pull You Down
      NaN
       happy
       2009
    
    
      997
       TRBIIJY12903CE4755.h5
            Mindy McCready
               Ten Thousand Angels
      NaN
       happy
       1996
    
    
      998
       TRBIIOT128F423C594.h5
              Joy Division
                    Leaders Of Men
      NaN
         sad
       1978
    
    
      999
       TRBIJYB128F14AE326.h5
       Seventh Day Slumber
                    Shattered Life
      NaN
         sad
       2005

Script to download the lyrics



In [4]:

    
# Sebastian Raschka, 2014
# 
# Script to download lyrics from http://lyrics.wikia.com/

import urllib
import lxml.html

class Song(object):
    def __init__(self, artist, title):
        self.artist = self.__format_str(artist)
        self.title = self.__format_str(title)
        self.url = None
        self.lyric = None
        
    def __format_str(self, s):
        # remove paranthesis and contents
        s = s.strip()
        try:
            # strip accent
            s = ''.join(c for c in unicodedata.normalize('NFD', s)
                         if unicodedata.category(c) != 'Mn')
        except:
            pass
        s = s.title()
        return s
        
    def __quote(self, s):
         return urllib.parse.quote(s.replace(' ', '_'))

    def __make_url(self):
        artist = self.__quote(self.artist)
        title = self.__quote(self.title)
        artist_title = '%s:%s' %(artist, title)
        url = 'http://lyrics.wikia.com/' + artist_title
        self.url = url
        
    def update(self, artist=None, title=None):
        if artist:
            self.artist = self.__format_str(artist)
        if title:
            self.title = self.__format_str(title)
        
    def lyricwikia(self):
        self.__make_url()
        try:
            doc = lxml.html.parse(self.url)
            lyricbox = doc.getroot().cssselect('.lyricbox')[0]
        except (IOError, IndexError) as e:
            self.lyric = ''
            return self.lyric
        lyrics = []

        for node in lyricbox:
            if node.tag == 'br':
                lyrics.append('\n')
            if node.tail is not None:
                lyrics.append(node.tail)
        self.lyric =  "".join(lyrics).strip()    
        return self.lyric
    
song = Song(artist='John Mellencamp', title='Jack and Diane')
lyr = song.lyricwikia()
print(lyr)









    



A little ditty about Jack and Diane
Two American kids growin' up in the heartland
Jackie gonna be a football star
Diane's a debutante, backseat of Jackie's car

Suckin' on a chili dog outside the Tastee-Freez
Diane's sittin' on Jackie's lap
He's got his hands between her knees
Jackie say, "Hey Diane, let's run off behind the shady trees
Dribble off those Bobbie Brooks, let me do what I please."
And say uh

Oh yeah, life goes on
Long after the thrill of livin' is gone, they say uh
Oh yeah, life goes on
Long after the thrill of livin' is gone, they walk on

Jackie sits back, collects his thoughts for the moment
Scratches his head and does his best James Dean
"Well then there Diane, we oughta run off to the city."
Diane says, "Baby, you ain't missin' nothing."
And Jackie say uh

Oh yeah, life goes on
Long after the thrill of livin' is gone
Oh yeah, they say life goes on
Long after the thrill of livin' is gone

Gonna let it rock
Let it roll
Let the Bible Belt come and save my soul
Hold on to sixteen as long as you can
Changes come around real soon
Make us women and men

Oh yeah, life goes on
Long after the thrill of livin' is gone
Oh yeah, they say life goes on
Long after the thrill of livin' is gone

A little ditty about Jack and Diane
Two American kids doin' the best they can

Download lyrics (training dataset)



In [6]:

    
import pyprind

pbar = pyprind.ProgBar(df.shape[0])
for row_id in df.index:
    song = Song(artist=df.loc[row_id]['artist'], title=df.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df.loc[row_id,'lyrics'] = lyr
    pbar.update()
    
df.tail()









    



0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 760.945 sec






    Out[6]:






  
    
      
      file
      artist
      title
      lyrics
      mood
      year
    
  
  
    
      995
       TRBIGRY128F42597B3.h5
                      Sade
                All About Our Love
       Its all about our love\nSo shall it be forever...
         sad
       2000
    
    
      996
       TRBIIEU128F9307C88.h5
           New Found Glory
       Don't Let Her Pull You Down
       It's time that I rain on your parade\nWatch as...
       happy
       2009
    
    
      997
       TRBIIJY12903CE4755.h5
            Mindy McCready
               Ten Thousand Angels
       Speakin of the devil\nLook who just walked in\...
       happy
       1996
    
    
      998
       TRBIIOT128F423C594.h5
              Joy Division
                    Leaders Of Men
       Born from some mother's womb\nJust like any ot...
         sad
       1978
    
    
      999
       TRBIJYB128F14AE326.h5
       Seventh Day Slumber
                    Shattered Life
       This wanting more from me is tearing me, it's ...
         sad
       2005



In [ ]:

    
df.to_csv('../../dataset/training/train_lyrics_1000.csv', index=False)

Download lyrics (validation dataset)



In [7]:

    
df = pd.read_csv('../../dataset/validation/valid_lyrics_rem_200.csv')
pbar = pyprind.ProgBar(df.shape[0])
for row_id in df.index:
    song = Song(artist=df.loc[row_id]['artist'], title=df.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df.loc[row_id,'lyrics'] = lyr
    pbar.update()
    
df.tail()









    



0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 186.137 sec






    Out[7]:






  
    
      
      file
      artist
      title
      genre
      lyrics
      mood
    
  
  
    
      195
       TRAKQEA128F1495E21.h5
               Prince
                  Escape ( LP Version)
       Rock
       {B-side of Glam Slam}\nSnare drum pounds on th...
       happy
    
    
      196
       TRAKQLN128F932AC25.h5
                 Cavo
            Over Again (Album Version)
       Rock
       Well I will rise\nThe morning comes\nNothing e...
         sad
    
    
      197
       TRAKQXJ128F147A028.h5
                  AFI
                        Summer Shudder
       Rock
       Listen when I say, when I say it's real\nReal ...
       happy
    
    
      198
       TRAKRQW128F427D6E3.h5
            Vitamin C
       Girls Against Boys (LP Version)
        Pop
       Imagine a world where the girls, girls rule th...
       happy
    
    
      199
       TRAKSRQ128F4269AE8.h5
       Richard Burton
                               Camelot
       Jazz
       Each evening, from December to December\nBefor...
       happy



In [ ]:

    
df.to_csv('../../dataset/validation/valid_lyrics_200.csv', index=False)

Download lyrics (auxiliary dataset)



In [ ]:

    
df = pd.read_csv('../../dataset/auxiliary/aux_lyrics_rem.csv')
pbar = pyprind.ProgBar(df.shape[0])
for row_id in df.index:
    song = Song(artist=df.loc[row_id]['artist'], title=df.loc[row_id]['title'])
    lyr = song.lyricwikia()
    df.loc[row_id,'lyrics'] = lyr
    pbar.update()
    
df.tail()



In [ ]:

    
df.to_csv('../../dataset/auxiliary/aux_lyrics.csv', index=False)

	file	artist	title	lyrics	mood	year
995	TRBIGRY128F42597B3.h5	Sade	All About Our Love	NaN	sad	2000
996	TRBIIEU128F9307C88.h5	New Found Glory	Don't Let Her Pull You Down	NaN	happy	2009
997	TRBIIJY12903CE4755.h5	Mindy McCready	Ten Thousand Angels	NaN	happy	1996
998	TRBIIOT128F423C594.h5	Joy Division	Leaders Of Men	NaN	sad	1978
999	TRBIJYB128F14AE326.h5	Seventh Day Slumber	Shattered Life	NaN	sad	2005

	file	artist	title	genre	lyrics	mood
195	TRAKQEA128F1495E21.h5	Prince	Escape ( LP Version)	Rock	{B-side of Glam Slam}\nSnare drum pounds on th...	happy
196	TRAKQLN128F932AC25.h5	Cavo	Over Again (Album Version)	Rock	Well I will rise\nThe morning comes\nNothing e...	sad
197	TRAKQXJ128F147A028.h5	AFI	Summer Shudder	Rock	Listen when I say, when I say it's real\nReal ...	happy
198	TRAKRQW128F427D6E3.h5	Vitamin C	Girls Against Boys (LP Version)	Pop	Imagine a world where the girls, girls rule th...	happy
199	TRAKSRQ128F4269AE8.h5	Richard Burton	Camelot	Jazz	Each evening, from December to December\nBefor...	happy