Data gathering

We manually sampled books from Project Gutenberg's collection.

First try



In [1]:

    
import book_classification as bc
import shelve



In [2]:

    
myShelf = shelve.open("storage_old.db")
oldBookCollection = myShelf['aBookCollection']
del myShelf
print("{} books, {} authors".format(len(oldBookCollection), len(oldBookCollection.authors)))









    



506 books, 320 authors

Even if this first set contained many books, we later noticed it had too few books per author. That would make training difficult.



In [3]:

    
import pandas



In [8]:

    
oldBookFrame = pandas.DataFrame([[b.title, b.author, b.path, b.contents] for b in oldBookCollection], columns=["Title", "Author", "Path", "Contents"])
oldBookFrame.iloc[:, 1].value_counts().plot(figsize=(8, 5))









    Out[8]:





<matplotlib.axes.AxesSubplot at 0x7f91110c57d0>

In particular, notice that most authors only have one book. This makes classification not only impractical, but also impossible.

Second try

So another sample was taken, considering the points mentioned before.



In [5]:

    
myShelf = shelve.open("storage_new.db")
newBookCollection = myShelf['aBookCollection']
del myShelf
print("{} books, {} authors".format(len(newBookCollection), len(newBookCollection.authors)))









    



597 books, 47 authors



In [9]:

    
newBookFrame = pandas.DataFrame([[b.title, b.author, b.path, b.contents] for b in newBookCollection], columns=["Title", "Author", "Path", "Contents"])
newBookFrame.iloc[:, 1].value_counts().plot(figsize=(8, 5))









    Out[9]:





<matplotlib.axes.AxesSubplot at 0x7f912a24e950>

Duplicates

But because of the manual selection process, there were some duplicate books (even after doing corrections, adding missing metadata, etc).



In [7]:

    
print(newBookFrame.iloc[:, 0].value_counts().head(10))









    



A Christmas Carol                           5
Nights With Uncle Remus                     2
The Cricket on the Hearth                   2
A Select Collection of Old English Plays    2
Twice Told Tales                            2
The Scarlet Letter                          2
The Queen of the Pirate Isle                2
Ade's Fables                                2
Dickens in Camp                             1
A New Name for the Mexican Red Bat          1
dtype: int64

To be sure, we could've tried a fuzzy matching or comparing the edit distance. But for this time, it was settled to keep it simple.

In case it's required, here are some alternatives: