Data gathering

We manually sampled books from Project Gutenberg's collection.

First try


In [1]:
import book_classification as bc
import shelve

In [2]:
myShelf = shelve.open("storage_old.db")
oldBookCollection = myShelf['aBookCollection']
del myShelf
print("{} books, {} authors".format(len(oldBookCollection), len(oldBookCollection.authors)))


506 books, 320 authors

Even if this first set contained many books, we later noticed it had too few books per author. That would make training difficult.


In [3]:
import pandas

In [8]:
oldBookFrame = pandas.DataFrame([[b.title, b.author, b.path, b.contents] for b in oldBookCollection], columns=["Title", "Author", "Path", "Contents"])
oldBookFrame.iloc[:, 1].value_counts().plot(figsize=(8, 5))


Out[8]:
<matplotlib.axes.AxesSubplot at 0x7f91110c57d0>

In particular, notice that most authors only have one book. This makes classification not only impractical, but also impossible.

Second try

So another sample was taken, considering the points mentioned before.


In [5]:
myShelf = shelve.open("storage_new.db")
newBookCollection = myShelf['aBookCollection']
del myShelf
print("{} books, {} authors".format(len(newBookCollection), len(newBookCollection.authors)))


597 books, 47 authors

In [9]:
newBookFrame = pandas.DataFrame([[b.title, b.author, b.path, b.contents] for b in newBookCollection], columns=["Title", "Author", "Path", "Contents"])
newBookFrame.iloc[:, 1].value_counts().plot(figsize=(8, 5))


Out[9]:
<matplotlib.axes.AxesSubplot at 0x7f912a24e950>

Duplicates

But because of the manual selection process, there were some duplicate books (even after doing corrections, adding missing metadata, etc).


In [7]:
print(newBookFrame.iloc[:, 0].value_counts().head(10))


A Christmas Carol                           5
Nights With Uncle Remus                     2
The Cricket on the Hearth                   2
A Select Collection of Old English Plays    2
Twice Told Tales                            2
The Scarlet Letter                          2
The Queen of the Pirate Isle                2
Ade's Fables                                2
Dickens in Camp                             1
A New Name for the Mexican Red Bat          1
dtype: int64

To be sure, we could've tried a fuzzy matching or comparing the edit distance. But for this time, it was settled to keep it simple.

In case it's required, here are some alternatives: