In [1]:
import book_classification as bc
import shelve
In [2]:
myShelf = shelve.open("storage_old.db")
oldBookCollection = myShelf['aBookCollection']
del myShelf
print("{} books, {} authors".format(len(oldBookCollection), len(oldBookCollection.authors)))
Even if this first set contained many books, we later noticed it had too few books per author. That would make training difficult.
In [3]:
import pandas
In [8]:
oldBookFrame = pandas.DataFrame([[b.title, b.author, b.path, b.contents] for b in oldBookCollection], columns=["Title", "Author", "Path", "Contents"])
oldBookFrame.iloc[:, 1].value_counts().plot(figsize=(8, 5))
Out[8]:
In particular, notice that most authors only have one book. This makes classification not only impractical, but also impossible.
In [5]:
myShelf = shelve.open("storage_new.db")
newBookCollection = myShelf['aBookCollection']
del myShelf
print("{} books, {} authors".format(len(newBookCollection), len(newBookCollection.authors)))
In [9]:
newBookFrame = pandas.DataFrame([[b.title, b.author, b.path, b.contents] for b in newBookCollection], columns=["Title", "Author", "Path", "Contents"])
newBookFrame.iloc[:, 1].value_counts().plot(figsize=(8, 5))
Out[9]:
In [7]:
print(newBookFrame.iloc[:, 0].value_counts().head(10))
To be sure, we could've tried a fuzzy matching or comparing the edit distance. But for this time, it was settled to keep it simple.
In case it's required, here are some alternatives: