Corpus Analysis, cut

Overview

In this section we'll look at the characteristics of the collected corpus.



In [1]:

    
import book_classification as bc
import pandas
import shelve



In [2]:

    
myShelf = shelve.open("storage_new.db")
aBookCollection = myShelf['aBookCollection']
aDataFrame = aBookCollection.as_dataframe()
del myShelf



In [51]:

    
aDataFrame.icol([0, 1]).describe()









    Out[51]:






  
    
      
      Title
      Author
    
  
  
    
      count
                     597
                       597
    
    
      unique
                     586
                        47
    
    
      top
       A Christmas Carol
       Nathaniel Hawthorne
    
    
      freq
                       5
                        94

Books and Authors

Some authors have more books than others, something that might impact the classification.



In [21]:

    
aDataFrame.groupby('Author').size().plot(kind='bar', figsize=(12, 5))









    Out[21]:





<matplotlib.axes.AxesSubplot at 0x7f84a7a25250>

This is the distribution of book counts over authors.



In [19]:

    
#aDataFrame.groupby('Author').size().plot(kind='kde', figsize=(6, 5))
aDataFrame.groupby('Author').size().hist()









    Out[19]:





<matplotlib.axes.AxesSubplot at 0x7f96a7988ad0>

Cutting the distribution

If we want to ignore rare words (that appear in few books and authors), we can make the distribution looke more "gaussian".



In [4]:

    
tokenizer = bc.BasicTokenizer()
frequency_extractor = bc.FrequenciesExtractor(tokenizer)

frequencies = bc.CollectionHierarchialFeatures.from_book_collection(aBookCollection, frequency_extractor)



In [5]:

    
vocabulary = pandas.Series(data=list(frequencies.total().values()), index=list(frequencies.total().keys()))
vocabulary.sort()
vocabulary.apply(math.log).plot()









    Out[5]:





<matplotlib.axes.AxesSubplot at 0x7faaf5eac6d0>



In [6]:

    
start = round(0.5 * len(vocabulary))
end = round(0.98 * len(vocabulary))
vocabulary_small = vocabulary.iloc[start:end]
vocabulary_small.apply(math.log).plot()









    Out[6]:





<matplotlib.axes.AxesSubplot at 0x7faaf5e21dd0>



In [8]:

    
tokenizer = bc.CollapsingTokenizer(bc.BasicTokenizer(), vocabulary_small.index)

Vocabularies and Authors

When talking about "vocabulary", an implicit tokenization scheme is also there. In this case, we chose tokens consisting of alphabetic symbols and longer than 2 characters. That BasicTokenizer uses NLTK.



In [9]:

    
aBookAnalysis = bc.BookCollectionAnalysis(aBookCollection, tokenizer)

The vocabulary size (unique words) for each book.



In [8]:

    
aBookAnalysis.vocabulary_size_by_book().set_index('Book').sort(['Unique words']).plot()









    Out[8]:





<matplotlib.axes.AxesSubplot at 0x7fa9610a7150>

The vocabulary size (unique words) for each author.



In [9]:

    
dataframe = aBookAnalysis.vocabulary_size_by_author().set_index('Author').sort(['Unique words'])
dataframe.plot(kind='bar', figsize=(15, 6))









    Out[9]:





<matplotlib.axes.AxesSubplot at 0x7fa960f79b90>

Shared words by authors

Here we'll explore vocabulary intersections among authors by number of words.

These are totals (each observation corresponds to the amount of words present in exactly the intersection of that many authors).



In [10]:

    
pandas.Series(aBookAnalysis.shared_words_by_authors()).apply(math.log).plot()









    Out[10]:





<matplotlib.axes.AxesSubplot at 0x7fa960fcc910>



In [11]:

    
pandas.Series(aBookAnalysis.shared_words_by_books()).apply(math.log).plot()









    Out[11]:





<matplotlib.axes.AxesSubplot at 0x7fa960f04bd0>

These are cumulative totals (observations represent the number of words that appear in N authors or less).



In [12]:

    
pandas.Series(aBookAnalysis.shared_words_by_authors()).cumsum().apply(math.log).plot()









    Out[12]:





<matplotlib.axes.AxesSubplot at 0x7fa960d96b10>



In [13]:

    
pandas.Series(aBookAnalysis.shared_words_by_books()).cumsum().apply(math.log).plot()









    Out[13]:





<matplotlib.axes.AxesSubplot at 0x7fa960d6e750>

Sparsity in feature matrix



In [10]:

    
vocabularySizes = aBookAnalysis.vocabulary_size_by_book()['Unique words'] / len(aBookAnalysis.vocabulary().total())
vocabularySizes.hist(bins=100)
#vocabularySizes.plot(kind='kde')









    Out[10]:





<matplotlib.axes.AxesSubplot at 0x7faae93a9c90>



In [11]:

    
print(vocabularySizes.sum() / len(vocabularySizes))









    



0.0468772770631

Entropies vs Frequencies

Let's look at the differences between them. Note that the logarithm was applied to frequencies, so they are in the same scale as entropies.



In [15]:

    
frequenciesExtractor = bc.FrequenciesExtractor(tokenizer)
entropiesExtractor = bc.EntropiesExtractor(tokenizer, bc.FixedGrouper(500))
frequencies = bc.HierarchialFeatures.from_book_collection(aBookCollection, frequenciesExtractor)
entropies = bc.HierarchialFeatures.from_book_collection(aBookCollection, entropiesExtractor)



In [16]:

    
df_input = []
for word in aBookAnalysis._vocabulary.total().keys():
    df_input.append([math.log(frequencies.total()[word]), entropies.total()[word]])
df_input.sort()
entropies_vs_frequencies = pandas.DataFrame(df_input, columns=["Frequencies", "Entropies"])



In [17]:

    
entropies_vs_frequencies.plot(kind='kde', figsize=(8, 8), subplots=True, sharex=False)









    Out[17]:





array([<matplotlib.axes.AxesSubplot object at 0x7fa97a0f8510>,
       <matplotlib.axes.AxesSubplot object at 0x7fa9802d4e10>], dtype=object)

If we plot both distributions individually, we can't see the difference (apart from the scales). But by sorting pairs according to one of them (in this case frequencies), it's clear that entropies aren't the same.

More over, the maximum grows in a similar fashion as frequencies, but it can't explain the additional variations. So entropy seems to have more information about the words than frequency.



In [18]:

    
entropies_vs_frequencies["Entropies"].plot(figsize=(12, 4))









    Out[18]:





<matplotlib.axes.AxesSubplot at 0x7fa97f8199d0>

By zooming at the tail (> 140000) we see the pattern continues.



In [19]:

    
entropies_vs_frequencies["Entropies"].iloc[60000:].plot(figsize=(12, 4))









    Out[19]:





<matplotlib.axes.AxesSubplot at 0x7fa96dde6710>



In [20]:

    
# TODO: get a decent density plot of x=freq,y=entr with log color map

#figure(figsize(10, 10))
#scatter(entropies_vs_frequencies["Frequencies"], entropies_vs_frequencies["Entropies"])
#figure(figsize(5, 5))

They seem to be almost equal many times, but increasingly more often informative words appear (which higher frequencies).

The following distribution of differences in the entropy series (sorted by increasing frequencies) indicates variation of entropy between words with essentialy the same frequency.



In [21]:

    
entropies_vs_frequencies["Entropies"].diff().dropna().apply(abs).hist(log=True)









    Out[21]:





<matplotlib.axes.AxesSubplot at 0x7fa96dfc37d0>



In [ ]:

	Title	Author
count	597	597
unique	586	47
top	A Christmas Carol	Nathaniel Hawthorne
freq	5	94