In [1]:
import keras
import pandas as pd
import numpy as np


Using TensorFlow backend.

Import Data for Train and Test

Use a variety of topics and sources of text. Remove everything except letters.


In [13]:
import nltk
from nltk import corpus
# nltk.download()


# print(dir(corpus))
# corp = corpus.gutenberg
files = corpus.gutenberg.fileids()
print(files)


['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

In [18]:
# NOTE: This is only needed to open NLTK's downloads manager!
# nltk.download()


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[18]:
True

In [46]:
# Get our source corpora from gutenberg in nltk.
emma_sents = corpus.gutenberg.sents('austen-emma.txt')

# Assign all of our samples that we'll be using.
corpora = emma_sents[:20]

# Iterate across the sentences.
alpha_sentences = pd.DataFrame()
for sentence in corpora:
    
#     print(sent)
    sent = list(filter(lambda x: str.isalpha(x), sentence))
    sent = ' '.join(sent)
    sent = pd.Series(sent)
    alpha_sentences = alpha_sentences.append(sent, ignore_index=True)
    
print(alpha_sentences.head(10))


                                                   0
0                                Emma by Jane Austen
1                                           VOLUME I
2                                          CHAPTER I
3  Emma Woodhouse handsome clever and rich with a...
4  She was the youngest of the two daughters of a...
5  Her mother had died too long ago for her to ha...
6  Sixteen years had Miss Taylor been in Mr Woodh...
7        Between it was more the intimacy of sisters
8  Even before Miss Taylor had ceased to hold the...
9  The real evils indeed of Emma s situation were...

In [47]:
print(emma_sents[3],alpha_sentences.iloc[3][0])


['Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.'] Emma Woodhouse handsome clever and rich with a comfortable home and happy disposition seemed to unite some of the best blessings of existence and had lived nearly twenty one years in the world with very little to distress or vex her

In [ ]:
# We'll need to vectorize the words, put that into a dataframe, 
# and then generate another dataframe that's a vector of letter values.

Label Samples

Apply topic labels to each sample using a conventional technique like SVD. Later, this might be expanded to include selection among assignment alternatives based on information criteria.


In [44]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

lda = LDA()
lda.fit(emma_sents)


/home/brian/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-44-fe5d71d02a82> in <module>()
      2 
      3 lda = LDA()
----> 4 lda.fit(emma_sents)

/home/brian/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py in fit(self, X, y)
    497         """
    498         self._check_params()
--> 499         X = self._check_non_neg_array(X, "LatentDirichletAllocation.fit")
    500         n_samples, n_features = X.shape
    501         max_iter = self.max_iter

/home/brian/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py in _check_non_neg_array(self, X, whom)
    439 
    440         """
--> 441         X = check_array(X, accept_sparse='csr')
    442         check_non_negative(X, whom)
    443         return X

/home/brian/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    400         # make sure we actually converted to numeric:
    401         if dtype_numeric and array.dtype.kind == "O":
--> 402             array = array.astype(np.float64)
    403         if not allow_nd and array.ndim >= 3:
    404             raise ValueError("Found array with dim %d. %s expected <= 2."

ValueError: setting an array element with a sequence.

Preprocess Text for CNN

Each sample text needs to be transformed or encoded. Ideally, we want unrestricted vocabulary and minimum size. We can therefore encode individual letters and spaces.


In [ ]:

Test Our CNN

We need to see how well our CNN can match the labels applied through a more conventional method.


In [ ]: