In [ ]:
import numpy as np
import os
from htrc_features import FeatureReader
In [ ]:
poetry_output = !htid2rsync --f data/poetry.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ data/poetry/
scifi_output = !htid2rsync --f data/scifi.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ data/scifi/
outputs = list([poetry_output, scifi_output])
subjects = ['poetry', 'scifi']
paths = {}
suffix = '.json.bz2'
for subject, output in zip(subjects, outputs):
folder = subject
filePaths = [path for path in output if path.endswith(suffix)]
paths[subject] = [os.path.join(folder, path) for path in filePaths]
fn = 'data/' + subject + '_paths.txt'
with open(fn, 'w') as f:
for path in paths[subject]:
p = str(path) + '\n'
f.write(p)
As in the previous notebooks, we'll construct FeatureReader
objects for each corpus. The line below reads in path files we created to the downloaded data:
In [ ]:
paths = {}
subjects = ['poetry', 'scifi']
for subject in subjects:
with open('data/' + subject + '_paths.txt', 'r') as f:
paths[subject] = ['data/' + line[:len(line)-1] for line in f.readlines()]
poetry = FeatureReader(paths['poetry'])
scifi = FeatureReader(paths['scifi'])
To create our bag of words matrix, we need to keep a global dictionary of all words seen in each of our texts. We initialize "wordDict", which tracks all the words seen and records its index in the bag of words matrix. We also keep a list of volumes so that we can parse them later.
In [ ]:
def createWordDict(HTRC_FeatureReader_List):
wordDict = {}
i = 0
volumes = []
for f in HTRC_FeatureReader_List:
for vol in f.volumes():
volumes.append(vol)
tok_list = vol.tokenlist(pages=False)
tokens = tok_list.index.get_level_values('token')
for token in tokens:
if token not in wordDict.keys():
wordDict[token] = i
i += 1
return wordDict, volumes
In [ ]:
wordDict, volumes = createWordDict([scifi, poetry])
Once we construct the global dictionary, we can fill the bag of words matrix with the word counts for each volume. Once we have this, we will use it to format the training data for our model.
In [ ]:
dtm = np.zeros((200, len(wordDict.keys())))
for i, vol in enumerate(volumes):
tok_list = vol.tokenlist(pages=False)
counts = list(tok_list['count'])
tokens = tok_list.index.get_level_values('token')
for token, count in zip(tokens, counts):
try:
index = wordDict[token]
dtm[i, index] = count
except:
pass
X = dtm
y = np.zeros((200))
y[100:200] = 1
We can then use the TfidfTransformer
to format the bag of words matrix, so that we can fit it to our LinearSVC model. Let's see how our model does.
In [ ]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import cross_validation
tfidf = TfidfTransformer()
out = tfidf.fit_transform(X, y)
model = LinearSVC()
score = cross_validation.cross_val_score(model, X, y, cv=10)
print(np.mean(score))
We can also get the most helpful features, or words, for each class. First we'll fit
the model:
In [ ]:
model.fit(X, y)
In [ ]:
feats = np.argsort(model.coef_[0])[:50]
top_scifi = [(list(feats).index(wordDict[w]) + 1, w) for w in wordDict.keys() if wordDict[w] in feats]
sorted(top_scifi)
In [ ]:
feats = np.argsort(model.coef_[0])[-50:]
top_poetry = [(list(feats).index(wordDict[w]) + 1, w) for w in wordDict.keys() if wordDict[w] in feats]
sorted(top_poetry, key=lambda tup: tup[0])
In [ ]: