Q1

In this question, you'll be doing some basic processing on four books: Moby Dick, War and Peace, The King James Bible, and The Complete Works of William Shakespeare. Each of these texts are available for free on the Project Gutenberg website.

Part A

Write a function, read_book, which takes the name of the text file containing the book's content, and returns a single string containing the content.

input: a single string indicating the name of the text file with the book
output: a single string of the entire book

Your function should be able to handle file-related errors gracefully. If an error occurs, just return None.



In [1]:



In [ ]:

    
assert read_book("queen_jean_bible.txt") is None



In [ ]:

    
assert read_book("complete_shakspeare.txt") is None



In [ ]:

    
book1 = read_book("moby_dick.txt")
assert len(book1) == 1238567



In [ ]:

    
book2 = read_book("war_and_peace.txt")
assert len(book2) == 3224780

Part B

Write a function, word_counts, which takes a single string as input (containing an entire book), and returns as output a dictionary of word counts.

Don't worry about handling punctuation, but definitely handle whitespace (spaces, tabs, newlines). Also make sure to handle capitalization, and throw out any words with a length of 2 or less. No other "preprocessing" requirements outside these.

You are welcome to use the collections.defaultdict dictionary for tracking word counts, but no other built-in Python packages or functions for counting.



In [ ]:



In [ ]:

    
assert 0 == len(word_counts("").keys())



In [ ]:

    
assert 1 == word_counts("hi there")["there"]



In [ ]:

    
kj = word_counts(open("king_james_bible.txt", "r").read())
assert 23 == kj["devil"]
assert 4 == kj["leviathan"]



In [ ]:

    
wp = word_counts(open("war_and_peace.txt", "r").read())
assert 30 == wp["devil"]
assert 86 == wp["soul"]

Part C

Write a function, total_words, which takes as input a string containing the contents of a book, and returns as output the integer count of the total number of words (this is NOT unique words, but total words).

Same rules apply as in Part B with respect to what constitutes a "word" (capitalization, punctuation, splitting, etc), but you are welcome to use your Part B solution in answering this question!



In [ ]:



In [ ]:

    
try:
    words = total_words("")
except:
    assert False
else:
    assert words == 0



In [ ]:

    
assert 11 == total_words("The brown fox jumped over the lazy cat.\nTwice.\nMMyep. Twice.")



In [ ]:

    
assert 681216 == total_words(open("king_james_bible.txt", "r").read())



In [ ]:

    
assert 729531 == total_words(open("complete_shakespeare.txt", "r").read())

Part D

Write a function, unique_words, which takes as input a string containing the full contents of a book, and returns an integer count of the number of unique words in the book.

Same rules apply as in Part B with respect to what constitutes a "word" (capitalization, punctuation, splitting, etc), but you are welcome to use your Part B solution in answering this question!



In [ ]:



In [ ]:

    
try:
    words = total_words("")
except:
    assert False
else:
    assert words == 0



In [ ]:

    
assert 9 == unique_words("The brown fox jumped over the lazy cat.\nTwice.\nMMyep. Twice.")



In [ ]:

    
assert 31586 == unique_words(open("moby_dick.txt", "r").read())



In [ ]:

    
assert 40021 == unique_words(open("war_and_peace.txt", "r").read())

Part E

Write a function, global_vocabulary, which takes a variable number of arguments: each argument is a string containing the contents of a book. The output of the function should be a list or set of unique words that comprise the full vocabulary of terms present across all the books that are passed to the function.

For example, if I have the following code:

book1 = "This is the entire content of a book."
book2 = "Here's another book."
book3 = "What is this?"

vocabulary = global_vocabulary(book1, book2, book3)

this should return a list or set containing the words:

{'a',
 'another',
 'book.',
 'content',
 'entire',
 "here's",
 'is',
 'of',
 'the',
 'this',
 'this?',
 'what'}

The words should be in increasing lexicographic order (aka, standard alphabetical order), and all the preprocessing steps required in previous sections should be used. As such, you are welcome to use your word_counts function from Part B.



In [ ]:



In [ ]:

    
doc1 = "This is a sentence."
doc2 = "This is another sentence."
doc3 = "What is this?"

assert set(["another", "sentence.", "this", "this?", "what"]) == set(global_vocabulary(doc1, doc2, doc3))



In [ ]:

    
assert 31586 == len(global_vocabulary(open("moby_dick.txt", "r").read()))



In [ ]:

    
assert 40021 == len(global_vocabulary(open("war_and_peace.txt", "r").read()))



In [ ]:

    
kj = open("king_james_bible.txt", "r").read()
wp = open("war_and_peace.txt", "r").read()
md = open("moby_dick.txt", "r").read()
cs = open("complete_shakespeare.txt", "r").read()

assert 118503 == len(global_vocabulary(kj, wp, md, cs))

Part F

Write a function, featurize, which takes a variable number of arguments: each argument is a string with the contents of an entire book. The output of this function is a 2D NumPy array of counts, where the rows are the documents/books (i.e., one row per argument!) and the columns are the counts for all the words in the global vocabulary.

For instance, if I pass two input strings to featurize that collectively have 50 unique words between them, the output matrix should have shape (2, 50): the first row will be the respective counts of the words in that document, and same with the second row.

The rows (documents) should be in the same ordering as they're given in the function's argument list, and the columns (words) should be in increasing lexicographic order (aka alphabetic order). You are welcome to use your function from Part B, and from Part E.



In [ ]:



In [ ]:

    
kj = open("king_james_bible.txt", "r").read()
wp = open("war_and_peace.txt", "r").read()

matrix = featurize(kj, wp)
assert 2 == matrix.shape[0]
assert 63889 == matrix.shape[1]
assert 2 == int(matrix[:, 836].sum())
assert 16 == int(matrix[:, 62655].sum())



In [ ]:

    
kj = open("king_james_bible.txt", "r").read()
wp = open("war_and_peace.txt", "r").read()
md = open("moby_dick.txt", "r").read()
cs = open("complete_shakespeare.txt", "r").read()

matrix = featurize(kj, wp, md, cs)
assert 4 == matrix.shape[0]
assert 118503 == matrix.shape[1]
assert 3 == int(matrix[:, 103817].sum())
assert 1 == int(matrix[:, 71100].sum())

Part G

Write a function, probability, which takes three arguments:

an integer, indicating the index of the word we're interested in computing the probability for (i.e., column number)
a 2D NumPy matrix of word counts, where the rows are the documents, and the columns are the words
an optional integer, indicating the specific document in which we want to compute the probability of the word (default is all documents)

This function is the implementation of $P(w)$ for some word $w$. By default, this is probability of word $w$ over our entire dataset. However, by specifying an optional integer, we can specify a conditional probability $P(w | d)$. In this case, we're asking for the probability of word $w$ given some specific document $d$.

Your function should return the probability, a floating-point value between 0 and 1. It should be able to handle the case where the specified word index is out of bounds (resulting probability of 0), as well as the case where the document index is out of bounds (also a probability of 0).



In [ ]:



In [ ]:

    
import numpy as np
matrix = np.load("lut.npy")

np.testing.assert_allclose(0.068569417725812987, probability(104088, matrix))
np.testing.assert_allclose(0.012485067486917144, probability(54096, matrix))
np.testing.assert_allclose(0.0073786475907416712, probability(21668, matrix))
np.testing.assert_allclose(0.0, probability(66535, matrix), rtol = 1e-5)



In [ ]:

    
matrix = np.load("lut.npy")

np.testing.assert_allclose(0.012404288801202555, probability(54096, matrix, 0))
np.testing.assert_allclose(0.0077914081371666744, probability(21668, matrix, 1))
np.testing.assert_allclose(0.0094279749592546449, probability(117297, matrix, 3))

Part H

Let's assume the four books you've analyzed are now going to constitute your "background" data. "Background" data is a concept that, in theory, allows you to identify important words: if you analyze a new book, you can compare its word counts to those in your "background" dataset. Any words in the new book that occur a lot more or a lot less frequently than in the background data could be considered "important" in some sense.

Let's say you receive a new book: Guns of the South. You want to compare its word counts to those in your "background". However, you quickly run into a problem: there are words in Guns of the South that do not exist at all in your background dataset--words like "Abraham" and "Lincoln" only show up in the new book, but never in your "background". This is extremely problematic, since now you'd potentially be dividing by 0 to gauge the relative importance of the words in Guns of the South.

Can you suggest a preprocessing step that might help?