In [ ]:
test_string = "Do you know the way to San Jose?"

Extend your functions from class

1. Add code to your tokenizer to filter for punctuation before tokenizing

This might be helpful: http://stackoverflow.com/a/266162/1808021


In [ ]:

2. Add code to your tokenizer to filter for stopwords

Your function should use the list of stopwords to filter the string and not return words in the stopword list

You can use the list in NLTK or create your own


In [ ]:

3. Add code to your tokenizer to call your tokenizer to create word tokens (if it doesn't already) and then generate the counts for each token


In [ ]:

Bonus

Write a simple function to calculate the tf-idf

Remember the following were $t$ is the term, $D$ is the document, $N$ is the total number of documents, $n_w$ is the number of documents containing each word $t$, and $i_w$ is the frequency word $t$ appears in a document

$tf(t,D)=\frac{i_w}{n_D}$

$idf(t,D)=\log(\frac{N}{1+n_w})$

$tfidf=tf\times idf$


In [ ]:

k-NN on Iris

4. Using the Iris dataset, test the kNN for various levels of k to see if you can build a better classifier than our decision tree in 3_2


In [ ]:

k-Means with Congressional Bills

5. Explore the clusters of Congressional Records. Select another subset and investigate the contents. Write code that investigates a different cluster.


In [ ]:

6. On the class Tumblr, provide a response to the lesson on k-Means, specifically whether you think this is a useful technique for working journalists (data or otherwise)