test_string = "Do you know the way to San Jose?"



## Extend your functions from class

### 1. Add code to your tokenizer to filter for punctuation before tokenizing

#### This might be helpful: http://stackoverflow.com/a/266162/1808021



### You can use the list in NLTK or create your own



### 3. Add code to your tokenizer to call your tokenizer to create word tokens (if it doesn't already) and then generate the counts for each token



## Bonus

### Write a simple function to calculate the tf-idf

#### Remember the following were $t$ is the term, $D$ is the document, $N$ is the total number of documents, $n_w$ is the number of documents containing each word $t$, and $i_w$ is the frequency word $t$ appears in a document

$tf(t,D)=\frac{i_w}{n_D}$

$idf(t,D)=\log(\frac{N}{1+n_w})$

$tfidf=tf\times idf$



## k-NN on Iris

### 4. Using the Iris dataset, test the kNN for various levels of k to see if you can build a better classifier than our decision tree in 3_2



## k-Means with Congressional Bills

### 5. Explore the clusters of Congressional Records. Select another subset and investigate the contents. Write code that investigates a different cluster.



