In [ ]:

    
# You will probably need to !pip install some of these
# !pip install scipy
# !pip install sklearn
# !pip install nltk



In [4]:

    
import pandas as pd
!pip install sklearn
!pip install scipy
!pip install nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re #re: module for regular expressions
from nltk.stem.porter import PorterStemmer

pd.options.display.max_columns = 30
%matplotlib inline









    



Requirement already satisfied (use --upgrade to upgrade): sklearn in /Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): scikit-learn in /Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages (from sklearn)
Requirement already satisfied (use --upgrade to upgrade): scipy in /Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.6.2 in /Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages (from scipy)
Collecting nltk
  Downloading nltk-3.2.1.tar.gz (1.1MB)
    100% |████████████████████████████████| 1.1MB 875kB/s 
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... - \ | / done
  Stored in directory: /Users/sz2472/Library/Caches/pip/wheels/55/0b/ce/960dcdaec7c9af5b1f81d471a90c8dae88374386efe6e54a50
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.2.1

Analyzing text!

Text analysis has a few parts. We are going to use bag of words analysis, which just treats a sentence like a bag of words - no particular order or anything. It's simple but it usually gets the job done adequately.

Here is our text.



In [5]:

    
texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "The cat ate a fish at the store.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "It meowed once at the bug, it is still meowing at the bug and the fish",
    "The cat is at the store. The cat is orange. The cat is meowing at the fish.",
    "Penny is a fish"
]

When you process text, you have a nice long series of steps, but let's say you're interested in three things:

Tokenizing converts all of the sentences/phrases/etc into a series of words, and then it might also include converting it into a series of numbers - math stuff only works with numbers, not words. So maybe 'cat' is 2 and 'rug' is 4 and stuff like that.
Counting takes those words and sees how many there are (obviously) - how many times does meow appear?
Normalizing takes the count and makes new numbers - maybe it's how many times meow appears vs. how many total words there are, or maybe you're seeing how often meow comes up to see whether it's important.



In [ ]:

    
Penny bought bright blue fishes

tokenized - penny bought bright blue fishes



In [13]:

    
"Penny bought bright blue fishes".split()









    Out[13]:





['Penny', 'bought', 'bright', 'blue', 'fishes']

The scikit-learn package does a ton of stuff, some of which includes the above. We're going to start by playing with the CountVectorizer.



In [14]:

    
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()



In [15]:

    
# .fit_transfer TOKENIZES and COUNTS
X = count_vectorizer.fit_transform(texts)

Let's take a look at what it found out!



In [17]:

    
X









    Out[17]:





<7x23 sparse matrix of type '<class 'numpy.int64'>'
	with 49 stored elements in Compressed Sparse Row format>

Okay, that looks like trash and garbage. What's a "sparse array"??????



In [18]:

    
X.toarray()









    Out[18]:





array([[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0],
       [0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 1, 0, 1, 1, 1, 1],
       [1, 2, 0, 0, 0, 0, 2, 0, 1, 0, 1, 2, 1, 1, 1, 0, 0, 0, 1, 0, 3, 0, 0],
       [0, 2, 0, 0, 0, 0, 0, 3, 1, 0, 3, 0, 0, 1, 0, 1, 0, 0, 0, 1, 5, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]], dtype=int64)

If we put on our Computer Goggles we see that the first sentence has the first word 3 times, the second word 1 time, the third word 1 time, etc... But we can't read it, really. It would look nicer as a dataframe.



In [19]:

    
pd.DataFrame(X.toarray())

What do all of those numbers mean????



In [20]:

    
# A fish is Penny
count_vectorizer.get_feature_names()









    Out[20]:





['and',
 'at',
 'ate',
 'blue',
 'bought',
 'bright',
 'bug',
 'cat',
 'fish',
 'fishes',
 'is',
 'it',
 'meowed',
 'meowing',
 'once',
 'orange',
 'penny',
 'saw',
 'still',
 'store',
 'the',
 'to',
 'went']



In [21]:

    
pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

So sentence #4 has "at" once, and the first sentence has "bought" once, and the last sentence has "the" three times. But hey, those are garbage words! They're cluttering up our dataframe! We need to add stopwords!



In [22]:

    
# We'll make a new vectorizer
count_vectorizer = CountVectorizer(stop_words='english') 

# .fit_transfer TOKENIZES and COUNTS
X = count_vectorizer.fit_transform(texts)
print(count_vectorizer.get_feature_names())









    



['ate', 'blue', 'bought', 'bright', 'bug', 'cat', 'fish', 'fishes', 'meowed', 'meowing', 'orange', 'penny', 'saw', 'store', 'went']



In [23]:

    
pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

I still see meowed and meowing and fish and fishes - they seem the same, so let's lemmatize/stem them.

You can specify a preprocessor or a tokenizer when you're creating your CountVectorizer to do custom stuff on your words. Maybe we want to get rid of punctuation, lowercase things and split them on spaces (this is basically the default). preprocessor is supposed to return a string, so it's a little easier to work with.



In [24]:

    
# This is what our normal tokenizer looks like
def boring_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    return words

count_vectorizer = CountVectorizer(stop_words='english', tokenizer=boring_tokenizer)
X = count_vectorizer.fit_transform(texts)
print(count_vectorizer.get_feature_names())









    



['ate', 'blue', 'bought', 'bright', 'bug', 'cat', 'fish', 'fishes', 'meowed', 'meowing', 'orange', 'penny', 'saw', 'store', 'went']

We're going to use one that features a STEMMER - something that strips the endings off of words (or tries to, at least). This one is from nltk.



In [25]:

    
from nltk.stem.porter import PorterStemmer #it doesn't know what words are, it just chop off the ends from the words
porter_stemmer = PorterStemmer()
# 
print(porter_stemmer.stem('fishes'))
print(porter_stemmer.stem('meowed'))
print(porter_stemmer.stem('oranges'))
print(porter_stemmer.stem('meowing'))
print(porter_stemmer.stem('organge'))









    



fish
meow
orang
meow
organg



In [26]:

    
porter_stemmer = PorterStemmer()

def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

count_vectorizer = CountVectorizer(stop_words='english', tokenizer=boring_tokenizer)
X = count_vectorizer.fit_transform(texts)
print(count_vectorizer.get_feature_names())









    



['ate', 'blue', 'bought', 'bright', 'bug', 'cat', 'fish', 'fishes', 'meowed', 'meowing', 'orange', 'penny', 'saw', 'store', 'went']

Now lets look at the new version of that dataframe.



In [27]:

    
pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

TF-IDF

Part One: Term Frequency

TF-IDF? What? It means term frequency inverse document frequency! It's the most important thing. Let's look at our list of phrases

Penny bought bright blue fishes.
Penny bought bright blue and orange fish.
The cat ate a fish at the store.
Penny went to the store. Penny ate a bug. Penny saw a fish.
It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.
The cat is fat. The cat is orange. The cat is meowing at the fish.
Penny is a fish

If we're searching for the word fish, which is the most helpful phrase?



In [ ]:

    
pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

Probably the one where fish appears three times.

It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish.

But are all the others the same?

Penny is a fish.

Penny went to the store. Penny ate a bug. Penny saw a fish.

In the second one we spend less time talking about the fish. Think about a huge long document where they say your name once, versus a tweet where they say your name once. Which one are you more important in? Probably the tweet, since you take up a larger percentage of the text.

This is term frequency - taking into account how often a term shows up. We're going to take this into account by using the TfidfVectorizer in the same way we used the CountVectorizer.



In [28]:

    
from sklearn.feature_extraction.text import TfidfVectorizer



In [29]:

    
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=False, norm='l1')
X = tfidf_vectorizer.fit_transform(texts)
pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

Now our numbers have shifted a little bit. Instead of just being a count, it's the percentage of the words.

value = (number of times word appears in sentence) / (number of words in sentence)

After we remove the stopwords, the term fish is 50% of the words in Penny is a fish vs. 37.5% in It meowed once at the fish, it is still meowing at the fish. It meowed at the bug and the fish..

Note: We made it be the percentage of the words by passing in norm="l1" - by default it's normally an L2 (Euclidean) norm, which is actually better, but I thought it would make more sense using the L1 - a.k.a. terms divided by words -norm.

So now when we search we'll get more relevant results because it takes into account whether half of our words are fish or 1% of millions upon millions of words is fish. But we aren't done yet!

Part Two: Inverse document frequency

Let's say we're searching for "fish meow"



In [30]:

    
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=False, norm='l1')
X = tfidf_vectorizer.fit_transform(texts)
df = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())
df

What's the highest combined? for 'fish' and 'meow'?



In [31]:

    
# Just add the columns together
pd.DataFrame([df['fish'], df['meow'], df['fish'] + df['meow']], index=["fish", "meow", "fish + meow"]).T









    Out[31]:






  
    
      
      fish
      meow
      fish + meow
    
  
  
    
      0
      0.200000
      0.000000
      0.200000
    
    
      1
      0.166667
      0.000000
      0.166667
    
    
      2
      0.250000
      0.000000
      0.250000
    
    
      3
      0.111111
      0.000000
      0.111111
    
    
      4
      0.166667
      0.333333
      0.500000
    
    
      5
      0.142857
      0.142857
      0.285714
    
    
      6
      0.500000
      0.000000
      0.500000

Indices 4 and 6 (numbers 5 and 7) are tied - but meow never even appears in one of them!

It meowed once at the bug, it is still meowing at the bug and the fish
Penny is a fish

It seems like since fish shows up again and again it should be weighted a little less - not like it's a stopword, but just... it's kind of cliche to have it show up in the text, so we want to make it less important.

This is inverse term frequency - the more often a term shows up across all documents, the less important it is in our matrix.



In [32]:

    
# use_idf=True is default, but I'll leave it in,idf inverse document frequency
idf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=True, norm='l1')
X = idf_vectorizer.fit_transform(texts)
idf_df = pd.DataFrame(X.toarray(), columns=idf_vectorizer.get_feature_names())
idf_df

Let's take a look at our OLD values, then our NEW values, just for meow and fish.



In [34]:

    
# OLD dataframe
pd.DataFrame([df['fish'], df['meow'], df['fish'] + df['meow']], index=["fish", "meow", "fish + meow"]).T









    Out[34]:






  
    
      
      fish
      meow
      fish + meow
    
  
  
    
      0
      0.200000
      0.000000
      0.200000
    
    
      1
      0.166667
      0.000000
      0.166667
    
    
      2
      0.250000
      0.000000
      0.250000
    
    
      3
      0.111111
      0.000000
      0.111111
    
    
      4
      0.166667
      0.333333
      0.500000
    
    
      5
      0.142857
      0.142857
      0.285714
    
    
      6
      0.500000
      0.000000
      0.500000



In [33]:

    
# NEW dataframe
pd.DataFrame([idf_df['fish'], idf_df['meow'], idf_df['fish'] + idf_df['meow']], index=["fish", "meow", "fish + meow"]).T









    Out[33]:






  
    
      
      fish
      meow
      fish + meow
    
  
  
    
      0
      0.118871
      0.000000
      0.118871
    
    
      1
      0.096216
      0.000000
      0.096216
    
    
      2
      0.150267
      0.000000
      0.150267
    
    
      3
      0.063142
      0.000000
      0.063142
    
    
      4
      0.088420
      0.350291
      0.438712
    
    
      5
      0.079382
      0.157242
      0.236625
    
    
      6
      0.404858
      0.000000
      0.404858

Notice how 'meow' increased in value because it's an infrequent term, and fish dropped in value because it's so frequent.

That meowing one (index 4) has gone from 0.50 to 0.43, while Penny is a fish (index 6) has dropped to 0.40. Now hooray, the meowing one is going to show up earlier when searching for "fish meow" because fish shows up all of the time, so we want to ignore it a lil' bit.

But honestly I wasn't very impressed by that drop.

And this is why defaults are important: let's try changing it to norm='l2' (or just removing norm completely).



In [46]:

    
# use_idf=True is default, but I'll leave it in
l2_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=True)
X = l2_vectorizer.fit_transform(texts)
l2_df = pd.DataFrame(X.toarray(), columns=l2_vectorizer.get_feature_names())
l2_df



In [47]:

    
# normal TF-IDF dataframe
pd.DataFrame([idf_df['fish'], idf_df['meow'], idf_df['fish'] + idf_df['meow']], index=["fish", "meow", "fish + meow"]).T









    Out[47]:






  
    
      
      fish
      meow
      fish + meow
    
  
  
    
      0
      0.118871
      0.000000
      0.118871
    
    
      1
      0.096216
      0.000000
      0.096216
    
    
      2
      0.150267
      0.000000
      0.150267
    
    
      3
      0.063142
      0.000000
      0.063142
    
    
      4
      0.088420
      0.350291
      0.438712
    
    
      5
      0.079382
      0.157242
      0.236625
    
    
      6
      0.404858
      0.000000
      0.404858



In [48]:

    
# L2 norm TF-IDF dataframe
pd.DataFrame([l2_df['fish'], l2_df['meow'], l2_df['fish'] + l2_df['meow']], index=["fish", "meow", "fish + meow"]).T









    Out[48]:






  
    
      
      fish
      meow
      fish + meow
    
  
  
    
      0
      0.258786
      0.000000
      0.258786
    
    
      1
      0.230292
      0.000000
      0.230292
    
    
      2
      0.292176
      0.000000
      0.292176
    
    
      3
      0.153301
      0.000000
      0.153301
    
    
      4
      0.162043
      0.641958
      0.804000
    
    
      5
      0.145823
      0.288850
      0.434673
    
    
      6
      0.562463
      0.000000
      0.562463

LOOK AT HOW IMPORTANT MEOW IS. Meowing is out of this world important, because no one ever meows.

Who cares? Why do we need to know this?

When someone dumps 100,000 documents on your desk in response to FOIA, you'll start to care! One of the reasons understanding TF-IDF is important is because of document similarity. By knowing what documents are similar you're able to find related documents and automatically group documents into clusters.

For example! Let's cluster these documents using K-Means clustering (check out this gif)

2 categories of documents



In [39]:

    
# Initialize a vectorizer
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=boring_tokenizer, stop_words='english')
X = vectorizer.fit_transform(texts) #fit_transform



In [40]:

    
# KMeans clustering is a method of clustering.
from sklearn.cluster import KMeans

number_of_clusters = 2
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)









    Out[40]:





KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)



In [41]:

    
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))









    



Top terms per cluster:
Cluster 0: cat store ate fish orange
Cluster 1: penny fish bug bright bought



In [42]:

    
results = pd.DataFrame()
results['text'] = texts
results['category'] = km.labels_
results









    Out[42]:






  
    
      
      text
      category
    
  
  
    
      0
      Penny bought bright blue fishes.
      1
    
    
      1
      Penny bought bright blue and orange fish.
      1
    
    
      2
      The cat ate a fish at the store.
      0
    
    
      3
      Penny went to the store. Penny ate a bug. Penn...
      1
    
    
      4
      It meowed once at the bug, it is still meowing...
      1
    
    
      5
      The cat is at the store. The cat is orange. Th...
      0
    
    
      6
      Penny is a fish
      1

4 categories of documents



In [43]:

    
from sklearn.cluster import KMeans

number_of_clusters = 4
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)









    Out[43]:





KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=4, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)



In [44]:

    
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))









    



Top terms per cluster:
Cluster 0: cat store ate fish orange
Cluster 1: bright bought blue penny fishes
Cluster 2: bug meowed meowing fish went
Cluster 3: penny fish went saw bug



In [45]:

    
results = pd.DataFrame()
results['text'] = texts
results['category'] = km.labels_
results









    Out[45]:






  
    
      
      text
      category
    
  
  
    
      0
      Penny bought bright blue fishes.
      1
    
    
      1
      Penny bought bright blue and orange fish.
      1
    
    
      2
      The cat ate a fish at the store.
      0
    
    
      3
      Penny went to the store. Penny ate a bug. Penn...
      3
    
    
      4
      It meowed once at the bug, it is still meowing...
      2
    
    
      5
      The cat is at the store. The cat is orange. Th...
      0
    
    
      6
      Penny is a fish
      3



In [ ]:

    
max_features: number of tokens



In [49]:

    
ax=df.plot(kind='scatter',x='fish',y='penni',alpha=0.25)
ax.set_xlabel("Fish")
ax.set_ylabel("Penny")









    Out[49]:





<matplotlib.text.Text at 0x109064c18>



In [ ]:

    
import matplotlib.pyplot as plt

color_list=['r','b','g','y']
colors = [color_list[i]] for i in df['category']
ax.scatter(df['fish'])
ax.set

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
0	0	0	0	1	1	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0
1	1	0	0	1	1	1	0	0	1	0	0	0	0	0	0	1	1	0	0	0	0	0	0
2	0	1	1	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	2	0	0
3	0	0	1	0	0	0	1	0	1	0	0	0	0	0	0	0	3	1	0	1	1	1	1
4	1	2	0	0	0	0	2	0	1	0	1	2	1	1	1	0	0	0	1	0	3	0	0
5	0	2	0	0	0	0	0	3	1	0	3	0	0	1	0	1	0	0	0	1	5	0	0
6	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0

	and	at	ate	blue	bought	bright	bug	cat	fish	fishes	is	it	meowed	meowing	once	orange	penny	saw	still	store	the	to	went
0	0	0	0	1	1	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0
1	1	0	0	1	1	1	0	0	1	0	0	0	0	0	0	1	1	0	0	0	0	0	0
2	0	1	1	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	2	0	0
3	0	0	1	0	0	0	1	0	1	0	0	0	0	0	0	0	3	1	0	1	1	1	1
4	1	2	0	0	0	0	2	0	1	0	1	2	1	1	1	0	0	0	1	0	3	0	0
5	0	2	0	0	0	0	0	3	1	0	3	0	0	1	0	1	0	0	0	1	5	0	0
6	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0

	ate	blue	bought	bright	bug	cat	fish	fishes	meowed	meowing	orange	penny	saw	store	went
0	0	1	1	1	0	0	0	1	0	0	0	1	0	0	0
1	0	1	1	1	0	0	1	0	0	0	1	1	0	0	0
2	1	0	0	0	0	1	1	0	0	0	0	0	0	1	0
3	1	0	0	0	1	0	1	0	0	0	0	3	1	1	1
4	0	0	0	0	2	0	1	0	1	1	0	0	0	0	0
5	0	0	0	0	0	3	1	0	0	1	1	0	0	1	0
6	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0

	ate	blue	bought	bright	bug	cat	fish	fishes	meowed	meowing	orange	penny	saw	store	went
0	0	1	1	1	0	0	0	1	0	0	0	1	0	0	0
1	0	1	1	1	0	0	1	0	0	0	1	1	0	0	0
2	1	0	0	0	0	1	1	0	0	0	0	0	0	1	0
3	1	0	0	0	1	0	1	0	0	0	0	3	1	1	1
4	0	0	0	0	2	0	1	0	1	1	0	0	0	0	0
5	0	0	0	0	0	3	1	0	0	1	1	0	0	1	0
6	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0

	ate	blue	bought	bright	bug	cat	fish	meow	onc	orang	penni	saw	store	went
0	0.000000	0.200000	0.200000	0.200000	0.000000	0.000000	0.200000	0.000000	0.000000	0.000000	0.200000	0.000000	0.000000	0.000000
1	0.000000	0.166667	0.166667	0.166667	0.000000	0.000000	0.166667	0.000000	0.000000	0.166667	0.166667	0.000000	0.000000	0.000000
2	0.250000	0.000000	0.000000	0.000000	0.000000	0.250000	0.250000	0.000000	0.000000	0.000000	0.000000	0.000000	0.250000	0.000000
3	0.111111	0.000000	0.000000	0.000000	0.111111	0.000000	0.111111	0.000000	0.000000	0.000000	0.333333	0.111111	0.111111	0.111111
4	0.000000	0.000000	0.000000	0.000000	0.333333	0.000000	0.166667	0.333333	0.166667	0.000000	0.000000	0.000000	0.000000	0.000000
5	0.000000	0.000000	0.000000	0.000000	0.000000	0.428571	0.142857	0.142857	0.000000	0.142857	0.000000	0.000000	0.142857	0.000000
6	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	0.000000	0.000000	0.000000	0.500000	0.000000	0.000000	0.000000

	ate	blue	bought	bright	bug	cat	fish	meow	onc	orang	penni	saw	store	went
0	0.000000	0.235463	0.235463	0.235463	0.000000	0.000000	0.118871	0.000000	0.000000	0.000000	0.174741	0.000000	0.000000	0.000000
1	0.000000	0.190587	0.190587	0.190587	0.000000	0.000000	0.096216	0.000000	0.000000	0.190587	0.141437	0.000000	0.000000	0.000000
2	0.297654	0.000000	0.000000	0.000000	0.000000	0.297654	0.150267	0.000000	0.000000	0.000000	0.000000	0.000000	0.254425	0.000000
3	0.125073	0.000000	0.000000	0.000000	0.125073	0.000000	0.063142	0.000000	0.000000	0.000000	0.278455	0.150675	0.106908	0.150675
4	0.000000	0.000000	0.000000	0.000000	0.350291	0.000000	0.088420	0.350291	0.210997	0.000000	0.000000	0.000000	0.000000	0.000000
5	0.000000	0.000000	0.000000	0.000000	0.000000	0.471727	0.079382	0.157242	0.000000	0.157242	0.000000	0.000000	0.134406	0.000000
6	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.404858	0.000000	0.000000	0.000000	0.595142	0.000000	0.000000	0.000000

	ate	blue	bought	bright	bug	cat	fish	meow	onc	orang	penni	saw	store	went
0	0.000000	0.512612	0.512612	0.512612	0.000000	0.000000	0.258786	0.000000	0.000000	0.00000	0.380417	0.000000	0.000000	0.000000
1	0.000000	0.456170	0.456170	0.456170	0.000000	0.000000	0.230292	0.000000	0.000000	0.45617	0.338530	0.000000	0.000000	0.000000
2	0.578752	0.000000	0.000000	0.000000	0.000000	0.578752	0.292176	0.000000	0.000000	0.00000	0.000000	0.000000	0.494698	0.000000
3	0.303663	0.000000	0.000000	0.000000	0.303663	0.000000	0.153301	0.000000	0.000000	0.00000	0.676058	0.365821	0.259561	0.365821
4	0.000000	0.000000	0.000000	0.000000	0.641958	0.000000	0.162043	0.641958	0.386682	0.00000	0.000000	0.000000	0.000000	0.000000
5	0.000000	0.000000	0.000000	0.000000	0.000000	0.866550	0.145823	0.288850	0.000000	0.28885	0.000000	0.000000	0.246899	0.000000
6	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.562463	0.000000	0.000000	0.00000	0.826823	0.000000	0.000000	0.000000

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
0	0	0	0	1	1	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0
1	1	0	0	1	1	1	0	0	1	0	0	0	0	0	0	1	1	0	0	0	0	0	0
2	0	1	1	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	2	0	0
3	0	0	1	0	0	0	1	0	1	0	0	0	0	0	0	0	3	1	0	1	1	1	1
4	1	2	0	0	0	0	2	0	1	0	1	2	1	1	1	0	0	0	1	0	3	0	0
5	0	2	0	0	0	0	0	3	1	0	3	0	0	1	0	1	0	0	0	1	5	0	0
6	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0

	and	at	ate	blue	bought	bright	bug	cat	fish	fishes	is	it	meowed	meowing	once	orange	penny	saw	still	store	the	to	went
0	0	0	0	1	1	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0
1	1	0	0	1	1	1	0	0	1	0	0	0	0	0	0	1	1	0	0	0	0	0	0
2	0	1	1	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	2	0	0
3	0	0	1	0	0	0	1	0	1	0	0	0	0	0	0	0	3	1	0	1	1	1	1
4	1	2	0	0	0	0	2	0	1	0	1	2	1	1	1	0	0	0	1	0	3	0	0
5	0	2	0	0	0	0	0	3	1	0	3	0	0	1	0	1	0	0	0	1	5	0	0
6	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0

	ate	blue	bought	bright	bug	cat	fish	fishes	meowed	meowing	orange	penny	saw	store	went
0	0	1	1	1	0	0	0	1	0	0	0	1	0	0	0
1	0	1	1	1	0	0	1	0	0	0	1	1	0	0	0
2	1	0	0	0	0	1	1	0	0	0	0	0	0	1	0
3	1	0	0	0	1	0	1	0	0	0	0	3	1	1	1
4	0	0	0	0	2	0	1	0	1	1	0	0	0	0	0
5	0	0	0	0	0	3	1	0	0	1	1	0	0	1	0
6	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0

	ate	blue	bought	bright	bug	cat	fish	fishes	meowed	meowing	orange	penny	saw	store	went
0	0	1	1	1	0	0	0	1	0	0	0	1	0	0	0
1	0	1	1	1	0	0	1	0	0	0	1	1	0	0	0
2	1	0	0	0	0	1	1	0	0	0	0	0	0	1	0
3	1	0	0	0	1	0	1	0	0	0	0	3	1	1	1
4	0	0	0	0	2	0	1	0	1	1	0	0	0	0	0
5	0	0	0	0	0	3	1	0	0	1	1	0	0	1	0
6	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
0	0	0	0	1	1	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0
1	1	0	0	1	1	1	0	0	1	0	0	0	0	0	0	1	1	0	0	0	0	0	0
2	0	1	1	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	2	0	0
3	0	0	1	0	0	0	1	0	1	0	0	0	0	0	0	0	3	1	0	1	1	1	1
4	1	2	0	0	0	0	2	0	1	0	1	2	1	1	1	0	0	0	1	0	3	0	0
5	0	2	0	0	0	0	0	3	1	0	3	0	0	1	0	1	0	0	0	1	5	0	0
6	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0

	and	at	ate	blue	bought	bright	bug	cat	fish	fishes	is	it	meowed	meowing	once	orange	penny	saw	still	store	the	to	went
0	0	0	0	1	1	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0
1	1	0	0	1	1	1	0	0	1	0	0	0	0	0	0	1	1	0	0	0	0	0	0
2	0	1	1	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1	2	0	0
3	0	0	1	0	0	0	1	0	1	0	0	0	0	0	0	0	3	1	0	1	1	1	1
4	1	2	0	0	0	0	2	0	1	0	1	2	1	1	1	0	0	0	1	0	3	0	0
5	0	2	0	0	0	0	0	3	1	0	3	0	0	1	0	1	0	0	0	1	5	0	0
6	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0

	ate	blue	bought	bright	bug	cat	fish	fishes	meowed	meowing	orange	penny	saw	store	went
0	0	1	1	1	0	0	0	1	0	0	0	1	0	0	0
1	0	1	1	1	0	0	1	0	0	0	1	1	0	0	0
2	1	0	0	0	0	1	1	0	0	0	0	0	0	1	0
3	1	0	0	0	1	0	1	0	0	0	0	3	1	1	1
4	0	0	0	0	2	0	1	0	1	1	0	0	0	0	0
5	0	0	0	0	0	3	1	0	0	1	1	0	0	1	0
6	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0

	ate	blue	bought	bright	bug	cat	fish	fishes	meowed	meowing	orange	penny	saw	store	went
0	0	1	1	1	0	0	0	1	0	0	0	1	0	0	0
1	0	1	1	1	0	0	1	0	0	0	1	1	0	0	0
2	1	0	0	0	0	1	1	0	0	0	0	0	0	1	0
3	1	0	0	0	1	0	1	0	0	0	0	3	1	1	1
4	0	0	0	0	2	0	1	0	1	1	0	0	0	0	0
5	0	0	0	0	0	3	1	0	0	1	1	0	0	1	0
6	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0