Nearest Neighbors

When exploring a large set of documents -- such as Wikipedia, news articles, StackOverflow, etc. -- it can be useful to get a list of related material. To find relevant documents you typically

Decide on a notion of similarity
Find the documents that are most similar

In the assignment you will

Gain intuition for different notions of similarity and practice finding similar documents.
Explore the tradeoffs with representing documents using raw word counts and TF-IDF
Explore the behavior of different distance metrics by looking at the Wikipedia pages most similar to President Obama’s page.

Note to Amazon EC2 users: To conserve memory, make sure to stop all the other notebooks before running this notebook.

Import necessary packages

As usual we need to first import the Python packages that we will need.

The following code block will check if you have the correct version of GraphLab Create. Any version later than 1.8.5 will do. To upgrade, read this page.



In [1]:

    
import graphlab
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

'''Check GraphLab Create version'''
from distutils.version import StrictVersion
assert (StrictVersion(graphlab.version) >= StrictVersion('1.8.5')), 'GraphLab Create must be version 1.8.5 or later.'









    



[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1475372233.log






    



This non-commercial license of GraphLab Create for academic use is assigned to sudhanshu.shekhar.iitd@gmail.com and will expire on September 18, 2017.

Load Wikipedia dataset

We will be using the same dataset of Wikipedia pages that we used in the Machine Learning Foundations course (Course 1). Each element of the dataset consists of a link to the wikipedia article, the name of the person, and the text of the article (in lowercase).



In [2]:

    
wiki = graphlab.SFrame('people_wiki.gl')



In [3]:

    
wiki









    Out[3]:





    
        URI
        name
        text
    
    
        <http://dbpedia.org/resou
rce/Digby_Morrell> ...
        Digby Morrell
        digby morrell born 10
october 1979 is a former ...
    
    
        <http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
        Alfred J. Lewy
        alfred j lewy aka sandy
lewy graduated from ...
    
    
        <http://dbpedia.org/resou
rce/Harpdog_Brown> ...
        Harpdog Brown
        harpdog brown is a singer
and harmonica player who ...
    
    
        <http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
        Franz Rottensteiner
        franz rottensteiner born
in waidmannsfeld lower ...
    
    
        <http://dbpedia.org/resou
rce/G-Enka> ...
        G-Enka
        henry krvits born 30
december 1974 in tallinn ...
    
    
        <http://dbpedia.org/resou
rce/Sam_Henderson> ...
        Sam Henderson
        sam henderson born
october 18 1969 is an ...
    
    
        <http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
        Aaron LaCrate
        aaron lacrate is an
american music producer ...
    
    
        <http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
        Trevor Ferguson
        trevor ferguson aka john
farrow born 11 november ...
    
    
        <http://dbpedia.org/resou
rce/Grant_Nelson> ...
        Grant Nelson
        grant nelson born 27
april 1971 in london  ...
    
    
        <http://dbpedia.org/resou
rce/Cathy_Caruth> ...
        Cathy Caruth
        cathy caruth born 1955 is
frank h t rhodes ...
    

[59071 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Extract word count vectors

As we have seen in Course 1, we can extract word count vectors using a GraphLab utility function. We add this as a column in wiki.



In [4]:

    
wiki['word_count'] = graphlab.text_analytics.count_words(wiki['text'])



In [5]:

    
wiki









    Out[5]:





    
        URI
        name
        text
        word_count
    
    
        <http://dbpedia.org/resou
rce/Digby_Morrell> ...
        Digby Morrell
        digby morrell born 10
october 1979 is a former ...
        {'selection': 1,
'carltons': 1, 'being': ...
    
    
        <http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
        Alfred J. Lewy
        alfred j lewy aka sandy
lewy graduated from ...
        {'precise': 1, 'thomas':
1, 'closely': 1, ...
    
    
        <http://dbpedia.org/resou
rce/Harpdog_Brown> ...
        Harpdog Brown
        harpdog brown is a singer
and harmonica player who ...
        {'just': 1, 'issued': 1,
'mainly': 1, 'nominat ...
    
    
        <http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
        Franz Rottensteiner
        franz rottensteiner born
in waidmannsfeld lower ...
        {'all': 1,
'bauforschung': 1, ...
    
    
        <http://dbpedia.org/resou
rce/G-Enka> ...
        G-Enka
        henry krvits born 30
december 1974 in tallinn ...
        {'they': 1,
'gangstergenka': 1, ...
    
    
        <http://dbpedia.org/resou
rce/Sam_Henderson> ...
        Sam Henderson
        sam henderson born
october 18 1969 is an ...
        {'currently': 1, 'less':
1, 'being': 1, ...
    
    
        <http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
        Aaron LaCrate
        aaron lacrate is an
american music producer ...
        {'exclusive': 2,
'producer': 1, 'show' ...
    
    
        <http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
        Trevor Ferguson
        trevor ferguson aka john
farrow born 11 november ...
        {'taxi': 1, 'salon': 1,
'gangs': 1, 'being': 1, ...
    
    
        <http://dbpedia.org/resou
rce/Grant_Nelson> ...
        Grant Nelson
        grant nelson born 27
april 1971 in london  ...
        {'houston': 1, 'frankie':
1, 'labels': 1, ...
    
    
        <http://dbpedia.org/resou
rce/Cathy_Caruth> ...
        Cathy Caruth
        cathy caruth born 1955 is
frank h t rhodes ...
        {'phenomenon': 1,
'deborash': 1, 'both' ...
    

[59071 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Find nearest neighbors

Let's start by finding the nearest neighbors of the Barack Obama page using the word count vectors to represent the articles and Euclidean distance to measure distance. For this, again will we use a GraphLab Create implementation of nearest neighbor search.



In [6]:

    
model = graphlab.nearest_neighbors.create(wiki, label='name', features=['word_count'],
                                          method='brute_force', distance='euclidean')









    




Starting brute force nearest neighbors model training.

Let's look at the top 10 nearest neighbors by performing the following query:



In [7]:

    
model.query(wiki[wiki['name']=='Barack Obama'], label='name', k=10)









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 10.868ms     |






    




| Done         |         | 100         | 210.75ms     |






    




+--------------+---------+-------------+--------------+






    Out[7]:





    
        query_label
        reference_label
        distance
        rank
    
    
        Barack Obama
        Barack Obama
        0.0
        1
    
    
        Barack Obama
        Joe Biden
        33.0756708171
        2
    
    
        Barack Obama
        George W. Bush
        34.3947670438
        3
    
    
        Barack Obama
        Lawrence Summers
        36.1524549651
        4
    
    
        Barack Obama
        Mitt Romney
        36.1662826401
        5
    
    
        Barack Obama
        Francisco Barrio
        36.3318042492
        6
    
    
        Barack Obama
        Walter Mondale
        36.4005494464
        7
    
    
        Barack Obama
        Wynn Normington Hugh-
Jones ...
        36.4965751818
        8
    
    
        Barack Obama
        Don Bonker
        36.633318168
        9
    
    
        Barack Obama
        Andy Anstett
        36.9594372252
        10
    

[10 rows x 4 columns]

All of the 10 people are politicians, but about half of them have rather tenuous connections with Obama, other than the fact that they are politicians.

Francisco Barrio is a Mexican politician, and a former governor of Chihuahua.
Walter Mondale and Don Bonker are Democrats who made their career in late 1970s.
Wynn Normington Hugh-Jones is a former British diplomat and Liberal Party official.
Andy Anstett is a former politician in Manitoba, Canada.

Nearest neighbors with raw word counts got some things right, showing all politicians in the query result, but missed finer and important details.

For instance, let's find out why Francisco Barrio was considered a close neighbor of Obama. To do this, let's look at the most frequently used words in each of Barack Obama and Francisco Barrio's pages:



In [8]:

    
def top_words(name):
    """
    Get a table of the most frequent words in the given person's wikipedia page.
    """
    row = wiki[wiki['name'] == name]
    word_count_table = row[['word_count']].stack('word_count', new_column_name=['word','count'])
    return word_count_table.sort('count', ascending=False)



In [9]:

    
obama_words = top_words('Barack Obama')
obama_words









    Out[9]:





    
        word
        count
    
    
        the
        40
    
    
        in
        30
    
    
        and
        21
    
    
        of
        18
    
    
        to
        14
    
    
        his
        11
    
    
        obama
        9
    
    
        act
        8
    
    
        he
        7
    
    
        a
        7
    

[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [10]:

    
barrio_words = top_words('Francisco Barrio')
barrio_words









    Out[10]:





    
        word
        count
    
    
        the
        36
    
    
        of
        24
    
    
        and
        18
    
    
        in
        17
    
    
        he
        10
    
    
        to
        9
    
    
        chihuahua
        7
    
    
        governor
        6
    
    
        a
        6
    
    
        his
        5
    

[225 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Let's extract the list of most frequent words that appear in both Obama's and Barrio's documents. We've so far sorted all words from Obama and Barrio's articles by their word frequencies. We will now use a dataframe operation known as join. The join operation is very useful when it comes to playing around with data: it lets you combine the content of two tables using a shared column (in this case, the word column). See the documentation for more details.

For instance, running

obama_words.join(barrio_words, on='word')

will extract the rows from both tables that correspond to the common words.



In [11]:

    
combined_words = obama_words.join(barrio_words, on='word')
combined_words









    Out[11]:





    
        word
        count
        count.1
    
    
        the
        40
        36
    
    
        in
        30
        17
    
    
        and
        21
        18
    
    
        of
        18
        24
    
    
        to
        14
        9
    
    
        his
        11
        5
    
    
        he
        7
        10
    
    
        a
        7
        6
    
    
        as
        6
        5
    
    
        was
        5
        4
    

[56 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Since both tables contained the column named count, SFrame automatically renamed one of them to prevent confusion. Let's rename the columns to tell which one is for which. By inspection, we see that the first column (count) is for Obama and the second (count.1) for Barrio.



In [12]:

    
combined_words = combined_words.rename({'count':'Obama', 'count.1':'Barrio'})
combined_words









    Out[12]:





    
        word
        Obama
        Barrio
    
    
        the
        40
        36
    
    
        in
        30
        17
    
    
        and
        21
        18
    
    
        of
        18
        24
    
    
        to
        14
        9
    
    
        his
        11
        5
    
    
        he
        7
        10
    
    
        a
        7
        6
    
    
        as
        6
        5
    
    
        was
        5
        4
    

[56 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Note. The join operation does not enforce any particular ordering on the shared column. So to obtain, say, the five common words that appear most often in Obama's article, sort the combined table by the Obama column. Don't forget ascending=False to display largest counts first.



In [13]:

    
combined_words.sort('Obama', ascending=False)









    Out[13]:





    
        word
        Obama
        Barrio
    
    
        the
        40
        36
    
    
        in
        30
        17
    
    
        and
        21
        18
    
    
        of
        18
        24
    
    
        to
        14
        9
    
    
        his
        11
        5
    
    
        he
        7
        10
    
    
        a
        7
        6
    
    
        as
        6
        5
    
    
        was
        5
        4
    

[56 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Quiz Question. Among the words that appear in both Barack Obama and Francisco Barrio, take the 5 that appear most frequently in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?

Hint:

Refer to the previous paragraph for finding the words that appear in both articles. Sort the common words by their frequencies in Obama's article and take the largest five.
Each word count vector is a Python dictionary. For each word count vector in SFrame, you'd have to check if the set of the 5 common words is a subset of the keys of the word count vector. Complete the function has_top_words to accomplish the task.
- Convert the list of top 5 words into set using the syntax
```
set(common_words)
```
  where common_words is a Python list. See this link if you're curious about Python sets.
- Extract the list of keys of the word count dictionary by calling the keys() method.
- Convert the list of keys into a set as well.
- Use issubset() method to check if all 5 words are among the keys.
Now apply the has_top_words function on every row of the SFrame.
Compute the sum of the result column to obtain the number of articles containing all the 5 top words.



In [22]:

    
common_words = combined_words['word'][:5]
common_words = set(common_words)



In [26]:

    
def has_top_words(word_count_vector):
    # extract the keys of word_count_vector and convert it to a set
    unique_words = set(word_count_vector.keys())
    print "length of unique words = " + str(len(unique_words))
    
    # return True if common_words is a subset of unique_words
    # return False otherwise
    return 1 if common_words.issubset(unique_words) else 0

wiki['has_top_words'] = wiki['word_count'].apply(has_top_words)

# use has_top_words column to answer the quiz question
print "#articles in the Wikipedia dataset contain all of those 5 words = " + str(wiki['has_top_words'].sum())









    



length of unique words = 139
length of unique words = 147
length of unique words = 144
length of unique words = 227
length of unique words = 114
length of unique words = 173
length of unique words = 265
length of unique words = 306
length of unique words = 173
length of unique words = 130
length of unique words = 144
length of unique words = 192
length of unique words = 133
length of unique words = 170
length of unique words = 270
length of unique words = 134
length of unique words = 128
length of unique words = 123
length of unique words = 138
length of unique words = 172
length of unique words = 119
length of unique words = 169
length of unique words = 294
length of unique words = 249
length of unique words = 198
length of unique words = 203
length of unique words = 140
length of unique words = 163
length of unique words = 190
length of unique words = 155
length of unique words = 172
length of unique words = 126
length of unique words = 167
length of unique words = 188
length of unique words = 157
length of unique words = 150
length of unique words = 158
length of unique words = 216
length of unique words = 138
length of unique words = 128
length of unique words = 148
length of unique words = 179
length of unique words = 143
length of unique words = 146
length of unique words = 187
length of unique words = 156
length of unique words = 133
length of unique words = 180
length of unique words = 139
length of unique words = 138
length of unique words = 151
length of unique words = 162
length of unique words = 465
length of unique words = 151
length of unique words = 164
length of unique words = 145
length of unique words = 126
length of unique words = 166
length of unique words = 244
length of unique words = 154
length of unique words = 175
length of unique words = 142
length of unique words = 132
length of unique words = 142
length of unique words = 216
length of unique words = 168
length of unique words = 137
length of unique words = 152
length of unique words = 150
length of unique words = 133
length of unique words = 171
length of unique words = 187
length of unique words = 186
length of unique words = 130
length of unique words = 184
length of unique words = 132
length of unique words = 137
length of unique words = 126
length of unique words = 135
length of unique words = 164
length of unique words = 110
length of unique words = 162
length of unique words = 319
length of unique words = 222
length of unique words = 142
length of unique words = 137
length of unique words = 155
length of unique words = 187
length of unique words = 213
length of unique words = 129
length of unique words = 135
length of unique words = 136
length of unique words = 138
length of unique words = 300
length of unique words = 181
length of unique words = 274
length of unique words = 170
length of unique words = 134
length of unique words = 177
length of unique words = 151
#articles in the Wikipedia dataset contain all of those 5 words = 56066

Checkpoint. Check your has_top_words function on two random articles:



In [27]:

    
print 'Output from your function:', has_top_words(wiki[32]['word_count'])
print 'Correct output: True'
print 'Also check the length of unique_words. It should be 167'









    



Output from your function: length of unique words = 167
1
Correct output: True
Also check the length of unique_words. It should be 167



In [28]:

    
print 'Output from your function:', has_top_words(wiki[33]['word_count'])
print 'Correct output: False'
print 'Also check the length of unique_words. It should be 188'









    



Output from your function: length of unique words = 188
0
Correct output: False
Also check the length of unique_words. It should be 188

Quiz Question. Measure the pairwise distance between the Wikipedia pages of Barack Obama, George W. Bush, and Joe Biden. Which of the three pairs has the smallest distance?

Hint: To compute the Euclidean distance between two dictionaries, use graphlab.toolkits.distances.euclidean. Refer to this link for usage.



In [37]:

    
obama = wiki[wiki['name'] == 'Barack Obama']
bush = wiki[wiki['name'] == 'George W. Bush']
biden = wiki[wiki['name'] == 'Joe Biden']

isinstance(obama['word_count'][0], dict)









    Out[37]:





True



In [39]:

    
# pair-wise distances
obama_bush = graphlab.toolkits.distances.euclidean(obama['word_count'][0], bush['word_count'][0])
print "distance b/w obama and bush = " + str(obama_bush)

obama_biden = graphlab.toolkits.distances.euclidean(obama['word_count'][0], biden['word_count'][0])
print "distance b/w obama and biden = " + str(obama_biden)

bush_biden = graphlab.toolkits.distances.euclidean(biden['word_count'][0], bush['word_count'][0])
print "distance b/w biden and bush = " + str(bush_biden)









    



distance b/w obama and bush = 34.3947670438
distance b/w obama and biden = 33.0756708171
distance b/w biden and bush = 32.7566787083

Quiz Question. Collect all words that appear both in Barack Obama and George W. Bush pages. Out of those words, find the 10 words that show up most often in Obama's page.



In [43]:

    
bush_words = top_words('Francisco Barrio')
bush_words

new_combined_words = obama_words.join(bush_words, on='word')
new_combined_words

new_combined_words = new_combined_words.rename({'count':'Obama', 'count.1':'Bush'})
new_combined_words

new_combined_words.sort('Obama', ascending=False)
new_combined_words.print_rows(10)









    



+------+-------+------+
| word | Obama | Bush |
+------+-------+------+
| the  |   40  |  36  |
|  in  |   30  |  17  |
| and  |   21  |  18  |
|  of  |   18  |  24  |
|  to  |   14  |  9   |
| his  |   11  |  5   |
|  he  |   7   |  10  |
|  a   |   7   |  6   |
|  as  |   6   |  5   |
| was  |   5   |  4   |
+------+-------+------+
[56 rows x 3 columns]

Note. Even though common words are swamping out important subtle differences, commonalities in rarer political words still matter on the margin. This is why politicians are being listed in the query result instead of musicians, for example. In the next subsection, we will introduce a different metric that will place greater emphasis on those rarer words.

TF-IDF to the rescue

Much of the perceived commonalities between Obama and Barrio were due to occurrences of extremely frequent words, such as "the", "and", and "his". So nearest neighbors is recommending plausible results sometimes for the wrong reasons.

To retrieve articles that are more relevant, we should focus more on rare words that don't happen in every article. TF-IDF (term frequency–inverse document frequency) is a feature representation that penalizes words that are too common. Let's use GraphLab Create's implementation of TF-IDF and repeat the search for the 10 nearest neighbors of Barack Obama:



In [44]:

    
wiki['tf_idf'] = graphlab.text_analytics.tf_idf(wiki['word_count'])



In [45]:

    
model_tf_idf = graphlab.nearest_neighbors.create(wiki, label='name', features=['tf_idf'],
                                                 method='brute_force', distance='euclidean')









    




Starting brute force nearest neighbors model training.



In [46]:

    
model_tf_idf.query(wiki[wiki['name'] == 'Barack Obama'], label='name', k=10)









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 15.143ms     |






    




| Done         |         | 100         | 265.154ms    |






    




+--------------+---------+-------------+--------------+






    Out[46]:





    
        query_label
        reference_label
        distance
        rank
    
    
        Barack Obama
        Barack Obama
        0.0
        1
    
    
        Barack Obama
        Phil Schiliro
        106.861013691
        2
    
    
        Barack Obama
        Jeff Sessions
        108.871674216
        3
    
    
        Barack Obama
        Jesse Lee (politician)
        109.045697909
        4
    
    
        Barack Obama
        Samantha Power
        109.108106165
        5
    
    
        Barack Obama
        Bob Menendez
        109.781867105
        6
    
    
        Barack Obama
        Eric Stern (politician)
        109.95778808
        7
    
    
        Barack Obama
        James A. Guest
        110.413888718
        8
    
    
        Barack Obama
        Roland Grossenbacher
        110.4706087
        9
    
    
        Barack Obama
        Tulsi Gabbard
        110.696997999
        10
    

[10 rows x 4 columns]

Let's determine whether this list makes sense.

With a notable exception of Roland Grossenbacher, the other 8 are all American politicians who are contemporaries of Barack Obama.
Phil Schiliro, Jesse Lee, Samantha Power, and Eric Stern worked for Obama.

Clearly, the results are more plausible with the use of TF-IDF. Let's take a look at the word vector for Obama and Schilirio's pages. Notice that TF-IDF representation assigns a weight to each word. This weight captures relative importance of that word in the document. Let us sort the words in Obama's article by their TF-IDF weights; we do the same for Schiliro's article as well.



In [47]:

    
def top_words_tf_idf(name):
    row = wiki[wiki['name'] == name]
    word_count_table = row[['tf_idf']].stack('tf_idf', new_column_name=['word','weight'])
    return word_count_table.sort('weight', ascending=False)



In [48]:

    
obama_tf_idf = top_words_tf_idf('Barack Obama')
obama_tf_idf









    Out[48]:





    
        word
        weight
    
    
        obama
        43.2956530721
    
    
        act
        27.678222623
    
    
        iraq
        17.747378588
    
    
        control
        14.8870608452
    
    
        law
        14.7229357618
    
    
        ordered
        14.5333739509
    
    
        military
        13.1159327785
    
    
        involvement
        12.7843852412
    
    
        response
        12.7843852412
    
    
        democratic
        12.4106886973
    

[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [49]:

    
schiliro_tf_idf = top_words_tf_idf('Phil Schiliro')
schiliro_tf_idf









    Out[49]:





    
        word
        weight
    
    
        schiliro
        21.9729907785
    
    
        staff
        15.8564416352
    
    
        congressional
        13.5470876563
    
    
        daschleschiliro
        10.9864953892
    
    
        obama
        9.62125623824
    
    
        waxman
        9.04058524017
    
    
        president
        9.03358661416
    
    
        2014from
        8.68391029623
    
    
        law
        7.36146788088
    
    
        consultant
        6.91310403725
    

[119 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Using the join operation we learned earlier, try your hands at computing the common words shared by Obama's and Schiliro's articles. Sort the common words by their TF-IDF weights in Obama's document.



In [52]:

    
combined_words_tf_idf = obama_tf_idf.join(schiliro_tf_idf, on='word')
combined_words_tf_idf

combined_words_tf_idf = combined_words_tf_idf.rename({'weight': 'Obama', 'weight.1' : 'Schiliro'})
combined_words_tf_idf

combined_words_tf_idf.sort('Obama', ascending=False)
combined_words_tf_idf.print_rows(10)









    



+--------------+---------------+---------------+
|     word     |     Obama     |    Schiliro   |
+--------------+---------------+---------------+
|    obama     | 43.2956530721 | 9.62125623824 |
|     law      | 14.7229357618 | 7.36146788088 |
|  democratic  | 12.4106886973 | 6.20534434867 |
|    senate    | 10.1642881797 |  3.3880960599 |
| presidential |  7.3869554189 | 3.69347770945 |
|  president   | 7.22686929133 | 9.03358661416 |
|    policy    | 6.09538628214 | 3.04769314107 |
|    states    | 5.47320098963 | 1.82440032988 |
|    office    | 5.24817282322 | 2.62408641161 |
|     2011     | 5.10704127031 | 3.40469418021 |
+--------------+---------------+---------------+
[47 rows x 3 columns]

The first 10 words should say: Obama, law, democratic, Senate, presidential, president, policy, states, office, 2011.

Quiz Question. Among the words that appear in both Barack Obama and Phil Schiliro, take the 5 that have largest weights in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?



In [55]:

    
common_words = set(combined_words_tf_idf['word'][:5])
common_words









    Out[55]:





{'democratic', 'law', 'obama', 'presidential', 'senate'}



In [57]:

    
def has_top_words(word_count_vector):
    # extract the keys of word_count_vector and convert it to a set
    unique_words = set(word_count_vector.keys())
    
    # return True if common_words is a subset of unique_words
    # return False otherwise
    return 1 if common_words.issubset(unique_words) else 0

wiki['has_top_words'] = wiki['word_count'].apply(has_top_words)

# use has_top_words column to answer the quiz question
print "#articles in the Wikipedia dataset contain all of those 5 words = " + str(wiki['has_top_words'].sum())









    



#articles in the Wikipedia dataset contain all of those 5 words = 14

Notice the huge difference in this calculation using TF-IDF scores instead of raw word counts. We've eliminated noise arising from extremely common words.

Choosing metrics

You may wonder why Joe Biden, Obama's running mate in two presidential elections, is missing from the query results of model_tf_idf. Let's find out why. First, compute the distance between TF-IDF features of Obama and Biden.

Quiz Question. Compute the Euclidean distance between TF-IDF features of Obama and Biden. Hint: When using Boolean filter in SFrame/SArray, take the index 0 to access the first match.



In [58]:

    
obama = wiki[wiki['name'] == 'Barack Obama']
biden = wiki[wiki['name'] == 'Joe Biden']

obama_biden = graphlab.toolkits.distances.euclidean(obama['tf_idf'][0], biden['tf_idf'][0])
print "distance between obama and biden based on tf-idf = " + str(obama_biden)









    



distance between obama and biden based on tf-idf = 123.29745601

The distance is larger than the distances we found for the 10 nearest neighbors, which we repeat here for readability:



In [59]:

    
model_tf_idf.query(wiki[wiki['name'] == 'Barack Obama'], label='name', k=10)









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 14.458ms     |






    




| Done         |         | 100         | 266.854ms    |






    




+--------------+---------+-------------+--------------+






    Out[59]:





    
        query_label
        reference_label
        distance
        rank
    
    
        Barack Obama
        Barack Obama
        0.0
        1
    
    
        Barack Obama
        Phil Schiliro
        106.861013691
        2
    
    
        Barack Obama
        Jeff Sessions
        108.871674216
        3
    
    
        Barack Obama
        Jesse Lee (politician)
        109.045697909
        4
    
    
        Barack Obama
        Samantha Power
        109.108106165
        5
    
    
        Barack Obama
        Bob Menendez
        109.781867105
        6
    
    
        Barack Obama
        Eric Stern (politician)
        109.95778808
        7
    
    
        Barack Obama
        James A. Guest
        110.413888718
        8
    
    
        Barack Obama
        Roland Grossenbacher
        110.4706087
        9
    
    
        Barack Obama
        Tulsi Gabbard
        110.696997999
        10
    

[10 rows x 4 columns]

But one may wonder, is Biden's article that different from Obama's, more so than, say, Schiliro's? It turns out that, when we compute nearest neighbors using the Euclidean distances, we unwittingly favor short articles over long ones. Let us compute the length of each Wikipedia document, and examine the document lengths for the 100 nearest neighbors to Obama's page.



In [60]:

    
def compute_length(row):
    return len(row['text'].split(' '))

wiki['length'] = wiki.apply(compute_length)



In [61]:

    
nearest_neighbors_euclidean = model_tf_idf.query(wiki[wiki['name'] == 'Barack Obama'], label='name', k=100)
nearest_neighbors_euclidean = nearest_neighbors_euclidean.join(wiki[['name', 'length']], on={'reference_label':'name'})









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 13.202ms     |






    




| Done         |         | 100         | 240.009ms    |






    




+--------------+---------+-------------+--------------+



In [62]:

    
nearest_neighbors_euclidean.sort('rank')









    Out[62]:





    
        query_label
        reference_label
        distance
        rank
        length
    
    
        Barack Obama
        Barack Obama
        0.0
        1
        540
    
    
        Barack Obama
        Phil Schiliro
        106.861013691
        2
        208
    
    
        Barack Obama
        Jeff Sessions
        108.871674216
        3
        230
    
    
        Barack Obama
        Jesse Lee (politician)
        109.045697909
        4
        216
    
    
        Barack Obama
        Samantha Power
        109.108106165
        5
        310
    
    
        Barack Obama
        Bob Menendez
        109.781867105
        6
        220
    
    
        Barack Obama
        Eric Stern (politician)
        109.95778808
        7
        255
    
    
        Barack Obama
        James A. Guest
        110.413888718
        8
        215
    
    
        Barack Obama
        Roland Grossenbacher
        110.4706087
        9
        201
    
    
        Barack Obama
        Tulsi Gabbard
        110.696997999
        10
        228
    

[100 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

To see how these document lengths compare to the lengths of other documents in the corpus, let's make a histogram of the document lengths of Obama's 100 nearest neighbors and compare to a histogram of document lengths for all documents.



In [63]:

    
plt.figure(figsize=(10.5,4.5))
plt.hist(wiki['length'], 50, color='k', edgecolor='None', histtype='stepfilled', normed=True,
         label='Entire Wikipedia', zorder=3, alpha=0.8)
plt.hist(nearest_neighbors_euclidean['length'], 50, color='r', edgecolor='None', histtype='stepfilled', normed=True,
         label='100 NNs of Obama (Euclidean)', zorder=10, alpha=0.8)
plt.axvline(x=wiki['length'][wiki['name'] == 'Barack Obama'][0], color='k', linestyle='--', linewidth=4,
           label='Length of Barack Obama', zorder=2)
plt.axvline(x=wiki['length'][wiki['name'] == 'Joe Biden'][0], color='g', linestyle='--', linewidth=4,
           label='Length of Joe Biden', zorder=1)
plt.axis([0, 1000, 0, 0.04])

plt.legend(loc='best', prop={'size':15})
plt.title('Distribution of document length')
plt.xlabel('# of words')
plt.ylabel('Percentage')
plt.rcParams.update({'font.size':16})
plt.tight_layout()

Relative to the rest of Wikipedia, nearest neighbors of Obama are overwhemingly short, most of them being shorter than 300 words. The bias towards short articles is not appropriate in this application as there is really no reason to favor short articles over long articles (they are all Wikipedia articles, after all). Many of the Wikipedia articles are 300 words or more, and both Obama and Biden are over 300 words long.

Note: For the interest of computation time, the dataset given here contains excerpts of the articles rather than full text. For instance, the actual Wikipedia article about Obama is around 25000 words. Do not be surprised by the low numbers shown in the histogram.

Note: Both word-count features and TF-IDF are proportional to word frequencies. While TF-IDF penalizes very common words, longer articles tend to have longer TF-IDF vectors simply because they have more words in them.

To remove this bias, we turn to cosine distances: $$ d(\mathbf{x},\mathbf{y}) = 1 - \frac{\mathbf{x}^T\mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|} $$ Cosine distances let us compare word distributions of two articles of varying lengths.

Let us train a new nearest neighbor model, this time with cosine distances. We then repeat the search for Obama's 100 nearest neighbors.



In [64]:

    
model2_tf_idf = graphlab.nearest_neighbors.create(wiki, label='name', features=['tf_idf'],
                                                  method='brute_force', distance='cosine')









    




Starting brute force nearest neighbors model training.



In [65]:

    
nearest_neighbors_cosine = model2_tf_idf.query(wiki[wiki['name'] == 'Barack Obama'], label='name', k=100)
nearest_neighbors_cosine = nearest_neighbors_cosine.join(wiki[['name', 'length']], on={'reference_label':'name'})









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 13.374ms     |






    




| Done         |         | 100         | 247.814ms    |






    




+--------------+---------+-------------+--------------+



In [66]:

    
nearest_neighbors_cosine.sort('rank')









    Out[66]:





    
        query_label
        reference_label
        distance
        rank
        length
    
    
        Barack Obama
        Barack Obama
        0.0
        1
        540
    
    
        Barack Obama
        Joe Biden
        0.703138676734
        2
        414
    
    
        Barack Obama
        Samantha Power
        0.742981902328
        3
        310
    
    
        Barack Obama
        Hillary Rodham Clinton
        0.758358397887
        4
        580
    
    
        Barack Obama
        Eric Stern (politician)
        0.770561227601
        5
        255
    
    
        Barack Obama
        Robert Gibbs
        0.784677504751
        6
        257
    
    
        Barack Obama
        Eric Holder
        0.788039072943
        7
        232
    
    
        Barack Obama
        Jesse Lee (politician)
        0.790926415366
        8
        216
    
    
        Barack Obama
        Henry Waxman
        0.798322602893
        9
        279
    
    
        Barack Obama
        Joe the Plumber
        0.799466360042
        10
        217
    

[100 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

From a glance at the above table, things look better. For example, we now see Joe Biden as Barack Obama's nearest neighbor! We also see Hillary Clinton on the list. This list looks even more plausible as nearest neighbors of Barack Obama.

Let's make a plot to better visualize the effect of having used cosine distance in place of Euclidean on our TF-IDF vectors.



In [67]:

    
plt.figure(figsize=(10.5,4.5))
plt.figure(figsize=(10.5,4.5))
plt.hist(wiki['length'], 50, color='k', edgecolor='None', histtype='stepfilled', normed=True,
         label='Entire Wikipedia', zorder=3, alpha=0.8)
plt.hist(nearest_neighbors_euclidean['length'], 50, color='r', edgecolor='None', histtype='stepfilled', normed=True,
         label='100 NNs of Obama (Euclidean)', zorder=10, alpha=0.8)
plt.hist(nearest_neighbors_cosine['length'], 50, color='b', edgecolor='None', histtype='stepfilled', normed=True,
         label='100 NNs of Obama (cosine)', zorder=11, alpha=0.8)
plt.axvline(x=wiki['length'][wiki['name'] == 'Barack Obama'][0], color='k', linestyle='--', linewidth=4,
           label='Length of Barack Obama', zorder=2)
plt.axvline(x=wiki['length'][wiki['name'] == 'Joe Biden'][0], color='g', linestyle='--', linewidth=4,
           label='Length of Joe Biden', zorder=1)
plt.axis([0, 1000, 0, 0.04])
plt.legend(loc='best', prop={'size':15})
plt.title('Distribution of document length')
plt.xlabel('# of words')
plt.ylabel('Percentage')
plt.rcParams.update({'font.size': 16})
plt.tight_layout()









    





<matplotlib.figure.Figure at 0x11dea3a50>

Indeed, the 100 nearest neighbors using cosine distance provide a sampling across the range of document lengths, rather than just short articles like Euclidean distance provided.

Moral of the story: In deciding the features and distance measures, check if they produce results that make sense for your particular application.

Problem with cosine distances: tweets vs. long articles

Happily ever after? Not so fast. Cosine distances ignore all document lengths, which may be great in certain situations but not in others. For instance, consider the following (admittedly contrived) example.

+--------------------------------------------------------+
|                                             +--------+ |
|  One that shall not be named                | Follow | |
|  @username                                  +--------+ |
|                                                        |
|  Democratic governments control law in response to     |
|  popular act.                                          |
|                                                        |
|  8:05 AM - 16 May 2016                                 |
|                                                        |
|  Reply   Retweet (1,332)   Like (300)                  |
|                                                        |
+--------------------------------------------------------+

How similar is this tweet to Barack Obama's Wikipedia article? Let's transform the tweet into TF-IDF features, using an encoder fit to the Wikipedia dataset. (That is, let's treat this tweet as an article in our Wikipedia dataset and see what happens.)



In [68]:

    
sf = graphlab.SFrame({'text': ['democratic governments control law in response to popular act']})
sf['word_count'] = graphlab.text_analytics.count_words(sf['text'])

encoder = graphlab.feature_engineering.TFIDF(features=['word_count'], output_column_prefix='tf_idf')
encoder.fit(wiki)
sf = encoder.transform(sf)
sf









    Out[68]:





    
        text
        word_count
        tf_idf.word_count
    
    
        democratic governments
control law in response ...
        {'control': 1,
'democratic': 1, 'in' ...
        {'control':
3.721765211295327, ...
    

[1 rows x 3 columns]

Let's look at the TF-IDF vectors for this tweet and for Barack Obama's Wikipedia entry, just to visually see their differences.



In [69]:

    
tweet_tf_idf = sf[0]['tf_idf.word_count']
tweet_tf_idf









    Out[69]:





{'act': 3.4597778278724887,
 'control': 3.721765211295327,
 'democratic': 3.1026721743330414,
 'governments': 4.167571323949673,
 'in': 0.0009654063501214492,
 'law': 2.4538226269605703,
 'popular': 2.764478952022998,
 'response': 4.261461747058352,
 'to': 0.04694493768179923}



In [70]:

    
obama = wiki[wiki['name'] == 'Barack Obama']
obama









    Out[70]:





    
        URI
        name
        text
        word_count
        has_top_words
    
    
        <http://dbpedia.org/resou
rce/Barack_Obama> ...
        Barack Obama
        barack hussein obama ii
brk husen bm born august ...
        {'operations': 1,
'represent': 1, 'offi ...
        1
    


    
        tf_idf
        length
    
    
        {'operations':
3.811771079388818, ...
        540
    

[? rows x 7 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.

Now, compute the cosine distance between the Barack Obama article and this tweet:



In [71]:

    
obama_tf_idf = obama[0]['tf_idf']
graphlab.toolkits.distances.cosine(obama_tf_idf, tweet_tf_idf)









    Out[71]:





0.7059183777794328

Let's compare this distance to the distance between the Barack Obama article and all of its Wikipedia 10 nearest neighbors:



In [72]:

    
model2_tf_idf.query(obama, label='name', k=10)









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 0            | 1       | 0.00169288  | 23.764ms     |






    




| Done         |         | 100         | 337.913ms    |






    




+--------------+---------+-------------+--------------+






    Out[72]:





    
        query_label
        reference_label
        distance
        rank
    
    
        Barack Obama
        Barack Obama
        0.0
        1
    
    
        Barack Obama
        Joe Biden
        0.703138676734
        2
    
    
        Barack Obama
        Samantha Power
        0.742981902328
        3
    
    
        Barack Obama
        Hillary Rodham Clinton
        0.758358397887
        4
    
    
        Barack Obama
        Eric Stern (politician)
        0.770561227601
        5
    
    
        Barack Obama
        Robert Gibbs
        0.784677504751
        6
    
    
        Barack Obama
        Eric Holder
        0.788039072943
        7
    
    
        Barack Obama
        Jesse Lee (politician)
        0.790926415366
        8
    
    
        Barack Obama
        Henry Waxman
        0.798322602893
        9
    
    
        Barack Obama
        Joe the Plumber
        0.799466360042
        10
    

[10 rows x 4 columns]

With cosine distances, the tweet is "nearer" to Barack Obama than everyone else, except for Joe Biden! This probably is not something we want. If someone is reading the Barack Obama Wikipedia page, would you want to recommend they read this tweet? Ignoring article lengths completely resulted in nonsensical results. In practice, it is common to enforce maximum or minimum document lengths. After all, when someone is reading a long article from The Atlantic, you wouldn't recommend him/her a tweet.



In [ ]:

URI	name	text
<http://dbpedia.org/resou rce/Digby_Morrell> ...	Digby Morrell	digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...	Alfred J. Lewy	alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...	Harpdog Brown	harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...	Franz Rottensteiner	franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...	G-Enka	henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...	Sam Henderson	sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...	Aaron LaCrate	aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...	Trevor Ferguson	trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...	Grant Nelson	grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...	Cathy Caruth	cathy caruth born 1955 is frank h t rhodes ...

query_label	reference_label	distance	rank
Barack Obama	Barack Obama	0.0	1
Barack Obama	Joe Biden	33.0756708171	2
Barack Obama	George W. Bush	34.3947670438	3
Barack Obama	Lawrence Summers	36.1524549651	4
Barack Obama	Mitt Romney	36.1662826401	5
Barack Obama	Francisco Barrio	36.3318042492	6
Barack Obama	Walter Mondale	36.4005494464	7
Barack Obama	Wynn Normington Hugh- Jones ...	36.4965751818	8
Barack Obama	Don Bonker	36.633318168	9
Barack Obama	Andy Anstett	36.9594372252	10

word	weight
obama	43.2956530721
act	27.678222623
iraq	17.747378588
control	14.8870608452
law	14.7229357618
ordered	14.5333739509
military	13.1159327785
involvement	12.7843852412
response	12.7843852412
democratic	12.4106886973

word	weight
schiliro	21.9729907785
staff	15.8564416352
congressional	13.5470876563
daschleschiliro	10.9864953892
obama	9.62125623824
waxman	9.04058524017
president	9.03358661416
2014from	8.68391029623
law	7.36146788088
consultant	6.91310403725