Document retrieval from wikipedia data Using TF-IDF

Fire up GraphLab Create



In [2]:

    
import graphlab

Load some text data - from wikipedia, pages on people



In [7]:

    
people = graphlab.SFrame('people_wiki.gl/')

Let's View data set with head to view some of the rows let's 5 rows

Data contains: link to wikipedia article, name of person, text of article.



In [8]:

    
people.head(5)









    Out[8]:





    
        URI
        name
        text
    
    
        <http://dbpedia.org/resou
rce/Digby_Morrell> ...
        Digby Morrell
        digby morrell born 10
october 1979 is a former ...
    
    
        <http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
        Alfred J. Lewy
        alfred j lewy aka sandy
lewy graduated from ...
    
    
        <http://dbpedia.org/resou
rce/Harpdog_Brown> ...
        Harpdog Brown
        harpdog brown is a singer
and harmonica player who ...
    
    
        <http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
        Franz Rottensteiner
        franz rottensteiner born
in waidmannsfeld lower ...
    
    
        <http://dbpedia.org/resou
rce/G-Enka> ...
        G-Enka
        henry krvits born 30
december 1974 in tallinn ...
    

[5 rows x 3 columns]



In [9]:

    
len(people)









    Out[9]:





59071

Explore the dataset and checkout the text it contains

Exploring the entry for president Obama, Let's take subset of data as obama



In [10]:

    
obama = people[people['name'] == 'Barack Obama']

obama variable contain all the information about Barack Obama from wiki



In [11]:

    
obama









    Out[11]:





    
        URI
        name
        text
    
    
        <http://dbpedia.org/resou
rce/Barack_Obama> ...
        Barack Obama
        barack hussein obama ii
brk husen bm born august ...
    

[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.



In [14]:

    
obama['text']

Exploring the entry for actor George Clooney, in variable clooney



In [15]:

    
clooney = people[people['name'] == 'George Clooney']
clooney['text']

Get the word counts for Obama article and store result in same data set as a new variable name word_count



In [16]:

    
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])



In [17]:

    
print obama['word_count']









    



[{'operations': 1, 'represent': 1, 'office': 2, 'unemployment': 1, 'doddfrank': 1, 'over': 1, 'unconstitutional': 1, 'domestic': 2, 'major': 1, 'years': 1, 'against': 1, 'proposition': 1, 'seats': 1, 'graduate': 1, 'debate': 1, 'before': 1, 'death': 1, '20': 2, 'taxpayer': 1, 'representing': 1, 'obamacare': 1, 'barack': 1, 'to': 14, '4': 1, 'policy': 2, '8': 1, 'he': 7, '2011': 3, '2010': 2, '2013': 1, '2012': 1, 'bin': 1, 'then': 1, 'his': 11, 'march': 1, 'gains': 1, 'cuba': 1, 'school': 3, '1992': 1, 'new': 1, 'not': 1, 'during': 2, 'ending': 1, 'continued': 1, 'presidential': 2, 'states': 3, 'husen': 1, 'osama': 1, 'californias': 1, 'equality': 1, 'prize': 1, 'lost': 1, 'made': 1, 'inaugurated': 1, 'january': 3, 'university': 2, 'rights': 1, 'july': 1, 'gun': 1, 'stimulus': 1, 'rodham': 1, 'troop': 1, 'withdrawal': 1, 'brk': 1, 'nine': 1, 'where': 1, 'referred': 1, 'affordable': 1, 'attorney': 1, 'on': 2, 'often': 1, 'senate': 3, 'regained': 1, 'national': 2, 'creation': 1, 'related': 1, 'hawaii': 1, 'born': 2, 'second': 2, 'defense': 1, 'election': 3, 'close': 1, 'operation': 1, 'insurance': 1, 'sandy': 1, 'afghanistan': 2, 'initiatives': 1, 'for': 4, 'reform': 1, 'house': 2, 'review': 1, 'representatives': 2, 'ended': 1, 'current': 1, 'state': 1, 'won': 1, 'limit': 1, 'victory': 1, 'unsuccessfully': 1, 'reauthorization': 1, 'keynote': 1, 'full': 1, 'patient': 1, 'august': 1, 'degree': 1, '44th': 1, 'bm': 1, 'mitt': 1, 'attention': 1, 'delegates': 1, 'lgbt': 1, 'job': 1, 'harvard': 2, 'term': 3, 'served': 2, 'ask': 1, 'november': 2, 'debt': 1, 'by': 1, 'wall': 1, 'care': 1, 'received': 1, 'great': 1, 'signed': 3, 'libya': 1, 'receive': 1, 'of': 18, 'months': 1, 'urged': 1, 'foreign': 2, 'american': 3, 'protection': 2, 'economic': 1, 'act': 8, 'military': 4, 'hussein': 1, 'or': 1, 'first': 3, 'control': 4, 'named': 1, 'clinton': 1, 'dont': 2, 'campaign': 3, 'russia': 1, 'civil': 1, 'reinvestment': 1, 'into': 1, 'address': 1, 'primary': 2, 'community': 1, 'mccain': 1, 'down': 1, 'hook': 1, '63': 1, 'americans': 1, 'elementary': 1, 'total': 1, 'earning': 1, 'repeal': 1, 'from': 3, 'raise': 1, 'district': 1, 'spending': 1, 'republican': 2, 'legislation': 1, 'three': 1, 'relations': 1, 'nobel': 1, 'start': 1, 'tell': 1, 'iraq': 4, 'convention': 1, 'resulted': 1, 'john': 1, 'was': 5, '2012obama': 1, 'form': 1, 'that': 1, 'tax': 1, 'sufficient': 1, 'republicans': 1, 'strike': 1, 'hillary': 1, 'street': 1, 'arms': 1, 'honolulu': 1, 'filed': 1, 'worked': 1, 'hold': 1, 'with': 3, 'obama': 9, 'ii': 1, 'has': 4, '1997': 1, '1996': 1, 'whether': 1, 'reelected': 1, 'budget': 1, 'us': 6, 'nations': 1, 'recession': 1, 'while': 1, 'taught': 1, 'marriage': 1, 'policies': 1, 'promoted': 1, 'called': 1, 'and': 21, 'supreme': 1, 'ordered': 3, 'nominee': 2, 'process': 1, '2000in': 1, 'is': 2, 'romney': 1, 'briefs': 1, 'defeated': 1, 'general': 1, '13th': 1, 'as': 6, 'at': 2, 'in': 30, 'sought': 1, 'organizer': 1, 'shooting': 1, 'increased': 1, 'normalize': 1, 'lengthy': 1, 'united': 3, 'court': 1, 'recovery': 1, 'laden': 1, 'laureateduring': 1, 'peace': 1, 'administration': 1, '1961': 1, 'illinois': 2, 'other': 1, 'which': 1, 'party': 3, 'primaries': 1, 'sworn': 1, 'relief': 2, 'war': 1, 'columbia': 1, 'combat': 1, 'after': 4, 'islamic': 1, 'running': 1, 'levels': 1, 'two': 1, 'involvement': 3, 'response': 3, 'included': 1, 'president': 4, 'law': 6, 'nomination': 1, '2008': 1, 'a': 7, '2009': 3, 'chicago': 2, 'constitutional': 1, 'defeating': 1, 'treaty': 1, 'federal': 1, '2007': 1, '2004': 3, 'african': 1, 'the': 40, 'democratic': 4, 'consumer': 1, 'began': 1, 'terms': 1}]

Sort the word counts for the Obama article, for sorting we use stack with new column name as word and count

Turning dictonary of word counts into a table



In [18]:

    
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

Sorting the word counts to show most common words at the top



In [22]:

    
obama_word_count_table.head(5)









    Out[22]:





    
        word
        count
    
    
        normalize
        1
    
    
        sought
        1
    
    
        combat
        1
    
    
        continued
        1
    
    
        unconstitutional
        1
    

[5 rows x 2 columns]



In [23]:

    
obama_word_count_table.sort('count',ascending=False)









    Out[23]:





    
        word
        count
    
    
        the
        40
    
    
        in
        30
    
    
        and
        21
    
    
        of
        18
    
    
        to
        14
    
    
        his
        11
    
    
        obama
        9
    
    
        act
        8
    
    
        a
        7
    
    
        he
        7
    

[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Most common words include uninformative words like "the", "in", "and",...but some of the words does't contain any meaning full informaion about these article we can remove these words some time we call these words as stop words

Compute TF-IDF for the corpus

To give more weight to informative words, we weigh them by their TF-IDF scores. TF-IDF basically way to score the importance of words in document based on how frequently they appear across multiple doc, but this will work for the whole corpus so we apply tf-idf to the whole



In [28]:

    
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head(5)









    Out[28]:





    
        URI
        name
        text
        word_count
    
    
        <http://dbpedia.org/resou
rce/Digby_Morrell> ...
        Digby Morrell
        digby morrell born 10
october 1979 is a former ...
        {'since': 1, 'carltons':
1, 'being': 1, '2005' ...
    
    
        <http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
        Alfred J. Lewy
        alfred j lewy aka sandy
lewy graduated from ...
        {'precise': 1, 'thomas':
1, 'closely': 1, ...
    
    
        <http://dbpedia.org/resou
rce/Harpdog_Brown> ...
        Harpdog Brown
        harpdog brown is a singer
and harmonica player who ...
        {'just': 1, 'issued': 1,
'mainly': 1, 'nominat ...
    
    
        <http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
        Franz Rottensteiner
        franz rottensteiner born
in waidmannsfeld lower ...
        {'all': 1,
'bauforschung': 1, ...
    
    
        <http://dbpedia.org/resou
rce/G-Enka> ...
        G-Enka
        henry krvits born 30
december 1974 in tallinn ...
        {'legendary': 1,
'gangstergenka': 1, ...
    


    
        tfidf
    
    
        {'since':
1.455376717308041, ...
    
    
        {'precise':
6.44320060695519, ...
    
    
        {'just':
2.7007299687108643, ...
    
    
        {'all':
1.6431112434912472, ...
    
    
        {'legendary':
4.280856294365192, ...
    

[5 rows x 5 columns]



In [30]:

    
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])
tfidf









    Out[30]:





    
        docs
    
    
        {'since':
1.455376717308041, ...
    
    
        {'precise':
6.44320060695519, ...
    
    
        {'just':
2.7007299687108643, ...
    
    
        {'all':
1.6431112434912472, ...
    
    
        {'legendary':
4.280856294365192, ...
    
    
        {'now': 1.96695239252401,
'currently': ...
    
    
        {'exclusive':
10.455187230695827, ...
    
    
        {'taxi':
6.0520214560945025, ...
    
    
        {'houston':
3.935505942157149, ...
    
    
        {'phenomenon':
5.750053426395245, ...
    

[59071 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

add tf-idf docs to people dataset with a new variable called tfidf



In [31]:

    
people['tfidf'] = tfidf['docs']

Examine the TF-IDF for the Obama article, this new variable with tfidf



In [32]:

    
obama = people[people['name'] == 'Barack Obama']



In [33]:

    
obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)









    Out[33]:





    
        word
        tfidf
    
    
        obama
        43.2956530721
    
    
        act
        27.678222623
    
    
        iraq
        17.747378588
    
    
        control
        14.8870608452
    
    
        law
        14.7229357618
    
    
        ordered
        14.5333739509
    
    
        military
        13.1159327785
    
    
        involvement
        12.7843852412
    
    
        response
        12.7843852412
    
    
        democratic
        12.4106886973
    

[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Words with highest TF-IDF are much more informative. so we sorted words according to the TF-IDF

Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.



In [34]:

    
clinton = people[people['name'] == 'Bill Clinton']



In [35]:

    
beckham = people[people['name'] == 'David Beckham']

Is Obama closer to Clinton than to Beckham?

We will use cosine distance, which is given by

(1-cosine_similarity) for computing the distance between the two documents

and find that the article about president Obama is closer to the one about former president Clinton than that of footballer David Beckham.



In [36]:

    
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])









    Out[36]:





0.8339854936884276



In [38]:

    
graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])









    Out[38]:





0.9791305844747478

less the distance between two document more they are similar two each other like obam and clinton are more similar then obama and beckham

Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.



In [40]:

    
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')









    



PROGRESS: Starting brute force nearest neighbors model training.

Applying the nearest-neighbors model for retrieval

Who is closest to Obama? for that we use query with model



In [41]:

    
knn_model.query(obama)









    



PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 136.686ms    |
PROGRESS: | 0            | 5602    | 9.4835      | 343.897ms    |
PROGRESS: | 0            | 11164   | 18.8993     | 564.768ms    |
PROGRESS: | 0            | 15380   | 26.0365     | 789.289ms    |
PROGRESS: | 0            | 20757   | 35.1391     | 1.01s        |
PROGRESS: | 0            | 25677   | 43.468      | 1.23s        |
PROGRESS: | 0            | 31202   | 52.8212     | 1.47s        |
PROGRESS: | 0            | 36100   | 61.1129     | 1.68s        |
PROGRESS: | 0            | 40930   | 69.2895     | 1.91s        |
PROGRESS: | 0            | 46050   | 77.957      | 2.13s        |
PROGRESS: | 0            | 50726   | 85.8729     | 2.36s        |
PROGRESS: | 0            | 55514   | 93.9784     | 2.58s        |
PROGRESS: | 0            | 58721   | 99.4075     | 2.81s        |
PROGRESS: | Done         |         | 100         | 2.87s        |
PROGRESS: +--------------+---------+-------------+--------------+






    Out[41]:





    
        query_label
        reference_label
        distance
        rank
    
    
        0
        Barack Obama
        0.0
        1
    
    
        0
        Joe Biden
        0.794117647059
        2
    
    
        0
        Joe Lieberman
        0.794685990338
        3
    
    
        0
        Kelly Ayotte
        0.811989100817
        4
    
    
        0
        Bill Clinton
        0.813852813853
        5
    

[5 rows x 4 columns]

As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.

Other examples of document retrieval



In [42]:

    
swift = people[people['name'] == 'Taylor Swift']



In [44]:

    
knn_model.query(swift)









    



PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 92.438ms     |
PROGRESS: | 0            | 4012    | 6.79183     | 318.015ms    |
PROGRESS: | 0            | 6950    | 11.7655     | 541.453ms    |
PROGRESS: | 0            | 10953   | 18.5421     | 765.986ms    |
PROGRESS: | 0            | 16173   | 27.3789     | 994.226ms    |
PROGRESS: | 0            | 20450   | 34.6194     | 1.21s        |
PROGRESS: | 0            | 25428   | 43.0465     | 1.45s        |
PROGRESS: | 0            | 30178   | 51.0877     | 1.66s        |
PROGRESS: | 0            | 34714   | 58.7666     | 1.88s        |
PROGRESS: | 0            | 38071   | 64.4496     | 2.11s        |
PROGRESS: | 0            | 42002   | 71.1043     | 2.33s        |
PROGRESS: | 0            | 46054   | 77.9638     | 2.56s        |
PROGRESS: | 0            | 50833   | 86.0541     | 2.78s        |
PROGRESS: | 0            | 54260   | 91.8556     | 3.01s        |
PROGRESS: | 0            | 57665   | 97.6198     | 3.25s        |
PROGRESS: | Done         |         | 100         | 3.33s        |
PROGRESS: +--------------+---------+-------------+--------------+






    Out[44]:





    
        query_label
        reference_label
        distance
        rank
    
    
        0
        Taylor Swift
        0.0
        1
    
    
        0
        Carrie Underwood
        0.76231884058
        2
    
    
        0
        Alicia Keys
        0.764705882353
        3
    
    
        0
        Jordin Sparks
        0.769633507853
        4
    
    
        0
        Leona Lewis
        0.776119402985
        5
    

[5 rows x 4 columns]



In [30]:

    
jolie = people[people['name'] == 'Angelina Jolie']



In [31]:

    
knn_model.query(jolie)









    



PROGRESS: Starting pairwise querying...
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 24.658ms     |
PROGRESS: | Done         |         | 100         | 149.909ms    |
PROGRESS: +--------------+---------+-------------+--------------+






    Out[31]:





    
        query_label
        reference_label
        distance
        rank
    
    
        0
        Angelina Jolie
        0.0
        1
    
    
        0
        Brad Pitt
        0.784023668639
        2
    
    
        0
        Julianne Moore
        0.795857988166
        3
    
    
        0
        Billy Bob Thornton
        0.803069053708
        4
    
    
        0
        George Clooney
        0.8046875
        5
    

[5 rows x 4 columns]



In [45]:

    
arnold = people[people['name'] == 'Arnold Schwarzenegger']



In [46]:

    
knn_model.query(arnold)









    



PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 53.147ms     |
PROGRESS: | 0            | 6114    | 10.3503     | 295.705ms    |
PROGRESS: | 0            | 11182   | 18.9298     | 502.157ms    |
PROGRESS: | 0            | 16511   | 27.9511     | 729.308ms    |
PROGRESS: | 0            | 21738   | 36.7998     | 951.159ms    |
PROGRESS: | 0            | 27106   | 45.8872     | 1.19s        |
PROGRESS: | 0            | 32153   | 54.4311     | 1.40s        |
PROGRESS: | 0            | 37346   | 63.2222     | 1.62s        |
PROGRESS: | 0            | 42853   | 72.5449     | 1.84s        |
PROGRESS: | 0            | 47690   | 80.7334     | 2.07s        |
PROGRESS: | 0            | 53085   | 89.8664     | 2.29s        |
PROGRESS: | 0            | 57544   | 97.415      | 2.52s        |
PROGRESS: | Done         |         | 100         | 2.62s        |
PROGRESS: +--------------+---------+-------------+--------------+






    Out[46]:





    
        query_label
        reference_label
        distance
        rank
    
    
        0
        Arnold Schwarzenegger
        0.0
        1
    
    
        0
        Jesse Ventura
        0.818918918919
        2
    
    
        0
        John Kitzhaber
        0.824615384615
        3
    
    
        0
        Lincoln Chafee
        0.833876221498
        4
    
    
        0
        Anthony Foxx
        0.833910034602
        5
    

[5 rows x 4 columns]



In [ ]:

URI	name	text
<http://dbpedia.org/resou rce/Digby_Morrell> ...	Digby Morrell	digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...	Alfred J. Lewy	alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...	Harpdog Brown	harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...	Franz Rottensteiner	franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...	G-Enka	henry krvits born 30 december 1974 in tallinn ...

word	tfidf
obama	43.2956530721
act	27.678222623
iraq	17.747378588
control	14.8870608452
law	14.7229357618
ordered	14.5333739509
military	13.1159327785
involvement	12.7843852412
response	12.7843852412
democratic	12.4106886973

reference_label	distance	rank
Barack Obama	0.0	1
Joe Biden	0.794117647059	2
Joe Lieberman	0.794685990338	3
Kelly Ayotte	0.811989100817	4
Bill Clinton	0.813852813853	5

reference_label	distance	rank
Taylor Swift	0.0	1
Carrie Underwood	0.76231884058	2
Alicia Keys	0.764705882353	3
Jordin Sparks	0.769633507853	4
Leona Lewis	0.776119402985	5

reference_label	distance	rank
Angelina Jolie	0.0	1
Brad Pitt	0.784023668639	2
Julianne Moore	0.795857988166	3
Billy Bob Thornton	0.803069053708	4
George Clooney	0.8046875	5

reference_label	distance	rank
Arnold Schwarzenegger	0.0	1
Jesse Ventura	0.818918918919	2
John Kitzhaber	0.824615384615	3
Lincoln Chafee	0.833876221498	4
Anthony Foxx	0.833910034602	5