In [1]:
import graphlab

Load wiki text data --- pages on people


In [2]:
people = graphlab.SFrame('./people_wiki.gl/')


[INFO] This non-commercial license of GraphLab Create is assigned to akshay.narayan@u.nus.eduand will expire on September 26, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-9810 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1445218854.log
[INFO] GraphLab Server Version: 1.6.1

In [5]:
len(people)


Out[5]:
59071

In [6]:
obama = people[people['name'] == 'Barack Obama']

In [7]:
obama


Out[7]:
URI name text
<http://dbpedia.org/resou
rce/Barack_Obama> ...
Barack Obama barack hussein obama ii
brk husen bm born august ...
[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [8]:
people.head()


Out[8]:
URI name text
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrell digby morrell born 10
october 1979 is a former ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewy alfred j lewy aka sandy
lewy graduated from ...
<http://dbpedia.org/resou
rce/Harpdog_Brown> ...
Harpdog Brown harpdog brown is a singer
and harmonica player who ...
<http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
Franz Rottensteiner franz rottensteiner born
in waidmannsfeld lower ...
<http://dbpedia.org/resou
rce/G-Enka> ...
G-Enka henry krvits born 30
december 1974 in tallinn ...
<http://dbpedia.org/resou
rce/Sam_Henderson> ...
Sam Henderson sam henderson born
october 18 1969 is an ...
<http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
Aaron LaCrate aaron lacrate is an
american music producer ...
<http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
Trevor Ferguson trevor ferguson aka john
farrow born 11 november ...
<http://dbpedia.org/resou
rce/Grant_Nelson> ...
Grant Nelson grant nelson born 27
april 1971 in london ...
<http://dbpedia.org/resou
rce/Cathy_Caruth> ...
Cathy Caruth cathy caruth born 1955 is
frank h t rhodes ...
[10 rows x 3 columns]


In [9]:
clooney = people[people['name']=='George Clooney']

In [10]:
clooney


Out[10]:
URI name text
<http://dbpedia.org/resou
rce/George_Clooney> ...
George Clooney george timothy clooney
born may 6 1961 is an ...
[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [11]:
clooney['text']


Out[11]:
dtype: str
Rows: ?
['george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his acting debut on television in 1978 and later gained wide recognition in his role as dr doug ross on the longrunning medical drama er from 1994 to 1999 for which he received two emmy award nominations while working on er he began attracting a variety of leading roles in films including the superhero film batman robin 1997 and the crime comedy out of sight 1998 in which he first worked with a director who would become a longtime collaborator steven soderbergh in 1999 clooney took the lead role in three kings a wellreceived war satire set during the gulf warin 2001 clooneys fame widened with the release of his biggest commercial success the heist comedy oceans eleven the first of the film trilogy a remake of the 1960 film with frank sinatra as danny ocean he made his directorial debut a year later with the biographical thriller confessions of a dangerous mind and has since directed the drama good night and good luck 2005 the sports comedy leatherheads 2008 the political drama the ides of march 2011 and the comedydrama war film the monuments men 2014he won an academy award for best supporting actor for the middle east thriller syriana 2005 and subsequently earned best actor nominations for the legal thriller michael clayton 2007 the comedydrama up in the air 2009 and the drama the descendants 2011 in 2013 he received the academy award for best picture for producing the political thriller argo alongside ben affleck and grant heslov he is the only person ever to be nominated for academy awards in six categoriesclooney is sometimes described as one of the most handsome men in the world in 2005 tv guide ranked clooney no 1 on its 50 sexiest stars of all time list in 2009 he was included in times annual time 100 as one of the most influential people in the world clooney is also noted for his political activism and has served as one of the united nations messengers of peace since january 31 2008 his humanitarian work includes his advocacy of finding a resolution for the darfur conflict raising funds for the 2010 haiti earthquake 2004 tsunami and 911 victims and creating documentaries such as sand and sorrow to raise awareness about international crises he is also a member of the council on foreign relations', ... ]

Word counts


In [12]:
obama['wordCount'] = graphlab.text_analytics.count_words(obama['text'])

In [13]:
obama


Out[13]:
URI name text wordCount
<http://dbpedia.org/resou
rce/Barack_Obama> ...
Barack Obama barack hussein obama ii
brk husen bm born august ...
{'operations': 1,
'represent': 1, 'offi ...
[1 rows x 4 columns]

Sort word count on the Obama article


In [14]:
obama_wordCountTable = obama[['wordCount']].stack('wordCount', new_column_name=['word', 'count'])

In [15]:
obama_wordCountTable.head()


Out[15]:
word count
normalize 1
sought 1
combat 1
continued 1
unconstitutional 1
8 1
californias 1
1996 1
marriage 1
defense 1
[10 rows x 2 columns]


In [16]:
obama_wordCountTable.sort('count', ascending=False)


Out[16]:
word count
the 40
in 30
and 21
of 18
to 14
his 11
obama 9
act 8
a 7
he 7
[273 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Computing TF-IDF


In [17]:
people['wordCount'] = graphlab.text_analytics.count_words(people['text'])
people.head()


Out[17]:
URI name text wordCount
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrell digby morrell born 10
october 1979 is a former ...
{'since': 1, 'carltons':
1, 'being': 1, '2005' ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewy alfred j lewy aka sandy
lewy graduated from ...
{'precise': 1, 'thomas':
1, 'closely': 1, ...
<http://dbpedia.org/resou
rce/Harpdog_Brown> ...
Harpdog Brown harpdog brown is a singer
and harmonica player who ...
{'just': 1, 'issued': 1,
'mainly': 1, 'nominat ...
<http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
Franz Rottensteiner franz rottensteiner born
in waidmannsfeld lower ...
{'all': 1,
'bauforschung': 1, ...
<http://dbpedia.org/resou
rce/G-Enka> ...
G-Enka henry krvits born 30
december 1974 in tallinn ...
{'legendary': 1,
'gangstergenka': 1, ...
<http://dbpedia.org/resou
rce/Sam_Henderson> ...
Sam Henderson sam henderson born
october 18 1969 is an ...
{'now': 1, 'currently':
1, 'less': 1, 'being' ...
<http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
Aaron LaCrate aaron lacrate is an
american music producer ...
{'exclusive': 2,
'producer': 1, 'tribe': ...
<http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
Trevor Ferguson trevor ferguson aka john
farrow born 11 november ...
{'taxi': 1, 'salon': 1,
'gangs': 1, 'being': 1, ...
<http://dbpedia.org/resou
rce/Grant_Nelson> ...
Grant Nelson grant nelson born 27
april 1971 in london ...
{'houston': 1, 'frankie':
1, 'labels': 1, ...
<http://dbpedia.org/resou
rce/Cathy_Caruth> ...
Cathy Caruth cathy caruth born 1955 is
frank h t rhodes ...
{'phenomenon': 1,
'deborash': 1, ...
[10 rows x 4 columns]


In [18]:
tfidf = graphlab.text_analytics.tf_idf(people['wordCount'])
tfidf


Out[18]:
docs
{'since':
1.455376717308041, ...
{'precise':
6.44320060695519, ...
{'just':
2.7007299687108643, ...
{'all':
1.6431112434912472, ...
{'legendary':
4.280856294365192, ...
{'now': 1.96695239252401,
'currently': ...
{'exclusive':
10.455187230695827, ...
{'taxi':
6.0520214560945025, ...
{'houston':
3.935505942157149, ...
{'phenomenon':
5.750053426395245, ...
[59071 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [19]:
people['tfidf'] = tfidf['docs']

Examine TF-IDF for Obama article


In [20]:
obama = people[people['name']=='Barack Obama']

In [22]:
obama_tfidf_table = obama[['tfidf']].stack('tfidf', new_column_name=['word','tfidf']).sort('tfidf', ascending=False)

In [23]:
obama_tfidf_table.head()


Out[23]:
word tfidf
obama 43.2956530721
act 27.678222623
iraq 17.747378588
control 14.8870608452
law 14.7229357618
ordered 14.5333739509
military 13.1159327785
involvement 12.7843852412
response 12.7843852412
democratic 12.4106886973
[10 rows x 2 columns]


In [24]:
people.head()


Out[24]:
URI name text wordCount
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrell digby morrell born 10
october 1979 is a former ...
{'since': 1, 'carltons':
1, 'being': 1, '2005' ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewy alfred j lewy aka sandy
lewy graduated from ...
{'precise': 1, 'thomas':
1, 'closely': 1, ...
<http://dbpedia.org/resou
rce/Harpdog_Brown> ...
Harpdog Brown harpdog brown is a singer
and harmonica player who ...
{'just': 1, 'issued': 1,
'mainly': 1, 'nominat ...
<http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
Franz Rottensteiner franz rottensteiner born
in waidmannsfeld lower ...
{'all': 1,
'bauforschung': 1, ...
<http://dbpedia.org/resou
rce/G-Enka> ...
G-Enka henry krvits born 30
december 1974 in tallinn ...
{'legendary': 1,
'gangstergenka': 1, ...
<http://dbpedia.org/resou
rce/Sam_Henderson> ...
Sam Henderson sam henderson born
october 18 1969 is an ...
{'now': 1, 'currently':
1, 'less': 1, 'being' ...
<http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
Aaron LaCrate aaron lacrate is an
american music producer ...
{'exclusive': 2,
'producer': 1, 'tribe': ...
<http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
Trevor Ferguson trevor ferguson aka john
farrow born 11 november ...
{'taxi': 1, 'salon': 1,
'gangs': 1, 'being': 1, ...
<http://dbpedia.org/resou
rce/Grant_Nelson> ...
Grant Nelson grant nelson born 27
april 1971 in london ...
{'houston': 1, 'frankie':
1, 'labels': 1, ...
<http://dbpedia.org/resou
rce/Cathy_Caruth> ...
Cathy Caruth cathy caruth born 1955 is
frank h t rhodes ...
{'phenomenon': 1,
'deborash': 1, ...
tfidf
{'since':
1.455376717308041, ...
{'precise':
6.44320060695519, ...
{'just':
2.7007299687108643, ...
{'all':
1.6431112434912472, ...
{'legendary':
4.280856294365192, ...
{'now': 1.96695239252401,
'currently': ...
{'exclusive':
10.455187230695827, ...
{'taxi':
6.0520214560945025, ...
{'houston':
3.935505942157149, ...
{'phenomenon':
5.750053426395245, ...
[10 rows x 5 columns]

Manually compute distance for few people


In [25]:
clinton = people[people['name'] == 'Bill Clinton']

In [26]:
beckham = people[people['name']== 'David Beckham']

Is obama closer to Clinton than to Beckham?


In [28]:
# various ways to find similarity between 2 docs. 
# We use distance metric called cosine distance
# higher the number, farther the articles are
# lower the distance, closer the articles are
graphlab.distances.cosine(obama['tfidf'][0], clinton['tfidf'][0])


Out[28]:
0.8339854936884276

In [29]:
graphlab.distances.cosine(obama['tfidf'][0], beckham['tfidf'][0])


Out[29]:
0.9791305844747478

Build nearest neighbor model for doc retrieval


In [32]:
knnModel = graphlab.nearest_neighbors.create(people, features=['tfidf'], label='name')


PROGRESS: Starting brute force nearest neighbors model training.

Apply kNN for retrieval


In [33]:
knnModel.query(obama)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 7.343ms      |
PROGRESS: | Done         |         | 100         | 263.563ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[33]:
query_label reference_label distance rank
0 Barack Obama 0.0 1
0 Joe Biden 0.794117647059 2
0 Joe Lieberman 0.794685990338 3
0 Kelly Ayotte 0.811989100817 4
0 Bill Clinton 0.813852813853 5
[5 rows x 4 columns]


In [34]:
knnModel


Out[34]:
Class                         : NearestNeighborsModel

Attributes
----------
Method                        : brute_force
Number of distance components : 1
Number of examples            : 59071
Number of feature columns     : 1
Number of unpacked features   : 547979
Total training time (seconds) : 6.7946

Other examples of doc retrieval


In [35]:
swift = people[people['name']=='Taylor Swift']

In [36]:
knnModel.query(swift)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 7.273ms      |
PROGRESS: | Done         |         | 100         | 271.576ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[36]:
query_label reference_label distance rank
0 Taylor Swift 0.0 1
0 Carrie Underwood 0.76231884058 2
0 Alicia Keys 0.764705882353 3
0 Jordin Sparks 0.769633507853 4
0 Leona Lewis 0.776119402985 5
[5 rows x 4 columns]


In [37]:
jolie = people[people['name']=='Angelina Jolie']

In [38]:
knnModel.query(jolie)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 7.73ms       |
PROGRESS: | Done         |         | 100         | 268.975ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[38]:
query_label reference_label distance rank
0 Angelina Jolie 0.0 1
0 Brad Pitt 0.784023668639 2
0 Julianne Moore 0.795857988166 3
0 Billy Bob Thornton 0.803069053708 4
0 George Clooney 0.8046875 5
[5 rows x 4 columns]


In [39]:
arnold = people[people['name']=='Arnold Schwarzenegger']

In [40]:
knnModel.query(arnold)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 7.904ms      |
PROGRESS: | Done         |         | 100         | 269.544ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[40]:
query_label reference_label distance rank
0 Arnold Schwarzenegger 0.0 1
0 Jesse Ventura 0.818918918919 2
0 John Kitzhaber 0.824615384615 3
0 Lincoln Chafee 0.833876221498 4
0 Anthony Foxx 0.833910034602 5
[5 rows x 4 columns]


In [44]:
elton = people[people['name'] == 'Elton John']

In [45]:
elton_wordCountTable = elton[['wordCount']].stack('wordCount', new_column_name=['word', 'count']).sort('count', ascending=False)

In [46]:
elton_wordCountTable


Out[46]:
word count
the 27
in 18
and 15
of 13
a 10
has 9
he 7
john 7
on 6
since 5
[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [47]:
elton_tfidf_table = elton[['tfidf']].stack('tfidf', new_column_name=['word','tfidf']).sort('tfidf', ascending=False)

In [48]:
elton_tfidf_table


Out[48]:
word tfidf
furnish 18.38947184
elton 17.48232027
billboard 17.3036809575
john 13.9393127924
songwriters 11.250406447
overallelton 10.9864953892
tonightcandle 10.9864953892
19702000 10.2933482087
fivedecade 10.2933482087
aids 10.262846934
[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [55]:
knnCosineModel_tfidf = graphlab.nearest_neighbors.create(people, features=['tfidf'], label='name', distance='cosine')


PROGRESS: Starting brute force nearest neighbors model training.

In [56]:
knnCosineModel_words = graphlab.nearest_neighbors.create(people, features=['wordCount'], label='name', distance='cosine')


PROGRESS: Starting brute force nearest neighbors model training.

In [51]:
victoria = people[people['name']=='Victoria Beckham']

In [52]:
paul = people[people['name']=='Paul McCartney']

In [53]:
graphlab.distances.cosine(elton['tfidf'][0],victoria['tfidf'][0])


Out[53]:
0.9567006376655429

In [54]:
graphlab.distances.cosine(elton['tfidf'][0],paul['tfidf'][0])


Out[54]:
0.8250310029221779

In [57]:
knnCosineModel_words.query(elton)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 4.296ms      |
PROGRESS: | Done         |         | 100         | 199.19ms     |
PROGRESS: +--------------+---------+-------------+--------------+
Out[57]:
query_label reference_label distance rank
0 Elton John 2.22044604925e-16 1
0 Cliff Richard 0.16142415259 2
0 Sandro Petrone 0.16822542751 3
0 Rod Stewart 0.168327165587 4
0 Malachi O'Doherty 0.177315545979 5
[5 rows x 4 columns]


In [58]:
knnCosineModel_tfidf.query(elton)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 12.735ms     |
PROGRESS: | Done         |         | 100         | 313.385ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[58]:
query_label reference_label distance rank
0 Elton John -2.22044604925e-16 1
0 Rod Stewart 0.717219667893 2
0 George Michael 0.747600998969 3
0 Sting (musician) 0.747671954431 4
0 Phil Collins 0.75119324879 5
[5 rows x 4 columns]


In [59]:
knnCosineModel_words.query(victoria)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 3.597ms      |
PROGRESS: | Done         |         | 100         | 199.852ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[59]:
query_label reference_label distance rank
0 Victoria Beckham -2.22044604925e-16 1
0 Mary Fitzgerald (artist) 0.207307036115 2
0 Adrienne Corri 0.214509782788 3
0 Beverly Jane Fry 0.217466468741 4
0 Raman Mundair 0.217695474992 5
[5 rows x 4 columns]


In [60]:
knnCosineModel_tfidf.query(victoria)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 6.234ms      |
PROGRESS: | Done         |         | 100         | 293.866ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[60]:
query_label reference_label distance rank
0 Victoria Beckham 1.11022302463e-16 1
0 David Beckham 0.548169610263 2
0 Stephen Dow Beckham 0.784986706828 3
0 Mel B 0.809585523409 4
0 Caroline Rush 0.819826422919 5
[5 rows x 4 columns]


In [ ]: