Document retrieval from wikipedia data

Fire up GraphLab Create


In [1]:
import graphlab

In [2]:
people = graphlab.SFrame('people_wiki.gl/')


[INFO] This non-commercial license of GraphLab Create is assigned to zhanglh13@fudan.edu.cnand will expire on September 21, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-18880 - Server binary: c:\home\courses\machine learning - uw\dato\lib\site-packages\graphlab\unity_server.exe - Server log: C:\Users\linghao\AppData\Local\Temp\graphlab_server_1443775729.log.0
[INFO] GraphLab Server Version: 1.6

In [3]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])

In [6]:
elton = people[people['name'] == 'Elton John']

In [8]:
elton['word_count'] = graphlab.text_analytics.count_words(elton['text'])

In [12]:
elton_word_count_table = elton[['word_count']].stack('word_count', new_column_name=['word', 'count'])

In [14]:
elton_word_count_table.sort('count', ascending=False)


Out[14]:
word count
the 27
in 18
and 15
of 13
a 10
has 9
he 7
john 7
on 6
since 5
[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [16]:
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])

In [17]:
people['tfidf'] = tfidf['docs']

In [19]:
elton2 = people[people['name'] == 'Elton John']

In [22]:
elton_tfidf_talbe = elton2[['tfidf']].stack('tfidf', new_column_name=['word', 'tfidf'])

In [23]:
elton_tfidf_talbe.sort('tfidf', ascending=False)


Out[23]:
word tfidf
furnish 18.38947184
elton 17.48232027
billboard 17.3036809575
john 13.9393127924
songwriters 11.250406447
overallelton 10.9864953892
tonightcandle 10.9864953892
19702000 10.2933482087
fivedecade 10.2933482087
aids 10.262846934
[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [26]:
paul = people[people['name'] == 'Paul McCartney']

In [25]:
victoria = people[people['name'] == 'Victoria Beckham']

In [30]:
graphlab.distances.cosine(elton2['tfidf'][0], paul['tfidf'][0])


Out[30]:
0.8250310029221779

In [31]:
graphlab.distances.cosine(elton2['tfidf'][0], victoria['tfidf'][0])


Out[31]:
0.9567006376655429

In [39]:
word_count_knn_model = graphlab.nearest_neighbors.create(people, features=['word_count'], distance='cosine', label='name')


PROGRESS: Starting brute force nearest neighbors model training.

In [41]:
tfidf_knn_model = graphlab.nearest_neighbors.create(people, features=['tfidf'], distance='cosine', label='name')


PROGRESS: Starting brute force nearest neighbors model training.

In [42]:
word_count_knn_model.query(elton2)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 6.001ms      |
PROGRESS: | Done         |         | 100         | 304.018ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[42]:
query_label reference_label distance rank
0 Elton John 2.22044604925e-16 1
0 Cliff Richard 0.16142415259 2
0 Sandro Petrone 0.16822542751 3
0 Rod Stewart 0.168327165587 4
0 Malachi O'Doherty 0.177315545979 5
[5 rows x 4 columns]

In [44]:
tfidf_knn_model.query(elton2)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 12.001ms     |
PROGRESS: | Done         |         | 100         | 315.018ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[44]:
query_label reference_label distance rank
0 Elton John -2.22044604925e-16 1
0 Rod Stewart 0.717219667893 2
0 George Michael 0.747600998969 3
0 Sting (musician) 0.747671954431 4
0 Phil Collins 0.75119324879 5
[5 rows x 4 columns]

In [45]:
word_count_knn_model.query(victoria)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 6ms          |
PROGRESS: | Done         |         | 100         | 300.017ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[45]:
query_label reference_label distance rank
0 Victoria Beckham -2.22044604925e-16 1
0 Mary Fitzgerald (artist) 0.207307036115 2
0 Adrienne Corri 0.214509782788 3
0 Beverly Jane Fry 0.217466468741 4
0 Raman Mundair 0.217695474992 5
[5 rows x 4 columns]

In [46]:
tfidf_knn_model.query(victoria)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 10.001ms     |
PROGRESS: | Done         |         | 100         | 340.02ms     |
PROGRESS: +--------------+---------+-------------+--------------+
Out[46]:
query_label reference_label distance rank
0 Victoria Beckham 1.11022302463e-16 1
0 David Beckham 0.548169610263 2
0 Stephen Dow Beckham 0.784986706828 3
0 Mel B 0.809585523409 4
0 Caroline Rush 0.819826422919 5
[5 rows x 4 columns]