In [1]:

    
'''
ad hoc Information Retrieval System: practice the vector space model
Imagine we have a collection of documents, and we would like to make a query
to the software to retrieve the document most relevant to the query, what is
the technique we should use? One simple model that can be used is called the
vector space model. The idea here is to create a hyperspace where each unique
word (term) in the collection represents a separate dimension. And each document
is represented by a vector composed of the weights (usually correlated with
the number of appearances) of each term (dimension). For example, if we have
2 recipes in a collections, the fried chicken recipe fc = ['chicken', 'fried',
'oil', 'pepper'] and the pouched chicken pc = ['chicken', 'water'], we would
have a collection (hyperspace) of 5 dimensions: ['chicken', 'fried', 'oil',
'pepper', 'water']. Further assume that in fc, the weight (frequency of word)
for each term is [8, 2, 7, 4], and in pc the weights are [6, 5], then the weight
represented in our hyperspace are correspondingly fc = [8, 2, 7, 4, 0], pc = [6,
0, 0, 0, 5]. Suppose we have a query q = ['fried', 'chicken'] with each term
weighting 1, q = [1, 1, 0, 0, 0]. Then in the vector space model, we only need
to calculate the cosine similarity between (q, fc) and (q, pc) and compare the
results. The more similar the topic, the larger the cosine similarity is. This
notebook is a simple implementation of this idea.

Footnote: a collection is usually represented by a so-called term-by-document
sparse matrix, where the rows resprent the weights of each feature, and the
columns represents each document.
'''
__author__ = 'Xia Wang'



In [2]:

    
from __future__ import print_function
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity



In [3]:

    
# create a document of 2 files
d1 = 'woof woof meow'
d2 = 'woof woof squeak'
query = 'meow, woof'
col = (d1, d2, query)

There are at least two ways of vectorizing the collection, one is by simple count, the other is use the term frequency * inversed document frequency (to reduce the weight imposed by the very common words but meaningless words such as a, the, and, etc.). Let's start with the first one.



In [4]:

    
# create a collection matrix (using the count vectorizer)
countVectorizer = CountVectorizer()
# The CountVectorizer will return a document-term sparse matrix
# the rows represent the documents, and the columns represent terms
# since we have only 2 documents, I use 2 variables to represent the 2 vectors returned
d1_count, d2_count, q_count = countVectorizer.fit_transform(col)
print(d1_count.shape)
print(d2_count.shape)
print(q_count.shape)









    



(1, 3)
(1, 3)
(1, 3)



In [5]:

    
def content(name):
    '''
    print out the cosine similarity result
    
    params:
    -----------------
    name: the variable name for the cosine similarity function output (a 1X1 matrix in our case)
    '''
    return 'The cosine similarity between the two docs is {name:.4}.'.format(name=name[0][0])



In [6]:

    
# let's see the cosine similarity of the two documents first
cs_docs = cosine_similarity(d1_count, d2_count)
print(content(cs_docs))









    



The cosine similarity between the two docs is 0.8.



In [7]:

    
# create a query matrix
d1_q = cosine_similarity(d1_count, q_count)
print(content(d1_q))
d2_q = cosine_similarity(d2_count, q_count)
print(content(d2_q))
# so the query will probably return the first file for us









    



The cosine similarity between the two docs is 0.9487.
The cosine similarity between the two docs is 0.6325.

Now let's try the second vectorization method



In [8]:

    
tfidfVectorizer = TfidfVectorizer()
d1_tf, d2_tf, q_tf = tfidfVectorizer.fit_transform(col)
for i in (d1_tf, d2_tf, q_tf):
    print(i.shape)









    



(1, 3)
(1, 3)
(1, 3)



In [9]:

    
# again, cosine similarities
cs_docs_tf = cosine_similarity(d1_tf, d2_tf)
print(content(cs_docs_tf))
cs_d1_q_tf = cosine_similarity(d1_tf, q_tf)
print(content(cs_d1_q_tf))
cs_d2_q_tf = cosine_similarity(d2_tf, q_tf)
print(content(cs_d2_q_tf))









    



The cosine similarity between the two docs is 0.6417.
The cosine similarity between the two docs is 0.9433.
The cosine similarity between the two docs is 0.4681.

if we add a new document 'meow squeak' to the collection, let's see the difference.



In [10]:

    
d3 = 'meow squeak'
cols = (d1, d2, d3)
d1_tf, tf, d3_tf = tfidfVectorizer.fit_transform(cols)
cs_d1_d2 = cosine_similarity(d1_tf, d2_tf)
print(content(cs_d1_d2))









    



The cosine similarity between the two docs is 0.6827.



In [ ]: