Introduction

We introduce the vector space model in this notebook.

This notebook is adapted from http://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/tfidf.html

From Documents to Bags-of-words

In the Vector Space Model (VSM), we view each document as a bag of words. (Bag means multiset)

In python, the best way to represent a bag is via collections.counter.



In [1]:

    
from pprint import pprint

import numpy as np
import contextlib

@contextlib.contextmanager
def printoptions(*args, **kwargs):
    original = np.get_printoptions()
    np.set_printoptions(*args, **kwargs)
    yield 
    np.set_printoptions(**original)

## compact print (numpy array) 
def cprint(x):
    with printoptions(precision=3, suppress=True, linewidth=120):
        print(x)        
    
# test
x = np.random.random(10)
cprint(x)









    



[ 0.174  0.711  0.054  0.656  0.928  0.664  0.177  0.23   0.686  0.88 ]



In [2]:

    
#examples taken from here: http://stackoverflow.com/a/1750187
doc_list = ['Julie loves me more than Linda loves me',
            'Jane likes me more than Julie loves me',
            'He likes basketball more than baseball']

bags = []

from collections import Counter

# we use lower() here so that the sorted vocabulary is easier to see.  
bags = [Counter(doc.lower().split()) for doc in doc_list]

for bag in bags:
    print(bag)









    



Counter({'loves': 2, 'me': 2, 'more': 1, 'linda': 1, 'julie': 1, 'than': 1})
Counter({'me': 2, 'julie': 1, 'loves': 1, 'more': 1, 'jane': 1, 'than': 1, 'likes': 1})
Counter({'more': 1, 'baseball': 1, 'basketball': 1, 'he': 1, 'than': 1, 'likes': 1})

Now we construct the vocabulary and the doc-term matrix.



In [3]:

    
import itertools

vocabulary = sorted(list(set(itertools.chain(*[list(b) for b in bags]))))
vocabulary









    Out[3]:





['baseball',
 'basketball',
 'he',
 'jane',
 'julie',
 'likes',
 'linda',
 'loves',
 'me',
 'more',
 'than']



In [4]:

    
def print_vocabulary_vertically(voc, leading_str = '', spacing=2, align=1):
    # align = 0: align top; otherwise, align bottom
    max_len = max([len(v) for v in voc])
    for i in range(max_len):
        if align == 0:
            line = [v[i] if i < len(v) else ' ' for v in voc]
        else:
            line = [' ' if i < max_len - len(v) else v[i-max_len] for v in voc]
        print('{}{}'.format(leading_str, (' '*spacing).join(line)))

print_vocabulary_vertically(vocabulary, align=0)









    



b  b  h  j  j  l  l  l  m  m  t
a  a  e  a  u  i  i  o  e  o  h
s  s     n  l  k  n  v     r  a
e  k     e  i  e  d  e     e  n
b  e        e  s  a  s         
a  t                           
l  b                           
l  a                           
   l                           
   l



In [5]:

    
for doc in doc_list:
    print(doc)
print()
vec_list = []
print_vocabulary_vertically(vocabulary, leading_str=' ')
print('-'*70)
for bag in bags:
    vec = [bag[v] for v in vocabulary] # Counter['non-existing-key'] = 0
    vec_list.append(vec)
    print(vec)









    



Julie loves me more than Linda loves me
Jane likes me more than Julie loves me
He likes basketball more than baseball

    b                           
    a                           
 b  s                           
 a  k                           
 s  e                           
 e  t        j  l  l  l         
 b  b     j  u  i  i  o     m  t
 a  a     a  l  k  n  v     o  h
 l  l  h  n  i  e  d  e  m  r  a
 l  l  e  e  e  s  a  s  e  e  n
----------------------------------------------------------------------
[0, 0, 0, 0, 1, 0, 1, 2, 2, 1, 1]
[0, 0, 0, 1, 1, 1, 0, 1, 2, 1, 1]
[1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1]

Now we transform raw tf to tf by $1 + \log(tf)$.



In [6]:

    
from math import log
def normalize_tf(tf):
    if tf == 0:
        return 0.0
    else:
        return 1.0 + log(tf)



In [7]:

    
tf_vec_list = []
for vec in vec_list:
    tf_vec_list.append([normalize_tf(val) for val in vec])

print_vocabulary_vertically(vocabulary, leading_str='   ', spacing=6)
print('-'*80)
cprint(np.matrix(tf_vec_list))









    



          b                                                               
          a                                                               
   b      s                                                               
   a      k                                                               
   s      e                                                               
   e      t                    j      l      l      l                     
   b      b             j      u      i      i      o             m      t
   a      a             a      l      k      n      v             o      h
   l      l      h      n      i      e      d      e      m      r      a
   l      l      e      e      e      s      a      s      e      e      n
--------------------------------------------------------------------------------
[[ 0.     0.     0.     0.     1.     0.     1.     1.693  1.693  1.     1.   ]
 [ 0.     0.     0.     1.     1.     1.     0.     1.     1.693  1.     1.   ]
 [ 1.     1.     1.     0.     0.     1.     0.     0.     0.     1.     1.   ]]

Adding idf

The above vectors in vec_list only keep track of term frequencies (tfs). Now we make each element $w_{d, t} = tf(d, t) \cdot idf(t)$.

Note that the textbook version of idf will have 0 if df == ndocs. Therefore, we follow the implementation in Lucene (https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) which gives: $$idf(t) = 1 + \log(\frac{ndocs}{df + 1})$$

BTW, on the above page, you can see the detailes and justifications of all major deviations of Lucene's implementation from the standard VSM. You will always need to do this to tackle new or specific problems you are facing when you join the workforce, :p. Therefore, a deep understanding of the principles behind the knowledge you learned in the uni is more important.



In [8]:

    
from math import log
def idf(cnt, ndocs): 
    # here we use ln(). The base of the log does not matter much. 
    return 1.0 + log(ndocs/(cnt+1))



In [9]:

    
ndocs = len(doc_list)
# voc = [ (v, [b[v] for b in bags]) for v in vocabulary] # if you want to see the individual counts
voc = [(v, sum([b[v] for b in bags])) for v in vocabulary]
pprint(voc)









    



[('baseball', 1),
 ('basketball', 1),
 ('he', 1),
 ('jane', 1),
 ('julie', 2),
 ('likes', 2),
 ('linda', 1),
 ('loves', 3),
 ('me', 4),
 ('more', 3),
 ('than', 3)]

We will eventually record in voc the idf values. You can look at the raw df value and the resulting idf values from the outputs of the above the below cells.



In [10]:

    
voc = [(v, idf( sum([b[v] for b in bags]), ndocs)) for v in vocabulary]
idf_dict = dict(voc)
pprint(voc)
print(idf_dict['he'])









    



[('baseball', 1.4054651081081644),
 ('basketball', 1.4054651081081644),
 ('he', 1.4054651081081644),
 ('jane', 1.4054651081081644),
 ('julie', 1.0),
 ('likes', 1.0),
 ('linda', 1.4054651081081644),
 ('loves', 0.7123179275482191),
 ('me', 0.4891743762340093),
 ('more', 0.7123179275482191),
 ('than', 0.7123179275482191)]
1.4054651081081644

Now we can look at the final weighted vectors of the documents.

We choose to implement it based on matrix operations supported by numpy. E.g., to obtained $\{a_i b_i\}_{i=1}^n$ from two vector $\vec{A} = \{a_i\}_{i=1}^n$ and $\vec{B} = \{b_i\}_{i=1}^n$, we first construct a diagonal matrix $\mathbf{D_A}$ whose diagonal elements are $a_i$, then we can obtain the desired result by the standard matrix multiplication of $\vec{B} \mathbf{D_A}$.



In [11]:

    
import numpy as np

def build_idf_matrix(idf_vector):
    idf_mat = np.zeros((len(idf_vector), len(idf_vector)))
    np.fill_diagonal(idf_mat, idf_vector)
    return idf_mat

idf_matrix = build_idf_matrix([v[1] for v in voc])
cprint(idf_matrix)









    



[[ 1.405  0.     0.     0.     0.     0.     0.     0.     0.     0.     0.   ]
 [ 0.     1.405  0.     0.     0.     0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.     1.405  0.     0.     0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.     0.     1.405  0.     0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.     0.     0.     1.     0.     0.     0.     0.     0.     0.   ]
 [ 0.     0.     0.     0.     0.     1.     0.     0.     0.     0.     0.   ]
 [ 0.     0.     0.     0.     0.     0.     1.405  0.     0.     0.     0.   ]
 [ 0.     0.     0.     0.     0.     0.     0.     0.712  0.     0.     0.   ]
 [ 0.     0.     0.     0.     0.     0.     0.     0.     0.489  0.     0.   ]
 [ 0.     0.     0.     0.     0.     0.     0.     0.     0.     0.712  0.   ]
 [ 0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.712]]



In [12]:

    
import math
def l2_normalizer(vec):
    denom = np.sum([el**2 for el in vec])
    return [(el / math.sqrt(denom)) for el in vec]



In [13]:

    
doc_term_matrix_tfidf = []

#performing tf-idf matrix multiplication
for vec in tf_vec_list:
    doc_term_matrix_tfidf.append(np.dot(vec, idf_matrix))
cprint(np.matrix(doc_term_matrix_tfidf)) # np.matrix() just to make it easier to look at


#normalizing
print('\nAfter normalization ...')
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
    doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))

cprint(np.matrix(doc_term_matrix_tfidf_l2))









    



[[ 0.     0.     0.     0.     1.     0.     1.405  1.206  0.828  0.712  0.712]
 [ 0.     0.     0.     1.405  1.     1.     0.     0.712  0.828  0.712  0.712]
 [ 1.405  1.405  1.405  0.     0.     1.     0.     0.     0.     0.712  0.712]]

After normalization ...
[[ 0.     0.     0.     0.     0.404  0.     0.568  0.487  0.335  0.288  0.288]
 [ 0.     0.     0.     0.565  0.402  0.402  0.     0.286  0.333  0.286  0.286]
 [ 0.499  0.499  0.499  0.     0.     0.355  0.     0.     0.     0.253  0.253]]

Look at the matrix m before normalization.

m[0][6] = 1.405. This is because $tf(d_0, 'linda') = 1$, and $idf('linda') = 1.4054651081081644$.
m[0][8] = 0.828. This is because $tf(d_0, 'me') = 1.693$, and $idf('me') = 0.4891743762340093$.

Measuring the Similarity of Two Documents

We use the cosine to measure the similarity of two weighted vectors.



In [14]:

    
import numpy as np

def normalize(v):
    norm = np.linalg.norm(v)
    if norm == 0: 
       return v
    return v / norm

def cosine(u, v):
    a = normalize(u)
    b = normalize(v)
    return np.inner(a, b)



In [15]:

    
cosine(doc_term_matrix_tfidf[0], doc_term_matrix_tfidf[1])









    Out[15]:





0.5781798652650999



In [16]:

    
cosine(doc_term_matrix_tfidf[0], doc_term_matrix_tfidf[2])









    Out[16]:





0.14544242471587354



In [17]:

    
cosine(doc_term_matrix_tfidf[1], doc_term_matrix_tfidf[2])









    Out[17]:





0.28752866083029266

Modelling a Query as a Document

We implement an equivalent form of the lnc.ltc model. Therefore, there is no idf weighting on the query terms, since it is already used in the document matrix.



In [18]:

    
q = 'Linda likes me'



In [19]:

    
def encode(q, voc):
    c = Counter(q.lower().split())
    return [normalize_tf(c[v]) if (v in c) else normalize_tf(0) for v in voc]



In [20]:

    
vec_q = encode(q, vocabulary)
print_vocabulary_vertically(vocabulary, leading_str=' ', spacing=4)
vec_q









    



      b                                             
      a                                             
 b    s                                             
 a    k                                             
 s    e                                             
 e    t              j    l    l    l               
 b    b         j    u    i    i    o         m    t
 a    a         a    l    k    n    v         o    h
 l    l    h    n    i    e    d    e    m    r    a
 l    l    e    e    e    s    a    s    e    e    n






    Out[20]:





[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0]

Scoring Documents



In [21]:

    
for i, vec in enumerate(vec_list):
    print('score of doc[{}] = {}'.format(i, cosine(doc_term_matrix_tfidf[i], vec_q)))









    



score of doc[0] = 0.5208482997884292
score of doc[1] = 0.4244788020052428
score of doc[2] = 0.20488374908794352



In [22]:

    
def topk(query, k):
    v_q = encode(query, vocabulary)
    out = []
    for i, vec in enumerate(vec_list):
        out.append((i, cosine(doc_term_matrix_tfidf[i], v_q)))
    answer = sorted(out, key=lambda r: r[1], reverse=True)[:k]
    return answer



In [23]:

    
topk('loves baseball', 3)









    Out[23]:





[(2, 0.352673810629198), (0, 0.34442828548738263), (1, 0.20255422347635244)]



In [ ]: