We introduce the vector space model in this notebook.
This notebook is adapted from http://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/tfidf.html
In [1]:
from pprint import pprint
import numpy as np
import contextlib
@contextlib.contextmanager
def printoptions(*args, **kwargs):
original = np.get_printoptions()
np.set_printoptions(*args, **kwargs)
yield
np.set_printoptions(**original)
## compact print (numpy array)
def cprint(x):
with printoptions(precision=3, suppress=True, linewidth=120):
print(x)
# test
x = np.random.random(10)
cprint(x)
In [2]:
#examples taken from here: http://stackoverflow.com/a/1750187
doc_list = ['Julie loves me more than Linda loves me',
'Jane likes me more than Julie loves me',
'He likes basketball more than baseball']
bags = []
from collections import Counter
# we use lower() here so that the sorted vocabulary is easier to see.
bags = [Counter(doc.lower().split()) for doc in doc_list]
for bag in bags:
print(bag)
Now we construct the vocabulary and the doc-term matrix.
In [3]:
import itertools
vocabulary = sorted(list(set(itertools.chain(*[list(b) for b in bags]))))
vocabulary
Out[3]:
In [4]:
def print_vocabulary_vertically(voc, leading_str = '', spacing=2, align=1):
# align = 0: align top; otherwise, align bottom
max_len = max([len(v) for v in voc])
for i in range(max_len):
if align == 0:
line = [v[i] if i < len(v) else ' ' for v in voc]
else:
line = [' ' if i < max_len - len(v) else v[i-max_len] for v in voc]
print('{}{}'.format(leading_str, (' '*spacing).join(line)))
print_vocabulary_vertically(vocabulary, align=0)
In [5]:
for doc in doc_list:
print(doc)
print()
vec_list = []
print_vocabulary_vertically(vocabulary, leading_str=' ')
print('-'*70)
for bag in bags:
vec = [bag[v] for v in vocabulary] # Counter['non-existing-key'] = 0
vec_list.append(vec)
print(vec)
Now we transform raw tf to tf by $1 + \log(tf)$.
In [6]:
from math import log
def normalize_tf(tf):
if tf == 0:
return 0.0
else:
return 1.0 + log(tf)
In [7]:
tf_vec_list = []
for vec in vec_list:
tf_vec_list.append([normalize_tf(val) for val in vec])
print_vocabulary_vertically(vocabulary, leading_str=' ', spacing=6)
print('-'*80)
cprint(np.matrix(tf_vec_list))
The above vectors in vec_list
only keep track of term frequencies (tf
s). Now we make each element $w_{d, t} = tf(d, t) \cdot idf(t)$.
Note that the textbook version of idf will have 0 if df == ndocs
. Therefore, we follow the implementation in Lucene
(https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) which gives:
$$idf(t) = 1 + \log(\frac{ndocs}{df + 1})$$
BTW, on the above page, you can see the detailes and justifications of all major deviations of Lucene's implementation from the standard VSM. You will always need to do this to tackle new or specific problems you are facing when you join the workforce, :p. Therefore, a deep understanding of the principles behind the knowledge you learned in the uni is more important.
In [8]:
from math import log
def idf(cnt, ndocs):
# here we use ln(). The base of the log does not matter much.
return 1.0 + log(ndocs/(cnt+1))
In [9]:
ndocs = len(doc_list)
# voc = [ (v, [b[v] for b in bags]) for v in vocabulary] # if you want to see the individual counts
voc = [(v, sum([b[v] for b in bags])) for v in vocabulary]
pprint(voc)
We will eventually record in voc
the idf values. You can look at the raw df
value and the resulting idf
values from the outputs of the above the below cells.
In [10]:
voc = [(v, idf( sum([b[v] for b in bags]), ndocs)) for v in vocabulary]
idf_dict = dict(voc)
pprint(voc)
print(idf_dict['he'])
Now we can look at the final weighted vectors of the documents.
We choose to implement it based on matrix operations supported by numpy
. E.g., to obtained $\{a_i b_i\}_{i=1}^n$ from two vector $\vec{A} = \{a_i\}_{i=1}^n$ and $\vec{B} = \{b_i\}_{i=1}^n$, we first construct a diagonal matrix $\mathbf{D_A}$ whose diagonal elements are $a_i$, then we can obtain the desired result by the standard matrix multiplication of $\vec{B} \mathbf{D_A}$.
In [11]:
import numpy as np
def build_idf_matrix(idf_vector):
idf_mat = np.zeros((len(idf_vector), len(idf_vector)))
np.fill_diagonal(idf_mat, idf_vector)
return idf_mat
idf_matrix = build_idf_matrix([v[1] for v in voc])
cprint(idf_matrix)
In [12]:
import math
def l2_normalizer(vec):
denom = np.sum([el**2 for el in vec])
return [(el / math.sqrt(denom)) for el in vec]
In [13]:
doc_term_matrix_tfidf = []
#performing tf-idf matrix multiplication
for vec in tf_vec_list:
doc_term_matrix_tfidf.append(np.dot(vec, idf_matrix))
cprint(np.matrix(doc_term_matrix_tfidf)) # np.matrix() just to make it easier to look at
#normalizing
print('\nAfter normalization ...')
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))
cprint(np.matrix(doc_term_matrix_tfidf_l2))
Look at the matrix m before normalization.
m[0][6] = 1.405
. This is because $tf(d_0, 'linda') = 1$, and $idf('linda') = 1.4054651081081644$. m[0][8] = 0.828
. This is because $tf(d_0, 'me') = 1.693$, and $idf('me') = 0.4891743762340093$.
In [14]:
import numpy as np
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
def cosine(u, v):
a = normalize(u)
b = normalize(v)
return np.inner(a, b)
In [15]:
cosine(doc_term_matrix_tfidf[0], doc_term_matrix_tfidf[1])
Out[15]:
In [16]:
cosine(doc_term_matrix_tfidf[0], doc_term_matrix_tfidf[2])
Out[16]:
In [17]:
cosine(doc_term_matrix_tfidf[1], doc_term_matrix_tfidf[2])
Out[17]:
In [18]:
q = 'Linda likes me'
In [19]:
def encode(q, voc):
c = Counter(q.lower().split())
return [normalize_tf(c[v]) if (v in c) else normalize_tf(0) for v in voc]
In [20]:
vec_q = encode(q, vocabulary)
print_vocabulary_vertically(vocabulary, leading_str=' ', spacing=4)
vec_q
Out[20]:
In [21]:
for i, vec in enumerate(vec_list):
print('score of doc[{}] = {}'.format(i, cosine(doc_term_matrix_tfidf[i], vec_q)))
In [22]:
def topk(query, k):
v_q = encode(query, vocabulary)
out = []
for i, vec in enumerate(vec_list):
out.append((i, cosine(doc_term_matrix_tfidf[i], v_q)))
answer = sorted(out, key=lambda r: r[1], reverse=True)[:k]
return answer
In [23]:
topk('loves baseball', 3)
Out[23]:
In [ ]: