Improving the Search Index

Inspired by and borrowed heavily from: Collective Intelligence - Luís F. Simões
IR version and assignments by J.E. Hoeksema, 2014-11-12

This notebook's purpose is to improve the search index and query functions built in the previous notebook and assignments

Loading the Data, Defining some functions

This section is copied from the previous notebook.



In [1]:

    
import cPickle, bz2, re
from collections import namedtuple, defaultdict, Counter
from IPython.display import display, HTML
from math import log10

data_path = './' # e.g. 'C:\Downloads\' (includes trailing slash)

Summaries_file = data_path + 'evolution__Summaries.pkl.bz2'
Abstracts_file = data_path + 'evolution__Abstracts.pkl.bz2'

Summaries = cPickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
Abstracts = cPickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )

paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )
for (id, paper_info) in Summaries.iteritems():
    Summaries[id] = paper( *paper_info )
    
def display_summary( id, extra_text='' ): # For a non-notebook version: see the Discussion Board
    """
    Function for printing a paper's summary through IPython's Rich Display System.
    Trims long titles or author lists, and links to the paper's  DOI (when available).
    """
    s = Summaries[ id ]
    
    title = ( s.title if s.title[-1]!='.' else s.title[:-1] )
    title = title[:150].rstrip() + ('' if len(title)<=150 else '...')
    if s.doi!='':
        title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
    
    authors = ', '.join( s.authors[:5] ) + ('' if len(s.authors)<=5 else ', ...')
    
    lines = [
        title,
        authors,
        str(s.year),
        '<small>id: %d%s</small>' % (id, extra_text)
        ]
    
    display( HTML( '<blockquote>%s</blockquote>' % '<br>'.join(lines) ) )
    
def display_abstract( id, highlights=[]): # For a non-notebook version: see the Discussion Board
    """
    Function for displaying an abstract. Includes optional (naive) highlighting
    """
    a = Abstracts[ id ]
    for h in highlights:
        a = re.sub(r'\b(%s)\b'%h,'<mark>\\1</mark>',a, flags=re.IGNORECASE)
    display( HTML( '<blockquote>%s</blockquote' % a ) )



In [2]:

    
def tokenize(text):
    """
    Function that tokenizes a string in a rather naive way. Can be extended later.
    """
    return text.split(' ')

def preprocess(tokens):
    """
    Perform linguistic preprocessing on a list of tokens. Can be extended later.
    """
    result = []
    for token in tokens:
        result.append(token.lower())
    return result



In [3]:

    
inverted_index = defaultdict(set)

# Takes a while
for (id, abstract) in Abstracts.iteritems():
    for term in preprocess(tokenize(abstract)):
        inverted_index[term].add(id)



In [4]:

    
# Short (one-liner) versions of and_query and or_query
def or_query(query):
    return reduce(lambda a, e: a.union(e), [inverted_index[term] for term in preprocess(tokenize(query))])

def and_query(query): 
    return reduce(lambda a, e: a.intersection(e), [inverted_index[term] for term in preprocess(tokenize(query))])

Stemming

As we could see from the results of the last assignment, our simple index doesn't handle punctuation and the difference between singular and plural versions of the same word very well. A possible solution to those issues would be to apply better tokenization and stemming. Fortunately, Python's NLTK package provides implementations of these algorithms we can use:



In [5]:

    
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import EnglishStemmer
import nltk
nltk.download('punkt')
stemmer = EnglishStemmer()

s = '''Good muffins cost $3.88\nin New York.  Please buy me two of them.\n\nThanks.'''

print tokenize(s)
print word_tokenize(s)









    



[nltk_data] Downloading package punkt to /Users/username/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me', 'two', 'of', 'them.\n\nThanks.']
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']



In [6]:

    
print stemmer.stem("processes")









    



process

Ranking

Another way to improve our search results is to rank them. A possible way to do this is to calculate a score for each document based on the matching terms from the query. One such scoring method is tf.idf, as explained in the slides from Lecture 5.

In order to quickly calculate the scores for a term/document combination, we'll need quick access to a couple of things:

tf(t,d) - How often does a term occur in a document
df(t) - In how many documents does a term occur
N - The number of documents in our index



In [7]:

    
tf_matrix = defaultdict(Counter)
for (id, abstract) in Abstracts.iteritems():
    tf_matrix[id] = Counter(preprocess(tokenize(abstract)))

def tf(t,d):
    return float(tf_matrix[d][t])

def df(t):
    return float(len(inverted_index[t]))
    
def num_documents():
    return float(len(Abstracts))



In [8]:

    
print tf('evolution',23144668)
print df('evolution')
print num_documents()

Using these three helper functions, we can now easily calculate the tf.idf weights of a term in a document by implementing the weighting formula from the slides, which you will do in the assignments below.

Assignments

Change (in the code cell below) the smarter_tokenize function to use NLTK's word_tokenize function for tokenization, and the smarter_preprocess function to perform stemming in addition to case normalization. Does smarter_and_query("evolutionary process") return our example paper 23144668? Why (not)?

We are purposely generating this index on a subset of the data, as generating an index with stemming on the entire set would take up to half an hour



In [9]:

    
def smarter_tokenize(text):
    # Change this
    return text.split(' ')

def smarter_preprocess(tokens):
    result = []
    for token in tokens:
        # Change this
        result.append(token.lower())
    return result

def smarter_and_query(query): # Regular and_query using smarter_tokenize and smarter_preprocess
    return reduce(lambda a, e: a.intersection(e), [smarter_index[term] for term in smarter_preprocess(smarter_tokenize(query))])

smarter_index = defaultdict(set)
# Create a subset of around 1400 document IDs
subset = set(Abstracts.keys()).intersection(set(xrange(23100000,23200000)))

for (id, abstract) in ((k, Abstracts[k]) for k in subset):
    for term in smarter_preprocess(smarter_tokenize(abstract)):
        smarter_index[term].add(id)
        
print smarter_and_query("evolutionary process")
print 23144668 in smarter_and_query("evolutionary process")









    



set([23179131, 23185476, 23106710, 23148266, 23193289, 23192074, 23138755, 23198988, 23194053, 23188590, 23168079, 23132463, 23197827, 23136734, 23116822, 23100716, 23166479, 23135324, 23106717, 23188175, 23133349])
False

Create a function tfidf(t,d) that returns the tf.idf score of term t in document d by using the tf(t,d), df(t) and num_documents() functions we defined above. The tf.idf formula can be found on the slides for Lecture 5.

You can use the `log10(n)` function to calculate log₁₀(n), and you can use the code cell below to verify your results.



In [10]:

    
#print tfidf('embodied',23144668) # 3.35343851032
#print tfidf('evolution',23144668) # 0.302176987782
#print tfidf('notinthedocument',23144668) # 0.0

Create a function query(query_string), which accepts as input a single query string that could consist of one or more words, and returns or prints a list of (up to) 10 best matching documents, along with their score.

You should use tf.idf to calculate document scores based on the query, and the results should be ordered by score in descending order.

Hint: Start by copying your `or_query` function from mini-assignment 2, then expand that to rank the results, making use of the `tfidf(t,d)` function you created earlier.

You can either return or print a list of (id, score) tuples, or use the provided `display_summary(id,extra_text)` function to make the output a bit more 'search engine'-like. Both are valid for completing the assignment.

Come up with a few example queries to run, and include the output here. Do the results make sense? Why (not)?