Improving the Search Index

(Inspired by and borrowed heavily from: Collective Intelligence - Luís F. Simões. IR version and assignments by J.E. Hoeksema, 2014-11-12. Converted to Python 3 and minor changes by Tobias Kuhn, 2015-11-10.)


This notebook's purpose is to improve the search index and query functions built in the previous notebooks and assignments.

Loading the Data, Defining some functions

This section is copied from the previous notebook.


In [1]:
import pickle, bz2, re
from collections import namedtuple, defaultdict, Counter
from IPython.display import display, HTML
from math import log10

Summaries_file = 'data/air__Summaries.pkl.bz2'
Abstracts_file = 'data/air__Abstracts.pkl.bz2'

Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )

paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )
for (id, paper_info) in Summaries.items():
    Summaries[id] = paper( *paper_info )
    
def display_summary( id, extra_text='' ):
    """
    Function for printing a paper's summary through IPython's Rich Display System.
    Trims long titles or author lists, and links to the paper's  DOI (when available).
    """
    s = Summaries[ id ]
    
    title = ( s.title if s.title[-1]!='.' else s.title[:-1] )
    title = title[:150].rstrip() + ('' if len(title)<=150 else '...')
    if s.doi!='':
        title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
    
    authors = ', '.join( s.authors[:5] ) + ('' if len(s.authors)<=5 else ', ...')
    
    lines = [
        title,
        authors,
        str(s.year),
        '<small>id: %d%s</small>' % (id, extra_text)
        ]
    
    display( HTML( '<blockquote>%s</blockquote>' % '<br>'.join(lines) ) )
    
def display_abstract( id, highlights=[]):
    """
    Function for displaying an abstract. Includes optional (naive) highlighting
    """
    a = Abstracts[ id ]
    for h in highlights:
        a = re.sub(r'\b(%s)\b'%h,'<mark>\\1</mark>',a, flags=re.IGNORECASE)
    display( HTML( '<blockquote>%s</blockquote' % a ) )

In [2]:
def tokenize(text):
    """
    Function that tokenizes a string in a rather naive way. Can be extended later.
    """
    return text.split(' ')

def preprocess(tokens):
    """
    Perform linguistic preprocessing on a list of tokens. Can be extended later.
    """
    result = []
    for token in tokens:
        result.append(token.lower())
    return result

In [3]:
inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
    for term in preprocess(tokenize(abstract)):
        inverted_index[term].add(id)

Stemming

As we could see from the results of the last assignment, our simple index doesn't handle punctuation and the difference between singular and plural versions of the same word very well. A possible solution to those issues would be to apply better tokenization and stemming. Fortunately, Python's NLTK package provides implementations of these algorithms we can use. You have to install NLTK by following these instructions.


In [4]:
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import EnglishStemmer
import nltk
nltk.download('punkt')
stemmer = EnglishStemmer()

s = '''Good muffins cost $3.88\nin New York.  Please buy me two of them.\n\nThanks.'''

print(tokenize(s))
print(word_tokenize(s))


[nltk_data] Downloading package punkt to /home/tk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me', 'two', 'of', 'them.\n\nThanks.']
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

In [5]:
print(stemmer.stem("processes"))


process

Ranking

Another way to improve our search results is to rank them. A possible way to do this is to calculate a score for each document based on the matching terms from the query. One such scoring method is tf-idf, as explained in the lecture slides.

In order to quickly calculate the scores for a term/document combination, we'll need quick access to a couple of things:

  • tf(t,d) - How often does a term occur in a document
  • df(t) - In how many documents does a term occur
  • N - The number of documents in our index

In [6]:
tf_matrix = defaultdict(Counter)
for (id, abstract) in Abstracts.items():
    tf_matrix[id] = Counter(preprocess(tokenize(abstract)))

def tf(t,d):
    return float(tf_matrix[d][t])

def df(t):
    return float(len(inverted_index[t]))
    
def num_documents():
    return float(len(Abstracts))

In [7]:
print(tf('air',16820458))
print(df('air'))
print(num_documents())


2.0
101905.0
190555.0

Using these three helper functions, we can now easily calculate the tf-idf weights of a term in a document by implementing the weighting formula from the slides, which you will do in the assignments below.

Assignments

Your name: ...

  • Change (in the code cell below) the smarter_tokenize function to use NLTK's word_tokenize function for tokenization, and the smarter_preprocess function to perform stemming in addition to case normalization. Does smarter_and_query("air sample") return the paper 26488732? Why (not)?

    Note: We are generating this index on a subset of the data, as generating an index with stemming on the entire set would take up to half an hour.


In [ ]:
# Change this code according to the task above:

from functools import reduce

def smarter_tokenize(text):
    # Change this
    return text.split(' ')

def smarter_preprocess(tokens):
    result = []
    for token in tokens:
        # Change this
        result.append(token.lower())
    return result

def smarter_and_query(query): # Regular and_query using smarter_tokenize and smarter_preprocess
    return reduce(lambda a, e: a.intersection(e), [smarter_index[term] for term in smarter_preprocess(smarter_tokenize(query))])

smarter_index = defaultdict(set)
# The code below creates an inverted index based on a subset of the documents
subset = set(Abstracts.keys()).intersection(set(range(26400000,26500000)))

for (id, abstract) in ((k, Abstracts[k]) for k in subset):
    for term in smarter_preprocess(smarter_tokenize(abstract)):
        smarter_index[term].add(id)

[Write your answer text here]

  • Create a function tfidf(t,d) that returns the tf.idf score of term t in document d by using the tf(t,d), df(t) and num_documents() functions we defined above. The tf-idf formula can be found on the lecture slides. Test your function with the examples shown below.

    You can use our old index for this task and the tasks below: You do not need to include the results from above with the smarter tokenization and preprocessing functions.

    You can use the log10(n) function to calculate log10(n).


In [ ]:
# Add your code here

#print(tfidf('air', 26488732))
#print(tfidf('samples', 26488732))
#print(tfidf('monkey', 26488732))
  • Create a function query(query_string), which accepts as input a single query string that could consist of one or more words, and returns or prints a list of (up to) 10 best matching documents, along with their score.

    You should use tf-idf to calculate document scores based on the query, and the results should be ordered by score in descending order.

    Hint: Start by copying your or_query function from mini-assignment 2, then expand that to rank the results, making use of the tfidf(t,d) function you created earlier.

    Use the provided display_summary(id,extra_text) function to make the output a bit more 'search engine'-like.


In [ ]:
# Add your code here
  • Come up with a few example queries to run, and include the output here. Do the results make sense? Why (not)?

In [ ]:
# Add your code here

[Write your answer text here]