Mini-Assignment 3: Improving the Search Index

In this mini-assignment, we will improve the search index and query functions from the previous mini-assignment.

Loading the Data and Defining Auxiliary Functions

This section is copied from the previous notebook.


In [1]:
import pickle, bz2, re
from collections import namedtuple, defaultdict, Counter
from IPython.display import display, HTML
from math import log10, sqrt

Summaries_file = 'data/malaria__Summaries.pkl.bz2'
Abstracts_file = 'data/malaria__Abstracts.pkl.bz2'

Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )

paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )
for (id, paper_info) in Summaries.items():
    Summaries[id] = paper( *paper_info )

In [2]:
def tokenize(text):
    """
    Function that tokenizes a string in a rather naive way. Can be extended later.
    """
    return text.split(' ')

def preprocess(tokens):
    """
    Perform linguistic preprocessing on a list of tokens. Can be extended later.
    """
    result = []
    for token in tokens:
        result.append(token.lower())
    return result

In [3]:
def display_summary( id, show_abstract=False, show_id=True, extra_text='' ):
    """
    Function for printing a paper's summary through IPython's Rich Display System.
    Trims long author lists, and adds a link to the paper's DOI (when available).
    """
    s = Summaries[id]
    lines = []
    title = s.title
    if s.doi != '':
        title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
    title = '<strong>' + title + '</strong>'
    lines.append(title)
    authors = ', '.join( s.authors[:20] ) + ('' if len(s.authors) <= 20 else ', ...')
    lines.append(str(s.year) + '. ' + authors)
    if (show_abstract):
        lines.append('<small><strong>Abstract:</strong> <em>%s</em></small>' % Abstracts[id])
    if (show_id):
        lines.append('[ID: %d]' % id)
    if (extra_text != ''):
         lines.append(extra_text)
    display( HTML('<br>'.join(lines)) )

In [4]:
inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
    for term in preprocess(tokenize(abstract)):
        inverted_index[term].add(id)

Stemming

As we could see from the results of the last assignment, our simple index doesn't handle punctuation and the difference between singular and plural versions of the same word very well. Fortunately, Python's NLTK package provides implementations of these algorithms we can use. If your Python package doesn't already come with it, you have to install NLTK by following these instructions.


In [5]:
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import EnglishStemmer
import nltk
nltk.download('punkt')
stemmer = EnglishStemmer()

s = '''Good muffins cost $3.88\nin New York.  Please buy me two of them.\n\nThanks.'''

print(tokenize(s))
print(word_tokenize(s))


[nltk_data] Downloading package punkt to /home/tk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me', 'two', 'of', 'them.\n\nThanks.']
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

In [6]:
print(stemmer.stem("processes"))


process

Ranking

Another way to improve our search results is to rank them. A possible way to do this is to calculate a score for each document based on the matching terms from the query. One such scoring method is tf-idf, which comes with several variants, as explained in the lecture slides.

In order to quickly calculate the scores for a term/document combination, we'll need quick access to a couple of things:

  • tf(t,d): How often does a term occur in a document
  • df(t): In how many documents does a term occur
  • N: The number of documents in our index
  • length_tf(d): The length of the document vector (with vectors of plain tf values)

In [7]:
tf_matrix = defaultdict(Counter)
length_values = defaultdict(int)

for (doc_id, abstract) in Abstracts.items():
    tokens = preprocess(tokenize(abstract))
    tf_matrix[doc_id] = Counter(tokens)
    l = 0
    for t in tf_matrix[doc_id].keys():
        l += tf_matrix[doc_id][t] ** 2
    length_values[doc_id] = sqrt(l)

def tf(t,d):
    return float(tf_matrix[d][t])

def df(t):
    return float(len(inverted_index[t]))
    
def num_documents():
    return float(len(Abstracts))

def length_tf(d):
    return length_values[d]

Let's test these functions with some examples:


In [8]:
print(tf('network', 24130474))
print(df('network'))
print(num_documents())
print(length_tf(24130474))


3.0
403.0
67028.0
27.110883423451916

Using these three helper functions, we can now easily calculate the tf-idf weights of a term in a document by implementing the weighting formula from the slides, which you will do in the assignments below.


Assignments

Your name: ...

Task 1

Implement in the code block below the smarter_tokenize function using NLTK's function for tokenization, and the smarter_preprocess function to perform stemming in addition to case normalization.


In [ ]:
# Smarter linguistic processing

# Your code here:

#def smarter_tokenize(text):
#  ...

#def smarter_preprocess(tokens):
#  ...

# To test it:
print(smarter_preprocess(smarter_tokenize("Mary had a little group of processes worth $5.4")))

Now we can make a smarter index based on these functions. For practical purposes, the code below generates the smarter index on a subset of the data, as generating an index with stemming on the entire set would take too much time. (You don't need to change or add anything here.)


In [ ]:
# Below, we create our smarter index (based on a subset of the documents for demonstration purposes)
smarter_index = defaultdict(set)

# Here we define the subset (somewhat arbitrary):
subset_of_ids = set(key for key in Abstracts.keys() if 24100000 <= key < 24200000)
subset_of_abstracts = ((key, Abstracts[key]) for key in subset_of_ids)

# Uncomment this line to process the whole corpus (might take a long time):
#subset_of_abstracts = Abstracts.items()

# Building our smarter index:
for (id, abstract) in subset_of_abstracts:
    for term in smarter_preprocess(smarter_tokenize(abstract)):
        smarter_index[term].add(id)

Now implement the smarter_and_query function based on the two functions above. You can start from the code for and_query from the last assignment.


In [ ]:
# Smarter and_query based on the smarter tokenize and preprocess functions

# Your code here:

#def smarter_and_query(query_string):
#  ...

Task 2

Run the queries "red blood cell" and "pfemp1" with the new smarter_and_query function from task 1. Do they return our exemplary paper 24130474? For each of these examples, what do our new smarter functions specifically contribute to the result (as compared to our previous naive implementations for tokenization and preprocessing)?


In [ ]:
# Add your code here

Answer: [Write your answer text here]

Task 3

Create a function tfidf(t,d) that returns the tf-idf score of term t in document d by using the tf(t,d), df(t) and num_documents() functions we defined above. The tf-idf formula can be found on the lecture slides. Use tf-idf with plain (non-logarithmic) term frequency, as applied by scoring variant ntn. Test your function with the examples shown below.

You can use our old index for this task and the tasks below: You do not need to include the results from above with the smarter tokenization and preprocessing functions.

You can use the log10(n) function to calculate the base 10 logarithm.


In [ ]:
#def tfidf(t,d):
#    ...

#print(tfidf('network', 24130474))
#print(tfidf('var', 24130474))
#print(tfidf('surface', 24130474))

Task 4

Create a function query_ntn(query_string), which accepts as input a single query string that could consist of one or more words, and returns or prints a list of (up to) 10 best matching documents, along with their score. Use tf-idf to calculate document scores based on the query, applying variant ntn, as above (see the formula for score_ntn on the lecture slides). Use an auxiliary function score_ntn to calculate the score. The results should be shown in descending order by score.

You can start by copying your or_query function from mini-assignment 2, then expand that to rank the results, making use of the tfidf(t,d) function you created above.


In [ ]:
#score_ntn(query_words, doc_id)
#    ...

#query_ntn(query_string)
#    ...

Task 5

Create a second version of the query function from Task 4, and call it query_nnc. This second version should use, as its name suggests, variant nnc instead of ntn, and therefore apply the cosine similarity measure. (See the formula for score_nnc on the lecture slides. You can drop the square root of |q| in the formula, as indicated on the slides.) You can use the length_tf function defined above. Use again an auxiliary function called score_nnc. You can start by copy-pasting the code from Task 4.

Use the provided display_summary function to make the output a bit more like the results page of a search engine, and demonstrate your query_nnc function with two or three interesting example queries.


In [ ]:
#score_nnc(query_words, doc_id)
#    ...

#query_nnc(query_string)
#    ...