Inspired by and borrowed heavily from: Collective Intelligence - Luís F. Simões
IR version and assignments by J.E. Hoeksema, 2014-11-03
This notebook's purpose is to build a simple search index (to be used for boolean retrieval)
In [1]:
data_path = './' # e.g. 'C:\Downloads\' (includes trailing slash)
Summaries_file = data_path + 'evolution__Summaries.pkl.bz2'
Abstracts_file = data_path + 'evolution__Abstracts.pkl.bz2'
In [2]:
import cPickle, bz2
from collections import namedtuple
Summaries = cPickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )
for (id, paper_info) in Summaries.iteritems():
Summaries[id] = paper( *paper_info )
Abstracts = cPickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )
Let's have a look at how the data looks for our example paper:
In [3]:
Summaries[23144668]
Out[3]:
In [4]:
Abstracts[23144668]
Out[4]:
We'll define some utility functions that allow us to tokenize a string into terms, perform linguistic preprocessing on a list of terms, as well as a function to display information about a paper in a nice way. Note that these tokenization and preprocessing functions are rather naive - you may have to make them smarter in a later assignment.
In [5]:
def tokenize(text):
"""
Function that tokenizes a string in a rather naive way. Can be extended later.
"""
return text.split(' ')
def preprocess(tokens):
"""
Perform linguistic preprocessing on a list of tokens. Can be extended later.
"""
result = []
for token in tokens:
result.append(token.lower())
return result
print preprocess(tokenize("Lorem ipsum dolor sit AMET"))
In [6]:
from IPython.display import display, HTML
import re
def display_summary( id, extra_text='' ):
"""
Function for printing a paper's summary through IPython's Rich Display System.
Trims long titles or author lists, and links to the paper's DOI (when available).
"""
s = Summaries[ id ]
title = ( s.title if s.title[-1]!='.' else s.title[:-1] )
title = title[:150].rstrip() + ('' if len(title)<=150 else '...')
if s.doi!='':
title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
authors = ', '.join( s.authors[:5] ) + ('' if len(s.authors)<=5 else ', ...')
lines = [
title,
authors,
str(s.year),
'<small>id: %d%s</small>' % (id, extra_text)
]
display( HTML( '<blockquote>%s</blockquote>' % '<br>'.join(lines) ) )
def display_abstract( id, highlights=[]):
"""
Function for displaying an abstract. Includes optional (naive) highlighting
"""
a = Abstracts[ id ]
for h in highlights:
a = re.sub(r'\b(%s)\b'%h,'<mark>\\1</mark>',a, flags=re.IGNORECASE)
display( HTML( '<blockquote>%s</blockquote' % a ) )
display_summary(23144668)
display_abstract(23144668, ['embodied'])
We will now create an Inverted Index based on the words in the abstracts of the papers in our dataset. We will once again use our defaultdict with a default value of an empty set trick to ensure a document is only added to a posting list once.
Our end result will be a dictionary, where each key is a term, and each value is a posting list, represented by a set
of paper IDs.
Note that not every paper in our summaries set has an abstract; we will only index papers for which an abstract is present.
In [7]:
from collections import defaultdict
inverted_index = defaultdict(set)
# Takes a while
for (id, abstract) in Abstracts.iteritems():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)
In [8]:
print inverted_index['embodied']
We can now use this inverted index to answer simple one-word queries, for example to get an arbitrary paper that contains the word 'embodied':
In [9]:
query_word = 'embodied'
first_paper = list(inverted_index[query_word])[0] # Note that we convert a set into a list in order to fetch its first element
display_summary(first_paper)
display_abstract(first_paper,[query_word])
or_query
and and_query
) that will each take as input a single string, consisting of one or more words, and return a list of matching documents. or_query
will return documents that contain at least one of the words in the query, while and_query
requires all query terms to be present in the documents.Note that you can use the tokenize
and preprocess
functions we defined above to tokenize and preprocess your query. You can also exploit the fact that the posting lists are sets, which means you can easily perform set operations such as union, difference and intersect on them.
or_query('The Who')
return? Given the nature of our dataset, how many documents do you think are actually about The Who? What could you do to prevent these kind of incorrect results? (Note that you do not have to implement this yet)and_query('Evolutionary Process')
not return our example paper 23144668, while it does speak about evolutionary processes in the abstract? (Note that you do not have to implement anything to fix this yet)