Building a simple search index

Inspired by and borrowed heavily from: Collective Intelligence - Luís F. Simões
IR version and assignments by J.E. Hoeksema, 2014-11-03


This notebook's purpose is to build a simple search index (to be used for boolean retrieval)

Loading the data


In [1]:
data_path = './' # e.g. 'C:\Downloads\' (includes trailing slash)

Summaries_file = data_path + 'evolution__Summaries.pkl.bz2'
Abstracts_file = data_path + 'evolution__Abstracts.pkl.bz2'

In [2]:
import cPickle, bz2
from collections import namedtuple

Summaries = cPickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )

paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )

for (id, paper_info) in Summaries.iteritems():
    Summaries[id] = paper( *paper_info )
    
Abstracts = cPickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )

Let's have a look at how the data looks for our example paper:


In [3]:
Summaries[23144668]


Out[3]:
paper(title='Embodied artificial evolution: Artificial evolutionary systems in the 21st Century.', authors=['Eiben AE', 'Kernbach S', 'Haasdijk E'], year=2012, doi='10.1007/s12065-012-0071-x')

In [4]:
Abstracts[23144668]


Out[4]:
'Evolution is one of the major omnipresent powers in the universe that has been studied for about two centuries. Recent scientific and technical developments make it possible to make the transition from passively understanding to actively using evolutionary processes. Today this is possible in Evolutionary Computing, where human experimenters can design and manipulate all components of evolutionary processes in digital spaces. We argue that in the near future it will be possible to implement artificial evolutionary processes outside such imaginary spaces and make them physically embodied. In other words, we envision the "Evolution of Things", rather than just the evolution of digital objects, leading to a new field of Embodied Artificial Evolution (EAE). The main objective of this paper is to present a unifying vision in order to aid the development of this high potential research area. To this end, we introduce the notion of EAE, discuss a few examples and applications, and elaborate on the expected benefits as well as the grand challenges this developing field will have to address.'

Some utility functions

We'll define some utility functions that allow us to tokenize a string into terms, perform linguistic preprocessing on a list of terms, as well as a function to display information about a paper in a nice way. Note that these tokenization and preprocessing functions are rather naive - you may have to make them smarter in a later assignment.


In [5]:
def tokenize(text):
    """
    Function that tokenizes a string in a rather naive way. Can be extended later.
    """
    return text.split(' ')

def preprocess(tokens):
    """
    Perform linguistic preprocessing on a list of tokens. Can be extended later.
    """
    result = []
    for token in tokens:
        result.append(token.lower())
    return result

print preprocess(tokenize("Lorem ipsum dolor sit AMET"))


['lorem', 'ipsum', 'dolor', 'sit', 'amet']

In [6]:
from IPython.display import display, HTML
import re

def display_summary( id, extra_text='' ):
    """
    Function for printing a paper's summary through IPython's Rich Display System.
    Trims long titles or author lists, and links to the paper's  DOI (when available).
    """
    s = Summaries[ id ]
    
    title = ( s.title if s.title[-1]!='.' else s.title[:-1] )
    title = title[:150].rstrip() + ('' if len(title)<=150 else '...')
    if s.doi!='':
        title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
    
    authors = ', '.join( s.authors[:5] ) + ('' if len(s.authors)<=5 else ', ...')
    
    lines = [
        title,
        authors,
        str(s.year),
        '<small>id: %d%s</small>' % (id, extra_text)
        ]
    
    display( HTML( '<blockquote>%s</blockquote>' % '<br>'.join(lines) ) )
    
def display_abstract( id, highlights=[]):
    """
    Function for displaying an abstract. Includes optional (naive) highlighting
    """
    a = Abstracts[ id ]
    for h in highlights:
        a = re.sub(r'\b(%s)\b'%h,'<mark>\\1</mark>',a, flags=re.IGNORECASE)
    display( HTML( '<blockquote>%s</blockquote' % a ) )
    
display_summary(23144668)
display_abstract(23144668, ['embodied'])


Evolution is one of the major omnipresent powers in the universe that has been studied for about two centuries. Recent scientific and technical developments make it possible to make the transition from passively understanding to actively using evolutionary processes. Today this is possible in Evolutionary Computing, where human experimenters can design and manipulate all components of evolutionary processes in digital spaces. We argue that in the near future it will be possible to implement artificial evolutionary processes outside such imaginary spaces and make them physically embodied. In other words, we envision the "Evolution of Things", rather than just the evolution of digital objects, leading to a new field of Embodied Artificial Evolution (EAE). The main objective of this paper is to present a unifying vision in order to aid the development of this high potential research area. To this end, we introduce the notion of EAE, discuss a few examples and applications, and elaborate on the expected benefits as well as the grand challenges this developing field will have to address.

Creating our first index

We will now create an Inverted Index based on the words in the abstracts of the papers in our dataset. We will once again use our defaultdict with a default value of an empty set trick to ensure a document is only added to a posting list once.

Our end result will be a dictionary, where each key is a term, and each value is a posting list, represented by a set of paper IDs.

Note that not every paper in our summaries set has an abstract; we will only index papers for which an abstract is present.


In [7]:
from collections import defaultdict

inverted_index = defaultdict(set)

# Takes a while
for (id, abstract) in Abstracts.iteritems():
    for term in preprocess(tokenize(abstract)):
        inverted_index[term].add(id)

In [8]:
print inverted_index['embodied']


set([1992194, 18701321, 16288782, 18440207, 20418223, 9231910, 19516970, 8412208, 15881782, 18801719, 8837176, 20027964, 9728068, 17764940, 10736215, 12093024, 23762020, 16191591, 23177324, 10352237, 19884148, 18166390, 20015239, 24045704, 10904202, 21829774, 18673296, 11805332, 10263701, 15811222, 23272600, 22688431, 20059328, 20068033, 16922313, 11794638, 16301776, 21466836, 16056021, 4017882, 23144668, 15833311, 23396064, 11783397, 23098601, 20158188, 12556021, 23344886, 1331447, 12405508, 17847050, 7013649, 15631635, 9122581, 17328421, 16170792, 23480626, 16053576, 3136396, 18979384, 20537174, 20416855, 22734053, 22349884, 18415979, 19272557, 11341678, 10652527, 10290033, 16797055, 23141772, 19665560, 19665811, 20573589, 23600022, 11141700, 9209758, 6394275, 17109420, 23979453, 19013054, 18193346, 20452086, 10634185, 22695379, 16240612, 21646161, 22420459, 22947821, 7879666, 21241334])

We can now use this inverted index to answer simple one-word queries, for example to get an arbitrary paper that contains the word 'embodied':


In [9]:
query_word = 'embodied'
first_paper = list(inverted_index[query_word])[0] # Note that we convert a set into a list in order to fetch its first element
display_summary(first_paper) 
display_abstract(first_paper,[query_word])


Thomas Hodgkin and Hodgkin's disease. Two paradigms appropriate to medicine today
Hellman S
1991
id: 1992194
Thomas Hodgkin was an investigator whose contributions extended over a wide range of medicine. While he is known for Hodgkin's disease, this was not his major interest. That this is so has more to do with his successors than him. He had a highly committed social conscience and was outspoken in advocacy of his positions. This greatly limited his professional career. The history of Hodgkin's disease is one of hypothesis generation, which allowed for its effective treatment even without an understanding of its etiology, illustrating the approximate nature of scientific discovery and the importance of chance in historical attribution. Hodgkin, as a scientist, healer, and socially committed individual, embodied the many characteristics that are desirable for today's physician, while the evolution of knowledge about Hodgkin's disease and its treatment is an instructive model for future medical advances.

Assignments

  • Construct two functions (or_query and and_query) that will each take as input a single string, consisting of one or more words, and return a list of matching documents. or_query will return documents that contain at least one of the words in the query, while and_query requires all query terms to be present in the documents.

Note that you can use the tokenize and preprocess functions we defined above to tokenize and preprocess your query. You can also exploit the fact that the posting lists are sets, which means you can easily perform set operations such as union, difference and intersect on them.

  • How many hits does or_query('The Who') return? Given the nature of our dataset, how many documents do you think are actually about The Who? What could you do to prevent these kind of incorrect results? (Note that you do not have to implement this yet)
  • Why does and_query('Evolutionary Process') not return our example paper 23144668, while it does speak about evolutionary processes in the abstract? (Note that you do not have to implement anything to fix this yet)