Building a simple search index

(Inspired by and borrowed heavily from: Collective Intelligence - Luís F. Simões. IR version and assignments by J.E. Hoeksema, 2014-11-03. Converted to Python 3 and minor changes by Tobias Kuhn, 2015-11-02.)


This notebook's purpose is to build a simple search index (to be used for Boolean retrieval)

Loading the data


In [1]:
Summaries_file = 'data/air__Summaries.pkl.bz2'
Abstracts_file = 'data/air__Abstracts.pkl.bz2'

In [2]:
import pickle, bz2
from collections import namedtuple

Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )

paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )

for (id, paper_info) in Summaries.items():
    Summaries[id] = paper( *paper_info )
    
Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )

Let's have a look at how the data looks for our example paper:


In [3]:
Summaries[26488732]


Out[3]:
paper(title='Use of Whole-Genome Sequencing to Link Burkholderia pseudomallei from Air Sampling to Mediastinal Melioidosis, Australia.', authors=['Currie BJ', 'Price EP', 'Mayo M', 'Kaestli M', 'Theobald V', 'Harrington I', 'Harrington G', 'Sarovich DS'], year=2015, doi='10.3201/eid2111.141802')

In [4]:
Abstracts[26488732]


Out[4]:
'The frequency with which melioidosis results from inhalation rather than percutaneous inoculation or ingestion is unknown. We recovered Burkholderia pseudomallei from air samples at the residence of a patient with presumptive inhalational melioidosis and used whole-genome sequencing to link the environmental bacteria to B. pseudomallei recovered from the patient.'

Some utility functions

We'll define some utility functions that allow us to tokenize a string into terms, perform linguistic preprocessing on a list of terms, as well as a function to display information about a paper in a nice way. Note that these tokenization and preprocessing functions are rather naive - you may have to make them smarter in a later assignment.


In [5]:
def tokenize(text):
    """
    Function that tokenizes a string in a rather naive way. Can be extended later.
    """
    return text.split(' ')

def preprocess(tokens):
    """
    Perform linguistic preprocessing on a list of tokens. Can be extended later.
    """
    result = []
    for token in tokens:
        result.append(token.lower())
    return result

print(preprocess(tokenize("Lorem ipsum dolor sit AMET")))


['lorem', 'ipsum', 'dolor', 'sit', 'amet']

In [6]:
from IPython.display import display, HTML
import re

def display_summary( id, extra_text='' ):
    """
    Function for printing a paper's summary through IPython's Rich Display System.
    Trims long titles or author lists, and links to the paper's  DOI (when available).
    """
    s = Summaries[ id ]
    
    title = ( s.title if s.title[-1]!='.' else s.title[:-1] )
    title = title[:150].rstrip() + ('' if len(title)<=150 else '...')
    if s.doi!='':
        title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
    
    authors = ', '.join( s.authors[:5] ) + ('' if len(s.authors)<=5 else ', ...')
    
    lines = [
        title,
        authors,
        str(s.year),
        '<small>id: %d%s</small>' % (id, extra_text)
        ]
    
    display( HTML( '<blockquote>%s</blockquote>' % '<br>'.join(lines) ) )
    
def display_abstract( id, highlights=[]):
    """
    Function for displaying an abstract. Includes optional (naive) highlighting
    """
    a = Abstracts[ id ]
    for h in highlights:
        a = re.sub(r'\b(%s)\b'%h,'<mark>\\1</mark>',a, flags=re.IGNORECASE)
    display( HTML( '<blockquote>%s</blockquote' % a ) )
    
display_summary(26488732)
display_abstract(26488732, ['sequencing'])


The frequency with which melioidosis results from inhalation rather than percutaneous inoculation or ingestion is unknown. We recovered Burkholderia pseudomallei from air samples at the residence of a patient with presumptive inhalational melioidosis and used whole-genome sequencing to link the environmental bacteria to B. pseudomallei recovered from the patient.

Creating our first index

We will now create an inverted index based on the words in the abstracts of the papers in our dataset. We will once again use our defaultdict with a default value of an empty set trick to ensure a document is only added to a posting list once.

Our end result will be a dictionary, where each key is a term, and each value is a posting list, represented by a set of paper IDs.

Note that not every paper in our summaries set has an abstract; we will only index papers for which an abstract is present.


In [7]:
from collections import defaultdict

inverted_index = defaultdict(set)

# Takes a while
for (id, abstract) in Abstracts.items():
    for term in preprocess(tokenize(abstract)):
        inverted_index[term].add(id)

In [8]:
print(inverted_index['sequencing'])


{26431488, 18087937, 18044421, 22678023, 21038088, 17596935, 10852872, 24140300, 24039948, 22278670, 19593230, 21745680, 26157585, 19901455, 9361427, 21395476, 15653909, 18297876, 9570839, 12114969, 11724825, 26412057, 9180697, 23642653, 12171293, 25880607, 4013081, 24083487, 19581474, 18521631, 21468708, 24551973, 21936679, 15303720, 15258153, 24507435, 15943213, 20519470, 25348656, 25109864, 11981877, 21459509, 23698488, 20955193, 9504827, 18430012, 24446525, 24072765, 16436289, 16989772, 23289422, 22038094, 24823374, 26304593, 25877070, 23646291, 24456276, 23275093, 23851093, 18024023, 23880792, 9694291, 17365082, 24719455, 16972383, 14525537, 10747999, 7781991, 2007144, 22993513, 25710698, 15112299, 19700844, 26414193, 9582197, 14717560, 25463417, 24056441, 24739965, 22835837, 25482367, 16408190, 11334785, 25853058, 22624387, 23586944, 11683465, 12573836, 23461517, 18802318, 22771343, 17229452, 16962705, 16722067, 8118931, 2196119, 25405592, 8217167, 16425634, 25047205, 25831590, 24602277, 17229480, 18845862, 11043498, 22365354, 12803246, 23717552, 23604403, 26288819, 8832180, 22877879, 10543800, 24950459, 17709244, 11100862, 20607679, 1833151, 15496384, 26191042, 20133059, 22935230, 10920641, 16010438, 2383558, 18429125, 25093834, 18406093, 25353933, 21871827, 14749908, 26462420, 26473684, 21840084, 25833176, 24196315, 25923292, 15116879, 8876257, 19561186, 17546981, 8735975, 16820458, 19641069, 24977646, 22092528, 9068272, 11164914, 18215160, 21445881, 23237370, 18524922, 17902847, 11536130, 21634308, 22349060, 2793735, 15997704, 23576841, 21056779, 23917836, 21840652, 22288654, 23851275, 17824530, 24224019, 22323479, 23663384, 16406297, 26039063, 26024222, 14991647, 24258337, 11762467, 16722215, 7547690, 9729835, 25648940, 24859438, 17944367, 20188464, 25202990, 19796784, 7702838, 24600887, 22619962, 24382270, 16622910, 18256704, 22188865, 9378113, 15084355, 9796420, 26338118, 8865606, 21779275, 24462668, 25028427, 23690579, 12643668, 24645462, 17391958, 10564443, 20347228, 21791581, 19502432, 21514081, 23179618, 11141472, 24956771, 8311650, 12678502, 24758119, 24784743, 22289767, 23891302, 15656298, 21429612, 24893805, 18457966, 26033517, 10336622, 24490865, 15656303, 25393011, 21228399, 26330487, 15656313, 20039035, 20347772, 21683584, 24851328, 23264642, 17189762, 22072196, 26417541, 18068357, 17646984, 25413003, 23564172, 22193549, 23613838, 24823694, 23883155, 21944214, 23762327, 26066842, 20358043, 26488732, 26252189, 20680606, 20968351, 25921951, 25035165, 17633184, 22516132, 24494501, 25340840, 22917544, 14663080, 9784747, 7380396, 25697709, 24313773, 23372716, 11376046, 19124145, 22503357, 16232877, 25220534, 18359223, 25407927, 16713656, 10488762, 1907645, 19039166, 15923135, 24402366, 16280513, 22974399, 25906115, 2374595, 22386629, 8111040, 11976131, 24121801, 8704971, 16253388, 24973773, 11880397, 25103311, 14719952, 12688848, 24349140, 23372757, 14735321, 8508378, 22503898, 19359165, 1653216, 17766880, 25079268, 25768421, 25351142, 25953766, 7821287, 24053737, 17478634, 17277413, 16907756, 18025457, 24901619, 25713142, 19496439, 25989624, 19065337, 20658686}

We can now use this inverted index to answer simple one-word queries, for example to get an arbitrary paper that contains the word 'sequencing':


In [9]:
query_word = 'sequencing'
first_paper = list(inverted_index[query_word])[0] # Note that we convert a set into a list in order to fetch its first element
display_summary(first_paper) 
display_abstract(first_paper,[query_word])


Metagenomic Human Repiratory Air in a Hospital Environment
Lai YY, Li Y, Lang J, Tong X, Zhang L, ...
2015
id: 26431488
Hospital-acquired infection (HAI) or nosocomial infection is an issue that frequent hospital environment. We believe conventional regulated Petri dish method is insufficient to evaluate HAI. To address this problem, metagenomic sequencing was applied to screen airborne microbes in four rooms of Beijing Hospital. With air-in amount of sampler being setup to one person's respiration quantity, metagenomic sequencing identified huge numbers of species in the rooms which had already qualified widely accepted petridish exposing standard, imposing urgency for new technology. Meanwhile,the comparative culture only got small portion of recovered species and remain blind for even cultivable pathogens reminded us the limitations of old technologies. To the best of our knowledge, the method demonstrated in this study could be broadly applied in hospital indoor environment for various monitoring activities as well as HAI study. It is also potential as a transmissible pathogen real-time modelling system worldwide.

Assignments

Your name: ...

  • Construct two functions (or_query and and_query) that will each take as input a single string, consisting of one or more words, and return a list of matching documents. or_query will return documents that contain at least one of the words in the query, while and_query requires all query terms to be present in the documents.

Note that you can use the tokenize and preprocess functions we defined above to tokenize and preprocess your query. You can also exploit the fact that the posting lists are sets, which means you can easily perform set operations such as union, difference and intersect on them.


In [10]:
# Add your code here
  • How many hits does or_query('The Who') return? Given the nature of our dataset, how many documents do you think are actually about The Who? What could you do to prevent these kind of incorrect results? (Note that you do not have to implement this yet)

[Write your answer text here]


In [11]:
# Add your code here
  • Why does and_query('air sample') not return our example paper 26488732, while it does speak about air samples in the abstract? (Note that you do not have to implement anything to fix this yet)

[Write your answer text here]