Mini-Assignment 2: Building a Simple Search Index

In this mini-assignment, we will build a simple search index, which we will use later for Boolean retrieval. The assignment tasks are again at the bottom of this document.

Loading the Data



In [1]:

    
Summaries_file = 'data/malaria__Summaries.pkl.bz2'
Abstracts_file = 'data/malaria__Abstracts.pkl.bz2'



In [2]:

    
import pickle, bz2
from collections import namedtuple

Summaries = pickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )

paper = namedtuple( 'paper', ['title', 'authors', 'year', 'doi'] )

for (id, paper_info) in Summaries.items():
    Summaries[id] = paper( *paper_info )
    
Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )

Let's have a look at what the data looks like for our example paper:



In [3]:

    
Summaries[24130474]









    Out[3]:





paper(title='A network approach to analyzing highly recombinant malaria parasite genes.', authors=['Larremore DB', 'Clauset A', 'Buckee CO'], year=2013, doi='10.1371/journal.pcbi.1003268')



In [4]:

    
Abstracts[24130474]









    Out[4]:





'The var genes of the human malaria parasite Plasmodium falciparum present a challenge to population geneticists due to their extreme diversity, which is generated by high rates of recombination. These genes encode a primary antigen protein called PfEMP1, which is expressed on the surface of infected red blood cells and elicits protective immune responses. Var gene sequences are characterized by pronounced mosaicism, precluding the use of traditional phylogenetic tools that require bifurcating tree-like evolutionary relationships. We present a new method that identifies highly variable regions (HVRs), and then maps each HVR to a complex network in which each sequence is a node and two nodes are linked if they share an exact match of significant length. Here, networks of var genes that recombine freely are expected to have a uniformly random structure, but constraints on recombination will produce network communities that we identify using a stochastic block model. We validate this method on synthetic data, showing that it correctly recovers populations of constrained recombination, before applying it to the Duffy Binding Like-α (DBLα) domain of var genes. We find nine HVRs whose network communities map in distinctive ways to known DBLα classifications and clinical phenotypes. We show that the recombinational constraints of some HVRs are correlated, while others are independent. These findings suggest that this micromodular structuring facilitates independent evolutionary trajectories of neighboring mosaic regions, allowing the parasite to retain protein function while generating enormous sequence diversity. Our approach therefore offers a rigorous method for analyzing evolutionary constraints in var genes, and is also flexible enough to be easily applied more generally to any highly recombinant sequences.'

Some Utility Functions

We'll define some utility functions that allow us to tokenize a string into terms, perform linguistic preprocessing on a list of terms, as well as a function to display information about a paper in a nice way. Note that these tokenization and preprocessing functions are rather naive - you may have to make them smarter in a later assignment.



In [5]:

    
def tokenize(text):
    """
    Function that tokenizes a string in a rather naive way. Can be extended later.
    """
    return text.split(' ')

def preprocess(tokens):
    """
    Perform linguistic preprocessing on a list of tokens. Can be extended later.
    """
    result = []
    for token in tokens:
        result.append(token.lower())
    return result

print(preprocess(tokenize("Lorem ipsum dolor sit AMET")))









    



['lorem', 'ipsum', 'dolor', 'sit', 'amet']



In [6]:

    
from IPython.display import display, HTML
import re

def display_summary( id, show_abstract=False, show_id=True, extra_text='' ):
    """
    Function for printing a paper's summary through IPython's Rich Display System.
    Trims long author lists, and adds a link to the paper's DOI (when available).
    """
    s = Summaries[id]
    lines = []
    title = s.title
    if s.doi != '':
        title = '<a href=http://dx.doi.org/%s>%s</a>' % (s.doi, title)
    title = '<strong>' + title + '</strong>'
    lines.append(title)
    authors = ', '.join( s.authors[:20] ) + ('' if len(s.authors) <= 20 else ', ...')
    lines.append(str(s.year) + '. ' + authors)
    if (show_abstract):
        lines.append('<small><strong>Abstract:</strong> <em>%s</em></small>' % Abstracts[id])
    if (show_id):
        lines.append('[ID: %d]' % id)
    if (extra_text != ''):
         lines.append(extra_text)
    display( HTML('<br>'.join(lines)) )

display_summary(22433778)
display_summary(24130474, show_abstract=True)









    




An analysis of timing and frequency of malaria infection during pregnancy in relation to the risk of low birth weight, anaemia and perinatal mortality in Burkina Faso.
2012. Valea I, Tinto H, Drabo MK, Huybregts L, Sorgho H, Ouedraogo JB, Guiguemde RT, van Geertruyden JP, Kolsteren P, D'Alessandro U, FSP/MISAME study Group
[ID: 22433778]






    




A network approach to analyzing highly recombinant malaria parasite genes.
2013. Larremore DB, Clauset A, Buckee CO
Abstract: The var genes of the human malaria parasite Plasmodium falciparum present a challenge to population geneticists due to their extreme diversity, which is generated by high rates of recombination. These genes encode a primary antigen protein called PfEMP1, which is expressed on the surface of infected red blood cells and elicits protective immune responses. Var gene sequences are characterized by pronounced mosaicism, precluding the use of traditional phylogenetic tools that require bifurcating tree-like evolutionary relationships. We present a new method that identifies highly variable regions (HVRs), and then maps each HVR to a complex network in which each sequence is a node and two nodes are linked if they share an exact match of significant length. Here, networks of var genes that recombine freely are expected to have a uniformly random structure, but constraints on recombination will produce network communities that we identify using a stochastic block model. We validate this method on synthetic data, showing that it correctly recovers populations of constrained recombination, before applying it to the Duffy Binding Like-α (DBLα) domain of var genes. We find nine HVRs whose network communities map in distinctive ways to known DBLα classifications and clinical phenotypes. We show that the recombinational constraints of some HVRs are correlated, while others are independent. These findings suggest that this micromodular structuring facilitates independent evolutionary trajectories of neighboring mosaic regions, allowing the parasite to retain protein function while generating enormous sequence diversity. Our approach therefore offers a rigorous method for analyzing evolutionary constraints in var genes, and is also flexible enough to be easily applied more generally to any highly recombinant sequences.
[ID: 24130474]

Creating our first index

We will now create an inverted index based on the words in the abstracts of the papers in our dataset.

We will implement our inverted index as a Python dictionary with terms as keys and posting lists as values. For the posting lists, instead of using Python lists and then implementing the different operations on them ourselves, we will use Python sets and use the predefined set operations to process these posting "lists". This will also ensure that each document is added at most once per term. The use of Python sets is not the most efficient solution but will work for our purposes. (As an optional additional exercise, you can try to implement the posting lists as Python lists for this and the following mini-assignments.)

Not every paper in our dataset has an abstract; we will only index papers for which an abstract is available.



In [7]:

    
from collections import defaultdict

inverted_index = defaultdict(set)

# This may take a while:
for (id, abstract) in Abstracts.items():
    for term in preprocess(tokenize(abstract)):
        inverted_index[term].add(id)

Let's see what's in the index for the example term 'network':



In [8]:

    
print(inverted_index['network'])









    



{3045377, 26042370, 19863555, 22145028, 18432008, 17428488, 26662925, 26824718, 8740893, 16711714, 26130467, 24248381, 1738819, 24297541, 26343495, 14516296, 27320393, 26052682, 21692492, 23568463, 25288796, 23197789, 15458409, 12902507, 24619116, 25475193, 10080386, 7788680, 17932434, 18313365, 23554200, 18426014, 19779742, 8511649, 23025827, 19460261, 21184684, 26955948, 25325746, 27021497, 25880767, 20406478, 16257247, 24158442, 16922859, 17428722, 19697916, 16888066, 15245577, 15157523, 25325848, 18686233, 16267556, 16267557, 17695014, 16802085, 15114532, 9365801, 6441262, 19884337, 15032634, 23859522, 3787081, 9879889, 16701777, 9599321, 23243107, 17389924, 2056557, 23095668, 27091331, 9148808, 18551176, 26468747, 23437713, 20840856, 24033691, 19278236, 12351900, 2257315, 26022315, 24103345, 19272134, 26411472, 26235344, 19163607, 22178265, 3056092, 11426271, 26259940, 26259942, 26259945, 17482222, 8854004, 8385016, 27441662, 15331846, 15331847, 26831371, 20111887, 27030035, 24062484, 11127318, 23853600, 26247716, 20009515, 25377324, 25377325, 12763694, 25358894, 25377329, 8106545, 22882881, 18225737, 17562188, 10066518, 23380570, 22329949, 8020574, 10998386, 26043001, 20734592, 8233606, 17703561, 26636940, 23282319, 3603091, 22446744, 18434713, 24957599, 26639022, 26772147, 15663795, 22344380, 19864253, 24679119, 27071187, 20966107, 20966111, 7568097, 23696099, 24455910, 16866033, 23161588, 23685876, 18021118, 26456841, 27015947, 17689355, 19739416, 22444826, 12643098, 23825180, 16585514, 11875126, 22195007, 21996352, 20304706, 17822531, 24423236, 17822532, 10769222, 17822535, 17822537, 24085322, 21109587, 22704979, 20566870, 26284889, 19426143, 26100579, 23362409, 25387887, 11803507, 23573367, 22229886, 22188933, 24882058, 22776715, 24267660, 26045328, 24116122, 21881766, 21881769, 24130474, 8919990, 21914552, 22369208, 26725306, 15915970, 8577988, 27055045, 24581067, 14732245, 12358615, 25918423, 23849949, 19968995, 25474020, 19461091, 14744556, 27433965, 15657967, 21605361, 9223165, 12901377, 26430467, 9151492, 24755215, 23544851, 14642202, 26926112, 21879840, 21879842, 27706407, 17249335, 18394169, 22844474, 21480527, 23585879, 25879642, 21832797, 26676319, 22537313, 12708969, 17996908, 1340527, 26438780, 20376716, 15850637, 25001103, 27577491, 21515412, 15000726, 23274650, 2501793, 10212513, 25885859, 11728042, 27317420, 22328505, 27057354, 20849882, 23856348, 20847837, 22709478, 16245996, 21779696, 23747831, 20804860, 25371902, 17136907, 16520460, 20346126, 14724374, 21087510, 21388569, 9479464, 25470252, 18455854, 22342964, 9170231, 26053945, 316734, 19440961, 25886026, 2694475, 7517517, 25890126, 20192592, 26178900, 27313496, 16135514, 25736539, 6397274, 21376357, 24876390, 23876977, 25466232, 19766648, 22232445, 26856829, 25185663, 19094917, 22586766, 27696532, 20714916, 20041125, 25628071, 8066483, 10900917, 21781953, 19000771, 18718153, 18693579, 20413913, 7112163, 20176360, 9653737, 26093035, 26299885, 22355440, 14599665, 15003128, 21816826, 24655358, 17526276, 14628361, 26136076, 22892045, 15302159, 10784278, 11355673, 21640731, 11703843, 7994921, 16551469, 25658925, 15324728, 18042432, 24581698, 22912579, 14530121, 20813387, 25495121, 25523795, 9504343, 23236186, 27436636, 23279197, 21112415, 20852334, 15117937, 19379828, 16606847, 17544832, 22328964, 23078536, 26234510, 11548307, 19682965, 11548310, 19682967, 11548311, 21159576, 11548313, 25214619, 19891865, 12152478, 10088103, 21085871, 18826933, 24121016, 25224909, 24852174, 26552015, 6684365, 27709137, 26607328, 22384356, 20686576, 1668857, 1558265, 18165504, 18165506, 17090308, 22216480, 21741349, 16647974, 20823846, 12379942, 25861939, 26627892, 18411325, 25870149, 25573196, 15906637, 18988878, 17553228, 23773015, 18655063, 26310492, 20109149, 22241127, 24110954, 23433077, 11976582, 25126794, 19079050, 26429335, 27502491, 24010653, 3061675, 26169270, 22990783, 8304577, 11601860, 20223948, 26748879, 20037583, 9185261, 22433778, 20084734}

We can now use this inverted index to answer simple one-word queries, for example to show all papers that contain the word 'amsterdam':



In [9]:

    
query_word = 'amsterdam'
for i in inverted_index[query_word]:
    display_summary(i)









    




[Thijsse, teacher of nature, teacher of malaria].
2005. Visscher-Endeveld L, Verhave JP
[ID: 17154120]






    




The use of quinine for treatment and control of malaria in The Netherlands.
1995. Verhave JP
[ID: 8650735]






    




Malaria control: achievements, problems and strategies.
2001. Nájera JA
[ID: 11921521]






    




Malaria in the State of Amazonas: a typical Brazilian tropical disease influenced by waves of economic development.
2015. Sampaio VS, Siqueira AM, Alecrim Md, Mourão MP, Marchesini PB, Albuquerque BC, Nascimento J, Figueira ÉA, Alecrim WD, Monteiro WM, Lacerda MV
[ID: 26061365]






    




Global malaria challenge: the Amsterdam summit.
1992. Kidson C
[ID: 1298069]






    




[Malaria on the global agenda: control and chemotherapy of malaria in Vanuatu].
1998. Kaneko A
[ID: 9721529]






    




[Increase of malaria among migrants in Amsterdam-Zuidoost].
2000. Makdoembaks AM, Kager PA
[ID: 10674108]

Assignments

Your name: ...

Task 1

Construct a function called and_query that takes as input a single string, consisting of one or more words, and returns a list of matching documents. and_query, as its name suggests, should require that all query terms are present in the documents of the result list. Demonstrate the working of your function with an example (choose one that leads to fewer than 100 hits to not overblow this notebook file).

(You can use the tokenize and preprocess functions we defined above to tokenize and preprocess your query. You can also exploit the fact that the posting lists are sets, which means you can easily perform set operations such as union, difference and intersect on them.)



In [ ]:

    
# Add your code here

Task 2

Construct a second function called or_query that works in the same way as and_query you just implemented, but returns documents that contain at least one of the words in the query. Demonstrate the working of this second function also with an example (again, choose one that leads to fewer than 100 hits).



In [ ]:

    
# Add your code here

Task 3

Show how many hits the query "the who" returns for your two query functions (and_query and or_query).



In [ ]:

    
# Add your code here

Task 4

Given the nature of our dataset, how many of the documents from task 3 do you think are actually about The Who? What could you do to prevent these kind of incorrect results? (You don't have to implement this yet!)

Answer: [Write your answer text here]

Task 5

Why does and_query('red blood cell') not return our example paper 24130474, even though it mentions red blood cells in the abstract? (You do not have to implement anything to fix this yet!)

Answer: [Write your answer text here]

Submission

Submit the answers to the assignment as a modified version of this Notebook file (file with .ipynb extension) that includes your code and your answers via Blackboard. Don't forget to add your name, and remember that the assignments have to be done individually and group submissions are not allowed.