Building an Inverted Index for search using NLTK

Inverted index are table like data-structure associating a given word to a list of document(s) containing it. They are very relevant to modern search engines and variants of it with contextual optimizations are used in many production search engines like Google and Solr.

Since inverted indices are, usually, built to search natural language documents, we need ways to programmatically analyse natural language texts. Such analysis requires use of statistics, machine learning techniques and datasets to train these models. While, no doubt, Natural Language Processing or (NLP) - field that studies such models and techniques - is itself pretty interesting, it is beyond the scope of our current context. The book - Information Retrieval - is a very accessible starting point for some these techniques and also covers Inverted Indices in the later half of the book.

For our immediate requirements we will rely on NLTK. It is an open source python package that enables us to work with natural language and provides handy operations like tokenization and lemmatization with pre-trained models.

# Install NLTK using pip
pip install nltk

When we install NLTK using pip, we only get the code and not the datasets and pretrained models. Lets download some of them which are relevant to our use case.


In [85]:
import nltk
from collections import defaultdict
from nltk.tokenize import sent_tokenize, word_tokenize

The below command will open an inline dialog to explore and select packages that are needed. We will select and download stopwords and punkt_tokenizer packages.

There is an alternative command - nltk.download() - which will open a graphical user interface to select packages for download. While, some users may find the interface to be convenient, it may be buggy and slow on some machines.


In [86]:
nltk.download_shell()

As has been our approach in our earlier sections, we will construct a fictitious text database to build our inverted index.

Following texts have been extraced via Project Gutenberg from the books:

  • The Jungle Book (Rudyard Kipling)
  • Alice in Wonderland (Lewis Carrol)
  • Concrete Construction Methods and Costs (HP Gillette)
  • The Telephone (Dolbear)

In [87]:
corpus = [
    """
    So she swallowed one of the cakes and was delighted to find that she
    began shrinking directly. As soon as she was small enough to get through the door,
    she ran out of the house and found quite a crowd of little animals and birds waiting outside.
    They all made a rush at Alice the moment she appeared, but she ran off as hard as 
    she could and soon found herself safe in a thick wood.
    """,
    """
    If two strips of different metals, such as silver and iron, be soldered together
    at one end, and the other ends be connected with a galvanometer, on heating the
    soldered junction of the metals it will be found that a current of electricity
    traverses the circuit from the iron to the silver. If other metals be used,
    having the same size, and the same degree of heat be applied, the current of
    electricity thus generated will give a greater or a less deflection, which will be
    constant for the metals employed. The two metals generally employed are bismuth
    and antimony, in bars about an inch long and an eighth of an inch square.
    These are soldered together in series so as to present for faces the ends of the
    bars, and these often number as many as fifty pairs. Such a series is called a thermo-pile.
    This method of[25] generating electricity was discovered by Seebeck of Berlin in 1821,
    but the thermo-pile so much in use now in heat investigations was invented by Nobili in 1835.    
    """,
    """
    According to this authority a field testing laboratory will cost for equipment $250 to $350.
    Such a laboratory can be operated by two or three men at a salary charge of from $100 to $200
    per month. Two men will test on an average four samples per day and each additional man will
    test four more samples. The cost of testing will range from $3 to $5 per sample, which
    is roughly equivalent to 3 cts. per barrel of[Pg 5] cement, or from 3 to 5 cts.per cubic yard
    of concrete. These figures are for field laboratory work reasonably well conducted
    under ordinarily favorable conditions. In large laboratories the cost per sample will
    run somewhat lower.
    """,
    """
    Just then Alice ran across the Duchess (who was now out of prison). She tucked her arm
    affectionately into Alice's and they walked off together. Alice was very glad to find her in
    such a pleasant temper. She was a little startled, however, when she heard the voice of the
    Duchess close to her ear. "You're thinking about something, my dear, and that makes you forget to talk."
    """,
    """
    It was the jackal—Tabaqui, the Dish-licker—and the wolves of India despise Tabaqui because he 
    runs about making mischief, and telling tales, and eating rags and pieces of leather from the 
    village rubbish-heaps. But they are afraid of him too, because Tabaqui, more than 
    anyone else in the jungle, is apt to go mad, and then he forgets that he was ever afraid of
    anyone, and runs through the forest biting everything in his way. Even the tiger runs and
    hides when little Tabaqui goes mad, for madness is the most disgraceful thing that can
    overtake a wild creature. We call it hydrophobia, but they call it dewanee—the madness—and run.
    """,
    """
    A great roofless palace crowned the hill, and the marble of the courtyards and the fountains
    was split, and stained with red and green, and the very cobblestones in the courtyard where the
    king’s elephants used to live had been thrust up and apart by grasses and young trees.
    From the palace you could see the rows and rows of roofless houses that made up the city looking
    like empty honeycombs filled with blackness; the shapeless block of stone that had been an idol
    in the square where four roads met; the pits and dimples at street corners where the public wells
    once stood, and the shattered domes of temples with wild figs sprouting on their sides.
    The monkeys called the place their city, and pretended to despise the Jungle-People because they
    lived in the forest. And yet they never knew what the buildings were made for nor how to use them.
    They would sit in circles on the hall of the king’s council chamber, and scratch for fleas and
    pretend to be men; or they would run in and out of the roofless houses and collect pieces of plaster
    and old bricks in a corner, and forget where they had hidden them, and fight and cry in scuffling crowds,
    and then break off to play up and down the terraces of the king’s garden, where they would shake the
    rose trees and the oranges in sport to see the fruit and flowers fall. They explored all the passages
    and dark tunnels in the palace and the hundreds of little dark rooms, but they never remembered what
    they had seen and what they had not; and so drifted about in ones and twos or crowds telling each
    other that they were doing as men did. They drank at the tanks and made the water all muddy,
    and then they fought over it, and then they would all rush together in mobs and shout:
    "There is no one in the jungle so wise and good and clever and strong and gentle as the Bandar-log."
    Then all would begin again till they grew tired of the city and went back to the tree-tops,
    hoping the Jungle-People would notice them.
    """
]

Eliminate Stopwords

Often pronouns, vowels, prepositions like - a, you, them etc. - have little relevance in our search queries. (Unless, of course, we are attempting to isolate the semantic implication of the query.) We can filter out these words to reduce "noise" in our index. Such words are called stopwords. NLTK provides a dataset of such words for English language which we can use to filter our texts.

But to eliminate stopwords, we must first derive words from our corpus. A naive approach could be:

[ c.split() for c in corpus ]

While this may work with ordinary concatenation of words, it will create problems when our corpus contains punctuations among other artifacts. We use a tokenizer - functions based on statistical models to separate paragraphs into sentences or sentences into words.


In [101]:
# This will contain a list of all words in the corpus
corpus_words = []

# Tokenize a paragraph into sentences and each sentence in to
# words
for c in corpus:
    for sent in sent_tokenize(c):
        word_tokens = word_tokenize(sent)
        corpus_words += word_tokens

len(corpus_words)


Out[101]:
1019

In [89]:
# Lets inspect the words in our corpus a bit
corpus_words.index('she'), corpus_words.index('She')


Out[89]:
(1, 427)

In [90]:
corpus_words.index('so'), corpus_words.index('So')


Out[90]:
(212, 0)

The data not only contains duplicates but also case sensitive duplicates. Other than in some specialised scenarios, we do not differentiate between upper, lower or mixed case words. So lets convert them all to lower case and eliminate duplicates by making a set.


In [91]:
lower_corpus_words = set([ x.lower() for x in corpus_words ])
len(lower_corpus_words)


Out[91]:
432

In [92]:
# Remove the stopwords
from nltk.corpus import stopwords

stwords = set(stopwords.words('english'))

# Using set difference to eliminate stopwords from our words
stopfree_words = lower_corpus_words - stwords
len(stopfree_words)


Out[92]:
350

Stemming

Often same words are spelt and pronounced differently based on grammar tense and the context they are used in. For example: summarisation, summarise are similiar words but reflecting different grammatical context. Whereas the former is a noun, the latter is a verb. In most search scenarios, this difference should not matter.

We pass our set of words through a process called stemming which reduces words to their common root-word. In our example, summaris is the root of summarise and summarisation. This will not only help us in reducing the size of index but also improve search matching by looking "deeper" into a word than mere alphabet or string matching.

There is no formulae for stemming instead there are approaches to derive them. In our examples, we use the Snowball Stemmer built in nltk.


In [93]:
from nltk.stem import snowball

stemmer = snowball.SnowballStemmer('english')
stemmed_words = set([stemmer.stem(x) for x in stopfree_words])
len(stemmed_words)


Out[93]:
330

We have reduced our corpus to about 330 words that we need to look up to match documents.


In [94]:
# Lets look at some of our words
list(stemmed_words)[-10:]


Out[94]:
['voic',
 'split',
 'investig',
 'till',
 'passag',
 'yard',
 'electr',
 'cri',
 'jackal—tabaqui',
 'thrust']

Constructing the Index

We will discard whatever we have done until now. Why? Because we never maintained a reference to the document where the word was found. We did so to get a better understanding of each step in our approach to build an inverted index. Lets us now do everything over again, but this time maintaining reference to the document where the word was found.


In [95]:
# Our index is a map of word -> documents it is found in
inverted_index = defaultdict(set)

# We maintain the reference to the document by its index in the corpus list
for docid, c in enumerate(corpus):
    for sent in sent_tokenize(c):
        for word in word_tokenize(sent):
            word_lower = word.lower()
            if word_lower not in stwords:
                word_stem = stemmer.stem(word_lower)
                # We add the document to the set againt the word in our
                # index
                inverted_index[word_stem].add(docid)

len(inverted_index.keys())


Out[95]:
330

Searching using the index

Just as we processed our documents we need to process our query before looking up in the index.


In [96]:
def process_and_search(query):
    matched_documents = set()
    for word in word_tokenize(query):
        word_lower = word.lower()
        if word_lower not in stwords:
            word_stem = stemmer.stem(word_lower)
            matches = inverted_index.get(word_stem)
            if matches:
                # The operator |= is a short hand for set union
                matched_documents |= matches
    return matched_documents

In [97]:
process_and_search("alice")


Out[97]:
{0, 3}

The word alice - as per our index - appears in documents 0 and 3.


In [98]:
process_and_search("coward")


Out[98]:
set()

coward does not appear in our corpus.


In [99]:
process_and_search("run")


Out[99]:
{2, 4, 5}

run appears in documents - 2, 4 and 5


In [100]:
process_and_search("metal crowds")


Out[100]:
{0, 1, 5}

Since we used |= (set.union) operation when processing the query, if a document matches any one of the words in the query, it will be returned in the result. In production grade search engines, we would have to handle requirements where documents need to be returned if they contains both the words or even exact same phrase and many more such conditions.

Also our search does not even bother with ranking results - sorting the list of matched documents with document most relevant to the query being the first in the list. Ranking is very important when most of the queries match a very large number of documents.

A common approach used for ranking documents is tf-idf or Term Frequency - Inverse Document Frequency. In web search engines, statistical models like clustering, click analysis and hyperlink references are also used in ranking.