In [ ]:

    
%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; !important } </style>'))

Implementing a Local Search Engine

This notebook shows one important application of dictionaries and sets: It implements a local search engine that can be used to index .pdf documents on the local file system. The index can then be used to search for all documents that contain specified words. The main data structure used is a so called inverted index, which is a dictionary mapping words to the sets of documents that contain these words.

The module subprocess enables us to execute shell command and to capture their output via a pipe from which we can read.



In [ ]:

    
import subprocess

The function get_text takes a path specifiying a .pdf file. It converts the .pdf file into a text file and returns the resulting text. This function assumes that the program pdftotext is installed. This program can be dowloaded at https://www.xpdfreader.com/download.html.



In [ ]:

    
def get_text(path):
    command = ['pdftotext', '-enc', 'UTF-8', '-q', path, '-']
    process = subprocess.Popen(command, stdout=subprocess.PIPE)
    Lines   = process.stdout.readlines()
    return ''.join([str(line, 'utf-8', 'ignore') for line in Lines])

Let us test this for one file.



In [ ]:

    
%%time
text = get_text('../Literature/DualPivotQuicksort.pdf')
print(len(text))

In order to split the text contained in a file into words, we need the regular expressions provided by the module re.



In [ ]:

    
import re

The function tokenize takes a string s and returns the set of words that have been found in the string s. We assume that the words contain only latin characters. Furthermore, we convert the words to lower case.



In [ ]:

    
def tokenize(s):
    return set(t.lower() for t in re.findall(r'[A-Za-z]+', s))

Let us check how many different words occur in the file that we have read above.



In [ ]:

    
len(tokenize(text))

We need the module os to traverse directories. os is short for operating system.



In [ ]:

    
import os

The class Document represents a single file. This class maintains three member variables:

path is the absolut file path specifying the location of the file containing the pdf document,
docID is a natural number that serves as a unique identifier for the document,
Words is the set of words contained in the file.

This class only serves as a container of its member variables, hence it has no methods.



In [ ]:

    
class Document:
    def __init__(self, path, docID, Words):
        self.path  = path
        self.docID = docID
        self.Words = Words

The class Index contains three member variables:

InvertedIndex is a dictionary that maps every word to the set of documents containing this word. In this set, the documents are represented by their unique identifiers.
ID2Doc is a dictionary mapping the document identifiers to the corresponding Document objects.
fileCount is a counter that is needed to create unique document identifiers.



In [ ]:

    
class Index:
    def __init__(self):
        self.InvertedIndex = {}
        self.ID2Doc        = {}
        self.fileCount     = 0

The method $\texttt{self}.\texttt{buildIndex}(d)$ takes an Index $\texttt{self}$ and a directory $d$. It traverses the directory $d$ recursively and collects all .pdf files contained in $d$ and its subdirectories. These files are converted to text and their words are added to the InvertedIndex.



In [ ]:

    
def buildIndex(self, directory):   
    for root, _, files in os.walk(directory):
        for fileName in files:
            if fileName[-4:] == '.pdf':
                fullpath = os.path.abspath(os.path.join(root, fileName))
                print('indexing', fullpath, end=' has ')
                try:
                    fileText = get_text(fullpath)
                    tokenSet = tokenize(fileText)
                    print(len(tokenSet), 'different words.')
                    self.fileCount += 1
                    document = Document(fullpath, self.fileCount, tokenSet)
                    self.ID2Doc[self.fileCount] = document
                    self._addToIndex(self.fileCount, tokenSet)
                except:
                    print('unable to read', path)
                    continue
        
Index.buildIndex = buildIndex

The function _addToIndex takes a document identifier $d$ and a set of words $W$ occurring in the document specified by $d$ and extends the InvertedIndex so that for every word $w$ in Words we have that $$ d \in \texttt{InvertedIndex}[w]. $$



In [ ]:

    
def _addToIndex(self, documentID, Words):
    for term in Words:
        try:
            docSet = self.InvertedIndex[term]
            docSet.add(documentID)
        except KeyError:
            self.InvertedIndex[term] = { documentID }
        
Index._addToIndex = _addToIndex

Let us build an Index for a directory containing some literature regarding my lectures on algorithm.



In [ ]:

    
%%time
index = Index()
index.buildIndex("../Literature")

The method $\texttt{self}.\texttt{retrieve}(Q)$ takes an Index self and a query $Q$. $Q$ is a string containing multiple words. The method returns the set of those documents that contain all the words occurring in $Q$.



In [ ]:

    
def retrieve(self, Query):
    SearchStrings = list(tokenize(Query))
    result        = set()
    Documents     = self.InvertedIndex.get(SearchStrings[0], set())
    for word in SearchStrings[1:]:
        Documents &= self.InvertedIndex.get(word, set())           
    return { self.ID2Doc[docID].path for docID in Documents }

Index.retrieve = retrieve



In [ ]:

    
index.retrieve("trie avl")



In [ ]: