In [ ]:
%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; !important } </style>'))
This notebook shows one important application of dictionaries and sets:
It implements a local search engine that can be used to index .pdf
documents on the local file system. The index can then be used to search for all documents that contain specified words. The main data structure used is a so called
inverted index, which is a dictionary mapping words to the
sets of documents that contain these words.
The module subprocess
enables us to execute shell command and to capture their output via a pipe from which we can read.
In [ ]:
import subprocess
The function get_text
takes a path
specifiying a .pdf
file. It converts the .pdf
file into a text file and returns the
resulting text. This function assumes that the program pdftotext
is installed. This program can be dowloaded at
https://www.xpdfreader.com/download.html.
In [ ]:
def get_text(path):
command = ['pdftotext', '-enc', 'UTF-8', '-q', path, '-']
process = subprocess.Popen(command, stdout=subprocess.PIPE)
Lines = process.stdout.readlines()
return ''.join([str(line, 'utf-8', 'ignore') for line in Lines])
Let us test this for one file.
In [ ]:
%%time
text = get_text('../Literature/DualPivotQuicksort.pdf')
print(len(text))
In order to split the text contained in a file into words, we need the regular expressions provided by the module re
.
In [ ]:
import re
The function tokenize
takes a string s
and returns the set of words that have been found in the string s
. We assume that the words contain only latin characters. Furthermore, we convert the words to lower case.
In [ ]:
def tokenize(s):
return set(t.lower() for t in re.findall(r'[A-Za-z]+', s))
Let us check how many different words occur in the file that we have read above.
In [ ]:
len(tokenize(text))
We need the module os
to traverse directories. os
is short for operating system.
In [ ]:
import os
The class Document
represents a single file. This class maintains three member variables:
path
is the absolut file path specifying the location of the file containing the pdf
document,docID
is a natural number that serves as a unique identifier for the document,Words
is the set of words contained in the file.This class only serves as a container of its member variables, hence it has no methods.
In [ ]:
class Document:
def __init__(self, path, docID, Words):
self.path = path
self.docID = docID
self.Words = Words
The class Index
contains three member variables:
InvertedIndex
is a dictionary that maps every word to the set of documents containing this word.
In this set, the documents are represented by their unique identifiers.ID2Doc
is a dictionary mapping the document identifiers to the corresponding Document
objects.fileCount
is a counter that is needed to create unique document identifiers.
In [ ]:
class Index:
def __init__(self):
self.InvertedIndex = {}
self.ID2Doc = {}
self.fileCount = 0
The method $\texttt{self}.\texttt{buildIndex}(d)$ takes an Index
$\texttt{self}$ and a directory $d$. It traverses the directory $d$ recursively and collects all .pdf
files contained in $d$ and its subdirectories. These files are converted to text and their words are added to the InvertedIndex
.
In [ ]:
def buildIndex(self, directory):
for root, _, files in os.walk(directory):
for fileName in files:
if fileName[-4:] == '.pdf':
fullpath = os.path.abspath(os.path.join(root, fileName))
print('indexing', fullpath, end=' has ')
try:
fileText = get_text(fullpath)
tokenSet = tokenize(fileText)
print(len(tokenSet), 'different words.')
self.fileCount += 1
document = Document(fullpath, self.fileCount, tokenSet)
self.ID2Doc[self.fileCount] = document
self._addToIndex(self.fileCount, tokenSet)
except:
print('unable to read', path)
continue
Index.buildIndex = buildIndex
The function _addToIndex
takes a document identifier $d$ and a set of words $W$ occurring in the document specified by $d$ and extends the InvertedIndex
so that for every word $w$ in Words
we have that
$$ d \in \texttt{InvertedIndex}[w]. $$
In [ ]:
def _addToIndex(self, documentID, Words):
for term in Words:
try:
docSet = self.InvertedIndex[term]
docSet.add(documentID)
except KeyError:
self.InvertedIndex[term] = { documentID }
Index._addToIndex = _addToIndex
Let us build an Index
for a directory containing some literature regarding my lectures on algorithm.
In [ ]:
%%time
index = Index()
index.buildIndex("../Literature")
The method $\texttt{self}.\texttt{retrieve}(Q)$ takes an Index self
and a query $Q$. $Q$ is a string containing multiple words.
The method returns the set of those documents that contain all the words occurring in $Q$.
In [ ]:
def retrieve(self, Query):
SearchStrings = list(tokenize(Query))
result = set()
Documents = self.InvertedIndex.get(SearchStrings[0], set())
for word in SearchStrings[1:]:
Documents &= self.InvertedIndex.get(word, set())
return { self.ID2Doc[docID].path for docID in Documents }
Index.retrieve = retrieve
In [ ]:
index.retrieve("trie avl")
In [ ]: