In [1]:
# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))
from formats import load_style
load_style(plot_style=False)
Out[1]:
In [2]:
os.chdir(path)
import math
import json
import requests
# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload
%autoreload 2
%watermark -a 'Ethen' -d -t -v -p requests
The problem that BM25 (Best Match 25) tries to solve is similar to that of TFIDF (Term Frequency, Inverse Document Frequency), that is representing our text in a vector space (it can be applied to field outside of text, but text is where it has the biggest presence) so we can search/find similar documents for a given document or query.
The gist behind TFIDF is that is relies on two main factors to determine whether a document is similar to our query.
Using these two factors, TFIDF measures the relative concentration of a term in a given piece of document. If the term is common in this article, but relatively rare elsewhere, then the TFIDF score will be high, and documents that have higher TFIDF score would be considered as very relevant to the search term.
BM25 improves upon TFIDF by casting relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant. Instead of going through how the formula was derived, here we'll take a look a the formula and try to digest it to see why it makes some kind of sense.
BM25 (Best Match 25) function scores each document in a corpus according to the document's relevance to a particular text query. For a query $Q$, with terms $q_1, \ldots, q_n$, the BM25 score for document $D$ is:
\begin{align} \mbox{BM25}(D, Q) = \sum_{i=1}^n IDF(q_i, D) \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i) + k_1 \cdot (1-b + b \cdot |D|/d_{avg}))} \end{align}where:
Let's break the formula down into smaller components to see why it makes sense.
From the graph above, we can see that shorter documents hit the asymptote much faster. Hopefully, this resembles our intuition as the more matches we have on shorter documents, the more certain we are about the relevance, whereas, for a lengthy book, it might take us longer to get to a point where we feel confident that the book is indeed relevant to the given query.
As for the inverse document frequency part, ${IDF}(q_i, D)$. For a corpus with $N$ documents, inverse document frequency for term $q_i$ is computed as follows:
\begin{align} \mbox{IDF}(q_i, D) = \log \frac{N - N(q_i) + 0.5}{N(q_i) + 0.5} \end{align}where
The inverse document frequency part is very similar to that of TFIDF, whose role is to make sure that rarer words will have a higher score and contribute more to the final score.
Please note that the IDF formula listed above has a drawback when using it for terms appearing in more than half of the corpus since the value would come out as negative value, resulting in the overall score to become negative. e.g. if we have 10 documents in the corpus, and the term "the" appeared in 6 of them, its IDF would be $log(10 - 6 + 0.5 / 6 + 0.5) = log(4.5 / 6.5)$. Although we can argue that our implementation should have already removed these frequently appearing words as these words are mostly used to form a complete sentence and carry little meaning of note, different softwares/packages still make different adjustments to prevent a negative score from ever occurring. e.g.
Like all hyperparameters in general, the default ones are usually a good starting point, and we should probably focus on tweaking other stuff before jumping into the rabbit hole of hyperparameter tuning. In the context of search, it might be making sure our ranking scores older documents lower in application such as news ranking. But if we were to start tuning, remember to always measure the performance of various settings and the following questions are general starting points that we can reference to.
In [3]:
# we'll generate some fake texts to experiment with
corpus = [
'Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system',
'System and human system engineering testing of EPS',
'Relation of user perceived response time to error measurement',
'The generation of random binary unordered trees',
'The intersection graph of paths in trees',
'Graph minors IV Widths of trees and well quasi ordering',
'Graph minors A survey'
]
# remove stop words and tokenize them (we probably want to do some more
# preprocessing with our text in a real world setting, but we'll keep
# it simple here)
stopwords = set(['for', 'a', 'of', 'the', 'and', 'to', 'in'])
texts = [
[word for word in document.lower().split() if word not in stopwords]
for document in corpus
]
# build a word count dictionary so we can remove words that appear only once
word_count_dict = {}
for text in texts:
for token in text:
word_count = word_count_dict.get(token, 0) + 1
word_count_dict[token] = word_count
texts = [[token for token in text if word_count_dict[token] > 1] for text in texts]
texts
Out[3]:
In [4]:
class BM25:
"""
Best Match 25.
Parameters
----------
k1 : float, default 1.5
b : float, default 0.75
Attributes
----------
tf_ : list[dict[str, int]]
Term Frequency per document. So [{'hi': 1}] means
the first document contains the term 'hi' 1 time.
df_ : dict[str, int]
Document Frequency per term. i.e. Number of documents in the
corpus that contains the term.
idf_ : dict[str, float]
Inverse Document Frequency per term.
doc_len_ : list[int]
Number of terms per document. So [3] means the first
document contains 3 terms.
corpus_ : list[list[str]]
The input corpus.
corpus_size_ : int
Number of documents in the corpus.
avg_doc_len_ : float
Average number of terms for documents in the corpus.
"""
def __init__(self, k1=1.5, b=0.75):
self.b = b
self.k1 = k1
def fit(self, corpus):
"""
Fit the various statistics that are required to calculate BM25 ranking
score using the corpus given.
Parameters
----------
corpus : list[list[str]]
Each element in the list represents a document, and each document
is a list of the terms.
Returns
-------
self
"""
tf = []
df = {}
idf = {}
doc_len = []
corpus_size = 0
for document in corpus:
corpus_size += 1
doc_len.append(len(document))
# compute tf (term frequency) per document
frequencies = {}
for term in document:
term_count = frequencies.get(term, 0) + 1
frequencies[term] = term_count
tf.append(frequencies)
# compute df (document frequency) per term
for term, _ in frequencies.items():
df_count = df.get(term, 0) + 1
df[term] = df_count
for term, freq in df.items():
idf[term] = math.log(1 + (corpus_size - freq + 0.5) / (freq + 0.5))
self.tf_ = tf
self.df_ = df
self.idf_ = idf
self.doc_len_ = doc_len
self.corpus_ = corpus
self.corpus_size_ = corpus_size
self.avg_doc_len_ = sum(doc_len) / corpus_size
return self
def search(self, query):
scores = [self._score(query, index) for index in range(self.corpus_size_)]
return scores
def _score(self, query, index):
score = 0.0
doc_len = self.doc_len_[index]
frequencies = self.tf_[index]
for term in query:
if term not in frequencies:
continue
freq = frequencies[term]
numerator = self.idf_[term] * freq * (self.k1 + 1)
denominator = freq + self.k1 * (1 - self.b + self.b * doc_len / self.avg_doc_len_)
score += (numerator / denominator)
return score
In [5]:
# query our corpus to see which document is more relevant
query = 'The intersection of graph survey and trees'
query = [word for word in query.lower().split() if word not in stopwords]
bm25 = BM25()
bm25.fit(texts)
scores = bm25.search(query)
for score, doc in zip(scores, corpus):
score = round(score, 3)
print(str(score) + '\t' + doc)
In the code chunk above, we printed each corpus's BM25 relevance score along with the original text, note that this isn't sorted in decreasing order of the relevance score yet, which is usually what we want to do in a real world setting. That is to find the more relevant document, sort them in decreasing order and present them to the user. Also here, we are computing the scores for every document, this becomes computationally expensive when we start have a large corpus size. Thus search engine uses inverted index to speed things up. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears, this allows us to quickly find the documents that contains the term in our query and only then do we compute the relevance score for this smaller recall set. This link contains a good high level description of this concept.
We can see BM25 in action to rank documents using ElasticSearch, this notebook isn't an ElasticSearch tutorial, so hopefully, the reader are some what familiar with the tool, if not, each code chunk contains links to some helpful references.
We will follow the standard process of creating the index to store our documents, add some sample documents to the index and provide a query against the index to return the relevant documents sorted in decreasing order based on the relevance score, which will be based on BM25.
In [8]:
# installation instructions
# https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html
# creating an index
# https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html
settings = {
'settings': {
'index': {
'number_of_shards': 1,
'number_of_replicas': 1,
# configure our default similarity algorithm explicitly to use bm25,
# this allows it to use it for all the fields
'similarity': {
'default': {
'type': 'BM25'
}
}
}
},
# we will be indexing our documents in the title field using the English analyzer,
# which removes stop words for us, the default standard analyzer doesn't have
# this preprocessing step
# https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html
'mappings': {
# this key is the "type", which will be explained in the next code chunk
'_doc': {
'properties': {
'title': {
'type': 'text',
'analyzer': 'english'
}
}
}
}
}
headers = {'Content-Type': 'application/json'}
response = requests.put('http://localhost:9200/experiment', data=json.dumps(settings), headers=headers)
response
Out[8]:
In [9]:
# indexing document
# https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
# https://www.elastic.co/guide/en/elasticsearch/guide/master/index-doc.html
# a document is uniquely identified by the index, the type and id
# it's worth noting that there's a note on removing the capabilities of
# having multiple types under one index, and going forward the type will
# just to set to '_doc'
# https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html
url = 'http://localhost:9200/experiment/_doc'
for document in corpus:
# we insert the document into the 'title' field
data = {'title': document}
response = requests.post(url, data=json.dumps(data), headers=headers)
response
Out[9]:
In [10]:
def search(query, headers):
url = 'http://localhost:9200/experiment/_doc/_search'
response = requests.get(url, data=json.dumps(query), headers=headers)
# the response contains other information, such as time it took to
# give the response back, here we are only interested in the matched
# results, which are stored under 'hits'
search_hits = json.loads(response.text)['hits']['hits']
print('Num\tRelevance Score\tTitle')
for idx, hit in enumerate(search_hits):
print('%s\t%s\t%s' % (idx + 1, hit['_score'], hit['_source']['title']))
In [11]:
# match query
# https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
query = {
'query': {
'match': {
# search against the 'title' field
'title': 'The intersection of graph survey and trees'
}
}
}
search(query, headers={'Content-Type': 'application/json'})
In [12]:
# we can delete this experimental index to prevent occupying space
response = requests.delete('http://localhost:9200/experiment')
response
Out[12]: