In a text analytics context, document similarity relies on reimagining texts as points in space that can be close (similar) or different (far apart). However, it's not always a straightforward process to determine which document features should be encoded into a similarity measure (words/phrases? document length/structure?). Moreover, in practice it can be challenging to find a quick, efficient way of finding similar documents given some input document. In this post I'll explore some of the similarity tools implemented in Elasticsearch, which can enable us to augment search speed without having to sacrifice too much in the way of nuance.
In this post I'll be focusing mostly on getting started with Elasticsearch and comparing the built-in similarity measures currently implemented in ES. However, if you're new to the concept of document similarity, here's a quick overview.
Essentially, to represent the distance between documents, we need two things: first, a way of encoding text as vectors, and second, a way of measuring distance.
For more about vector encoding, you can check out Chapter 4 of our book, and for more about different distance metrics check out Chapter 6. In Chapter 10, we prototype a kitchen chatbot that, among other things, uses a nearest neigbor search to recommend recipes that are similar to the ingredients listed by the user. You can also poke around in the code for the book here.
One of my observations during the prototyping phase for that chapter is how slow vanilla nearest neighbor search is. This led me to think about different ways to optimize the search, from using variations like ball tree, to using other Python libraries like Spotify's Annoy, and also to other kind of tools altogether that attempt to deliver a similar results as quickly as possible. Enter Elasticsearch...
Elasticsearch is a open source text search engine that leverages the information retrieval library Lucene together with a key-value store to expose deep and rapid search functionalities. It combines the features of a NoSQL document store database, an analytics engine, and RESTful API, and is particularly useful for indexing and searching text documents.
To run Elasticsearch, you need to have the Java JVM (>= 8) installed. For more on this, read the installation instructions.
In this section, we'll go over the basics of starting up a local elasticsearch instance, creating a new index, querying for all the existing indices, and deleting a given index. If you know how to do this, feel free to skip to the next section!
In the command line, start running an instance by navigating to where ever you have elasticsearch installed and typing:
$ cd elasticsearch-<version>
$ ./bin/elasticsearch
Now we will create an index. Think of an index as a database in PostgreSQL or MongoDB. An Elasticsearch cluster can contain multiple indices (e.g. relational or noSql databases), which in turn contain multiple types (similar to MongoDB collections or PostgreSQL tables). These types hold multiple documents (similar to MongoDB documents or PostgreSQL rows), and each document has properties (like MongoDB document key-values or PostgreSQL columns).
curl -X PUT "localhost:9200/cooking " -H 'Content-Type: application/json' -d'
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
}
}
}
'
And the response:
{"acknowledged":true,"shards_acknowledged":true,"index":"cooking"}
$ curl -X GET "localhost:9200/_cat/indices?v"
$ curl -X DELETE "localhost:9200/cooking"
To explore how Elasticsearch approaches document relevance, let's begin by manually adding some documents to the cooking index we created above:
$ curl -X PUT "localhost:9200/cooking/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
"description": "Smoothies are one of our favorite breakfast options year-round."
}
'
$ curl -X PUT "localhost:9200/cooking/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
"description": "A smoothie is a thick, cold beverage made from pureed raw fruit."
}
'
$ curl -X PUT "localhost:9200/cooking/_doc/3?pretty" -H 'Content-Type: application/json' -d'
{
"description": "Eggs Benedict is a traditional American breakfast or brunch dish."
}
'
At a very basic level, we can think of Elasticsearch's basic search functionality as a kind of similarity search, where we are essentially comparing the bag-of-words formed by the search query with that of each of our documents. This allows Elasticsearch not only to return results that explicitly mention the desired search terms, but also to surface a score that conveys some measure of relevance.
We now have three breakfast-related documents in our cooking index; let's use the basic search function to find documents that explicitly mention "breakfast":
$ curl -XGET 'localhost:9200/cooking/_search?q=description:breakfast&pretty'
And the response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.48233607,
"hits" : [
{
"_index" : "cooking",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.48233607,
"_source" : {
"description" : "Smoothies are one of our favorite breakfast options year-round."
}
},
{
"_index" : "cooking",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.48233607,
"_source" : {
"description" : "Eggs Benedict is a traditional American breakfast or brunch dish."
}
}
]
}
}
We get two results back, the first and third documents, which each have the same relevance score, because both include the single search term exactly once.
However if we look for documents that mention "smoothie"...
$ curl -XGET 'localhost:9200/cooking/_search?q=description:smoothie&pretty'
...we only get the second document back, since the word "smoothie" is pluralized in the first document. On the other hand, our relevance score has jumped up to nearly 1, since it is the only result in the index that contains the search term.
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.9331132,
"hits" : [
{
"_index" : "cooking",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9331132,
"_source" : {
"description" : "A smoothie is a thick, cold beverage made from pureed raw fruit."
}
}
]
}
}
We can work around this by using a fuzzy search, which will return both the first and second documents:
curl -XGET "localhost:9200/cooking/_search?pretty=true" -H 'Content-Type: application/json' -d'
{
"query": {
"fuzzy" : { "description" : "smoothie" }
}
}
'
With the following results:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.9331132,
"hits" : [
{
"_index" : "cooking",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9331132,
"_source" : {
"description" : "A smoothie is a thick, cold beverage made from pureed raw fruit."
}
},
{
"_index" : "cooking",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.8807446,
"_source" : {
"description" : "Smoothies are one of our favorite breakfast options year-round."
}
}
]
}
}
We can work around this by using a fuzzy search, which will return both the first and second documents:
curl -XGET "localhost:9200/cooking/_search?pretty=true" -H 'Content-Type: application/json' -d'
{
"query": {
"fuzzy" : { "description" : "smoothie" }
}
}
'
With the following results:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.9331132,
"hits" : [
{
"_index" : "cooking",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9331132,
"_source" : {
"description" : "A smoothie is a thick, cold beverage made from pureed raw fruit."
}
},
{
"_index" : "cooking",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.8807446,
"_source" : {
"description" : "Smoothies are one of our favorite breakfast options year-round."
}
}
]
}
}
In order to really appreciate the differences and nuances of different similarity measures, we need more than three documents! For convenience, we'll use the sample text corpus that comes with the machine learning visualization library Yellowbrick.
Yellowbrick hosts several datasets wrangled from the UCI Machine Learning Repository or built by District Data Labs to present the examples used throughout this documentation, one of which is a text corpus of news documents collected from different domain area RSS feeds. If you haven't downloaded the data, you can do so by running:
$ python -m yellowbrick.download
This should create a folder named data in your current working directory that contains all of the datasets. You can load a specified dataset as follows:
In [20]:
import os
from sklearn.datasets.base import Bunch
from yellowbrick.download import download_all
## The path to the test data sets
FIXTURES = os.path.join(os.getcwd(), "data")
## Dataset loading mechanisms
datasets = {
"hobbies": os.path.join(FIXTURES, "hobbies")
}
def load_data(name, download=True):
"""
Loads and wrangles the passed in text corpus by name.
If download is specified, this method will download any missing files.
"""
# Get the path from the datasets
path = datasets[name]
# Check if the data exists, otherwise download or raise
if not os.path.exists(path):
if download:
download_all()
else:
raise ValueError((
"'{}' dataset has not been downloaded, "
"use the download.py module to fetch datasets"
).format(name))
# Read the directories in the directory as the categories.
categories = [
cat for cat in os.listdir(path)
if os.path.isdir(os.path.join(path, cat))
]
files = [] # holds the file names relative to the root
data = [] # holds the text read from the file
target = [] # holds the string of the category
# Load the data from the files in the corpus
for cat in categories:
for name in os.listdir(os.path.join(path, cat)):
files.append(os.path.join(path, cat, name))
target.append(cat)
with open(os.path.join(path, cat, name), 'r') as f:
data.append(f.read())
# Return the data bunch for use similar to the newsgroups example
return Bunch(
categories=categories,
files=files,
data=data,
target=target,
)
corpus = load_data('hobbies')
hobby_types = {}
for category in corpus.categories:
texts = []
for idx in range(len(corpus.data)):
if corpus['target'][idx] == category:
texts.append(' '.join(corpus.data[idx].split()))
hobby_types[category] = texts
The categories in the hobbies corpus include: "cinema", "books", "cooking", "sports", and "gaming". We can explore them like this:
In [21]:
food_stories = [text for text in hobby_types['cooking']]
print(food_stories[5])
Most of the articles, like the one above, are straightforward and are clearly correctly labeled, though there are some exceptions:
In [22]:
print(food_stories[23])
We can use the elasticsearch
library in Python to hop out of the command line and interact with our Elasticsearch instance a bit more systematically. Here we'll create a class that goes through each of the hobbies categories in the corpus and indexes each to a new index appropriately named after it's category:
In [17]:
from elasticsearch.helpers import bulk
from elasticsearch import Elasticsearch
class ElasticIndexer(object):
"""
Create an ElasticSearch instance, and given a list of documents,
index the documents into ElasticSearch.
"""
def __init__(self):
self.elastic_search = Elasticsearch()
def make_documents(self, textdict):
for category, docs in textdict:
for document in docs:
yield {
"_index": category,
"_type": "_doc",
"description": document
}
def index(self, textdict):
bulk(self.elastic_search, self.make_documents(textdict))
indexer = ElasticIndexer()
indexer.index(hobby_types.items())
Let's poke around a bit to see what's in our instance. Note: after running the above, you should see the indices appear when you type curl -X GET "localhost:9200/_cat/indices?v"
into the command line.
In [5]:
from pprint import pprint
query = {"match_all": {}}
result = indexer.elastic_search.search(index="cooking", body={"query":query})
print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])
In [6]:
query = {"fuzzy":{"description":"breakfast"}}
result = indexer.elastic_search.search(index="cooking", body={"query":query})
print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])
Elasticsearch exposes a convenient way of doing more advanced querying based on document similarity, which is called "More Like This" (MLT). Given an input document or set of documents, MLT wraps all of the following behavior:
*Note: this is done using term frequency-inverse document frequency (TF-IDF). Term frequency-inverse document frequency is an encoding method that normalizes term frequency in a document with respect to the rest of the corpus. As such, TF-IDF measures the relevance of a term to a document by the scaled frequency of the appearance of the term in the document, normalized by the inverse of the scaled frequency of the term in the entire corpus. This has the effect of selecting terms that make the input document or documents the most unique.
We can now build an MLT query in much the same way as we did the "fuzzy"
search above. The Elasticsearch MLT query exposes many search parameters, but the only required one is "like"
, to which we can specify a string, a document, or multiple documents.
Let's see if we can find any documents from our corpus that are similar to a New York Times review for the Italian restaurant Don Angie.
In [7]:
red_sauce_renaissance = """
Ever since Rich Torrisi and Mario Carbone began rehabilitating chicken Parm and
Neapolitan cookies around 2010, I’ve been waiting for other restaurants to carry
the torch of Italian-American food boldly into the future. This is a major branch
of American cuisine, too important for its fate to be left to the Olive Garden.
For the most part, though, the torch has gone uncarried. I have been told that
Palizzi Social Club, in Philadelphia, may qualify, but because Palizzi is a
veritable club — members and guests only, no new applications accepted — I don’t
expect to eat there before the nation’s tricentennial. Then in October, a place
opened in the West Village that seemed to hit all the right tropes. It’s called
Don Angie. Two chefs share the kitchen — Angela Rito and her husband, Scott
Tacinelli — and they make versions of chicken scarpariello, antipasto salad and
braciole. The dining room brings back the high-glitz Italian restaurant décor of
the 1970s and ’80s, the period when Formica and oil paintings of the Bay of Naples
went out and mirrors with gold pinstripes came in. The floor is a black-and-white
checkerboard. The bar is made of polished marble the color of beef carpaccio.
There is a house Chianti, and it comes in a straw-covered bottle. There is hope
for a red-sauce renaissance, after all.
"""
In [8]:
query = {
"more_like_this" : {
"fields" : ["description"],
"like" : red_sauce_renaissance,
"min_term_freq" : 3,
"max_query_terms" : 50,
"min_doc_freq" : 4
}
}
result = indexer.elastic_search.search(index="cooking", body={"query":query})
print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])
In [9]:
query = {
"more_like_this" : {
"fields" : ["description"],
"like" : red_sauce_renaissance,
"unlike" : [food_stories[23], food_stories[28]],
"min_term_freq" : 2,
"max_query_terms" : 50,
"min_doc_freq" : 4
}
}
result = indexer.elastic_search.search(index="cooking", body={"query":query})
print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])
We can also expand our search to other indices, to see if there are documents related to our red sauce renaissance article that may appear in other hobbies corpus categories:
In [10]:
query = {
"more_like_this" : {
"fields" : ["description"],
"like" : red_sauce_renaissance,
"unlike" : [food_stories[23], food_stories[28]],
"min_term_freq" : 2,
"max_query_terms" : 50,
"min_doc_freq" : 4
}
}
result = indexer.elastic_search.search(index=["cooking","books","sports"], body={"query":query})
print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])
So far we've explored how to get started with Elasticsearch and to perform basic search and fuzzy search. These search tools all use the practical scoring function to compute the relevance score for search results. This scoring function is a variation of TF-IDF that also takes into account a few other things, including the length of the query and the field that's being searched.
Now we will look at some of the more advanced tools implemented in Elasticsearch. Similarity algorithms can be set on a per-index or per-field basis. The available similarity computations include:
BM25
): currently the default setting in Elasticsearch, BM25 is a TF-IDF based similarity that has built-in tf normalization and supposedly works better for short fields (like names). classic
): TF-IDFDFR
): Similarity that implements the divergence from randomness framework.DFI
): Similarity that implements the divergence from independence model.IB
): Algorithm that presumes the content in any symbolic 'distribution' sequence is primarily determined by the repetitive usage of its basic elements.LMDirichlet
): Bayesian smoothing using Dirichlet priors.LMJelinekMercer
): Attempts to capture important patterns in the text but leave out the noise.If you want to change the default similarity after creating an index you must close your index, send the following request and open it again afterwards:
curl -X POST "localhost:9200/cooking/_close"
curl -X PUT "localhost:9200/cooking/_settings" -H 'Content-Type: application/json' -d'
{
"index": {
"similarity": {
"default": {
"type": "classic"
}
}
}
}
'
curl -X POST "localhost:9200/cooking/_open"
Now that we've manually changed the similarity scoring metric (in this case to classic TF-IDF), we can see how this effects the results of our previous queries, where we note right away that the first result is the same, but it's relevance score is lower.
In [16]:
query = {
"more_like_this" : {
"fields" : ["description"],
"like" : red_sauce_renaissance,
"unlike" : [food_stories[23], food_stories[28]],
"min_term_freq" : 2,
"max_query_terms" : 50,
"min_doc_freq" : 4
}
}
result = indexer.elastic_search.search(index=["cooking"], body={"query":query})
print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])
In [18]:
query = {
"more_like_this" : {
"fields" : ["description"],
"like" : red_sauce_renaissance,
"unlike" : [food_stories[23], food_stories[28]],
"min_term_freq" : 2,
"max_query_terms" : 50,
"min_doc_freq" : 4
}
}
result = indexer.elastic_search.search(index=["cooking"], body={"query":query})
print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])
In [19]:
query = {
"more_like_this" : {
"fields" : ["description"],
"like" : red_sauce_renaissance,
"unlike" : [food_stories[23], food_stories[28]],
"min_term_freq" : 2,
"max_query_terms" : 50,
"min_doc_freq" : 4
}
}
result = indexer.elastic_search.search(index=["cooking"], body={"query":query})
print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])
From these simple experiments, we can clearly see that document similarity is not one-size-fits-all, but also that Elasticsearch offers quite a few options for relevance scoring that attempt to take into account the nuances of real-world documents, from variations in length and grammar, to vocabulary and style!