In [ ]:
%matplotlib inline
In [ ]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
If you simply want to calculate the similarity between documents, then you
may want to check out the Similarity Queries Tutorial
<https://radimrehurek.com/gensim/tut3.html>
and the API reference
<https://radimrehurek.com/gensim/similarities/docsim.html>
. The current
tutorial shows the building block of these larger methods, which are a small
suite of distance metrics, including:
Here's a brief summary of this tutorial:
Compare the distributions to each other using a variety of distance metrics:
Discuss the concept of distance metrics in slightly more detail
In [ ]:
from gensim.corpora import Dictionary
# you can use any corpus, this is just illustratory
texts = [
['bank','river','shore','water'],
['river','water','flow','fast','tree'],
['bank','water','fall','flow'],
['bank','bank','water','rain','river'],
['river','water','mud','tree'],
['money','transaction','bank','finance'],
['bank','borrow','money'],
['bank','finance'],
['finance','money','sell','bank'],
['borrow','sell'],
['bank','loan','sell'],
]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
import numpy
numpy.random.seed(1) # setting random seed to get the same results each time.
from gensim.models import ldamodel
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2, minimum_probability=1e-8)
model.show_topics()
Let's call the 1st topic the water topic and the second topic the finance topic.
Let's take a few sample documents and get them ready to test our distance functions.
In [ ]:
doc_water = ['river', 'water', 'shore']
doc_finance = ['finance', 'money', 'sell']
doc_bank = ['finance', 'bank', 'tree', 'water']
# now let's make these into a bag of words format
bow_water = model.id2word.doc2bow(doc_water)
bow_finance = model.id2word.doc2bow(doc_finance)
bow_bank = model.id2word.doc2bow(doc_bank)
# we can now get the LDA topic distributions for these
lda_bow_water = model[bow_water]
lda_bow_finance = model[bow_finance]
lda_bow_bank = model[bow_bank]
We're now ready to apply our distance metrics. These metrics return a value between 0 and 1, where values closer to 0 indicate a smaller 'distance' and therefore a larger similarity.
Let's start with the popular Hellinger distance.
The Hellinger distance metric gives an output in the range [0,1] for two probability distributions, with values closer to 0 meaning they are more similar.
In [ ]:
from gensim.matutils import hellinger
print(hellinger(lda_bow_water, lda_bow_finance))
print(hellinger(lda_bow_finance, lda_bow_bank))
Makes sense, right? In the first example, Document 1 and Document 2 are hardly similar, so we get a value of roughly 0.5.
In the second case, the documents are a lot more similar, semantically. Trained with the model, they give a much less distance value.
In [ ]:
from gensim.matutils import kullback_leibler
print(kullback_leibler(lda_bow_water, lda_bow_bank))
print(kullback_leibler(lda_bow_finance, lda_bow_bank))
.. important::
KL is not a Distance Metric in the mathematical sense, and hence is not
symmetrical. This means that kullback_leibler(lda_bow_finance,
lda_bow_bank)
is not equal to kullback_leibler(lda_bow_bank,
lda_bow_finance)
.
As you can see, the values are not equal. We'll get more into the details of this later on in the notebook.
In [ ]:
print(kullback_leibler(lda_bow_bank, lda_bow_finance))
In our previous examples we saw that there were lower distance values between bank and finance than for bank and water, even if it wasn't by a huge margin. What does this mean?
The bank
document is a combination of both water and finance related
terms - but as bank in this context is likely to belong to the finance topic,
the distance values are less between the finance and bank bows.
In [ ]:
# just to confirm our suspicion that the bank bow is more to do with finance:
model.get_document_topics(bow_bank)
It's evident that while it isn't too skewed, it it more towards the finance topic.
Distance metrics (also referred to as similarity metrics), as suggested in
the examples above, are mainly for probability distributions, but the methods
can accept a bunch of formats for input. You can do some further reading on
Kullback Leibler <https://en.wikipedia.org/wiki/Kullback–
Leibler_divergence>
_ and Hellinger
<https://en.wikipedia.org/wiki/Hellinger_distance>
_ to figure out what suits
your needs.
In [ ]:
from gensim.matutils import jaccard
print(jaccard(bow_water, bow_bank))
print(jaccard(doc_water, doc_bank))
print(jaccard(['word'], ['word']))
The three examples above feature 2 different input methods.
In the first case, we present to jaccard document vectors already in bag of words format. The distance can be defined as 1 minus the size of the intersection upon the size of the union of the vectors.
We can see (on manual inspection as well), that the distance is likely to be high - and it is.
The last two examples illustrate the ability for jaccard to accept even lists (i.e, documents) as inputs.
In the last case, because they are the same vectors, the value returned is 0
In [ ]:
topic_water, topic_finance = model.show_topics()
# some pre processing to get the topics in a format acceptable to our distance metrics
def parse_topic_string(topic):
# takes the string returned by model.show_topics()
# split on strings to get topics and the probabilities
topic = topic.split('+')
# list to store topic bows
topic_bow = []
for word in topic:
# split probability and word
prob, word = word.split('*')
# get rid of spaces and quote marks
word = word.replace(" ","").replace('"', '')
# convert to word_type
word = model.id2word.doc2bow([word])[0][0]
topic_bow.append((word, float(prob)))
return topic_bow
finance_distribution = parse_topic_string(topic_finance[1])
water_distribution = parse_topic_string(topic_water[1])
# the finance topic in bag of words format looks like this:
print(finance_distribution)
Now that we've got our topics in a format more acceptable by our functions, let's use a Distance metric to see how similar the word distributions in the topics are.
In [ ]:
print(hellinger(water_distribution, finance_distribution))
Our value of roughly 0.36 means that the topics are not TOO distant with respect to their word distributions.
This makes sense again, because of overlapping words like bank
and a
small size dictionary.
In our previous example we didn't use Kullback Leibler to test for similarity
for a reason - KL is not a Distance 'Metric' in the technical sense (you can
see what a metric is here
<https://en.wikipedia.org/wiki/Metric_(mathematics>
_\ ). The nature of it,
mathematically also means we must be a little careful before using it,
because since it involves the log function, a zero can mess things up. For
example:
In [ ]:
# 16 here is the number of features the probability distribution draws from
print(kullback_leibler(water_distribution, finance_distribution, 16))
That wasn't very helpful, right? This just means that we have to be a bit
careful about our inputs. Our old example didn't work out because they were
some missing values for some words (because show_topics()
only returned
the top 10 topics).
This can be remedied, though.
In [ ]:
# return ALL the words in the dictionary for the topic-word distribution.
topic_water, topic_finance = model.show_topics(num_words=len(model.id2word))
# do our bag of words transformation again
finance_distribution = parse_topic_string(topic_finance[1])
water_distribution = parse_topic_string(topic_water[1])
# and voila!
print(kullback_leibler(water_distribution, finance_distribution))
You may notice that the distance for this is quite less, indicating a high similarity. This may be a bit off because of the small size of the corpus, where all topics are likely to contain a decent overlap of word probabilities. You will likely get a better value for a bigger corpus.
So, just remember, if you intend to use KL as a metric to measure similarity or distance between two distributions, avoid zeros by returning the ENTIRE distribution. Since it's unlikely any probability distribution will ever have absolute zeros for any feature/word, returning all the values like we did will make you good to go.
Having seen the practical usages of these measures (i.e, to find similarity), let's learn a little about what exactly Distance Measures and Metrics are.
I mentioned in the previous section that KL was not a distance metric. There are 4 conditons for for a distance measure to be a metric:
That is: it must be non-negative; if x and y are the same, distance must be zero; it must be symmetric; and it must obey the triangle inequality law.
Simple enough, right?
Let's test these out for our measures.
In [ ]:
# normal Hellinger
a = hellinger(water_distribution, finance_distribution)
b = hellinger(finance_distribution, water_distribution)
print(a)
print(b)
print(a == b)
# if we pass the same values, it is zero.
print(hellinger(water_distribution, water_distribution))
# for triangle inequality let's use LDA document distributions
print(hellinger(lda_bow_finance, lda_bow_bank))
# Triangle inequality works too!
print(hellinger(lda_bow_finance, lda_bow_water) + hellinger(lda_bow_water, lda_bow_bank))
So Hellinger is indeed a metric. Let's check out KL.
In [ ]:
a = kullback_leibler(finance_distribution, water_distribution)
b = kullback_leibler(water_distribution, finance_distribution)
print(a)
print(b)
print(a == b)
We immediately notice that when we swap the values they aren't equal! One of the four conditions not fitting is enough for it to not be a metric.
However, just because it is not a metric, (strictly in the mathematical sense) does not mean that it is not useful to figure out the distance between two probability distributions. KL Divergence is widely used for this purpose, and is probably the most 'famous' distance measure in fields like Information Theory.
For a nice review of the mathematical differences between Hellinger and KL,
this
<http://stats.stackexchange.com/questions/130432/differences-between-bhattacharyya-distance-and-kl-divergence>
__
link does a very good job.
Let's plot a graph of our toy dataset using the popular networkx
<https://networkx.github.io/documentation/stable/>
_ library.
Each node will be a document, where the color of the node will be its topic according to the LDA model. Edges will connect documents to each other, where the weight of the edge will be inversely proportional to the Jaccard similarity between two documents. We will also annotate the edges to further aid visualization: strong edges will connect similar documents, and weak (dashed) edges will connect dissimilar documents.
In summary, similar documents will be closer together, different documents will be further apart.
In [ ]:
import itertools
import networkx as nx
def get_most_likely_topic(doc):
bow = model.id2word.doc2bow(doc)
topics, probabilities = zip(*model.get_document_topics(bow))
max_p = max(probabilities)
topic = topics[probabilities.index(max_p)]
return topic
def get_node_color(i):
return 'skyblue' if get_most_likely_topic(texts[i]) == 0 else 'pink'
G = nx.Graph()
for i, _ in enumerate(texts):
G.add_node(i)
for (i1, i2) in itertools.combinations(range(len(texts)), 2):
bow1, bow2 = texts[i1], texts[i2]
distance = jaccard(bow1, bow2)
G.add_edge(i1, i2, weight=1/distance)
#
# https://networkx.github.io/documentation/networkx-1.9/examples/drawing/weighted_graph.html
#
pos = nx.spring_layout(G)
threshold = 1.25
elarge=[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > threshold]
esmall=[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] <= threshold]
node_colors = [get_node_color(i) for (i, _) in enumerate(texts)]
nx.draw_networkx_nodes(G, pos, node_size=700, node_color=node_colors)
nx.draw_networkx_edges(G,pos,edgelist=elarge, width=2)
nx.draw_networkx_edges(G,pos,edgelist=esmall, width=2, alpha=0.2, edge_color='b', style='dashed')
nx.draw_networkx_labels(G, pos, font_size=20, font_family='sans-serif')
We can make several observations from this graph.
First, the graph consists of two connected components (if you ignore the weak edges). Nodes 0, 1, 2, 3, 4 (which all belong to the water topic) form the first connected component. The other nodes, which all belong to the finance topic, form the second connected component.
Second, the LDA model didn't do a very good job of classifying our documents into topics. There were many misclassifications, as you can confirm in the summary below:
In [ ]:
print('id\ttopic\tdoc')
for i, t in enumerate(texts):
print('%d\t%d\t%s' % (i, get_most_likely_topic(t), ' '.join(t)))
This is mostly because the corpus used to train the LDA model is so small. Using a larger corpus should give you much better results, but that is beyond the scope of this tutorial.
That brings us to the end of this small tutorial. To recap, here's what we covered:
The scope for adding new similarity metrics is large, as there exist an even
larger suite of metrics and methods to add to the matutils.py file.
For more details, see Similarity Measures for Text Document Clustering
<http://www.academia.edu/download/32952068/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf>
_
by A. Huang.