This is an example notebook of using UMAP to embed text (but this can be extended to any collection of tokens). We are going to use the 20 newsgroups dataset which is a collection of forum posts labelled by topic. We are going to embed these documents and see that similar documents (i.e. posts in the same subforum) will end up close together. You can use this embedding for other downstream tasks such as visualizing your corpus or run a clustering algorithm (e.g. HDBSCAN). We will use a bag of words model and use UMAP on the count vectors as well as the TF-IDF vectors.
In [1]:
!python --version
First install the dependencies
In [ ]:
!pip install numpy scipy pandas scikit-learn datashader holoviews numba
You will need umap-learn >- 0.4.0 which is currently (at the time of writing) not available via pip/conda. If it is, great! Just run the command below
In [ ]:
#!pip install umap-learn
You may need to install it from the master branch of the github repo by following the instructions in the README
In [2]:
import pandas as pd
import umap
import umap.plot
# Used to get the data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)
In [3]:
%%time
dataset = fetch_20newsgroups(subset='all',
shuffle=True, random_state=42)
In [4]:
print(f'{len(dataset.data)} documents')
print(f'{len(dataset.target_names)} categories')
Here are the categories of documents. As you can see many are related to one another (e.g. 'comp.sys.ibm.pc.hardware' and 'comp.sys.mac.hardware') but they are not all correlated (e.g. 'sci.med' and 'rec.sport.baseball').
In [5]:
dataset.target_names
Out[5]:
Let's look at a couple sample documents
In [6]:
for idx, document in enumerate(dataset.data[:3]):
category = dataset.target_names[dataset.target[idx]]
print(f'Category: {category}')
print('---------------------------')
# Print the first 500 characters of the post
print(document[:500])
print('---------------------------')
In [7]:
category_labels = [dataset.target_names[x] for x in dataset.target]
hover_df = pd.DataFrame(category_labels, columns=['category'])
We are going to use a bag-of-words approach (i.e. word order doesn't matter) and construct a word document matrix. In this matrix the rows will correspond to a document (i.e. post) and each column will correspond to a particular word. The values will be the counts of how many times a given word appeared in a particular document.
We will use sklearns CountVectorizer function to do this for us along with a couple other preprocessing steps:
1) Split the text into tokens (i.e. words) by splitting on whitespace
2) Remove english stopwords (the, and, etc)
3) Remove all words which occur less than 5 times in the entire corpus (via the min_df parameter)
In [8]:
vectorizer = CountVectorizer(min_df=5, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset.data)
This gives us a 18846x34880 matrix where there are 18846 documents (same as above) and 34880 unique tokens. This matrix is sparse since most words do not appear in most documents.
In [9]:
word_doc_matrix
Out[9]:
We are going to do dimension reduction using UMAP to reduce the matrix from 34880 dimensions to 2 dimensions (since n_components=2). We need a distance metric and will use Hellinger distance which measures the similarity between two probability distributions. Each document has a set of counts generated by a multinomial distribution where we can use Hellinger distance to measure the similarity of these distributions.
In [10]:
%%time
embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
Now we have an embedding of 18846x2
In [11]:
embedding.embedding_.shape
Out[11]:
In [12]:
f = umap.plot.interactive(embedding, labels=dataset.target, hover_data=hover_df, point_size=1)
show(f)
As you can see this does reasonably well. There is some separation and groups that you would expect to be similar (such as 'rec.sport.baseball' and 'rec.sport.hockey'). The big clump in the middle corresponds to a lot of extremely similar newsgroups like 'comp.sys.ibm.pc.hardware' and 'comp.sys.mac.hardware'.
We will now do the same pipeline with the only change being the use of TF-IDF weighting. TF-IDF gives less weight to words that appear frequently across a large number of documents since they are more popular in general. It higher weight to words that appear frequently in a smaller subset of documents since they are probably important words for those documents.
To do the TF-IDF weighting we will use sklearns TfidfVectorizer with the same parameters as CountVectorizer above.
In [13]:
tfidf_vectorizer = TfidfVectorizer(min_df=5, stop_words='english')
tfidf_word_doc_matrix = tfidf_vectorizer.fit_transform(dataset.data)
We get a matrix of the same size as before
In [14]:
tfidf_word_doc_matrix
Out[14]:
Again we use Hellinger distance and UMAP to embed the documents
In [15]:
%%time
tfidf_embedding = umap.UMAP(metric='hellinger').fit(tfidf_word_doc_matrix)
In [16]:
fig = umap.plot.interactive(tfidf_embedding, labels=dataset.target, hover_data=hover_df, point_size=1)
show(fig)
The results look fairly similar to before but this can be a useful trick to have in your toolbox