Visualizing text


In [ ]:
import hypertools as hyp
import wikipedia as wiki
%matplotlib inline

In this example, we will download some text from wikipedia, split it up into chunks and then plot it. We will use the wikipedia package to retrieve the wiki pages for 'dog' and 'cat'.


In [ ]:
def chunk(s, count):
    return [''.join(x) for x in zip(*[list(s[z::count]) for z in range(count)])]

chunk_size = 5

dog_text = wiki.page('Dog').content
cat_text = wiki.page('Cat').content

dog = chunk(dog_text, int(len(dog_text)/chunk_size))
cat = chunk(cat_text, int(len(cat_text)/chunk_size))

Below is a snippet of some of the text from the dog wikipedia page. As you can see, the word dog appears in many of the sentences, but also words related to dog like wolf and carnivore appear.


In [ ]:
dog[0][:1000]

Now we will simply pass the text samples as a list to hyp.plot. By default hypertools will transform the text data using a topic model that was fit on a variety of wikipedia pages. Specifically, the text is vectorized using the scikit-learn CountVectorizer and then passed on to a LatentDirichletAllocation to estimate topics. As can be seen below, the 5 chunks of text from the dog/cat wiki pages cluster together, suggesting they are made up of distint topics.


In [ ]:
hue=['dog']*chunk_size+['cat']*chunk_size
geo = hyp.plot(dog + cat, 'o', hue=hue, size=[8, 6])

Now, let's add a third very different topic to the plot.


In [ ]:
bball_text = wiki.page('Basketball').content
bball = chunk(bball_text, int(len(bball_text)/chunk_size))

hue=['dog']*chunk_size+['cat']*chunk_size+['bball']*chunk_size
geo = hyp.plot(dog + cat + bball, 'o', hue=hue, labels=hue, size=[8, 6])

As you might expect, the cat and dog text chunks are closer to each other than to basketball in this topic space. Since cats and dogs are both animals, they share many more features (and thus are described with similar text) than basketball.

Visualizing NIPS papers

The next example is a dataset of all NIPS papers published from 1987. They are fit and transformed using the text from each paper. This example dataset can be loaded using the code below.


In [ ]:
nips = hyp.load('nips')
nips.plot(size=[8, 6])

Visualizing Wikipedia pages

Here, we will plot a collection of wikipedia pages, transformed using a topic model (the default 'wiki' model) that was fit on the same articles. We will reduce the dimensionality of the data with TSNE, and then discover cluster with the 'HDBSCAN' algorithm.


In [ ]:
wiki = hyp.load('wiki')
wiki.plot(size=[8, 6])

Visualizing State of the Union Addresses

In this example we will plot each state of the union address from 1989 to present. The dots are colored and labeled by president. The semantic model that was used to transform is the default 'wiki' model, which is a CountVectorizer->LatentDirichletAllocation pipeline fit with a selection of wikipedia pages. As you can see below, the points generally seem to cluster by president, but also by party affiliation (democrats mostly on the left and republicans mostly on the right).


In [ ]:
sotus = hyp.load('sotus')
sotus.plot(size=[10,8])

Changing the reduction model

These data are reduce with PCA. Want to visualize using a different algorithm? Simply change the reduce parameter. This gives a different, but equally interesting lower dimensional representation of the data.


In [ ]:
sotus.plot(reduce='UMAP', size=[10, 8])

Defining a corpus

Now let's change the corpus used to train the text model. Specifically, we'll use the 'nips' text, a collection of scientific papers. To do this, set corpus='nips'. You can also specify your own text (as a list of text samples) to train the model.


In [ ]:
sotus.plot(reduce='UMAP', corpus='nips', size=[10, 8])

Interestingly, plotting the data transformed by a different topic model (trained on scientific articles) gives a totally different representation of the data. This is because the themes extracted from a homogenous set of scientific articles are distinct from the themes extract from diverse set of wikipedia articles, so the transformation function will be unique.