In [ ]:
import hypertools as hyp
import wikipedia as wiki
%matplotlib inline
In this example, we will download some text from wikipedia, split it up into chunks and then plot it. We will use the wikipedia package to retrieve the wiki pages for 'dog' and 'cat'.
In [ ]:
def chunk(s, count):
return [''.join(x) for x in zip(*[list(s[z::count]) for z in range(count)])]
chunk_size = 5
dog_text = wiki.page('Dog').content
cat_text = wiki.page('Cat').content
dog = chunk(dog_text, int(len(dog_text)/chunk_size))
cat = chunk(cat_text, int(len(cat_text)/chunk_size))
Below is a snippet of some of the text from the dog wikipedia page. As you can see, the word dog appears in many of the sentences, but also words related to dog like wolf and carnivore appear.
In [ ]:
dog[0][:1000]
Now we will simply pass the text samples as a list to hyp.plot
. By default hypertools will transform the text data using a topic model that was fit on a variety of wikipedia pages. Specifically, the text is vectorized using the scikit-learn CountVectorizer
and then passed on to a LatentDirichletAllocation
to estimate topics. As can be seen below, the 5 chunks of text from the dog/cat wiki pages cluster together, suggesting they are made up of distint topics.
In [ ]:
hue=['dog']*chunk_size+['cat']*chunk_size
geo = hyp.plot(dog + cat, 'o', hue=hue, size=[8, 6])
Now, let's add a third very different topic to the plot.
In [ ]:
bball_text = wiki.page('Basketball').content
bball = chunk(bball_text, int(len(bball_text)/chunk_size))
hue=['dog']*chunk_size+['cat']*chunk_size+['bball']*chunk_size
geo = hyp.plot(dog + cat + bball, 'o', hue=hue, labels=hue, size=[8, 6])
As you might expect, the cat and dog text chunks are closer to each other than to basketball in this topic space. Since cats and dogs are both animals, they share many more features (and thus are described with similar text) than basketball.
In [ ]:
nips = hyp.load('nips')
nips.plot(size=[8, 6])
In [ ]:
wiki = hyp.load('wiki')
wiki.plot(size=[8, 6])
In this example we will plot each state of the union address from 1989 to present. The dots are colored and labeled by president. The semantic model that was used to transform is the default 'wiki' model, which is a CountVectorizer->LatentDirichletAllocation pipeline fit with a selection of wikipedia pages. As you can see below, the points generally seem to cluster by president, but also by party affiliation (democrats mostly on the left and republicans mostly on the right).
In [ ]:
sotus = hyp.load('sotus')
sotus.plot(size=[10,8])
In [ ]:
sotus.plot(reduce='UMAP', size=[10, 8])
In [ ]:
sotus.plot(reduce='UMAP', corpus='nips', size=[10, 8])
Interestingly, plotting the data transformed by a different topic model (trained on scientific articles) gives a totally different representation of the data. This is because the themes extracted from a homogenous set of scientific articles are distinct from the themes extract from diverse set of wikipedia articles, so the transformation function will be unique.