Visualizing Doc2Vec with TensorBoard

In this tutorial, I will explain how to visualize Doc2Vec Embeddings aka Paragraph Vectors via TensorBoard. It is a data visualization framework for visualizing and inspecting the TensorFlow runs and graphs. We will use a built-in Tensorboard visualizer called Embedding Projector in this tutorial. It lets you interactively visualize and analyze high-dimensional data like embeddings.

For this tutorial, a transformed MovieLens dataset^[1] was used from this repository and the movie titles were added afterwards. You can download the prepared csv from here. The input documents for training are the synopsis of movies, on which Doc2Vec model is trained.

The visualizations will be a scatterplot as seen in the above image, where each datapoint is labelled by the movie title and colored by it's corresponding genre. You can also visit this Projector link which is configured with my embeddings for the above mentioned dataset.

Define a Function to Read and Preprocess Text



In [2]:

    
import gensim
import pandas as pd
import smart_open
import random

# read data
dataframe = pd.read_csv('movie_plots.csv')
dataframe









    Out[2]:






  
    
      
      MovieID
      Titles
      Plots
      Genres
    
  
  
    
      0
      1
      Toy Story (1995)
      A little boy named Andy loves to be in his roo...
      animation
    
    
      1
      2
      Jumanji (1995)
      When two kids find and play a magical board ga...
      fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Things don't seem to change much in Wabasha Co...
      comedy
    
    
      3
      6
      Heat (1995)
      Hunters and their prey--Neil and his professio...
      action
    
    
      4
      7
      Sabrina (1995)
      An ugly duckling having undergone a remarkable...
      romance
    
    
      5
      9
      Sudden Death (1995)
      Some terrorists kidnap the Vice President of t...
      action
    
    
      6
      10
      GoldenEye (1995)
      James Bond teams up with the lone survivor of ...
      action
    
    
      7
      15
      Cutthroat Island (1995)
      Morgan Adams and her slave, William Shaw, are ...
      action
    
    
      8
      17
      Sense and Sensibility (1995)
      When Mr. Dashwood dies, he must leave the bulk...
      romance
    
    
      9
      18
      Four Rooms (1995)
      This movie features the collaborative director...
      comedy
    
    
      10
      19
      Ace Ventura: When Nature Calls (1995)
      Ace Ventura, emerging from self-imposed exile ...
      comedy
    
    
      11
      29
      City of Lost Children, The (Cité des enfants p...
      Krank (Daniel Emilfork), who cannot dream, kid...
      sci-fi
    
    
      12
      32
      Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
      In a future world devastated by disease, a con...
      sci-fi
    
    
      13
      34
      Babe (1995)
      Farmer Hoggett wins a runt piglet at a local f...
      fantasy
    
    
      14
      39
      Clueless (1995)
      A rich high school student tries to boost a ne...
      romance
    
    
      15
      44
      Mortal Kombat (1995)
      Based on the popular video game of the same na...
      action
    
    
      16
      48
      Pocahontas (1995)
      Capt. John Smith leads a rag-tag band of Engli...
      animation
    
    
      17
      50
      Usual Suspects, The (1995)
      Following a truck hijack in New York, five con...
      comedy
    
    
      18
      57
      Home for the Holidays (1995)
      After losing her job, making out with her soon...
      comedy
    
    
      19
      69
      Friday (1995)
      Two homies, Smokey and Craig, smoke a dope dea...
      comedy
    
    
      20
      70
      From Dusk Till Dawn (1996)
      Two criminals and their hostages unknowingly s...
      action
    
    
      21
      76
      Screamers (1995)
      (SIRIUS 6B, Year 2078) On a distant mining pla...
      sci-fi
    
    
      22
      82
      Antonia's Line (Antonia) (1995)
      In an anonymous Dutch village, a sturdy, stron...
      fantasy
    
    
      23
      88
      Black Sheep (1996)
      Comedy about the prospective Washington State ...
      comedy
    
    
      24
      95
      Broken Arrow (1996)
      "Broken Arrow" is the term used to describe a ...
      action
    
    
      25
      104
      Happy Gilmore (1996)
      A rejected hockey player puts his skills to th...
      comedy
    
    
      26
      105
      Bridges of Madison County, The (1995)
      Photographer Robert Kincaid wanders into the l...
      romance
    
    
      27
      110
      Braveheart (1995)
      When his secret bride is executed for assaulti...
      action
    
    
      28
      141
      Birdcage, The (1996)
      Armand Goldman owns a popular drag nightclub i...
      comedy
    
    
      29
      145
      Bad Boys (1995)
      Marcus Burnett is a hen-pecked family man. Mik...
      action
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      1813
      122902
      Fantastic Four (2015)
      FANTASTIC FOUR, a contemporary re-imagining of...
      sci-fi
    
    
      1814
      127098
      Louis C.K.: Live at The Comedy Store (2015)
      Comedian Louis C.K. performs live at the Comed...
      comedy
    
    
      1815
      127158
      Tig (2015)
      An intimate, mixed media documentary that foll...
      comedy
    
    
      1816
      127202
      Me and Earl and the Dying Girl (2015)
      Seventeen-year-old Greg has managed to become ...
      comedy
    
    
      1817
      129354
      Focus (2015)
      In the midst of veteran con man Nicky's latest...
      action
    
    
      1818
      129428
      The Second Best Exotic Marigold Hotel (2015)
      The Second Best Exotic Marigold Hotel is the e...
      comedy
    
    
      1819
      129937
      Run All Night (2015)
      Professional Brooklyn hitman Jimmy Conlon is m...
      action
    
    
      1820
      130490
      Insurgent (2015)
      One choice can transform you-or it can destroy...
      sci-fi
    
    
      1821
      130520
      Home (2015)
      An alien on the run from his own people makes ...
      animation
    
    
      1822
      130634
      Furious 7 (2015)
      Dominic and his crew thought they'd left the c...
      action
    
    
      1823
      131013
      Get Hard (2015)
      Kevin Hart plays the role of Darnell--a family...
      comedy
    
    
      1824
      132046
      Tomorrowland (2015)
      Bound by a shared destiny, a bright, optimisti...
      sci-fi
    
    
      1825
      132480
      The Age of Adaline (2015)
      A young woman, born at the turn of the 20th ce...
      romance
    
    
      1826
      132488
      Lovesick (2014)
      Lovesick is the comic tale of Charlie Darby (M...
      fantasy
    
    
      1827
      132796
      San Andreas (2015)
      In San Andreas, California is experiencing a s...
      action
    
    
      1828
      132961
      Far from the Madding Crowd (2015)
      In Victorian England, the independent and head...
      romance
    
    
      1829
      133195
      Hitman: Agent 47 (2015)
      An assassin teams up with a woman to help her ...
      action
    
    
      1830
      133645
      Carol (2015)
      In an adaptation of Patricia Highsmith's semin...
      romance
    
    
      1831
      134130
      The Martian (2015)
      During a manned mission to Mars, Astronaut Mar...
      sci-fi
    
    
      1832
      134368
      Spy (2015)
      A desk-bound CIA analyst volunteers to go unde...
      comedy
    
    
      1833
      134783
      Entourage (2015)
      Movie star Vincent Chase, together with his bo...
      comedy
    
    
      1834
      134853
      Inside Out (2015)
      After young Riley is uprooted from her Midwest...
      comedy
    
    
      1835
      135518
      Self/less (2015)
      A dying real estate mogul transfers his consci...
      sci-fi
    
    
      1836
      135861
      Ted 2 (2015)
      Months after John's divorce, Ted and Tami-Lynn...
      comedy
    
    
      1837
      135887
      Minions (2015)
      Ever since the dawn of time, the Minions have ...
      comedy
    
    
      1838
      136016
      The Good Dinosaur (2015)
      In a world where dinosaurs and humans live sid...
      animation
    
    
      1839
      139855
      Anomalisa (2015)
      Michael Stone, an author that specializes in c...
      animation
    
    
      1840
      142997
      Hotel Transylvania 2 (2015)
      The Drac pack is back for an all-new monster c...
      animation
    
    
      1841
      145935
      Peanuts Movie, The (2015)
      Charlie Brown, Lucy, Snoopy, and the whole gan...
      animation
    
    
      1842
      149406
      Kung Fu Panda 3 (2016)
      Continuing his "legendary adventures of awesom...
      comedy
    
  

1843 rows × 4 columns

Below, we define a function to read the training documents, pre-process each document using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.



In [3]:

    
def read_corpus(documents):
    for i, plot in enumerate(documents):
        yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(plot, max_len=30), [i])



In [4]:

    
train_corpus = list(read_corpus(dataframe.Plots))

Let's take a look at the training corpus.



In [5]:

    
train_corpus[:2]









    Out[5]:





[TaggedDocument(words=[u'little', u'boy', u'named', u'andy', u'loves', u'to', u'be', u'in', u'his', u'room', u'playing', u'with', u'his', u'toys', u'especially', u'his', u'doll', u'named', u'woody', u'but', u'what', u'do', u'the', u'toys', u'do', u'when', u'andy', u'is', u'not', u'with', u'them', u'they', u'come', u'to', u'life', u'woody', u'believes', u'that', u'he', u'has', u'life', u'as', u'toy', u'good', u'however', u'he', u'must', u'worry', u'about', u'andy', u'family', u'moving', u'and', u'what', u'woody', u'does', u'not', u'know', u'is', u'about', u'andy', u'birthday', u'party', u'woody', u'does', u'not', u'realize', u'that', u'andy', u'mother', u'gave', u'him', u'an', u'action', u'figure', u'known', u'as', u'buzz', u'lightyear', u'who', u'does', u'not', u'believe', u'that', u'he', u'is', u'toy', u'and', u'quickly', u'becomes', u'andy', u'new', u'favorite', u'toy', u'woody', u'who', u'is', u'now', u'consumed', u'with', u'jealousy', u'tries', u'to', u'get', u'rid', u'of', u'buzz', u'then', u'both', u'woody', u'and', u'buzz', u'are', u'now', u'lost', u'they', u'must', u'find', u'way', u'to', u'get', u'back', u'to', u'andy', u'before', u'he', u'moves', u'without', u'them', u'but', u'they', u'will', u'have', u'to', u'pass', u'through', u'ruthless', u'toy', u'killer', u'sid', u'phillips'], tags=[0]),
 TaggedDocument(words=[u'when', u'two', u'kids', u'find', u'and', u'play', u'magical', u'board', u'game', u'they', u'release', u'man', u'trapped', u'for', u'decades', u'in', u'it', u'and', u'host', u'of', u'dangers', u'that', u'can', u'only', u'be', u'stopped', u'by', u'finishing', u'the', u'game'], tags=[1])]

Training the Doc2Vec Model

We'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 55 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time. Small datasets with short documents, like this one, can benefit from more training passes.



In [8]:

    
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)









    Out[8]:





92031

Now, we'll save the document embedding vectors per doctag.



In [9]:

    
model.save_word2vec_format('doc_tensor.w2v', doctag_vec=True, word_vec=False)

Prepare the Input files for Tensorboard

Tensorboard takes two Input files. One containing the embedding vectors and the other containing relevant metadata. We'll use a gensim script to directly convert the embedding file saved in word2vec format above to the tsv format required in Tensorboard.



In [11]:

    
%run ../../gensim/scripts/word2vec2tensor.py -i doc_tensor.w2v -o movie_plot









    



2017-04-20 02:23:05,284 : MainThread : INFO : running ../../gensim/scripts/word2vec2tensor.py -i doc_tensor.w2v -o movie_plot
2017-04-20 02:23:05,286 : MainThread : INFO : loading projection weights from doc_tensor.w2v
2017-04-20 02:23:05,464 : MainThread : INFO : loaded (1843, 50) matrix from doc_tensor.w2v
2017-04-20 02:23:05,578 : MainThread : INFO : 2D tensor file saved to movie_plot_tensor.tsv
2017-04-20 02:23:05,579 : MainThread : INFO : Tensor metadata file saved to movie_plot_metadata.tsv
2017-04-20 02:23:05,581 : MainThread : INFO : finished running word2vec2tensor.py

The script above generates two files, movie_plot_tensor.tsv which contain the embedding vectors and movie_plot_metadata.tsv containing doctags. But, these doctags are simply the unique index values and hence are not really useful to interpret what the document was while visualizing. So, we will overwrite movie_plot_metadata.tsv to have a custom metadata file with two columns. The first column will be for the movie titles and the second for their corresponding genres.



In [12]:

    
with open('movie_plot_metadata.tsv','w') as w:
    w.write('Titles\tGenres\n')
    for i,j in zip(dataframe.Titles, dataframe.Genres):
        w.write("%s\t%s\n" % (i,j))

Now you can go to http://projector.tensorflow.org/ and upload the two files by clicking on Load data in the left panel.

For demo purposes I have uploaded the Doc2Vec embeddings generated from the model trained above here. You can access the Embedding projector configured with these uploaded embeddings at this link.

Using Tensorboard

For the visualization purpose, the multi-dimensional embeddings that we get from the Doc2Vec model above, needs to be downsized to 2 or 3 dimensions. So that we basically end up with a new 2d or 3d embedding which tries to preserve information from the original multi-dimensional embedding. As these vectors are reduced to a much smaller dimension, the exact cosine/euclidean distances between them are not preserved, but rather relative, and hence as you’ll see below the nearest similarity results may change.

TensorBoard has two popular dimensionality reduction methods for visualizing the embeddings and also provides a custom method based on text searches:

Principal Component Analysis: PCA aims at exploring the global structure in data, and could end up losing the local similarities between neighbours. It maximizes the total variance in the lower dimensional subspace and hence, often preserves the larger pairwise distances better than the smaller ones. See an intuition behind it in this nicely explained answer on stackexchange.

T-SNE: The idea of T-SNE is to place the local neighbours close to each other, and almost completely ignoring the global structure. It is useful for exploring local neighborhoods and finding local clusters. But the global trends are not represented accurately and the separation between different groups is often not preserved (see the t-sne plots of our data below which testify the same).

Custom Projections: This is a custom bethod based on the text searches you define for different directions. It could be useful for finding meaningful directions in the vector space, for example, female to male, currency to country etc.

You can refer to this doc for instructions on how to use and navigate through different panels available in TensorBoard.

Visualize using PCA

The Embedding Projector computes the top 10 principal components. The menu at the left panel lets you project those components onto any combination of two or three. The above plot was made using the first two principal components with total variance covered being 36.5%.

Visualize using T-SNE

Data is visualized by animating through every iteration of the t-sne algorithm. The t-sne menu at the left lets you adjust the value of it's two hyperparameters. The first one is Perplexity, which is basically a measure of information. It may be viewed as a knob that sets the number of effective nearest neighbors^[2]. The second one is learning rate that defines how quickly an algorithm learns on encountering new examples/data points.

The above plot was generated with perplexity 8, learning rate 10 and iteration 500. Though the results could vary on successive runs, and you may not get the exact plot as above with same hyperparameter settings. But some small clusters will start forming as above, with different orientations.

Conclusion

We learned about visualizing the Document Embeddings through Tensorboard's Embedding Projector. It is a useful tool for visualizing different types of data for example, word embeddings, document embeddings or the gene expressions and biological sequences. It just needs an input of 2D tensors and then you can explore your data using provided algorithms. You can also perform nearest neighbours search to find most similar data points to your query point.

	MovieID	Titles	Plots	Genres
0	1	Toy Story (1995)	A little boy named Andy loves to be in his roo...	animation
1	2	Jumanji (1995)	When two kids find and play a magical board ga...	fantasy
2	3	Grumpier Old Men (1995)	Things don't seem to change much in Wabasha Co...	comedy
3	6	Heat (1995)	Hunters and their prey--Neil and his professio...	action
4	7	Sabrina (1995)	An ugly duckling having undergone a remarkable...	romance
5	9	Sudden Death (1995)	Some terrorists kidnap the Vice President of t...	action
6	10	GoldenEye (1995)	James Bond teams up with the lone survivor of ...	action
7	15	Cutthroat Island (1995)	Morgan Adams and her slave, William Shaw, are ...	action
8	17	Sense and Sensibility (1995)	When Mr. Dashwood dies, he must leave the bulk...	romance
9	18	Four Rooms (1995)	This movie features the collaborative director...	comedy
10	19	Ace Ventura: When Nature Calls (1995)	Ace Ventura, emerging from self-imposed exile ...	comedy
11	29	City of Lost Children, The (Cité des enfants p...	Krank (Daniel Emilfork), who cannot dream, kid...	sci-fi
12	32	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)	In a future world devastated by disease, a con...	sci-fi
13	34	Babe (1995)	Farmer Hoggett wins a runt piglet at a local f...	fantasy
14	39	Clueless (1995)	A rich high school student tries to boost a ne...	romance
15	44	Mortal Kombat (1995)	Based on the popular video game of the same na...	action
16	48	Pocahontas (1995)	Capt. John Smith leads a rag-tag band of Engli...	animation
17	50	Usual Suspects, The (1995)	Following a truck hijack in New York, five con...	comedy
18	57	Home for the Holidays (1995)	After losing her job, making out with her soon...	comedy
19	69	Friday (1995)	Two homies, Smokey and Craig, smoke a dope dea...	comedy
20	70	From Dusk Till Dawn (1996)	Two criminals and their hostages unknowingly s...	action
21	76	Screamers (1995)	(SIRIUS 6B, Year 2078) On a distant mining pla...	sci-fi
22	82	Antonia's Line (Antonia) (1995)	In an anonymous Dutch village, a sturdy, stron...	fantasy
23	88	Black Sheep (1996)	Comedy about the prospective Washington State ...	comedy
24	95	Broken Arrow (1996)	"Broken Arrow" is the term used to describe a ...	action
25	104	Happy Gilmore (1996)	A rejected hockey player puts his skills to th...	comedy
26	105	Bridges of Madison County, The (1995)	Photographer Robert Kincaid wanders into the l...	romance
27	110	Braveheart (1995)	When his secret bride is executed for assaulti...	action
28	141	Birdcage, The (1996)	Armand Goldman owns a popular drag nightclub i...	comedy
29	145	Bad Boys (1995)	Marcus Burnett is a hen-pecked family man. Mik...	action
...	...	...	...	...
1813	122902	Fantastic Four (2015)	FANTASTIC FOUR, a contemporary re-imagining of...	sci-fi
1814	127098	Louis C.K.: Live at The Comedy Store (2015)	Comedian Louis C.K. performs live at the Comed...	comedy
1815	127158	Tig (2015)	An intimate, mixed media documentary that foll...	comedy
1816	127202	Me and Earl and the Dying Girl (2015)	Seventeen-year-old Greg has managed to become ...	comedy
1817	129354	Focus (2015)	In the midst of veteran con man Nicky's latest...	action
1818	129428	The Second Best Exotic Marigold Hotel (2015)	The Second Best Exotic Marigold Hotel is the e...	comedy
1819	129937	Run All Night (2015)	Professional Brooklyn hitman Jimmy Conlon is m...	action
1820	130490	Insurgent (2015)	One choice can transform you-or it can destroy...	sci-fi
1821	130520	Home (2015)	An alien on the run from his own people makes ...	animation
1822	130634	Furious 7 (2015)	Dominic and his crew thought they'd left the c...	action
1823	131013	Get Hard (2015)	Kevin Hart plays the role of Darnell--a family...	comedy
1824	132046	Tomorrowland (2015)	Bound by a shared destiny, a bright, optimisti...	sci-fi
1825	132480	The Age of Adaline (2015)	A young woman, born at the turn of the 20th ce...	romance
1826	132488	Lovesick (2014)	Lovesick is the comic tale of Charlie Darby (M...	fantasy
1827	132796	San Andreas (2015)	In San Andreas, California is experiencing a s...	action
1828	132961	Far from the Madding Crowd (2015)	In Victorian England, the independent and head...	romance
1829	133195	Hitman: Agent 47 (2015)	An assassin teams up with a woman to help her ...	action
1830	133645	Carol (2015)	In an adaptation of Patricia Highsmith's semin...	romance
1831	134130	The Martian (2015)	During a manned mission to Mars, Astronaut Mar...	sci-fi
1832	134368	Spy (2015)	A desk-bound CIA analyst volunteers to go unde...	comedy
1833	134783	Entourage (2015)	Movie star Vincent Chase, together with his bo...	comedy
1834	134853	Inside Out (2015)	After young Riley is uprooted from her Midwest...	comedy
1835	135518	Self/less (2015)	A dying real estate mogul transfers his consci...	sci-fi
1836	135861	Ted 2 (2015)	Months after John's divorce, Ted and Tami-Lynn...	comedy
1837	135887	Minions (2015)	Ever since the dawn of time, the Minions have ...	comedy
1838	136016	The Good Dinosaur (2015)	In a world where dinosaurs and humans live sid...	animation
1839	139855	Anomalisa (2015)	Michael Stone, an author that specializes in c...	animation
1840	142997	Hotel Transylvania 2 (2015)	The Drac pack is back for an all-new monster c...	animation
1841	145935	Peanuts Movie, The (2015)	Charlie Brown, Lucy, Snoopy, and the whole gan...	animation
1842	149406	Kung Fu Panda 3 (2016)	Continuing his "legendary adventures of awesom...	comedy