everything2vec

word2vec for Machine Translation

Language translations can be rotations and scalings of the vector space.

How are Machine Translations learned?

The transform matrix can be learned by bootstrapping from a small sample (manually labeled), then extended to entire language.

Steps:

Create a word embedding in both languages
Manually specify pairs (typically, simple concrete nouns)
Find the translation matrix
Apply translation matrix across entire language

word2vec: Review

What is goal of word2vec in Plain English?

Create a dense vector representation of words that models semantic meaning based on context

Why are neural networks powerful in machine learning?

Capable of learning complex, arbitrary, non-linear relationships.

What are the inputs and outputs of word2vec neural network during training?

word and context

What math operations are most common for word2vec embedding?

- Arithmetic

- Distance

- Clustering

But any vector operations are meaningful

Every ML algorithm has to solve 3 basic problems:

Representation
Evaluation
Optimization

How does word2vec solve each of these?

1.Representation of word semantics with dense vectors (key breakthrough)

2. Evaluation is through word context (thus needs lots of words)

3. Optimization is backpropagation (aka, back prop)

By the end of this Notebook, you should be able to:

Apply word2vec beyond just words
Explain how doc2vec works to extend word2vec
Explain how word2vec can be applied to other domains (e.g., emojis 👻)
Describe the future of vectorization through thought2vec

doc2vec, a powerful extension of word2vec

How does doc2vec work?

Doc2vec (aka paragraph2vec or sentence embeddings) modifies the word2vec algorithm to larger blocks of text, such as sentences, paragraphs or entire documents.

Every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W . The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context.

Each additional context does not have be a fixed length (because it is vectorized and projected into the same space).

Additional parameters but the updates are sparse thus still efficient.

doc2vec Example: Descri.beer

How do to make sense of 1.6M beer reviews?

Source

emjoi2vec

Image
Source

Notable vectorizations

Name	Embedding
Char2Vec	Character
Word2Vec	Word
GloVe	Word
Doc2Vec	Sections of text
Image2Vec	Image
Video2Vec	Video

Source

Thought Vectors - The Future of Vectorizations

Geoffrey Hinton's, from Google, "Top Secret" new algorithm.

Instead of embedding words or documents in vector space, embed thoughts in vector space. Their features will represent how each thought relates to other thoughts.

It hasn't been publicly released so it is mostly speculation. Keep your eye out for it.

When Google farts 💨, the rest of the world 💩

Further study on thought2vec

Thought2vec teaser
General introduction
Skip-Thought Vectors paper

Summary

Word2vec is another perspective on Machine Translation through a rotation and translation of embedded space.
Longer pieces of text can also be embedded into the same space as words (i.e., doc2vec).
Given the properties of word2vec (e.g., large corpus input, straightforward training, and vector output), it can be applied to a variety of domains. Including
- emjois
- thoughts
- <insert your idea here>

Bonus Material

Document similarity at the concept level

The Sicilian gelato was extremely rich.

vs.

The Italian ice-cream was very velvety.

The statements reference the same idea but share no words.

Word Mover’s Distance (WMD)

Represent text documents as a weighted point cloud of embedded words.

The distance between two text documents A and B is the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B.

Earth mover’s distance metric (EMD)

Word Mover’s Distance (WMD) is a special case of the earth mover’s distance metric (EMD)

EMD is a method to evaluate dissimilarity between two multi-dimensional distributions in some feature space where a distance measure between single features, which we call the ground distance is given. The EMD 'lifts' this distance from individual features to full distributions.

Deep dive on EMD

Example

State-of-the-art kNN classification accuracy but slowest metric to compute.

Source: From Word Embeddings To Document Distances

Application to Data Science

lda2vec

Latent Dirichlet allocation (LDA) overview

lda2vec Executive Summary

lda2vec adds additional context. Defines context as a topic.

lda2vec model

$v_{doc}$ is a mixture:
$v_{doc}$ = a $v_{topic1}$ + b $v_{topic2}$ + ...

word2vec vs. LDA

algorithm	scope	prediction	numbers	visualization	density	metaphor
word2vec	local	one word predicts a nearby words	real numbers	bar chart	dense	location
LDA	global	documents predict global words	percentages that sum to 100%	pie chart	sparse	mixture

lda2vec implementation

GitHub repo