everything2vec


word2vec for Machine Translation

Language translations can be rotations and scalings of the vector space.


How are Machine Translations learned?

The transform matrix can be learned by bootstrapping from a small sample (manually labeled), then extended to entire language.

Steps:

  1. Create a word embedding in both languages
  2. Manually specify pairs (typically, simple concrete nouns)
  3. Find the translation matrix
  4. Apply translation matrix across entire language

word2vec: Review


What is goal of word2vec in Plain English?
Create a dense vector representation of words that models semantic meaning based on context
Why are neural networks powerful in machine learning?

Capable of learning complex, arbitrary, non-linear relationships.
What are the inputs and outputs of word2vec neural network during training?

word and context
What math operations are most common for word2vec embedding?

- Arithmetic

- Distance

- Clustering

But any vector operations are meaningful

Every ML algorithm has to solve 3 basic problems:

  1. Representation
  2. Evaluation
  3. Optimization
How does word2vec solve each of these?

1.Representation of word semantics with dense vectors (key breakthrough)

2. Evaluation is through word context (thus needs lots of words)

3. Optimization is backpropagation (aka, back prop)

By the end of this Notebook, you should be able to:

  • Apply word2vec beyond just words
  • Explain how doc2vec works to extend word2vec
  • Explain how word2vec can be applied to other domains (e.g., emojis 👻)
  • Describe the future of vectorization through thought2vec

doc2vec, a powerful extension of word2vec


How does doc2vec work?

Doc2vec (aka paragraph2vec or sentence embeddings) modifies the word2vec algorithm to larger blocks of text, such as sentences, paragraphs or entire documents.

Every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W . The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context.

Each additional context does not have be a fixed length (because it is vectorized and projected into the same space).

Additional parameters but the updates are sparse thus still efficient.


doc2vec Example: Descri.beer

How do to make sense of 1.6M beer reviews?


emjoi2vec

Image
Source


Notable vectorizations

Name Embedding
Char2Vec Character
Word2Vec Word
GloVe Word
Doc2Vec Sections of text
Image2Vec Image
Video2Vec Video

Source


Thought Vectors - The Future of Vectorizations

Geoffrey Hinton's, from Google, "Top Secret" new algorithm.

Instead of embedding words or documents in vector space, embed thoughts in vector space. Their features will represent how each thought relates to other thoughts.

It hasn't been publicly released so it is mostly speculation. Keep your eye out for it.

When Google farts 💨, the rest of the world 💩


Summary

  • Word2vec is another perspective on Machine Translation through a rotation and translation of embedded space.
  • Longer pieces of text can also be embedded into the same space as words (i.e., doc2vec).
  • Given the properties of word2vec (e.g., large corpus input, straightforward training, and vector output), it can be applied to a variety of domains. Including
    • emjois
    • thoughts
    • <insert your idea here>





Bonus Material


Document similarity at the concept level

The Sicilian gelato was extremely rich.

vs.

The Italian ice-cream was very velvety.

The statements reference the same idea but share no words.


Word Mover’s Distance (WMD)

Represent text documents as a weighted point cloud of embedded words.

The distance between two text documents A and B is the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B.


Earth mover’s distance metric (EMD)

Word Mover’s Distance (WMD) is a special case of the earth mover’s distance metric (EMD)

EMD is a method to evaluate dissimilarity between two multi-dimensional distributions in some feature space where a distance measure between single features, which we call the ground distance is given. The EMD 'lifts' this distance from individual features to full distributions.

Deep dive on EMD


Example

State-of-the-art kNN classification accuracy but slowest metric to compute.

Source: From Word Embeddings To Document Distances

Application to Data Science


lda2vec


Latent Dirichlet allocation (LDA) overview


lda2vec Executive Summary

lda2vec adds additional context. Defines context as a topic.


lda2vec model

$v_{doc}$ is a mixture:
$v_{doc}$ = a $v_{topic1}$ + b $v_{topic2}$ + ...


word2vec vs. LDA

algorithm scope prediction numbers visualization density metaphor
word2vec local one word predicts a nearby words real numbers bar chart dense location
LDA global documents predict global words percentages that sum to 100% pie chart sparse mixture

lda2vec implementation

GitHub repo