The transform matrix can be learned by bootstrapping from a small sample (manually labeled), then extended to entire language.
Steps:
Every ML algorithm has to solve 3 basic problems:
Every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W . The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context.
Each additional context does not have be a fixed length (because it is vectorized and projected into the same space).
Additional parameters but the updates are sparse thus still efficient.
It hasn't been publicly released so it is mostly speculation. Keep your eye out for it.
When Google farts 💨, the rest of the world 💩
<insert your idea here>
Word Mover’s Distance (WMD) is a special case of the earth mover’s distance metric (EMD)
EMD is a method to evaluate dissimilarity between two multi-dimensional distributions in some feature space where a distance measure between single features, which we call the ground distance is given. The EMD 'lifts' this distance from individual features to full distributions.
State-of-the-art kNN classification accuracy but slowest metric to compute.