In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Fall 2020"
This notebook is the first in our series about creating effective distributed representations. The focus is on matrix designs, assessing similarity, and methods for matrix reweighting.
The central idea (which takes some getting used to!) is that we can represent words and phrases as dense vectors of real numbers. These take on meaning by being embedded in a larger matrix of representations with comparable structure.
Why build distributed representations? There are potentially many reasons. The two we will emphasize in this course:
Understanding words in context: There is value to linguists in seeing what these data-rich approaches can teach us about natural language lexicons, and there is value for social scientists in understanding how words are being used.
Feature representations for other models: As we will see, many models can benefit from representing examples as distributed representations.
The distributed representations we build will always be vectors of real numbers. The models are often called vector space models (VSMs).
Distributional representations are the special case where the data come entirely from co-occurrence counts in corpora.
We'll look at models that use supervised labels to obtain vector-based word representations. These aren't purely distributional, in that they take advantage of more than just co-occurrence patterns among items in the vocabulary, but they share the idea that words can be modeled with vectors.
If a neural network is used to train the representations, then they might be called neural representations.
The term word embedding is also used for distributed representations, including distributional ones. This term is a reminder that vector representations are meaningful only when embedded in and compared with others in a unified space (usually a matrix) of representations of the same type.
In any case, distributed representation seems like the most general cover term for what we're trying to achieve, and its only downside is that sometimes people think it has something to do with distributed databases.
Make sure your environment meets all the requirements for the cs224u repository. For help getting set-up, see setup.ipynb.
Download the course data, unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change DATA_HOME
below.)
In [2]:
%matplotlib inline
import numpy as np
import os
import pandas as pd
import vsm
import utils
In [3]:
utils.fix_random_seeds()
In [4]:
DATA_HOME = os.path.join('data', 'vsmdata')
There are many, many ways to define distributional matrices. Here's a schematic overview that highlights the major decisions for building a word $\times$ word matrix:
Define a notion of co-occurrence context. This could be an entire document, a paragraph, a sentence, a clause, an NP — whatever domain seems likely to capture the associations you care about.
Define a count scaling method. The simplest method just counts everything in the context window, giving equal weight to everything inside it. A common alternative is to scale the weights based on proximity to the target word – e.g., $1/d$, where $d$ is the distance in tokens from the target.
Scan through your corpus building a dictionary $d$ mapping word-pairs to co-occurrence values. Every time a pair of words $w$ and $w'$ occurs in the same context (as you defined it in 1), increment $d[(w, w')]$ by whatever value is determined by your weighting scheme. You'd increment by $1$ with the weighting scheme that simply counts co-occurrences.
Using the count dictionary $d$ that you collected in 3, establish your full vocabulary $V$, an ordered list of words types.
Now build a matrix $M$ of dimension $|V| \times |V|$. Both the rows and the columns of $M$ represent words. Each cell $M[i, j]$ is filled with the value $d[(w_{1}, w_{j})]$.
The data distribution includes four matrices that we'll use for hands-on exploration. All of them were designed in the same basic way:
They are word $\times$ word matrices with 5K rows and 5K columns.
The vocabulary is the top 5K most frequent unigrams.
Two come from IMDB user-supplied movie reviews, and two come from Gigaword, a collection of newswire and newspaper text. Further details:
filename | source | window size | count weighting |
---|---|---|---|
imdb_window5-scaled.csv.gz | IMDB movie reviews | 5 | 1/d |
imdb_window20-flat.csv.gz | IMDB movie reviews | 20 | 1 |
gigaword_window5-scaled.csv.gz | Gigaword | 5 | 1/d |
gigaword_window20-flat.csv.gz | Gigaword | 20 | 1 |
Any hunches about how these matrices might differ from each other?
In [5]:
imdb5 = pd.read_csv(
os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)
In [6]:
imdb20 = pd.read_csv(
os.path.join(DATA_HOME, 'imdb_window20-flat.csv.gz'), index_col=0)
In [7]:
giga5 = pd.read_csv(
os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)
In [8]:
giga20 = pd.read_csv(
os.path.join(DATA_HOME, 'giga_window20-flat.csv.gz'), index_col=0)
Vector comparisons form the heart of our analyses in this context.
For the most part, we are interested in measuring the distance between vectors. The guiding idea is that semantically related words should be close together in the vector spaces we build, and semantically unrelated words should be far apart.
The scipy.spatial.distance module has a lot of vector comparison methods, so you might check them out if you want to go beyond the functions defined and explored here. Read the documentation closely, though: many of those methods are defined only for binary vectors, whereas the VSMs we'll use allow all float values.
The most basic and intuitive distance measure between vectors is euclidean distance. The euclidean distance between two vectors $u$ and $v$ of dimension $n$ is
$$\textbf{euclidean}(u, v) = \sqrt{\sum_{i=1}^{n}|u_{i} - v_{i}|^{2}}$$In two-dimensions, this corresponds to the length of the most direct line between the two points.
In vsm.py
, the function euclidean
just uses the corresponding scipy.spatial.distance method to define it.
Here's the tiny vector space from the screencast on vector comparisons associated with this notebook:
In [9]:
ABC = pd.DataFrame([
[ 2.0, 4.0],
[10.0, 15.0],
[14.0, 10.0]],
index=['A', 'B', 'C'],
columns=['x', 'y'])
In [10]:
ABC
Out[10]:
In [11]:
def plot_ABC(df):
ax = df.plot.scatter(x='x', y='y', marker='.', legend=False)
m = df.values.max(axis=None)
ax.set_xlim([0, m*1.2])
ax.set_ylim([0, m*1.2])
for label, row in df.iterrows():
ax.text(row['x'], row['y'], label)
In [12]:
plot_ABC(ABC)
The euclidean distances align well with raw visual distance in the plot:
In [13]:
def abc_comparisons(df, distfunc):
for a, b in (('A', 'B'), ('B', 'C')):
dist = distfunc(df.loc[a], df.loc[b])
print('{0:}({1:}, {2:}) = {3:7.02f}'.format(
distfunc.__name__, a, b, dist))
In [14]:
abc_comparisons(ABC, vsm.euclidean)
However, suppose we think of the vectors as word meanings in the vector-space sense. In that case, the values don't look good:
The distributions of B and C are more or less directly opposed, suggesting very different meanings, whereas A and B are rather closely aligned, abstracting away from the fact that the first is far less frequent than the second.
In terms of the large models we will soon explore, A and B resemble a pair like superb and good, which have similar meanings but very different frequencies.
In contrast, B and C are like good and disappointing — similar overall frequencies but different distributions with respect to the overall vocabulary.
These affinities are immediately apparent if we normalize the vectors by their length. To do this, we first define the L2-length of a vector:
$$\|u\|_{2} = \sqrt{\sum_{i=1}^{n} u_{i}^{2}}$$And then the normalization step just divides each value by this quantity:
$$\left[ \frac{u_{1}}{\|u\|_{2}}, \frac{u_{2}}{\|u\|_{2}}, \ldots \frac{u_{n}}{\|u\|_{2}} \right]$$
In [15]:
ABC_normed = ABC.apply(vsm.length_norm, axis=1)
In [16]:
plot_ABC(ABC_normed)
In [17]:
abc_comparisons(ABC_normed, vsm.euclidean)
Here, the connection between A and B is more apparent, as is the opposition between B and C.
Cosine distance takes overall length into account. The cosine distance between two vectors $u$ and $v$ of dimension $n$ is
$$\textbf{cosine}(u, v) = 1 - \frac{\sum_{i=1}^{n} u_{i} \cdot v_{i}}{\|u\|_{2} \cdot \|v\|_{2}}$$The similarity part of this (the righthand term of the subtraction) is actually measuring the angles between the two vectors. The result is the same (in terms of rank order) as one gets from first normalizing both vectors using $\|\cdot\|_{2}$ and then calculating their Euclidean distance.
In [18]:
abc_comparisons(ABC, vsm.cosine)
So, in building in the length normalization, cosine distance achieves our goal of associating A and B and separating both from C.
Matching-based methods are also common in the literature. The basic matching measure effectively creates a vector consisting of all of the smaller of the two values at each coordinate, and then sums them:
$$\textbf{matching}(u, v) = \sum_{i=1}^{n} \min(u_{i}, v_{i})$$This is implemented in vsm
as matching
.
One approach to normalizing the matching values is the Jaccard coefficient. The numerator is the matching coefficient. The denominator — the normalizer — is intuitively like the set union: for binary vectors, it gives the cardinality of the union of the two being compared:
$$\textbf{jaccard}(u, v) = 1 - \frac{\textbf{matching}(u, v)}{\sum_{i=1}^{n} \max(u_{i}, v_{i})}$$
In [19]:
for m in (vsm.euclidean, vsm.cosine, vsm.jaccard):
fmt = {
'n': m.__name__,
'AB': m(ABC.loc['A'], ABC.loc['B']),
'BC': m(ABC.loc['B'], ABC.loc['C'])}
print('{n:>15}(A, B) = {AB:5.2f} {n:>15}(B, C) = {BC:5.2f}'.format(**fmt))
The neighbors
function in vsm
is an investigative aide. For a given word w
, it ranks all the words in the vocabulary according to their distance from w
, as measured by distfunc
(default: vsm.cosine
).
By playing around with this function, you can start to get a sense for how the distance functions differ. Here are some example uses; you might try some new words to get a feel for what these matrices are like and how different words look.
In [20]:
vsm.neighbors('A', ABC, distfunc=vsm.euclidean)
Out[20]:
In [21]:
vsm.neighbors('A', ABC, distfunc=vsm.cosine)
Out[21]:
In [22]:
vsm.neighbors('good', imdb5, distfunc=vsm.euclidean).head()
Out[22]:
In [23]:
vsm.neighbors('good', imdb20, distfunc=vsm.euclidean).head()
Out[23]:
In [24]:
vsm.neighbors('good', imdb5, distfunc=vsm.cosine).head()
Out[24]:
In [25]:
vsm.neighbors('good', imdb20, distfunc=vsm.cosine).head()
Out[25]:
In [26]:
vsm.neighbors('good', giga20, distfunc=vsm.cosine).head()
Out[26]:
The goal of reweighting is to amplify the important, trustworthy, and unusual, while deemphasizing the mundane and the quirky.
Absent a defined objective function, this will remain fuzzy, but the intuition behind moving away from raw counts is that frequency is a poor proxy for our target semantic ideas.
Normalization (row-wise or column-wise) is perhaps the simplest form of reweighting. With vsm.length_norm
, we normalize using vsm.vector_length
. We can also normalize each row by the sum of its values, which turns each row into a probability distribution over the columns:
These normalization measures are insensitive to the magnitude of the underlying counts. This is often a mistake in the messy world of large data sets; $[1,10]$ and $[1000,10000]$ are very different vectors in ways that will be partly or totally obscured by normalization.
Reweighting by observed-over-expected values captures one of the central patterns in all of VSMs: we can adjust the actual cell value in a co-occurrence matrix using information from the corresponding row and column.
In the case of observed-over-expected, the rows and columns define our expectation about what the cell value would be if the two co-occurring words were independent. In dividing the observed count by this value, we amplify cells whose values are larger than we would expect.
So that this doesn't look more complex than it is, for an $m \times n$ matrix $X$, define
$$\textbf{rowsum}(X, i) = \sum_{j=1}^{n}X_{ij}$$$$\textbf{colsum}(X, j) = \sum_{i=1}^{m}X_{ij}$$$$\textbf{sum}(X) = \sum_{i=1}^{m}\sum_{j=1}^{n} X_{ij}$$$$\textbf{expected}(X, i, j) = \frac{ \textbf{rowsum}(X, i) \cdot \textbf{colsum}(X, j) }{ \textbf{sum}(X) }$$Then the observed-over-expected value is
$$\textbf{oe}(X, i, j) = \frac{X_{ij}}{\textbf{expected}(X, i, j)}$$In many contexts, it is more intuitive to first normalize the count matrix into a joint probability table and then think of $\textbf{rowsum}$ and $\textbf{colsum}$ as probabilities. Then it is clear that we are comparing the observed joint probability with what we would expect it to be under a null hypothesis of independence. These normalizations do not affect the final results, though.
Let's do a quick worked-out example. Suppose we have the count matrix $X$ =
. | a | b | rowsum |
---|---|---|---|
x | 34 | 11 | 45 |
y | 47 | 7 | 54 |
colsum | 81 | 18 | 99 |
Then we calculate like this:
$$\textbf{oe}(X, 1, 0) = \frac{47}{\frac{54 \cdot 81}{99}} = 1.06$$And the full table looks like this:
. | a | b |
---|---|---|
x | 0.92 | 1.34 |
y | 1.06 | 0.71 |
In [27]:
oe_ex = np.array([[ 34., 11.], [ 47., 7.]])
vsm.observed_over_expected(oe_ex).round(2)
Out[27]:
The implementation vsm.observed_over_expected
should be pretty efficient.
In [28]:
imdb5_oe = vsm.observed_over_expected(imdb5)
In [29]:
imdb20_oe = vsm.observed_over_expected(imdb20)
In [30]:
vsm.neighbors('good', imdb5_oe).head()
Out[30]:
In [31]:
vsm.neighbors('good', imdb20_oe).head()
Out[31]:
Pointwise Mutual Information (PMI) is observed-over-expected in log-space:
$$\textbf{pmi}(X, i, j) = \log\left(\frac{X_{ij}}{\textbf{expected}(X, i, j)}\right)$$This basic definition runs into a problem for $0$ count cells. The usual response is to set $\log(0) = 0$, but this is arguably confusing – cell counts that are smaller than expected get negative values, cell counts that are larger than expected get positive values, and 0-count values are placed in the middle of this ranking without real justification.
For this reason, it is more typical to use Positive PMI, which maps all negative PMI values to $0$:
$$\textbf{ppmi}(X, i, j) = \begin{cases} \textbf{pmi}(X, i, j) & \textrm{if } \textbf{pmi}(X, i, j) > 0 \\ 0 & \textrm{otherwise} \end{cases}$$This is the default for vsm.pmi
.
In [32]:
imdb5_pmi = vsm.pmi(imdb5)
In [33]:
imdb20_pmi = vsm.pmi(imdb20)
In [34]:
vsm.neighbors('good', imdb5_pmi).head()
Out[34]:
In [35]:
vsm.neighbors('good', imdb20_pmi).head()
Out[35]:
In [36]:
giga20_pmi = vsm.pmi(giga20)
In [37]:
vsm.neighbors('market', giga20_pmi).head()
Out[37]:
Perhaps the best known reweighting schemes is Term Frequency–Inverse Document Frequency (TF-IDF), which is, I believe, still the backbone of today's Web search technologies. As the name suggests, it is built from TF and IDF measures:
For an $m \times n$ matrix $X$:
$$\textbf{TF}(X, i, j) = \frac{X_{ij}}{\textbf{colsum}(X, i, j)}$$$$\textbf{IDF}(X, i, j) = \log\left(\frac{n}{|\{k : X_{ik} > 0\}|}\right)$$$$\textbf{TF-IDF}(X, i, j) = \textbf{TF}(X, i, j) \cdot \textbf{IDF}(X, i, j)$$TF-IDF generally performs best with sparse matrices. It severely punishes words that appear in many documents; if a word appears in every document, then its IDF value is 0. As a result, it can even be problematic with verb dense word $\times$ word matrices like ours, where most words appear with most other words.
There is an implementation of TF-IDF for dense matrices in vsm.tfidf
.
Important: sklearn
's version, TfidfTransformer, assumes that term frequency (TF) is defined row-wise and document frequency is defined column-wise. That is, it assumes sklearn
's document $\times$ word basic design, which makes sense for classification tasks, where the design is example $\times$ features. This is the transpose of the way we've been thinking.
Schütze (1993) pioneered the use of subword information to improve representations by reducing sparsity, thereby increasing the density of connections in a VSM. In recent years, this idea has shown value in numerous contexts.
Bojanowski et al. (2016) (the fastText team) explore a particularly straightforward approach to doing this: represent each word as the sum of the representations for the character-level n-grams it contains.
It is simple to derive character-level n-gram representations from our existing VSMs. The function vsm.ngram_vsm
implements the basic step. Here, we create the 4-gram version of imdb5
:
In [38]:
imdb5_ngrams = vsm.ngram_vsm(imdb5, n=4)
In [39]:
imdb5_ngrams.shape
Out[39]:
This has the same column dimension as the imdb5
, but the rows are expanded with all the 4-grams, including boundary symbols <w>
and </w>
.
vsm.character_level_rep
is a simple function for creating new word representations from the associated character-level ones. Many variations on that function are worth trying – for example, you could include the original word vector where available, change the aggregation method from sum
to something else, use a real morphological parser instead of just n-grams, and so on.
One very powerful thing about this is that we can represent words that are not even in the original VSM:
In [40]:
'superbly' in imdb5.index
Out[40]:
In [41]:
superbly = vsm.character_level_rep("superbly", imdb5_ngrams)
In [42]:
superb = vsm.character_level_rep("superb", imdb5_ngrams)
In [43]:
vsm.cosine(superb, superbly)
Out[43]:
You can begin to get a feel for what your matrix is like by poking around with vsm.neighbors
to see who is close to or far from whom.
It's very useful to complement this with the more holistic view one can get from looking at a visualization of the entire vector space.
Of course, any visualization will have to be much, much lower dimension than our actual VSM, so we need to proceed cautiously, balancing the high-level view with more fine-grained exploration.
We won't have time this term to cover VSM visualization in detail. scikit-learn has a bunch of functions for doing this in sklearn.manifold, and the user guide for that package is detailed.
It's also worth checking out the online TensorFlow Embedding Projector tool, which includes a fast implementation of t-SNE.
In addition, vsm.tsne_viz
is a wrapper around sklearn.manifold.TSNE that handles the basic preprocessing and layout for you. t-SNE stands for t-Distributed Stochastic Neighbor Embedding, a powerful method for visualizing high-dimensional vector spaces in 2d. See also Multiple Maps t-SNE.
In [44]:
vsm.tsne_viz(imdb20_pmi, random_state=42)