Let us see how we can process the textual information to create a vector representation, also known as word embeddings or word vectors, which can be used as an input to a neural network.

One-Hot Vector

This is the most simplest one where for each word we create a vector of length equal to the size of the vocabulary, $R^{\left\|V\right\|}$. We fill the vector with $1$ at the index of the word, rest all $0s$.

$$W^{apple} = \begin{bmatrix} 1 \\ \vdots \\ \vdots \\ \vdots \\ 0 \\ \end{bmatrix} W^{banana} = \begin{bmatrix} 0 \\ 1 \\ \vdots \\ \vdots \\ 0 \\ \end{bmatrix} W^{king} = \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ \end{bmatrix} W^{queen} = \begin{bmatrix} 0 \\ \vdots \\ \vdots \\ 1 \\ 0 \\ \end{bmatrix}$$

All these vectors are independent to each other. Hence this representation doesn't encodes any relationship between words:

$$(W^{apple})^TW^{banana}=(W^{king})^TW^{queen}=0$$

Also, each vector would be very sparse. Hence this approach requires large space to encode all our words in the vector form.

You shall know a word by the company it keeps (Firth, J. R. 1957:11)
- Wikipedia

Word-Document Matrix

In this approach, we create a matrix where a column represents a document and a row represents the frequency of a word in the document. This matrix scales with the number of documents ($D$). The matrix size would be $R^{\left\|D*V\right\|}$ where $V$ is the size of the vocabulary.

Word-Word Matrix

In this case, we build a co-occurence matrix where both columns and rows represent words from the vocabulary. The benefit of building this matrix is that the co-occurence value of the words which are highly likely to come together in a sentence will always be high as compared to the words which rarely come together. Hence we should be fine once we have a descent sized dataset or say documents. Also, the size of the matrix dependent now on the size of the vocabulary, $R^{\left\|V*V\right\|}$.

The beauty of the last two approaches is that we can apply Singular-Value-Decomposition (SVD) on the matrix and further reduce the dimentionality. Let us see an example on the Word-Word matrix.

Consider our data to have the following 3 sentence:

I enjoy driving.
I like banana.
I like reading.

The co-occurence matrix will look like:

$$X = \begin{array}{c|lcr} words & \text{I} & \text{enjoy} & \text{driving} & \text{like} & \text{banana} & \text{reading} &\text{.}\\ \hline \text{I} & 0 & 1 & 0 & 2 & 0 & 0 & 0 \\ \text{enjoy} & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\ \text{driving} & 0 & 1 & 0 & 0 & 0 & 0 & 1 \\ \text{like} & 2 & 0 & 0 & 0 & 1 & 1 & 0 \\ \text{banana} & 0 & 0 & 0 & 1 & 0 & 0 & 1 \\ \text{reading} & 0 & 0 & 0 & 1 & 0 & 0 & 1 \\ \text{.} & 0 & 0 & 1 & 0 & 1 & 1 & 0 \\ \end{array} $$



In [2]:

    
words = ["I" "enjoy" "driving" "like" "banana" "reading" "."];
X =   [0 1 0 2 0 0 0;
       1 0 1 0 0 0 0;
       0 1 0 0 0 0 1;
       2 0 0 0 1 1 0;
       0 0 0 1 0 0 1;
       0 0 0 1 0 0 1
       0 0 1 0 1 1 0];

In Julia, applying SVD on our matrix $X$ will give us $U$, $S$ and $V$ where:

$$A == U*diagm(S)*V^T$$



In [3]:

    
U,S,V = svd(X);



In [4]:

    
U









    Out[4]:





7x7 Array{Float64,2}:
  0.723607      0.0          -5.55112e-17  …   0.276393      4.24034e-17
  2.77556e-16  -0.356822     -1.38778e-16     -3.747e-16    -5.54668e-32
  0.276393     -1.04083e-17   1.66533e-16      0.723607      1.73914e-18
 -2.22045e-16  -0.835549     -0.447214         1.94289e-16   2.46519e-32
  0.447214      7.08113e-18  -1.26097e-16     -0.447214      0.707107   
  0.447214      7.08113e-18  -1.26097e-16  …  -0.447214     -0.707107   
  0.0          -0.417775      0.894427        -8.32667e-17  -1.2326e-32



In [5]:

    
S









    Out[5]:





7-element Array{Float64,1}:
 2.80252    
 2.80252    
 1.41421    
 1.41421    
 1.07047    
 1.07047    
 1.34641e-17



In [6]:

    
V









    Out[6]:





7x7 Array{Float64,2}:
 -0.0          -0.723607     -0.632456     …  -0.0           0.0        
  0.356822     -1.11022e-16   2.22045e-16      0.934172     -4.996e-16  
  1.52656e-16  -0.276393      0.632456        -6.10623e-16   4.7621e-18 
  0.835549      1.9845e-17   -1.78328e-16     -0.319151      1.01714e-16
  0.0          -0.447214      0.316228        -1.11022e-16  -0.707107   
  0.0          -0.447214      0.316228     …   4.44089e-16   0.707107   
  0.417775      0.0           0.0             -0.159576      2.11234e-16

"A useful rule of thumb is to retain enough singular values to make up 90% of the energy in Σ. That is, the sum of the squares of the retained singular values should be at least 90% of the sum of the squares of all the singular values." - Jeffrey D. Ullman

S matrix is the $\sum$, hence the total energy here is:



In [7]:

    
totEnergy = sum(S.^2)









    Out[7]:





22.0



In [8]:

    
energy = zeros(length(S));
energy[1] = S[1]^2/totEnergy;
for i=2:length(S)
    energy[i] = energy[i-1]+(S[i]^2/totEnergy);
end
energy









    Out[8]:





7-element Array{Float64,1}:
 0.357005
 0.714009
 0.804918
 0.895827
 0.947914
 1.0     
 1.0



In [9]:

    
using PyPlot









    



INFO: Loading help data...



In [10]:

    
plot(1:length(energy), energy)
xlabel("Dimensions")
ylabel("% Energy Retained")
grid("on")

Looking at the plot we can determine that keeping 4 dimensions are good enough for us rather than all 6. We can also print/plot the words based on the first two columns of $U$ corresponding to the two biggest singular values.



In [11]:

    
Y = X[:,1:4]









    Out[11]:





7x4 Array{Int64,2}:
 0  1  0  2
 1  0  1  0
 0  1  0  0
 2  0  0  0
 0  0  0  1
 0  0  0  1
 0  0  1  0



In [12]:

    
U,S,V = svd(Y);



In [13]:

    
U









    Out[13]:





7x4 Array{Float64,2}:
 -0.853553     -1.87443e-17  -1.09429e-16  -0.146447   
 -2.35514e-16  -0.541467     -0.51222       1.57009e-16
 -0.146447     -2.81165e-17  -1.64143e-16  -0.853553   
  1.96262e-16  -0.831251      0.444872     -3.92523e-17
 -0.353553      0.0           0.0           0.353553   
 -0.353553      0.0           0.0           0.353553   
  7.85046e-17  -0.125841     -0.734656     -7.85046e-17



In [14]:

    
for w=1:length(words)
    plt.text(U[w,1], U[w,2], words[w]);
end
plt.xlim((minimum(U[:,1])-1, maximum(U[:,1])+1));
plt.ylim((minimum(U[:,2])-1, maximum(U[:,2])+1));

In the coming posts, I'll write about more interesting ways of generating word vectors.

One-Hot Vector

Word-Document Matrix

Word-Word Matrix

References: