In [1]:
%matplotlib inline
import utils
from utils import *
from __future__ import division, print_function


WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
Using Theano backend.

In [2]:
path = 'data/glove/'
res_path = path+'results/'

Preprocessing

This section shows how we processed the original glove text files. However, there's no need for you to do this, since we provide the pre-processed glove data.


In [3]:
def get_glove(name):
    with open(path+ 'glove.' + name + '.txt', 'r') as f: lines = [line.split() for line in f]
    words = [d[0] for d in lines]
    vecs = np.stack(np.array(d[1:], dtype=np.float32) for d in lines)
    wordidx = {o:i for i,o in enumerate(words)}
    save_array(res_path+name+'.dat', vecs)
    pickle.dump(words, open(res_path+name+'_words.pkl','wb'))
    pickle.dump(wordidx, open(res_path+name+'_idx.pkl','wb'))

In [4]:
get_glove('6B.50d')
get_glove('6B.100d')
get_glove('6B.200d')
get_glove('6B.300d')


---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-8567ab96a317> in <module>()
----> 1 get_glove('6B.50d')
      2 get_glove('6B.100d')
      3 get_glove('6B.200d')
      4 get_glove('6B.300d')

<ipython-input-3-e589169b1583> in get_glove(name)
      1 def get_glove(name):
----> 2     with open(path+ 'glove.' + name + '.txt', 'r') as f: lines = [line.split() for line in f]
      3     words = [d[0] for d in lines]
      4     vecs = np.stack(np.array(d[1:], dtype=np.float32) for d in lines)
      5     wordidx = {o:i for i,o in enumerate(words)}

FileNotFoundError: [Errno 2] No such file or directory: 'data/glove/glove.6B.50d.txt'

Looking at the vectors

After you've downloaded the pre-processed glove data, you should use tar -zxf to untar them, and put them in the path that {res_path} points to. (If you don't have a great internet connection, feel free to only download the 50d version, since that's what we'll be using in class).

Then the following function will return the word vectors as a matrix, the word list, and the mapping from word to index.


In [5]:
def load_glove(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb')),
        pickle.load(open(loc+'_idx.pkl','rb')))

In [6]:
vecs, words, wordidx = load_glove(res_path+'6B.50d')
vecs.shape


---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-6-f3d9d0357284> in <module>()
----> 1 vecs, words, wordidx = load_glove(res_path+'6B.50d')
      2 vecs.shape

<ipython-input-5-76c1d3f9444c> in load_glove(loc)
      1 def load_glove(loc):
      2     return (load_array(loc+'.dat'),
----> 3         pickle.load(open(loc+'_words.pkl','rb')),
      4         pickle.load(open(loc+'_idx.pkl','rb')))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Here's the first 25 "words" in glove.


In [12]:
' '.join(words[:25])


Out[12]:
'the , . of to and in a " \'s for - that on is was said with he as it by at ( )'

This is how you can look up a word vector.


In [13]:
def w2v(w): return vecs[wordidx[w]]

In [14]:
w2v('of')


Out[14]:
array([ 0.7085,  0.5709, -0.4716,  0.1805,  0.5445,  0.726 ,  0.1816, -0.5239,  0.1038, -0.1757,
        0.0789, -0.3622, -0.1183, -0.8334,  0.1192, -0.1661,  0.0616, -0.0127, -0.5662,  0.0136,
        0.2285, -0.144 , -0.0675, -0.3816, -0.237 , -1.7037, -0.8669, -0.267 , -0.2589,  0.1767,
        3.8676, -0.1613, -0.1327, -0.6888,  0.1844,  0.0052, -0.3387, -0.079 ,  0.2419,  0.3658,
       -0.3473,  0.2848,  0.0757, -0.0622, -0.3899,  0.229 , -0.2162, -0.2256, -0.0939, -0.8037], dtype=float32)

Just for fun, let's take a look at a 2d projection of the first 350 words, using T-SNE.


In [16]:
reload(sys)
sys.setdefaultencoding('utf8')

In [31]:
tsne = TSNE(n_components=2, random_state=0)
Y = tsne.fit_transform(vecs[:500])

start=0; end=350
dat = Y[start:end]
plt.figure(figsize=(15,15))
plt.scatter(dat[:, 0], dat[:, 1])
for label, x, y in zip(words[start:end], dat[:, 0], dat[:, 1]):
    plt.text(x,y,label, color=np.random.rand(3)*0.7,
                 fontsize=14)
plt.show()



In [ ]: