Creating a custom Word2Vec embedding on your data

This notebook illustrates:

  1. Creating a training dataset
  2. Running word2vec
  3. Examining the created embedding
  4. Export the embedding into a file you can use in other models
  5. Training the text classification model of [txtcls2.ipynb](txtcls2.ipynb) with this custom embedding.

In [1]:
# change these to try this notebook out
BUCKET = 'alexhanna-dev-ml'
PROJECT = 'alexhanna-dev'
REGION = 'us-central1'

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

Creating a training dataset

The training dataset simply consists of a bunch of words separated by spaces extracted from your documents. The words are simply in the order that they appear in the documents and words from successive documents are simply appended together. In other words, there is not "document separator".

The only preprocessing that I do is to replace anything that is not a letter or hyphen by a space.

Recall that word2vec is unsupervised. There is no label.


In [3]:
import google.datalab.bigquery as bq

query="""
SELECT
  CONCAT( LOWER(REGEXP_REPLACE(title, '[^a-zA-Z $-]', ' ')), 
  " ", 
  LOWER(REGEXP_REPLACE(text, '[^a-zA-Z $-]', ' '))) AS text
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  LENGTH(title) > 100
  AND LENGTH(text) > 100
"""

df = bq.Query(query).execute().result().to_dataframe()

In [4]:
df[:5]


Out[4]:
text
0 reddit bookmarklets allow web site owners to c...
1 why not let online ads fight it out in a geome...
2 smashing the clock bestbuy s location and ho...
3 ask hn can google aggregate everything you ve...
4 ask yc think out loud - like twitter justi...

In [5]:
with open('word2vec/words.txt', 'w') as ofp:
  for txt in df['text']:
    ofp.write(txt + " ")

This is what the resulting file looks like:


In [6]:
!cut -c-1000 word2vec/words.txt


reddit bookmarklets allow web site owners to cheat to get mostly up votes  simple realistic example given   the idea is to associate a positive link and a negative link with your site  you would submit both to reddit  p based on the user s experience  you would switch him her to the positive negative link  p that way  happy users would vote up the positive link while unhappy users would vote down the negative link   your site now has a better chance of making the front page  p as an example  suppose your site has a game puzzle  p when the user visits the site via the positive or negative link  you redirect to the negative link  p if the user plays several levels of the game puzzle  then he she probably likes it and then you can switch him her to the positive link  why not let online ads fight it out in a geometric real-time game played by advertisers and consumers  the advertiser may display his her ad along with all the other ads currently on display    p larger ads have the disadvant

Running word2vec

We can run the existing tutorial code as-is.


In [7]:
%%bash
cd word2vec
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++11 \
  -shared word2vec_ops.cc word2vec_kernels.cc \
  -o word2vec_ops.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} \
  -O2 -D_GLIBCXX_USE_CXX11_ABI=0

#   -I/usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public \


/usr/local/envs/py3env/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/usr/local/envs/py3env/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

The actual evaluation dataset doesn't matter. Let's just make sure to have some words in the input also in the eval. The analogy dataset is of the form

Athens Greece Cairo Egypt
Baghdad Iraq Beijing China
i.e. four words per line where the model is supposed to predict the fourth given the first three. But we'll just make up a junk file.


In [8]:
%%writefile word2vec/junk.txt
: analogy-questions-ignored
the user plays several levels
of the game puzzle
vote down the negative


Writing word2vec/junk.txt

In [9]:
%%bash
cd word2vec
rm -rf trained
python word2vec.py \
   --train_data=./words.txt --eval_data=./junk.txt --save_path=./trained \
   --min_count=1 --embedding_size=10 --window_size=2


Data file:  ./words.txt
Vocab size:  889  + UNK
Words per epoch:  2912
Eval analogy file:  ./junk.txt
Questions:  2
Skipped:  1
Epoch    1 Step      185: lr = 0.191 loss =  23.78 words/sec =      199
Eval    0/2 accuracy =  0.0%
Epoch    2 Step      545: lr = 0.181 loss =  11.52 words/sec =      402
Eval    0/2 accuracy =  0.0%
Epoch    3 Step      905: lr = 0.172 loss =   9.14 words/sec =      404
Eval    0/2 accuracy =  0.0%
Epoch    4 Step     1265: lr = 0.163 loss =   7.78 words/sec =      394
Eval    0/2 accuracy =  0.0%
Epoch    5 Step     1625: lr = 0.154 loss =   7.44 words/sec =      397
Eval    0/2 accuracy =  0.0%
Epoch    6 Step     1984: lr = 0.145 loss =   6.64 words/sec =      399
Eval    0/2 accuracy =  0.0%
Epoch    7 Step     2531: lr = 0.131 loss =   6.40 words/sec =      601
Eval    0/2 accuracy =  0.0%
Epoch    8 Step     2891: lr = 0.122 loss =   6.05 words/sec =      404
Eval    0/2 accuracy =  0.0%
Epoch    9 Step     3251: lr = 0.113 loss =   6.15 words/sec =      397
Eval    0/2 accuracy =  0.0%
Epoch   10 Step     3611: lr = 0.104 loss =   6.07 words/sec =      393
Eval    0/2 accuracy =  0.0%
Epoch   11 Step     4159: lr = 0.090 loss =   6.12 words/sec =      602
Eval    0/2 accuracy =  0.0%
Epoch   12 Step     4331: lr = 0.085 loss =   6.12 words/sec =      201
Eval    0/2 accuracy =  0.0%
Epoch   13 Step     4878: lr = 0.072 loss =   5.83 words/sec =      605
Eval    0/2 accuracy =  0.0%
Epoch   14 Step     5237: lr = 0.062 loss =   6.08 words/sec =      404
Eval    0/2 accuracy =  0.0%
Epoch   15 Step     5597: lr = 0.053 loss =   5.96 words/sec =      401
Eval    0/2 accuracy =  0.0%
2018-09-13 15:57:12.157304: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-09-13 15:57:12.167545: I word2vec_kernels.cc:200] Data file: ./words.txt contains 16439 bytes, 2912 words, 889 unique words, 889 unique frequent words.
/usr/local/envs/py3env/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

Examine the created embedding

Let's load up the embedding file in TensorBoard. Start up TensorBoard, switch to the "Projector" tab and then click on the button to "Load data". Load the vocab.txt that is in the output directory of the model.


In [ ]:
from google.datalab.ml import TensorBoard
TensorBoard().start('word2vec/trained')

Here, for example, is the word "founders" in context -- it's near doing, creative, difficult, and fight, which sounds about right ... The numbers next to the words reflect the count -- we should try to get a large enough vocabulary that we can use --min_count=10 when training word2vec, but that would also take too long for a classroom situation.


In [ ]:
for pid in TensorBoard.list()['pid']:
    TensorBoard().stop(pid)
    print('Stopped TensorBoard with pid {}'.format(pid))

Export the embedding vectors into a text file

Let's export the embedding into a text file, so that we can use it the way we used the Glove embeddings in txtcls2.ipynb.

Notice that we have written out our vocabulary and vectors into two files. We just have to merge them now.


In [13]:
!wc word2vec/trained/*.txt


   890   8900 226962 word2vec/trained/vectors.txt
   890   1780  10929 word2vec/trained/vocab.txt
  1780  10680 237891 total

In [14]:
!head -3 word2vec/trained/*.txt


==> word2vec/trained/vectors.txt <==
2.540961503982543945e-01 5.147245526313781738e-01 -1.672663241624832153e-01 -4.278961420059204102e-01 -1.801081560552120209e-02 -1.780755966901779175e-01 -4.884876906871795654e-01 -1.350466255098581314e-02 8.268681168556213379e-02 -4.263160824775695801e-01
-2.084412276744842529e-01 -1.018680110573768616e-01 -4.349195361137390137e-01 4.004665911197662354e-01 -3.781259655952453613e-01 -3.250748813152313232e-01 -2.427831590175628662e-01 2.993280589580535889e-01 4.350312054157257080e-01 -1.009981334209442139e-01
-4.563158750534057617e-01 3.323142826557159424e-01 1.393919438123703003e-01 6.663140654563903809e-02 4.376976191997528076e-01 4.744379520416259766e-01 -1.887555420398712158e-01 -3.682589828968048096e-01 9.675115346908569336e-02 2.453538328409194946e-01

==> word2vec/trained/vocab.txt <==
b'UNK' 0
b'to' 99
b'the' 98

In [15]:
import pandas as pd
vocab = pd.read_csv("word2vec/trained/vocab.txt", sep="\s+", header=None, names=('word', 'count'))
vectors = pd.read_csv("word2vec/trained/vectors.txt", sep="\s+", header=None)
vectors = pd.concat([vocab, vectors], axis=1)
del vectors['count']
vectors.to_csv("word2vec/trained/embedding.txt.gz", sep=" ", header=False, index=False, index_label=False, compression='gzip')

In [16]:
!zcat word2vec/trained/embedding.txt.gz | head -3


b'UNK' 0.2540961503982544 0.5147245526313781 -0.1672663241624832 -0.4278961420059204 -0.018010815605521202 -0.17807559669017792 -0.4884876906871796 -0.01350466255098581 0.08268681168556212 -0.4263160824775696
b'to' -0.20844122767448423 -0.10186801105737686 -0.434919536113739 0.40046659111976624 -0.3781259655952454 -0.3250748813152313 -0.2427831590175629 0.29932805895805364 0.4350312054157257 -0.10099813342094419
b'the' -0.4563158750534058 0.33231428265571594 0.1393919438123703 0.06663140654563904 0.43769761919975286 0.474437952041626 -0.1887555420398712 -0.3682589828968048 0.09675115346908568 0.2453538328409195

gzip: stdout: Broken pipe

Training model with custom embedding

Now, you can use this embedding file instead of the Glove embedding used in txtcls2.ipynb


In [17]:
%%bash
gsutil cp word2vec/trained/embedding.txt.gz gs://${BUCKET}/txtcls2/custom_embedding.txt.gz


Copying file://word2vec/trained/embedding.txt.gz [Content-Type=text/plain]...
/ [1 files][ 85.3 KiB/ 85.3 KiB]                                                
Operation completed over 1 objects/85.3 KiB.                                     

In [20]:
%%bash
OUTDIR=gs://${BUCKET}/txtcls2/trained_model
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gsutil cp txtcls1/trainer/*.py $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=$(pwd)/txtcls1/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC_GPU \
   --runtime-version=1.4 \
   -- \
   --bucket=${BUCKET} \
   --output_dir=${OUTDIR} \
   --glove_embedding=gs://${BUCKET}/txtcls2/custom_embedding.txt.gz \
   --train_steps=36000


gs://alexhanna-dev-ml/txtcls2/trained_model us-central1 txtcls_180913_160510
CommandException: 1 files/objects could not be removed.
CommandException: No URLs matched: txtcls/trainer/*.py
ERROR: (gcloud.ml-engine.jobs.submit.training) Source directory [/content/datalab/training-data-analyst/courses/machine_learning/deepdive/09_sequence/txtcls] is not a valid directory.

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License