This notebook illustrates:
In [1]:
# change these to try this notebook out
BUCKET = 'alexhanna-dev-ml'
PROJECT = 'alexhanna-dev'
REGION = 'us-central1'
In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
The training dataset simply consists of a bunch of words separated by spaces extracted from your documents. The words are simply in the order that they appear in the documents and words from successive documents are simply appended together. In other words, there is not "document separator".
The only preprocessing that I do is to replace anything that is not a letter or hyphen by a space.
Recall that word2vec is unsupervised. There is no label.
In [3]:
import google.datalab.bigquery as bq
query="""
SELECT
CONCAT( LOWER(REGEXP_REPLACE(title, '[^a-zA-Z $-]', ' ')),
" ",
LOWER(REGEXP_REPLACE(text, '[^a-zA-Z $-]', ' '))) AS text
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
LENGTH(title) > 100
AND LENGTH(text) > 100
"""
df = bq.Query(query).execute().result().to_dataframe()
In [4]:
df[:5]
Out[4]:
In [5]:
with open('word2vec/words.txt', 'w') as ofp:
for txt in df['text']:
ofp.write(txt + " ")
This is what the resulting file looks like:
In [6]:
!cut -c-1000 word2vec/words.txt
In [7]:
%%bash
cd word2vec
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++11 \
-shared word2vec_ops.cc word2vec_kernels.cc \
-o word2vec_ops.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} \
-O2 -D_GLIBCXX_USE_CXX11_ABI=0
# -I/usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public \
The actual evaluation dataset doesn't matter. Let's just make sure to have some words in the input also in the eval. The analogy dataset is of the form
Athens Greece Cairo Egypt Baghdad Iraq Beijing Chinai.e. four words per line where the model is supposed to predict the fourth given the first three. But we'll just make up a junk file.
In [8]:
%%writefile word2vec/junk.txt
: analogy-questions-ignored
the user plays several levels
of the game puzzle
vote down the negative
In [9]:
%%bash
cd word2vec
rm -rf trained
python word2vec.py \
--train_data=./words.txt --eval_data=./junk.txt --save_path=./trained \
--min_count=1 --embedding_size=10 --window_size=2
In [ ]:
from google.datalab.ml import TensorBoard
TensorBoard().start('word2vec/trained')
Here, for example, is the word "founders" in context -- it's near doing, creative, difficult, and fight, which sounds about right ... The numbers next to the words reflect the count -- we should try to get a large enough vocabulary that we can use --min_count=10 when training word2vec, but that would also take too long for a classroom situation.
In [ ]:
for pid in TensorBoard.list()['pid']:
TensorBoard().stop(pid)
print('Stopped TensorBoard with pid {}'.format(pid))
In [13]:
!wc word2vec/trained/*.txt
In [14]:
!head -3 word2vec/trained/*.txt
In [15]:
import pandas as pd
vocab = pd.read_csv("word2vec/trained/vocab.txt", sep="\s+", header=None, names=('word', 'count'))
vectors = pd.read_csv("word2vec/trained/vectors.txt", sep="\s+", header=None)
vectors = pd.concat([vocab, vectors], axis=1)
del vectors['count']
vectors.to_csv("word2vec/trained/embedding.txt.gz", sep=" ", header=False, index=False, index_label=False, compression='gzip')
In [16]:
!zcat word2vec/trained/embedding.txt.gz | head -3
Now, you can use this embedding file instead of the Glove embedding used in txtcls2.ipynb
In [17]:
%%bash
gsutil cp word2vec/trained/embedding.txt.gz gs://${BUCKET}/txtcls2/custom_embedding.txt.gz
In [20]:
%%bash
OUTDIR=gs://${BUCKET}/txtcls2/trained_model
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gsutil cp txtcls1/trainer/*.py $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=$(pwd)/txtcls1/trainer \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET \
--scale-tier=BASIC_GPU \
--runtime-version=1.4 \
-- \
--bucket=${BUCKET} \
--output_dir=${OUTDIR} \
--glove_embedding=gs://${BUCKET}/txtcls2/custom_embedding.txt.gz \
--train_steps=36000
Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License