This notebook illustrates:
In [ ]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1
In [ ]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'
In [35]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '2.1'
if 'COLAB_GPU' in os.environ: # this is always set on Colab, the value is 0 or 1 depending on whether a GPU is attached
from google.colab import auth
auth.authenticate_user()
# download "sidecar files" since on Colab, this notebook will be on Drive
!rm -rf txtclsmodel
!git clone --depth 1 https://github.com/GoogleCloudPlatform/training-data-analyst
!mv training-data-analyst/courses/machine_learning/deepdive/09_sequence/txtclsmodel/ .
!rm -rf training-data-analyst
# downgrade TensorFlow to the version this notebook has been tested with
!pip install --upgrade tensorflow==$TFVERSION
In [ ]:
import tensorflow as tf
print(tf.__version__)
We will look at the titles of articles and figure out whether the article came from the New York Times, TechCrunch or GitHub.
We will use hacker news as our data source. It is an aggregator that displays tech related headlines from various sources.
Hacker news headlines are available as a BigQuery public dataset. The dataset contains all headlines from the sites inception in October 2006 until October 2015.
Here is a sample of the dataset:
In [ ]:
%load_ext google.cloud.bigquery
In [ ]:
%%bigquery --project $PROJECT
SELECT
url, title, score
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
LENGTH(title) > 10
AND score > 10
AND LENGTH(url) > 0
LIMIT 10
Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with nytimes
In [ ]:
%%bigquery --project $PROJECT
SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
COUNT(title) AS num_articles
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
GROUP BY
source
ORDER BY num_articles DESC
LIMIT 10
Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for AI Platform.
In [ ]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)
query="""
SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""
df = bq.query(query + " LIMIT 5").to_dataframe()
df.head()
For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).
A simple, repeatable way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).
In [ ]:
traindf = bq.query(query + " AND ABS(MOD(FARM_FINGERPRINT(title), 4)) > 0").to_dataframe()
evaldf = bq.query(query + " AND ABS(MOD(FARM_FINGERPRINT(title), 4)) = 0").to_dataframe()
Below we can see that roughly 75% of the data is used for training, and 25% for evaluation.
We can also see that within each dataset, the classes are roughly balanced.
In [ ]:
traindf['source'].value_counts()
In [ ]:
evaldf['source'].value_counts()
Finally we will save our data, which is currently in-memory, to disk.
In [ ]:
import os, shutil
DATADIR='data/txtcls'
shutil.rmtree(DATADIR, ignore_errors=True)
os.makedirs(DATADIR)
traindf.to_csv( os.path.join(DATADIR,'train.tsv'), header=False, index=False, encoding='utf-8', sep='\t')
evaldf.to_csv( os.path.join(DATADIR,'eval.tsv'), header=False, index=False, encoding='utf-8', sep='\t')
In [ ]:
!head -3 data/txtcls/train.tsv
In [ ]:
!wc -l data/txtcls/*.tsv
Please explore the code in this directory: model.py
contains the TensorFlow model and task.py
parses command line arguments and launches off the training job.
In particular look for the following:
The embedding layer in the keras model takes care of one-hot encoding these integers and learning a dense emedding represetation from them.
Finally we pass the embedded text representation through a CNN model pictured below
In [ ]:
%%bash
pip install google-cloud-storage
rm -rf txtcls_trained
gcloud ai-platform local train \
--module-name=trainer.task \
--package-path=${PWD}/txtclsmodel/trainer \
-- \
--output_dir=${PWD}/txtcls_trained \
--train_data_path=${PWD}/data/txtcls/train.tsv \
--eval_data_path=${PWD}/data/txtcls/eval.tsv \
--num_epochs=0.1
In [ ]:
%%bash
gsutil cp data/txtcls/*.tsv gs://${BUCKET}/txtcls/
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/txtcls/trained_fromscratch
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=${PWD}/txtclsmodel/trainer \
--job-dir=$OUTDIR \
--scale-tier=BASIC_GPU \
--runtime-version 2.1 \
--python-version 3.7 \
-- \
--output_dir=$OUTDIR \
--train_data_path=gs://${BUCKET}/txtcls/train.tsv \
--eval_data_path=gs://${BUCKET}/txtcls/eval.tsv \
--num_epochs=5
Change the job name appropriately. View the job in the console, and wait until the job is complete.
In [ ]:
!gcloud ai-platform jobs describe txtcls_190209_224828
We will use the popular GloVe embedding which is trained on Wikipedia as well as various news sources like the New York Times.
You can read more about Glove at the project homepage: https://nlp.stanford.edu/projects/glove/
You can download the embedding files directly from the stanford.edu site, but we've rehosted it in a GCS bucket for faster download speed.
In [36]:
!gsutil cp gs://cloud-training-demos/courses/machine_learning/deepdive/09_sequence/text_classification/glove.6B.200d.txt gs://$BUCKET/txtcls/
Once the embedding is downloaded re-run your cloud training job with the added command line argument:
--embedding_path=gs://${BUCKET}/txtcls/glove.6B.200d.txt
While the final accuracy may not change significantly, you should notice the model is able to converge to it much more quickly because it no longer has to learn an embedding from scratch.
Client-side tokenizing in Python is hugely problematic. See Text classification with native serving for how to carry out the preprocessing in the serving function itself.
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License