Text Classification using TensorFlow on Cloud ML Engine

This notebook illustrates:

Creating datasets for Machine Learning using BigQuery
Creating a text classification model using the high-level Estimator API
Training on Cloud ML Engine
Deploying model
Predicting with model



In [1]:

    
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'



In [2]:

    
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION



In [3]:

    
%datalab project set -p $PROJECT



In [ ]:

    
!pip install --upgrade tensorflow



In [5]:

    
import tensorflow as tf
print tf.__version__

The idea is to look at the title of a newspaper article and figure out whether the article came from the New York Times or from TechCrunch. There are very sophisticated approaches that we can try, but for now, let's go with something very simple.

Data exploration and preprocessing in BigQuery

What does the Hacker News dataset look like?



In [10]:

    
%bq query
SELECT
  url, title, score
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  LENGTH(title) > 10
  AND score > 10
LIMIT 10









    Out[10]:





    url title score
Ask HN: What are some good gift ideas for hacker types? 44
Ask HN: After Google Apps and Outlook, which email provider for custom domain? 22
http://code.ipstenu.org/2011/the-legality-of-forking/ The Legality of Forking 17
https://medium.com/@polarrist/where-are-chernobyl-s-children-a-photojournalist-s-honest-project-in-the-age-of-disaster-tourism-4cd333ab80c7 Where Are Chernobyl’s Children? 50
http://www.bbc.co.uk/news/technology-19597437 Twitter hands over messages at heart of Occupy case 61
http://weblogs.asp.net/fbouma/archive/2013/08/13/windows-store-account-getting-rid-of-it-is-as-hard-as-signing-up.aspx Windows Store dev account: getting rid of it is as hard as signing up 51
http://www.forbes.com/sites/andyellwood/2012/01/18/being-a-regular/ Being A Regular 28
Ask HN: The Road to Becoming an Angel or VC 12
http://personalmba.com/best-business-books/ Best Business Books 20
http://www.couch.io/migrating-to-couchdb Migrating to CouchDB 48
    
(rows: 10, time: 2.7s,   229MB processed, job: job_xEjVjj18rbx51BN1gSFipUuGdSE)

Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with nytimes. To ensure that the parsing works for all URLs of interest, I'll group by the source to make sure there are no weird names left. This was an iterative process.



In [26]:

    
query="""
SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  COUNT(title) AS num_articles
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
GROUP BY
  source
ORDER BY num_articles DESC
LIMIT 10
"""



In [27]:

    
import google.datalab.bigquery as bq
df = bq.Query(query).execute().result().to_dataframe()
df









    Out[27]:






  
    
      
      source
      num_articles
    
  
  
    
      0
      blogspot
      41386
    
    
      1
      github
      36525
    
    
      2
      techcrunch
      30891
    
    
      3
      youtube
      30848
    
    
      4
      nytimes
      28787
    
    
      5
      medium
      18422
    
    
      6
      google
      18235
    
    
      7
      wordpress
      17667
    
    
      8
      arstechnica
      13749
    
    
      9
      wired
      12841

Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.



In [92]:

    
query="""
SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM
(SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  title
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""
df = bq.Query(query + " LIMIT 10").execute().result().to_dataframe()
df.head()









    Out[92]:






  
    
      
      source
      title
    
  
  
    
      0
      github
      Opinionated Dress Color Simulator
    
    
      1
      techcrunch
      Meteor Raises $20M
    
    
      2
      techcrunch
      DataSift Raises $15M To Help Businesses Mine A...
    
    
      3
      github
      Writing Cross-Platform Games in Rust Using Piston
    
    
      4
      github
      DAws   Advanced Web Shell

For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). A simple way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).

So, let's do that and save the results as CSV files.



In [93]:

    
traindf = bq.Query(query + " AND ABS(MOD(FARM_FINGERPRINT(title), 4)) > 0").execute().result().to_dataframe()
evaldf  = bq.Query(query + " AND ABS(MOD(FARM_FINGERPRINT(title), 4)) = 0").execute().result().to_dataframe()
traindf.head()









    Out[93]:






  
    
      
      source
      title
    
  
  
    
      0
      github
      Awesome per directory history for ZSH
    
    
      1
      github
      PHP class which implements the Elo rating system
    
    
      2
      github
      Comic Sans Everything
    
    
      3
      github
      Amazing Flat version of Twitter Bootstrap
    
    
      4
      github
      A year of fun and hard work on learning Dart



In [86]:

    
traindf['source'].value_counts()









    Out[86]:





github        27445
techcrunch    23131
nytimes       21586
Name: source, dtype: int64



In [87]:

    
evaldf['source'].value_counts()









    Out[87]:





github        9080
techcrunch    7760
nytimes       7201
Name: source, dtype: int64



In [94]:

    
traindf.to_csv('train.csv', header=False, index=False, encoding='utf-8', sep='\t')
evaldf.to_csv('eval.csv', header=False, index=False, encoding='utf-8', sep='\t')



In [95]:

    
!head -3 train.csv









    



github	Awesome per directory history for ZSH
github	PHP class which implements the Elo rating system
github	Comic Sans Everything



In [96]:

    
!wc -l *.csv









    



  24041 eval.csv
  72162 train.csv
  72164 vocab.csv
 168367 total



In [97]:

    
%bash
gsutil cp *.csv gs://${BUCKET}/txtcls1/









    



Copying file://eval.csv [Content-Type=text/csv]...
Copying file://train.csv [Content-Type=text/csv]...                             
Copying file://vocab.csv [Content-Type=text/csv]...                             
\ [3 files][  9.4 MiB/  9.4 MiB]                                                
Operation completed over 3 objects/9.4 MiB.

TensorFlow code

Please explore the code in this directory -- model.py contains the key TensorFlow model and task.py has a main() that launches off the training job.

However, the following cells should give you an idea of what the model code does:



In [6]:

    
import tensorflow as tf
from tensorflow.contrib import lookup
from tensorflow.python.platform import gfile

print tf.__version__
MAX_DOCUMENT_LENGTH = 5  
PADWORD = 'ZYXW'

# vocabulary
lines = ['Some title', 'A longer title', 'An even longer title', 'This is longer than doc length']

# create vocabulary
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
vocab_processor.fit(lines)
with gfile.Open('vocab.tsv', 'wb') as f:
    f.write("{}\n".format(PADWORD))
    for word, index in vocab_processor.vocabulary_._mapping.iteritems():
      f.write("{}\n".format(word))
N_WORDS = len(vocab_processor.vocabulary_)
print '{} words into vocab.tsv'.format(N_WORDS)

# can use the vocabulary to convert words to numbers
table = lookup.index_table_from_file(
  vocabulary_file='vocab.tsv', num_oov_buckets=1, vocab_size=None, default_value=-1)
numbers = table.lookup(tf.constant(lines[0].split()))
with tf.Session() as sess:
  tf.tables_initializer().run()
  print "{} --> {}".format(lines[0], numbers.eval())









    



1.2.1
12 words into vocab.tsv
Some title --> [8 4]



In [13]:

    
!cat vocab.tsv









    



ZYXW
A
even
longer
title
This
doc
is
Some
An
length
than
<UNK>



In [7]:

    
# string operations
titles = tf.constant(lines)
words = tf.string_split(titles)
densewords = tf.sparse_tensor_to_dense(words, default_value=PADWORD)
numbers = table.lookup(densewords)

# now pad out with zeros and then slice to constant length
padding = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
padded = tf.pad(numbers, padding)
sliced = tf.slice(padded, [0,0], [-1, MAX_DOCUMENT_LENGTH])

with tf.Session() as sess:
  tf.tables_initializer().run()
  print "titles=", titles.eval(), titles.shape
  print "words=", words.eval()
  print "dense=", densewords.eval(), densewords.shape
  print "numbers=", numbers.eval(), numbers.shape
  print "padding=", padding.eval(), padding.shape
  print "padded=", padded.eval(), padded.shape
  print "sliced=", sliced.eval(), sliced.shape









    



titles= ['Some title' 'A longer title' 'An even longer title'
 'This is longer than doc length'] (4,)
words= SparseTensorValue(indices=array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1],
       [1, 2],
       [2, 0],
       [2, 1],
       [2, 2],
       [2, 3],
       [3, 0],
       [3, 1],
       [3, 2],
       [3, 3],
       [3, 4],
       [3, 5]]), values=array(['Some', 'title', 'A', 'longer', 'title', 'An', 'even', 'longer',
       'title', 'This', 'is', 'longer', 'than', 'doc', 'length'], dtype=object), dense_shape=array([4, 6]))
dense= [['Some' 'title' 'ZYXW' 'ZYXW' 'ZYXW' 'ZYXW']
 ['A' 'longer' 'title' 'ZYXW' 'ZYXW' 'ZYXW']
 ['An' 'even' 'longer' 'title' 'ZYXW' 'ZYXW']
 ['This' 'is' 'longer' 'than' 'doc' 'length']] (?, ?)
numbers= [[ 8  4  0  0  0  0]
 [ 1  3  4  0  0  0]
 [ 9  2  3  4  0  0]
 [ 5  7  3 11  6 10]] (?, ?)
padding= [[0 0]
 [0 5]] (2, 2)
padded= [[ 8  4  0  0  0  0  0  0  0  0  0]
 [ 1  3  4  0  0  0  0  0  0  0  0]
 [ 9  2  3  4  0  0  0  0  0  0  0]
 [ 5  7  3 11  6 10  0  0  0  0  0]] (?, ?)
sliced= [[ 8  4  0  0  0]
 [ 1  3  4  0  0]
 [ 9  2  3  4  0]
 [ 5  7  3 11  6]] (?, 5)



In [31]:

    
%bash
grep "^def" txtcls1/trainer/model.py









    



def init(bucket, num_steps):
def save_vocab(trainfile, txtcolname, outfilename):
def read_dataset(prefix):
def cnn_model(features, target, mode):
def serving_input_fn():
def get_train():
def get_valid():
def experiment_fn(output_dir):

Let's make sure the code works locally on a small dataset for a few steps.



In [ ]:

    
%bash
echo "bucket=${BUCKET}"
rm -rf outputdir
export PYTHONPATH=${PYTHONPATH}:${PWD}/txtcls1
python -m trainer.task \
   --bucket=${BUCKET} \
   --output_dir=outputdir \
   --job-dir=./tmp --train_steps=200

When I ran it, I got a 41% accuracy after a few steps. Because batchsize=32, 200 steps is essentially 6400 examples -- the full dataset is 72,000 examples, so this is not even the full dataset. And already, we are doing better than random chance.

Once the code works in standalone mode, you can run it on Cloud ML Engine. You can monitor the job from the GCP console in the Cloud Machine Learning Engine section. Since we have 72,000 examples and batchsize=32, train_steps=36,000 essentially means 16 epochs.



In [ ]:

    
%bash
OUTDIR=gs://${BUCKET}/txtcls1/trained_model
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gsutil cp txtcls1/trainer/*.py $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=$(pwd)/txtcls1/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC --runtime-version=1.2 \
   -- \
   --bucket=${BUCKET} \
   --output_dir=${OUTDIR} \
   --train_steps=36000

Training finished with an accuracy of 73%. Obviously, this was trained on a really small dataset and with more data will hopefully come even greater accuracy.

Deploy trained model

Deploying the trained model to act as a REST web service is a simple gcloud call.



In [20]:

    
%bash
gsutil ls gs://${BUCKET}/txtcls1/trained_model/export/Servo/









    



gs://cloud-training-demos-ml/txtcls1/trained_model/export/Servo/
gs://cloud-training-demos-ml/txtcls1/trained_model/export/Servo/1499296055/



In [ ]:

    
%bash
MODEL_NAME="txtcls"
MODEL_VERSION="v1"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/txtcls1/trained_model/export/Servo/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION}

Use model to predict

Send a JSON request to the endpoint of the service to make it predict which publication the article is more likely to run in. These are actual titles of articles in the New York Times, github, and TechCrunch on June 19. These titles were not part of the training or evaluation datasets.



In [28]:

    
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1beta1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1beta1_discovery.json')

request_data = {'instances':
  [
      {
        'title': 'Supreme Court to Hear Major Case on Partisan Districts'
      },
      {
        'title': 'Furan -- build and push Docker images from GitHub to target'
      },
      {
        'title': 'Time Warner will spend $100M on Snapchat original shows and ads'
      },
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'txtcls', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)









    



response={u'predictions': [{u'source': u'nytimes', u'prob': [0.7775614857673645, 5.86951500736177e-05, 0.22237983345985413], u'class': 0}, {u'source': u'github', u'prob': [0.1087314561009407, 0.8909648656845093, 0.0003036781563423574], u'class': 1}, {u'source': u'techcrunch', u'prob': [0.0021869686897844076, 1.563105769264439e-07, 0.9978128671646118], u'class': 2}]}

As you can see, the trained model predicts that the Supreme Court article is 78% likely to come from New York Times and 22% from TechCrunch. The Docker article is 89% likely to be from GitHub according to the service and the Time Warner one is 100% likely to be from TechCrunch.

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

url	title	score
	Ask HN: What are some good gift ideas for hacker types?	44
	Ask HN: After Google Apps and Outlook, which email provider for custom domain?	22
http://code.ipstenu.org/2011/the-legality-of-forking/	The Legality of Forking	17
https://medium.com/@polarrist/where-are-chernobyl-s-children-a-photojournalist-s-honest-project-in-the-age-of-disaster-tourism-4cd333ab80c7	Where Are Chernobyl’s Children?	50
http://www.bbc.co.uk/news/technology-19597437	Twitter hands over messages at heart of Occupy case	61
http://weblogs.asp.net/fbouma/archive/2013/08/13/windows-store-account-getting-rid-of-it-is-as-hard-as-signing-up.aspx	Windows Store dev account: getting rid of it is as hard as signing up	51
http://www.forbes.com/sites/andyellwood/2012/01/18/being-a-regular/	Being A Regular	28
	Ask HN: The Road to Becoming an Angel or VC	12
http://personalmba.com/best-business-books/	Best Business Books	20
http://www.couch.io/migrating-to-couchdb	Migrating to CouchDB	48

	source	num_articles
0	blogspot	41386
1	github	36525
2	techcrunch	30891
3	youtube	30848
4	nytimes	28787
5	medium	18422
6	google	18235
7	wordpress	17667
8	arstechnica	13749
9	wired	12841

	source	title
0	github	Opinionated Dress Color Simulator
1	techcrunch	Meteor Raises $20M
2	techcrunch	DataSift Raises $15M To Help Businesses Mine A...
3	github	Writing Cross-Platform Games in Rust Using Piston
4	github	DAws Advanced Web Shell

	source	title
0	github	Awesome per directory history for ZSH
1	github	PHP class which implements the Elo rating system
2	github	Comic Sans Everything
3	github	Amazing Flat version of Twitter Bootstrap
4	github	A year of fun and hard work on learning Dart