This notebook demonstrates the experience of using ML Workbench to create a machine learning model for text classification and setting it up for online prediction. This is the "cloud run" version of previous notebook. Preprocessing, training, batch prediction, are all done in cloud with various of services. The cloud run can be distributed, so it can handle really large data. Although in this case there is little benefit with the small demo data, the purpose is to demonstrate the usage of cloud run mode of ML Workbench.
There are only a few things that need to change between "local run" and "cloud run":
Other than this, nothing else changes from local to cloud!
If you have any feedback, please send them to datalab-feedback@google.com.
In [2]:
    
# Make sure you have the processed data there.
!ls ./data
    
    
    
The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the cleaned data from the previous notebook and build a text classification model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.
For details of each command, run with --help. For example, "%%ml train --help".
This notebook shows the cloud version of every command, and gives the normal experience when building models are large datasets. However, we will still use the 20 newsgroup data.
In [4]:
    
!gsutil mb gs://datalab-mlworkbench-20newslab
    
    
    
In [5]:
    
!gsutil -m cp ./data/train.csv ./data/eval.csv gs://datalab-mlworkbench-20newslab
    
    
    
In [1]:
    
import google.datalab.contrib.mlworkbench.commands  # This loads the '%%ml' magics
    
    
In [7]:
    
%%ml dataset create
name: newsgroup_data_gcs
format: csv
schema:
  - name: news_label
    type: STRING
  - name: text
    type: STRING  
train: gs://datalab-mlworkbench-20newslab/train.csv
eval: gs://datalab-mlworkbench-20newslab/eval.csv
    
    
In [8]:
    
%%ml analyze --cloud
output: gs://datalab-mlworkbench-20newslab/analysis
data: newsgroup_data_gcs
features:
    news_label:
        transform: target
    text:
        transform: bag_of_words
    
    
    
In [ ]:
    
!gsutil -m rm -rf gs://datalab-mlworkbench-20newslab/transform # Delete previous results if any.
    
In [12]:
    
%%ml transform --cloud
output: gs://datalab-mlworkbench-20newslab/transform
analysis: gs://datalab-mlworkbench-20newslab/analysis
data: newsgroup_data_gcs
    
    
    
Click the links in output cell to monitor the jobs progress. Once they are completed (usually within 15 minutes with the job startup overhead), check the output.
In [13]:
    
!gsutil ls gs://datalab-mlworkbench-20newslab/transform
    
    
    
In [2]:
    
%%ml dataset create
name: newsgroup_data_gcs_transformed
format: transformed
train: gs://datalab-mlworkbench-20newslab/transform/train-*
eval: gs://datalab-mlworkbench-20newslab/transform/eval-*
    
    
In [ ]:
    
# Training should use an empty output folder. So if you run training multiple times,
# use different folders or remove the output from the previous run.
!gsutil -m rm -fr gs://datalab-mlworkbench-20newslab/train
    
Note that, "runtime_version: '1.2'" specifies which TensorFlow version is used at training. The first time training is a bit slower because of warm up, but if you train it multiple times the runs after first will be faster.
In [3]:
    
%%ml train --cloud
output: gs://datalab-mlworkbench-20newslab/train
analysis: gs://datalab-mlworkbench-20newslab/analysis
data: newsgroup_data_gcs_transformed
model_args:
    model: linear_classification
    top-n: 5
cloud_config:
    scale_tier: BASIC
    region: us-central1
    runtime_version: '1.2'
    
    
    
    
In [4]:
    
# Once training is done, check the output.
!gsutil list gs://datalab-mlworkbench-20newslab/train
    
    
    
In [2]:
    
%%ml batch_predict
model: gs://datalab-mlworkbench-20newslab/train/evaluation_model
output: gs://datalab-mlworkbench-20newslab/prediction
format: csv
data:
  csv: gs://datalab-mlworkbench-20newslab/eval.csv
    
    
    
In [2]:
    
!gsutil ls gs://datalab-mlworkbench-20newslab/prediction/
    
    
    
In [5]:
    
%%ml evaluate confusion_matrix --plot
size: 15
csv: gs://datalab-mlworkbench-20newslab/prediction/predict_results_eval.csv
    
    
    
In [6]:
    
%%ml evaluate accuracy
csv: gs://datalab-mlworkbench-20newslab/prediction/predict_results_eval.csv
    
    
    Out[6]:
In [7]:
    
%%ml predict
model: gs://datalab-mlworkbench-20newslab/train/model
data:
  - nasa
  - windows xp
    
    
    
In [8]:
    
%%ml model deploy
name: newsgroup.alpha
path: gs://datalab-mlworkbench-20newslab/train/model
    
    
    
In [9]:
    
# Let's create a CSV file from eval.csv by removing the target column.
with open('./data/eval.csv', 'r') as f, open('./data/test.csv', 'w') as fout:
    for l in f:
        fout.write(l.split(',')[1])
    
    
In [12]:
    
!gsutil cp ./data/test.csv gs://datalab-mlworkbench-20newslab/test.csv
    
    
    
In [13]:
    
%%ml batch_predict --cloud
model: newsgroup.alpha
output: gs://datalab-mlworkbench-20newslab/test
format: json
data:
    csv: gs://datalab-mlworkbench-20newslab/test.csv
cloud_config:
    region: us-central1
    
    
    
Once job is completed, take a look at the results.
In [14]:
    
!gsutil ls -lh gs://datalab-mlworkbench-20newslab/test
    
    
    
In [15]:
    
!gsutil cat gs://datalab-mlworkbench-20newslab/test/prediction.results* | head -n 2
    
    
    
In [16]:
    
%%ml model delete
name: newsgroup.alpha
    
    
    
In [ ]:
    
%%ml model delete
name: newsgroup
    
In [ ]:
    
# Delete the files in the GCS bucket, and delete the bucket
!gsutil -m rm -r gs://datalab-mlworkbench-20newslab
    
In [ ]: