ML Workbench provides an easy command line interface for machine learning life cycle, which involves four stages:
There are "local" and "cloud" run mode for each stage. "cloud" run mode is recommended if your data is big.
ML Workbench supports numeric, categorical, text, image training data. For each type, there are a set of "transforms" to choose from. The "transforms" indicate how to convert the data into numeric features. For images, it is converted to fixed size vectors representing high level features.
ML Workbench supports image transforms (image to vec) with transfer learning.
This notebook continues the codifies the capabilities discussed in this blog post. In a nutshell, it uses the pre-trained inception model as a starting point and then uses transfer learning to train it further on additional, customer-specific images. For explanation, simple flower images are used. Compared to training from scratch, the time and costs are drastically reduced.
This notebook does preprocessing, training and prediction by calling CloudML API instead of running them "locally" in the Datalab container. It uses full data.
In [3]:
# ML Workbench magics (%%ml) are under google.datalab.contrib namespace. It is not enabled by default and you need to import it before use.
import google.datalab.contrib.mlworkbench.commands
In [2]:
# Create a temp GCS bucket. If the bucket already exists and you don't have permissions, rename it.
!gsutil mb gs://flower-datalab-demo-bucket-large-data
Next cell, we will create a dataset representing our training data.
In [5]:
%%ml dataset create
name: flower_data_full
format: csv
train: gs://cloud-datalab/sampledata/flower/train3000.csv
eval: gs://cloud-datalab/sampledata/flower/eval670.csv
schema:
- name: image_url
type: STRING
- name: label
type: STRING
Analysis step includes computing numeric stats (i.e. min/max), categorical classes, text vocabulary and frequency, etc. Run "%%ml analyze --help" for usage. The analysis results will be used for transforming raw data into numeric features that the model can deal with. For example, to convert categorical value to a one-hot vector ("Monday" becomes [1, 0, 0, 0, 0, 0, 0]). The data may be very large, so sometimes a cloud run is needed by adding --cloud flag. Cloud run will start BigQuery jobs, which may incur some costs.
In this case, analysis step only collects unique labels.
Note that we run analysis only on training data, but not evaluation data.
In [6]:
%%ml analyze --cloud
output: gs://flower-datalab-demo-bucket-large-data/analysis
data: flower_data_full
features:
image_url:
transform: image_to_vec
label:
transform: target
In [7]:
# Check analysis results
!gsutil list gs://flower-datalab-demo-bucket-large-data/analysis
With analysis results we can transform raw data into numeric features. This needs to be done for both training and eval data. The data may be very large, so sometimes a cloud pipeline is needed by adding --cloud. Cloud run is implemented by DataFlow jobs, so it may incur some costs.
In this case, transform is required. It downloads image, resizes it, and generate embeddings from each image by running a pretrained TensorFlow graph. Note that it creates two jobs --- one for training data and one for eval data.
In [ ]:
# Remove previous results
!gsutil -m rm gs://flower-datalab-demo-bucket-large-data/transform
In [12]:
%%ml transform --cloud
analysis: gs://flower-datalab-demo-bucket-large-data/analysis
output: gs://flower-datalab-demo-bucket-large-data/transform
data: flower_data_full
After transformation is done, create a new dataset referencing the training data.
In [15]:
%%ml dataset create
name: flower_data_full_transformed
format: transformed
train: gs://flower-datalab-demo-bucket-large-data/transform/train-*
eval: gs://flower-datalab-demo-bucket-large-data/transform/eval-*
In [ ]:
# Remove previous training results.
!gsutil -m rm -r gs://flower-datalab-demo-bucket-large-data/train
In [16]:
%%ml train --cloud
output: gs://flower-datalab-demo-bucket-large-data/train
analysis: gs://flower-datalab-demo-bucket-large-data/analysis
data: flower_data_full_transformed
model_args:
model: dnn_classification
hidden-layer-size1: 100
top-n: 0
cloud_config:
region: us-central1
scale_tier: BASIC
After training is complete, you should see model files like the following.
In [17]:
# List the model files
!gsutil list gs://flower-datalab-demo-bucket-large-data/train/model
Batch prediction performs prediction in a batched fashion. The data can be large, and is specified by files.
Note that, we use the "evaluation_model" which sits in "evaluation_model_dir". There are two models created in training. One is a regular model under "model" dir, the other is "evaluation_model". The difference is the regular one takes prediction data without target and the evaluation model takes data with target and output the target as is. So evaluation model is good for evaluating the quality of the model because the targets and predicted values are included in output.
In [18]:
%%ml batch_predict --cloud
model: gs://flower-datalab-demo-bucket-large-data/train/evaluation_model
output: gs://flower-datalab-demo-bucket-large-data/evaluation
cloud_config:
region: us-central1
data:
csv: gs://cloud-datalab/sampledata/flower/eval670.csv
In [19]:
# after prediction is done, check the output
!gsutil list -l -h gs://flower-datalab-demo-bucket-large-data/evaluation
In [21]:
# Take a look at the file.
!gsutil cat -r -500 gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-00000-of-00006
Prediction results are in JSON format. We can load the results into BigQuery table and performa analysis.
In [23]:
import google.datalab.bigquery as bq
schema = [
{'name': 'predicted', 'type': 'STRING'},
{'name': 'target', 'type': 'STRING'},
{'name': 'daisy', 'type': 'FLOAT'},
{'name': 'dandelion', 'type': 'FLOAT'},
{'name': 'roses', 'type': 'FLOAT'},
{'name': 'sunflowers', 'type': 'FLOAT'},
{'name': 'tulips', 'type': 'FLOAT'},
]
bq.Dataset('image_classification_results').create()
t = bq.Table('image_classification_results.flower').create(schema = schema, overwrite = True)
t.load('gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-*', mode='overwrite', source_format='json')
Out[23]:
Check wrong predictions.
In [24]:
%%bq query
SELECT * FROM image_classification_results.flower WHERE predicted != target
Out[24]:
In [26]:
%%ml evaluate confusion_matrix --plot
bigquery: image_classification_results.flower
In [27]:
%%ml evaluate accuracy
bigquery: image_classification_results.flower
Out[27]:
In [ ]:
!gsutil -m rm -rf gs://flower-datalab-demo-bucket-large-data
In [ ]: