ML Workbench provides an easy command line interface for machine learning life cycle, which involves four stages:
There are "local" and "cloud" run mode for each stage. "cloud" run mode is recommended if your data is big.
ML Workbench supports numeric, categorical, text, image training data. For each type, there are a set of "transforms" to choose from. The "transforms" indicate how to convert the data into numeric features. For images, it is converted to fixed size vectors representing high level features.
In this notebook, we are going to predict resolution of police reports. we will build a binary classifier --- Either there were actions taken such as arrest, citation, etc, or nothing. This is an imagenary scenario. We are not sure if the prediction does make sense in reality but we use the data to demonstrate how to use ML Workbench to create a classification model with structured data.
The original data is stored as a BigQuery public dataset. See San Francisco Police Reports Data for more details.
In [1]:
%%bq tables describe
name: bigquery-public-data.san_francisco.sfpd_incidents
Out[1]:
In [2]:
%%bq query
select * from `bigquery-public-data.san_francisco.sfpd_incidents` LIMIT 10
Out[2]:
In [3]:
%%bq query
select count(*) from `bigquery-public-data.san_francisco.sfpd_incidents`
Out[3]:
In [7]:
%%bq query
SELECT resolution, count(*) as count FROM `bigquery-public-data.san_francisco.sfpd_incidents` GROUP BY resolution
Out[7]:
We can use facets tools to visualize the overview of the data.
In [ ]:
from google.datalab.ml import FacetsOverview
from google.datalab.ml import BigQueryDataSet
data = BigQueryDataSet(table = 'bigquery-public-data.san_francisco.sfpd_incidents')
sampled = data.sample(10000)
FacetsOverview().plot({'data': sampled})
From results above, we learned that 'descript' is a text feature (not categorical) because of the large unique values size. "pdistrict" and "category" are both categorical columns. The information is important as to what transformation we will need to do on these data.
Since we already have timestamp, dayofweek seems redundant. Also, it seems address can be inferred from longitude and latitude too. So we are going to drop these two columns.
The timestamp is useful, but we need to convert it to weekday, day_in_year, hours to make them more impactful.
In [9]:
%%bq query --name sfpd
SELECT
unique_key,
category,
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(descript, r'[ ]+', '_'), r',', ' '), '[^a-zA-Z0-9 ]+', '') AS descript,
pddistrict,
CASE WHEN resolution='NONE' Then 'NONE' else 'ACTION' END as resolution,
longitude,
latitude,
CAST(EXTRACT(HOUR FROM timestamp) AS STRING) as hour,
CAST(EXTRACT(DAYOFWEEK FROM timestamp) AS STRING) as weekday,
CAST(EXTRACT(DAYOFYEAR FROM timestamp) AS STRING) as day
FROM `bigquery-public-data.san_francisco.sfpd_incidents`
In [47]:
# Sample 3% data, and split it into train/eval set.
import google.datalab.bigquery as bq
import numpy as np
sampling = bq.Sampling.random(percent=3)
job = sfpd.execute(sampling=sampling)
result = job.result()
data_schema = result.schema
df = result.to_dataframe()
msk = np.random.rand(len(df)) < 0.9
train_df = df[msk]
eval_df = df[~msk]
In [11]:
print('Training set includes %d instances.' % len(train_df))
print('Eval set includes %d instances.' % len(eval_df))
In [12]:
# Create a directory to store our training data.
!mkdir -p ./sfpd
In [13]:
train_df.to_csv('./sfpd/train.csv', header=False, index=False)
eval_df.to_csv('./sfpd/eval.csv', header=False, index=False)
In [14]:
# This loads %%ml commands
import google.datalab.contrib.mlworkbench.commands
Create a dataset (data location, schema, train, eval) so we can reference it later.
In [44]:
%%ml dataset create
format: csv
train: ./sfpd/train.csv
eval: ./sfpd/eval.csv
name: sfpd_3pcnt
schema: $data_schema
You can run "%%ml dataset explore" on the dataset we defined to get more insight if needed.
Steps for model building --- Analysis, Transform, Training, Evaluation.
In [48]:
# Delete previous run results.
!rm -r -f ./sfpd/analysis
In [49]:
%%ml analyze
output: ./sfpd/analysis
data: sfpd_3pcnt
features:
unique_key:
transform: key
category:
transform: one_hot
descript:
transform: bag_of_words
pddistrict:
transform: one_hot
resolution:
transform: target
longitude:
transform: scale
latitude:
transform: scale
hour:
transform: one_hot
weekday:
transform: one_hot
day:
transform: one_hot
In [20]:
# Delete previous run results.
!rm -r -f ./sfpd/transform
In [21]:
%%ml transform
output: ./sfpd/transform
analysis: ./sfpd/analysis
shuffle: true
data: sfpd_3pcnt
Create a "transformed" dataset for use in training.
In [23]:
%%ml dataset create
format: transformed
name: sfpd_3pcnt_transformed
train: ./sfpd/transform/train-*
eval: ./sfpd/transform/eval-*
In [24]:
%%ml dataset explore
name: sfpd_3pcnt_transformed
Local training, depending on your Datalab VM size, can take from 1 min to 10 min. You can click the Tensorboard link in the cell output to watch the progress.
In [25]:
# Delete previous run results.
!rm -r -f ./sfpd/train
In [26]:
%%ml train
output: ./sfpd/train
analysis: ./sfpd/analysis
data: sfpd_3pcnt_transformed
model_args:
model: dnn_classification
hidden-layer-size1: 200
Let's run our model with the eval data to gather some metrics. Note that after training step was completed, two directories were created: "evaluation_model" and "model". The only difference between these two models is that "evaluation_model" takes input with target (or truth) column, and output it as is, while "model" is mostly used at prediction time with no target column in data.
In [27]:
!rm -r -f ./sfpd/batch_predict # Delete previous results.
In [29]:
%%ml batch_predict
model: ./sfpd/train/evaluation_model/
output: ./sfpd/evaluation
format: csv
data:
csv: ./sfpd/eval.csv
In [32]:
%%ml evaluate confusion_matrix --plot
csv: ./sfpd/evaluation/predict_results_eval.csv
In [33]:
%%ml evaluate accuracy
csv: ./sfpd/evaluation/predict_results_eval.csv
Out[33]:
In [36]:
%%ml evaluate roc --plot
target_class: ACTION
csv: ./sfpd/evaluation/predict_results_eval.csv
Model trained with deep neuro-network is hard to inspect. We will use LIME to analyze the prediction results. The purspose is to inspect the model as a black box to see how important each feature is to the prediction results.
In [64]:
%%ml predict
model: ./sfpd/train/model
data:
- 120207601,DRUG/NARCOTIC,SALEOFCONTROLLEDSUBSTANCE,INGLESIDE,-122.45116399,37.7455640063,9,1,43
- 30280818,ROBBERY,ROBBERYONTHESTREET STRONGARM,TENDERLOIN,-122.411778296,37.7839805593,14,7,67
- 81312907,OTHER OFFENSES,OBSTRUCTIONSONSTREETSSIDEWALKS,BAYVIEW,-122.387067978,37.7554460266,8,3,344
In [65]:
%%ml explain --overview
training_data: sfpd_3pcnt
model: ./sfpd/train/model
labels: ACTION
data: 120207601,DRUG/NARCOTIC,SALEOFCONTROLLEDSUBSTANCE,INGLESIDE,-122.45116399,37.7455640063,9,1,43
In [66]:
%%ml explain --overview
training_data: sfpd_3pcnt
model: ./sfpd/train/model
labels: ACTION
data: 30280818,ROBBERY,ROBBERYONTHESTREET STRONGARM,TENDERLOIN,-122.411778296,37.7839805593,14,7,67
In [68]:
%%ml explain --overview
training_data: sfpd_3pcnt
model: ./sfpd/train/model
labels: ACTION
data: 81312907,OTHER OFFENSES,OBSTRUCTIONSONSTREETSSIDEWALKS,BAYVIEW,-122.387067978,37.7554460266,8,3,344