ML Workbench Sample --- Classification with Structured Data



Introduction of ML Workbench

ML Workbench provides an easy command line interface for machine learning life cycle, which involves four stages:

  • analyze: gather stats and metadata of the training data, such as numeric stats, vocabularies, etc. Analysis results are used in transforming raw data into numeric features, which can be consumed by training directly.
  • transform: explicitly transform raw data into numeric features which can be used for training.
  • train: training model using transformed data.
  • predict/batch_predict: given a few instances of prediction data, make predictions instantly / with large number of instances of prediction data, make predictions in a batched fassion.

There are "local" and "cloud" run mode for each stage. "cloud" run mode is recommended if your data is big.

ML Workbench supports numeric, categorical, text, image training data. For each type, there are a set of "transforms" to choose from. The "transforms" indicate how to convert the data into numeric features. For images, it is converted to fixed size vectors representing high level features.

In this notebook, we are going to predict resolution of police reports. we will build a binary classifier --- Either there were actions taken such as arrest, citation, etc, or nothing. This is an imagenary scenario. We are not sure if the prediction does make sense in reality but we use the data to demonstrate how to use ML Workbench to create a classification model with structured data.

Data Exploration

The original data is stored as a BigQuery public dataset. See San Francisco Police Reports Data for more details.


In [1]:
%%bq tables describe
name: bigquery-public-data.san_francisco.sfpd_incidents


Out[1]:

In [2]:
%%bq query
select * from `bigquery-public-data.san_francisco.sfpd_incidents` LIMIT 10


Out[2]:
unique_keycategorydescriptdayofweekpddistrictresolutionaddresslongitudelatitudelocationpdidtimestamp
160793808OTHER OFFENSESFAILURE TO REGISTER AS SEX OFFENDERFridayPARKARREST, BOOKED100 Block of GLENVIEW DR-122.44682747337.748334585(37.748334584991106, -122.44682747260848)160793808140202016-09-09 00:01:00
60103177NON-CRIMINALTRAFFIC ACCIDENTFridayPARKPSYCHOPATHIC CASEDUBOCE AV / NOE ST-122.43357509737.7691767476(37.7691767476277, -122.433575097282)60103177680502006-01-27 13:54:00
50791475BURGLARYBURGLARY OF FLAT, FORCIBLE ENTRYFridayPARKNONE1000 Block of OAK ST-122.43659232237.7731833319(37.7731833318632, -122.436592321946)50791475050212005-07-15 14:00:00
170605657LARCENY/THEFTPETTY THEFT FROM UNLOCKED AUTOMondayPARKNONE1800 Block of HAIGHT ST-122.45272832337.7693241293(37.769324129331906, -122.45272832275934)170605657062232017-07-24 18:00:00
166018573LARCENY/THEFTGRAND THEFT FROM LOCKED AUTOSundayNONE100 Block of VELASCO AV-122.41335198537.7082024585(37.70820245849022, -122.4133519852842)166018573062442016-01-17 23:54:00
81061687DRUNKENNESSUNDER INFLUENCE OF ALCOHOL IN A PUBLIC PLACESundayPARKARREST, BOOKED700 Block of STANYAN ST-122.45351291137.7686969787(37.7686969786551, -122.453512911126)81061687190902008-10-05 14:20:00
90064022ROBBERYROBBERY ON THE STREET WITH A DANGEROUS WEAPONSundayPARKARREST, BOOKEDTURK ST / PIERCE ST-122.43542667237.7800763782(37.7800763782418, -122.435426671978)90064022030132009-01-18 09:45:00
110970181BURGLARYBURGLARY, UNLAWFUL ENTRYSundayPARKNONE1300 Block of HAYES ST-122.43863744737.7748288206(37.7748288206275, -122.438637446729)110970181050732011-10-23 14:00:00
50265408VEHICLE THEFTSTOLEN MOTORCYCLETuesdayPARKARREST, BOOKED0 Block of BEAVER ST-122.43430318837.7649375062(37.7649375061603, -122.434303187888)50265408070232005-03-08 23:44:00
30203105WEAPON LAWSDISCHARGE FIREARM AT AN INHABITED DWELLINGTuesdayPARKNONECENTRAL AV / FULTON ST-122.4447825437.7760162801(37.7760162800858, -122.444782539633)30203105120262003-02-18 14:45:00

(rows: 10, time: 1.6s, 370MB processed, job: job_Eq-WxGq-tQZg9pikkg1EuI6OUMRD)



In [3]:
%%bq query
select count(*) from `bigquery-public-data.san_francisco.sfpd_incidents`


Out[3]:
f0_
2138115

(rows: 1, time: 1.1s, 0B processed, job: job_-bnRAjiulSiLxZjJNdqYPAWCj2-c)



In [7]:
%%bq query
SELECT resolution, count(*) as count FROM `bigquery-public-data.san_francisco.sfpd_incidents` GROUP BY resolution


Out[7]:
resolutioncount
NONE1331698
ARREST, BOOKED506904
PSYCHOPATHIC CASE29183
ARREST, CITED154774
UNFOUNDED23337
LOCATED34463
EXCEPTIONAL CLEARANCE4188
DISTRICT ATTORNEY REFUSES TO PROSECUTE7955
JUVENILE BOOKED13701
COMPLAINANT REFUSES TO PROSECUTE8089
NOT PROSECUTED7717
PROSECUTED BY OUTSIDE AGENCY5070
JUVENILE CITED6586
JUVENILE ADMONISHED3004
CLEARED-CONTACT JUVENILE FOR MORE INFO674
JUVENILE DIVERTED688
PROSECUTED FOR LESSER OFFENSE84

(rows: 17, time: 1.4s, 20MB processed, job: job_K8TH6YRs9xzMK2j3jDlwsm4-zFQo)


We can use facets tools to visualize the overview of the data.


In [ ]:
from google.datalab.ml import FacetsOverview
from google.datalab.ml import BigQueryDataSet

data = BigQueryDataSet(table = 'bigquery-public-data.san_francisco.sfpd_incidents')
sampled = data.sample(10000)
FacetsOverview().plot({'data': sampled})

From results above, we learned that 'descript' is a text feature (not categorical) because of the large unique values size. "pdistrict" and "category" are both categorical columns. The information is important as to what transformation we will need to do on these data.

Since we already have timestamp, dayofweek seems redundant. Also, it seems address can be inferred from longitude and latitude too. So we are going to drop these two columns.

The timestamp is useful, but we need to convert it to weekday, day_in_year, hours to make them more impactful.

Data Prep

We select the features from source table, sample it (to fit local run), split it into train/eval set, and also do some feature extraction.


In [9]:
%%bq query --name sfpd
SELECT
  unique_key,
  category,
  REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(descript, r'[ ]+', '_'), r',', ' '), '[^a-zA-Z0-9 ]+', '') AS descript,
  pddistrict,
  CASE WHEN resolution='NONE' Then 'NONE' else 'ACTION' END as resolution,
  longitude,
  latitude,
  CAST(EXTRACT(HOUR FROM timestamp) AS STRING) as hour,
  CAST(EXTRACT(DAYOFWEEK FROM timestamp) AS STRING) as weekday,
  CAST(EXTRACT(DAYOFYEAR FROM timestamp) AS STRING) as day
FROM `bigquery-public-data.san_francisco.sfpd_incidents`



In [47]:
# Sample 3% data, and split it into train/eval set.

import google.datalab.bigquery as bq
import numpy as np

sampling = bq.Sampling.random(percent=3)
job = sfpd.execute(sampling=sampling)
result = job.result()
data_schema = result.schema
df = result.to_dataframe()

msk = np.random.rand(len(df)) < 0.9
train_df = df[msk]
eval_df = df[~msk]



In [11]:
print('Training set includes %d instances.' % len(train_df))
print('Eval set includes %d instances.' % len(eval_df))


Training set includes 57724 instances.
Eval set includes 6509 instances.

In [12]:
# Create a directory to store our training data.
!mkdir -p ./sfpd



In [13]:
train_df.to_csv('./sfpd/train.csv', header=False, index=False)
eval_df.to_csv('./sfpd/eval.csv', header=False, index=False)


Model Building


In [14]:
# This loads %%ml commands
import google.datalab.contrib.mlworkbench.commands


Create a dataset (data location, schema, train, eval) so we can reference it later.


In [44]:
%%ml dataset create
format: csv
train: ./sfpd/train.csv
eval: ./sfpd/eval.csv
name: sfpd_3pcnt
schema: $data_schema


You can run "%%ml dataset explore" on the dataset we defined to get more insight if needed.

Steps for model building --- Analysis, Transform, Training, Evaluation.


In [48]:
# Delete previous run results.
!rm -r -f ./sfpd/analysis



In [49]:
%%ml analyze
output: ./sfpd/analysis
data: sfpd_3pcnt
features:
  unique_key:
    transform: key
  category:
    transform: one_hot   
  descript:
    transform: bag_of_words
  pddistrict:
    transform: one_hot
  resolution:
    transform: target 
  longitude:
    transform: scale
  latitude:
    transform: scale
  hour:
    transform: one_hot
  weekday:
    transform: one_hot
  day:
    transform: one_hot


Expanding any file patterns...
file list computed.
Analyzing file /content/datalab/docs/samples/contrib/mlworkbench/structured_data_classification_sfcrime/sfpd/train.csv...
file /content/datalab/docs/samples/contrib/mlworkbench/structured_data_classification_sfcrime/sfpd/train.csv analyzed.

In [20]:
# Delete previous run results.
!rm -r -f ./sfpd/transform



In [21]:
%%ml transform
output: ./sfpd/transform
analysis: ./sfpd/analysis
shuffle: true
data: sfpd_3pcnt


/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)

Create a "transformed" dataset for use in training.


In [23]:
%%ml dataset create
format: transformed
name: sfpd_3pcnt_transformed
train: ./sfpd/transform/train-*
eval: ./sfpd/transform/eval-*



In [24]:
%%ml dataset explore
name: sfpd_3pcnt_transformed


train data instances: 57724
eval data instances: 6509

Local training, depending on your Datalab VM size, can take from 1 min to 10 min. You can click the Tensorboard link in the cell output to watch the progress.


In [25]:
# Delete previous run results.
!rm -r -f ./sfpd/train



In [26]:
%%ml train
output: ./sfpd/train
analysis: ./sfpd/analysis
data: sfpd_3pcnt_transformed
model_args:
    model: dnn_classification
    hidden-layer-size1: 200


TensorBoard was started successfully with pid 1591. Click here to access it.

Let's run our model with the eval data to gather some metrics. Note that after training step was completed, two directories were created: "evaluation_model" and "model". The only difference between these two models is that "evaluation_model" takes input with target (or truth) column, and output it as is, while "model" is mostly used at prediction time with no target column in data.


In [27]:
!rm -r -f ./sfpd/batch_predict # Delete previous results.



In [29]:
%%ml batch_predict
model: ./sfpd/train/evaluation_model/
output: ./sfpd/evaluation
format: csv
data:
  csv: ./sfpd/eval.csv


local prediction...
INFO:tensorflow:Restoring parameters from ./sfpd/train/evaluation_model/variables/variables
done.

In [32]:
%%ml evaluate confusion_matrix --plot
csv: ./sfpd/evaluation/predict_results_eval.csv



In [33]:
%%ml evaluate accuracy
csv: ./sfpd/evaluation/predict_results_eval.csv


Out[33]:
accuracy count target
0 0.771281 2514 ACTION
1 0.893617 3995 NONE
2 0.846367 6509 _all

In [36]:
%%ml evaluate roc --plot
target_class: ACTION
csv: ./sfpd/evaluation/predict_results_eval.csv


Explain the Model

Model trained with deep neuro-network is hard to inspect. We will use LIME to analyze the prediction results. The purspose is to inspect the model as a black box to see how important each feature is to the prediction results.


In [64]:
%%ml predict
model: ./sfpd/train/model
data:
  - 120207601,DRUG/NARCOTIC,SALEOFCONTROLLEDSUBSTANCE,INGLESIDE,-122.45116399,37.7455640063,9,1,43
  - 30280818,ROBBERY,ROBBERYONTHESTREET STRONGARM,TENDERLOIN,-122.411778296,37.7839805593,14,7,67
  - 81312907,OTHER OFFENSES,OBSTRUCTIONSONSTREETSSIDEWALKS,BAYVIEW,-122.387067978,37.7554460266,8,3,344


ACTION NONE predicted unique_key category day descript hour latitude longitude pddistrict weekday
0.663459 0.336541 ACTION 120207601 DRUG/NARCOTIC 43 SALEOFCONTROLLEDSUBSTANCE 9 37.7455640063 -122.45116399 INGLESIDE 1
0.208403 0.791597 NONE 30280818 ROBBERY 67 ROBBERYONTHESTREET STRONGARM 14 37.7839805593 -122.411778296 TENDERLOIN 7
0.674107 0.325893 ACTION 81312907 OTHER OFFENSES 344 OBSTRUCTIONSONSTREETSSIDEWALKS 8 37.7554460266 -122.387067978 BAYVIEW 3

In [65]:
%%ml explain --overview
training_data: sfpd_3pcnt
model: ./sfpd/train/model
labels: ACTION
data: 120207601,DRUG/NARCOTIC,SALEOFCONTROLLEDSUBSTANCE,INGLESIDE,-122.45116399,37.7455640063,9,1,43



Explaining features for label "ACTION"

All Categorical and Numeric Columns

Text Column "descript"

In [66]:
%%ml explain --overview
training_data: sfpd_3pcnt
model: ./sfpd/train/model
labels: ACTION
data: 30280818,ROBBERY,ROBBERYONTHESTREET STRONGARM,TENDERLOIN,-122.411778296,37.7839805593,14,7,67



Explaining features for label "ACTION"

All Categorical and Numeric Columns

Text Column "descript"

In [68]:
%%ml explain --overview
training_data: sfpd_3pcnt
model: ./sfpd/train/model
labels: ACTION
data: 81312907,OTHER OFFENSES,OBSTRUCTIONSONSTREETSSIDEWALKS,BAYVIEW,-122.387067978,37.7554460266,8,3,344



Explaining features for label "ACTION"

All Categorical and Numeric Columns

Text Column "descript"

Model Deployment and Online Prediction

See structured_data_regression_taxi sample and image_classification_flower sample.