ML Workbench Sample --- Classification with Structured Data

Introduction of ML Workbench

ML Workbench provides an easy command line interface for machine learning life cycle, which involves four stages:

analyze: gather stats and metadata of the training data, such as numeric stats, vocabularies, etc. Analysis results are used in transforming raw data into numeric features, which can be consumed by training directly.
transform: explicitly transform raw data into numeric features which can be used for training.
train: training model using transformed data.
predict/batch_predict: given a few instances of prediction data, make predictions instantly / with large number of instances of prediction data, make predictions in a batched fassion.

There are "local" and "cloud" run mode for each stage. "cloud" run mode is recommended if your data is big.

ML Workbench supports numeric, categorical, text, image training data. For each type, there are a set of "transforms" to choose from. The "transforms" indicate how to convert the data into numeric features. For images, it is converted to fixed size vectors representing high level features.

In this notebook, we are going to predict resolution of police reports. we will build a binary classifier --- Either there were actions taken such as arrest, citation, etc, or nothing. This is an imagenary scenario. We are not sure if the prediction does make sense in reality but we use the data to demonstrate how to use ML Workbench to create a classification model with structured data.

Data Exploration

The original data is stored as a BigQuery public dataset. See San Francisco Police Reports Data for more details.



In [1]:

    
%%bq tables describe
name: bigquery-public-data.san_francisco.sfpd_incidents



In [2]:

    
%%bq query
select * from `bigquery-public-data.san_francisco.sfpd_incidents` LIMIT 10









    





          
          
          






    Out[2]:





    unique_key category descript dayofweek pddistrict resolution address longitude latitude location pdid timestamp
160793808 OTHER OFFENSES FAILURE TO REGISTER AS SEX OFFENDER Friday PARK ARREST, BOOKED 100 Block of GLENVIEW DR -122.446827473 37.748334585 (37.748334584991106, -122.44682747260848) 16079380814020 2016-09-09 00:01:00
60103177 NON-CRIMINAL TRAFFIC ACCIDENT Friday PARK PSYCHOPATHIC CASE DUBOCE AV / NOE ST -122.433575097 37.7691767476 (37.7691767476277, -122.433575097282) 6010317768050 2006-01-27 13:54:00
50791475 BURGLARY BURGLARY OF FLAT, FORCIBLE ENTRY Friday PARK NONE 1000 Block of OAK ST -122.436592322 37.7731833319 (37.7731833318632, -122.436592321946) 5079147505021 2005-07-15 14:00:00
170605657 LARCENY/THEFT PETTY THEFT FROM UNLOCKED AUTO Monday PARK NONE 1800 Block of HAIGHT ST -122.452728323 37.7693241293 (37.769324129331906, -122.45272832275934) 17060565706223 2017-07-24 18:00:00
166018573 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Sunday NONE 100 Block of VELASCO AV -122.413351985 37.7082024585 (37.70820245849022, -122.4133519852842) 16601857306244 2016-01-17 23:54:00
81061687 DRUNKENNESS UNDER INFLUENCE OF ALCOHOL IN A PUBLIC PLACE Sunday PARK ARREST, BOOKED 700 Block of STANYAN ST -122.453512911 37.7686969787 (37.7686969786551, -122.453512911126) 8106168719090 2008-10-05 14:20:00
90064022 ROBBERY ROBBERY ON THE STREET WITH A DANGEROUS WEAPON Sunday PARK ARREST, BOOKED TURK ST / PIERCE ST -122.435426672 37.7800763782 (37.7800763782418, -122.435426671978) 9006402203013 2009-01-18 09:45:00
110970181 BURGLARY BURGLARY, UNLAWFUL ENTRY Sunday PARK NONE 1300 Block of HAYES ST -122.438637447 37.7748288206 (37.7748288206275, -122.438637446729) 11097018105073 2011-10-23 14:00:00
50265408 VEHICLE THEFT STOLEN MOTORCYCLE Tuesday PARK ARREST, BOOKED 0 Block of BEAVER ST -122.434303188 37.7649375062 (37.7649375061603, -122.434303187888) 5026540807023 2005-03-08 23:44:00
30203105 WEAPON LAWS DISCHARGE FIREARM AT AN INHABITED DWELLING Tuesday PARK NONE CENTRAL AV / FULTON ST -122.44478254 37.7760162801 (37.7760162800858, -122.444782539633) 3020310512026 2003-02-18 14:45:00
    
(rows: 10, time: 1.6s,   370MB processed, job: job_Eq-WxGq-tQZg9pikkg1EuI6OUMRD)



In [3]:

    
%%bq query
select count(*) from `bigquery-public-data.san_francisco.sfpd_incidents`









    





          
          
          






    Out[3]:





    f0_
2138115
    
(rows: 1, time: 1.1s,     0B processed, job: job_-bnRAjiulSiLxZjJNdqYPAWCj2-c)



In [7]:

    
%%bq query
SELECT resolution, count(*) as count FROM `bigquery-public-data.san_francisco.sfpd_incidents` GROUP BY resolution









    





          
          
          






    Out[7]:





    resolution count
NONE 1331698
ARREST, BOOKED 506904
PSYCHOPATHIC CASE 29183
ARREST, CITED 154774
UNFOUNDED 23337
LOCATED 34463
EXCEPTIONAL CLEARANCE 4188
DISTRICT ATTORNEY REFUSES TO PROSECUTE 7955
JUVENILE BOOKED 13701
COMPLAINANT REFUSES TO PROSECUTE 8089
NOT PROSECUTED 7717
PROSECUTED BY OUTSIDE AGENCY 5070
JUVENILE CITED 6586
JUVENILE ADMONISHED 3004
CLEARED-CONTACT JUVENILE FOR MORE INFO 674
JUVENILE DIVERTED 688
PROSECUTED FOR LESSER OFFENSE 84
    
(rows: 17, time: 1.4s,    20MB processed, job: job_K8TH6YRs9xzMK2j3jDlwsm4-zFQo)

We can use facets tools to visualize the overview of the data.



In [ ]:

    
from google.datalab.ml import FacetsOverview
from google.datalab.ml import BigQueryDataSet

data = BigQueryDataSet(table = 'bigquery-public-data.san_francisco.sfpd_incidents')
sampled = data.sample(10000)
FacetsOverview().plot({'data': sampled})

From results above, we learned that 'descript' is a text feature (not categorical) because of the large unique values size. "pdistrict" and "category" are both categorical columns. The information is important as to what transformation we will need to do on these data.

Since we already have timestamp, dayofweek seems redundant. Also, it seems address can be inferred from longitude and latitude too. So we are going to drop these two columns.

The timestamp is useful, but we need to convert it to weekday, day_in_year, hours to make them more impactful.

Data Prep

We select the features from source table, sample it (to fit local run), split it into train/eval set, and also do some feature extraction.



In [9]:

    
%%bq query --name sfpd
SELECT
  unique_key,
  category,
  REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(descript, r'[ ]+', '_'), r',', ' '), '[^a-zA-Z0-9 ]+', '') AS descript,
  pddistrict,
  CASE WHEN resolution='NONE' Then 'NONE' else 'ACTION' END as resolution,
  longitude,
  latitude,
  CAST(EXTRACT(HOUR FROM timestamp) AS STRING) as hour,
  CAST(EXTRACT(DAYOFWEEK FROM timestamp) AS STRING) as weekday,
  CAST(EXTRACT(DAYOFYEAR FROM timestamp) AS STRING) as day
FROM `bigquery-public-data.san_francisco.sfpd_incidents`



In [47]:

    
# Sample 3% data, and split it into train/eval set.

import google.datalab.bigquery as bq
import numpy as np

sampling = bq.Sampling.random(percent=3)
job = sfpd.execute(sampling=sampling)
result = job.result()
data_schema = result.schema
df = result.to_dataframe()

msk = np.random.rand(len(df)) < 0.9
train_df = df[msk]
eval_df = df[~msk]



In [11]:

    
print('Training set includes %d instances.' % len(train_df))
print('Eval set includes %d instances.' % len(eval_df))









    





          
          
          






    



Training set includes 57724 instances.
Eval set includes 6509 instances.



In [12]:

    
# Create a directory to store our training data.
!mkdir -p ./sfpd



In [13]:

    
train_df.to_csv('./sfpd/train.csv', header=False, index=False)
eval_df.to_csv('./sfpd/eval.csv', header=False, index=False)

Model Building



In [14]:

    
# This loads %%ml commands
import google.datalab.contrib.mlworkbench.commands

Create a dataset (data location, schema, train, eval) so we can reference it later.



In [44]:

    
%%ml dataset create
format: csv
train: ./sfpd/train.csv
eval: ./sfpd/eval.csv
name: sfpd_3pcnt
schema: $data_schema

You can run "%%ml dataset explore" on the dataset we defined to get more insight if needed.

Steps for model building --- Analysis, Transform, Training, Evaluation.



In [48]:

    
# Delete previous run results.
!rm -r -f ./sfpd/analysis



In [49]:

    
%%ml analyze
output: ./sfpd/analysis
data: sfpd_3pcnt
features:
  unique_key:
    transform: key
  category:
    transform: one_hot   
  descript:
    transform: bag_of_words
  pddistrict:
    transform: one_hot
  resolution:
    transform: target 
  longitude:
    transform: scale
  latitude:
    transform: scale
  hour:
    transform: one_hot
  weekday:
    transform: one_hot
  day:
    transform: one_hot









    





          
          
          






    



Expanding any file patterns...
file list computed.
Analyzing file /content/datalab/docs/samples/contrib/mlworkbench/structured_data_classification_sfcrime/sfpd/train.csv...
file /content/datalab/docs/samples/contrib/mlworkbench/structured_data_classification_sfcrime/sfpd/train.csv analyzed.



In [20]:

    
# Delete previous run results.
!rm -r -f ./sfpd/transform



In [21]:

    
%%ml transform
output: ./sfpd/transform
analysis: ./sfpd/analysis
shuffle: true
data: sfpd_3pcnt









    





          
          
          






    



/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)

Create a "transformed" dataset for use in training.



In [23]:

    
%%ml dataset create
format: transformed
name: sfpd_3pcnt_transformed
train: ./sfpd/transform/train-*
eval: ./sfpd/transform/eval-*



In [24]:

    
%%ml dataset explore
name: sfpd_3pcnt_transformed









    





          
          
          






    



train data instances: 57724
eval data instances: 6509

Local training, depending on your Datalab VM size, can take from 1 min to 10 min. You can click the Tensorboard link in the cell output to watch the progress.



In [25]:

    
# Delete previous run results.
!rm -r -f ./sfpd/train



In [26]:

    
%%ml train
output: ./sfpd/train
analysis: ./sfpd/analysis
data: sfpd_3pcnt_transformed
model_args:
    model: dnn_classification
    hidden-layer-size1: 200









    





          
          
          






    




TensorBoard was started successfully with pid 1591. Click here to access it.

Let's run our model with the eval data to gather some metrics. Note that after training step was completed, two directories were created: "evaluation_model" and "model". The only difference between these two models is that "evaluation_model" takes input with target (or truth) column, and output it as is, while "model" is mostly used at prediction time with no target column in data.



In [27]:

    
!rm -r -f ./sfpd/batch_predict # Delete previous results.



In [29]:

    
%%ml batch_predict
model: ./sfpd/train/evaluation_model/
output: ./sfpd/evaluation
format: csv
data:
  csv: ./sfpd/eval.csv









    





          
          
          






    



local prediction...
INFO:tensorflow:Restoring parameters from ./sfpd/train/evaluation_model/variables/variables
done.



In [32]:

    
%%ml evaluate confusion_matrix --plot
csv: ./sfpd/evaluation/predict_results_eval.csv



In [33]:

    
%%ml evaluate accuracy
csv: ./sfpd/evaluation/predict_results_eval.csv



In [36]:

    
%%ml evaluate roc --plot
target_class: ACTION
csv: ./sfpd/evaluation/predict_results_eval.csv

Explain the Model

Model trained with deep neuro-network is hard to inspect. We will use LIME to analyze the prediction results. The purspose is to inspect the model as a black box to see how important each feature is to the prediction results.



In [64]:

    
%%ml predict
model: ./sfpd/train/model
data:
  - 120207601,DRUG/NARCOTIC,SALEOFCONTROLLEDSUBSTANCE,INGLESIDE,-122.45116399,37.7455640063,9,1,43
  - 30280818,ROBBERY,ROBBERYONTHESTREET STRONGARM,TENDERLOIN,-122.411778296,37.7839805593,14,7,67
  - 81312907,OTHER OFFENSES,OBSTRUCTIONSONSTREETSSIDEWALKS,BAYVIEW,-122.387067978,37.7554460266,8,3,344









    





          
          
          






    





  
    
      ACTION
      NONE
      predicted
      unique_key
      category
      day
      descript
      hour
      latitude
      longitude
      pddistrict
      weekday
    
  
  
    
      0.663459
      0.336541
      ACTION
      120207601
      DRUG/NARCOTIC
      43
      SALEOFCONTROLLEDSUBSTANCE
      9
      37.7455640063
      -122.45116399
      INGLESIDE
      1
    
    
      0.208403
      0.791597
      NONE
      30280818
      ROBBERY
      67
      ROBBERYONTHESTREET STRONGARM
      14
      37.7839805593
      -122.411778296
      TENDERLOIN
      7
    
    
      0.674107
      0.325893
      ACTION
      81312907
      OTHER OFFENSES
      344
      OBSTRUCTIONSONSTREETSSIDEWALKS
      8
      37.7554460266
      -122.387067978
      BAYVIEW
      3



In [65]:

    
%%ml explain --overview
training_data: sfpd_3pcnt
model: ./sfpd/train/model
labels: ACTION
data: 120207601,DRUG/NARCOTIC,SALEOFCONTROLLEDSUBSTANCE,INGLESIDE,-122.45116399,37.7455640063,9,1,43









    





          
          
          






    





Explaining features for label "ACTION"







    





All Categorical and Numeric Columns







    





        
        
        
Feature Value
category=DRUG/NARCOTIC True
pddistrict=INGLESIDE True
weekday=1 True
latitude 37.75
longitude -122.45
        
        
        






    





  Text Column "descript"







    





        
        
        
Text with highlighted words
SALEOFCONTROLLEDSUBSTANCE



In [66]:

    
%%ml explain --overview
training_data: sfpd_3pcnt
model: ./sfpd/train/model
labels: ACTION
data: 30280818,ROBBERY,ROBBERYONTHESTREET STRONGARM,TENDERLOIN,-122.411778296,37.7839805593,14,7,67









    





          
          
          






    





Explaining features for label "ACTION"







    





All Categorical and Numeric Columns







    





        
        
        
Feature Value
category=ROBBERY True
day=67 True
pddistrict=TENDERLOIN True
hour=14 True
weekday=7 True
        
        
        






    





  Text Column "descript"







    





        
        
        
Text with highlighted words
ROBBERYONTHESTREET STRONGARM



In [68]:

    
%%ml explain --overview
training_data: sfpd_3pcnt
model: ./sfpd/train/model
labels: ACTION
data: 81312907,OTHER OFFENSES,OBSTRUCTIONSONSTREETSSIDEWALKS,BAYVIEW,-122.387067978,37.7554460266,8,3,344









    





          
          
          






    





Explaining features for label "ACTION"







    





All Categorical and Numeric Columns







    





        
        
        
Feature Value
category=OTHER OFFENSES True
hour=8 True
day=344 True
pddistrict=BAYVIEW True
weekday=3 True
        
        
        






    





  Text Column "descript"







    





        
        
        
Text with highlighted words
OBSTRUCTIONSONSTREETSSIDEWALKS

Model Deployment and Online Prediction

See structured_data_regression_taxi sample and image_classification_flower sample.

unique_key	category	descript	dayofweek	pddistrict	resolution	address	longitude	latitude	location	pdid	timestamp
160793808	OTHER OFFENSES	FAILURE TO REGISTER AS SEX OFFENDER	Friday	PARK	ARREST, BOOKED	100 Block of GLENVIEW DR	-122.446827473	37.748334585	(37.748334584991106, -122.44682747260848)	16079380814020	2016-09-09 00:01:00
60103177	NON-CRIMINAL	TRAFFIC ACCIDENT	Friday	PARK	PSYCHOPATHIC CASE	DUBOCE AV / NOE ST	-122.433575097	37.7691767476	(37.7691767476277, -122.433575097282)	6010317768050	2006-01-27 13:54:00
50791475	BURGLARY	BURGLARY OF FLAT, FORCIBLE ENTRY	Friday	PARK	NONE	1000 Block of OAK ST	-122.436592322	37.7731833319	(37.7731833318632, -122.436592321946)	5079147505021	2005-07-15 14:00:00
170605657	LARCENY/THEFT	PETTY THEFT FROM UNLOCKED AUTO	Monday	PARK	NONE	1800 Block of HAIGHT ST	-122.452728323	37.7693241293	(37.769324129331906, -122.45272832275934)	17060565706223	2017-07-24 18:00:00
166018573	LARCENY/THEFT	GRAND THEFT FROM LOCKED AUTO	Sunday		NONE	100 Block of VELASCO AV	-122.413351985	37.7082024585	(37.70820245849022, -122.4133519852842)	16601857306244	2016-01-17 23:54:00
81061687	DRUNKENNESS	UNDER INFLUENCE OF ALCOHOL IN A PUBLIC PLACE	Sunday	PARK	ARREST, BOOKED	700 Block of STANYAN ST	-122.453512911	37.7686969787	(37.7686969786551, -122.453512911126)	8106168719090	2008-10-05 14:20:00
90064022	ROBBERY	ROBBERY ON THE STREET WITH A DANGEROUS WEAPON	Sunday	PARK	ARREST, BOOKED	TURK ST / PIERCE ST	-122.435426672	37.7800763782	(37.7800763782418, -122.435426671978)	9006402203013	2009-01-18 09:45:00
110970181	BURGLARY	BURGLARY, UNLAWFUL ENTRY	Sunday	PARK	NONE	1300 Block of HAYES ST	-122.438637447	37.7748288206	(37.7748288206275, -122.438637446729)	11097018105073	2011-10-23 14:00:00
50265408	VEHICLE THEFT	STOLEN MOTORCYCLE	Tuesday	PARK	ARREST, BOOKED	0 Block of BEAVER ST	-122.434303188	37.7649375062	(37.7649375061603, -122.434303187888)	5026540807023	2005-03-08 23:44:00
30203105	WEAPON LAWS	DISCHARGE FIREARM AT AN INHABITED DWELLING	Tuesday	PARK	NONE	CENTRAL AV / FULTON ST	-122.44478254	37.7760162801	(37.7760162800858, -122.444782539633)	3020310512026	2003-02-18 14:45:00

resolution	count
NONE	1331698
ARREST, BOOKED	506904
PSYCHOPATHIC CASE	29183
ARREST, CITED	154774
UNFOUNDED	23337
LOCATED	34463
EXCEPTIONAL CLEARANCE	4188
DISTRICT ATTORNEY REFUSES TO PROSECUTE	7955
JUVENILE BOOKED	13701
COMPLAINANT REFUSES TO PROSECUTE	8089
NOT PROSECUTED	7717
PROSECUTED BY OUTSIDE AGENCY	5070
JUVENILE CITED	6586
JUVENILE ADMONISHED	3004
CLEARED-CONTACT JUVENILE FOR MORE INFO	674
JUVENILE DIVERTED	688
PROSECUTED FOR LESSER OFFENSE	84

ACTION	NONE	predicted	unique_key	category	day	descript	hour	latitude	longitude	pddistrict	weekday
0.663459	0.336541	ACTION	120207601	DRUG/NARCOTIC	43	SALEOFCONTROLLEDSUBSTANCE	9	37.7455640063	-122.45116399	INGLESIDE	1
0.208403	0.791597	NONE	30280818	ROBBERY	67	ROBBERYONTHESTREET STRONGARM	14	37.7839805593	-122.411778296	TENDERLOIN	7
0.674107	0.325893	ACTION	81312907	OTHER OFFENSES	344	OBSTRUCTIONSONSTREETSSIDEWALKS	8	37.7554460266	-122.387067978	BAYVIEW	3

Feature	Value
category=DRUG/NARCOTIC	True
pddistrict=INGLESIDE	True
weekday=1	True
latitude	37.75
longitude	-122.45

Feature	Value
category=ROBBERY	True
day=67	True
pddistrict=TENDERLOIN	True
hour=14	True
weekday=7	True

Feature	Value
category=OTHER OFFENSES	True
hour=8	True
day=344	True
pddistrict=BAYVIEW	True
weekday=3	True