This notebook accompanies the article "How to read BigQuery data from TensorFlow 2.0 efficiently"
The example problem is to find credit card fraud from the dataset published in: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015 and available in BigQuery at
bigquery-public-data.ml_datasets.ulb_fraud_detection
In order to compare things, we will do a simple logistic regression in BigQuery ML.
Note that we are using all the columns in the dataset as predictors (except for the Time and Class columns). The Time column is used to split the dataset 80:20 with the first 80% used for training and the last 20% used for evaluation. We will also have BigQuery ML automatically balance the weights.
Because the Amount column has a huge range, we take the log of it in preprocessing.
In [ ]:
%%bash
# create output dataset
bq mk advdata
In [ ]:
%%bigquery
CREATE OR REPLACE MODEL advdata.ulb_fraud_detection
TRANSFORM(
* EXCEPT(Amount),
SAFE.LOG(Amount) AS log_amount
)
OPTIONS(
INPUT_LABEL_COLS=['class'],
AUTO_CLASS_WEIGHTS = TRUE,
DATA_SPLIT_METHOD='seq',
DATA_SPLIT_COL='Time',
MODEL_TYPE='logistic_reg'
) AS
SELECT
*
FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection`
In [10]:
%%bigquery
SELECT * FROM ML.EVALUATE(MODEL advdata.ulb_fraud_detection)
Out[10]:
In [15]:
%%bigquery
SELECT predicted_class_probs, Class
FROM ML.PREDICT( MODEL advdata.ulb_fraud_detection,
(SELECT * FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection` WHERE Time = 85285.0)
)
Out[15]:
In [11]:
%%bigquery
WITH counts AS (
SELECT
APPROX_QUANTILES(Time, 5)[OFFSET(4)] AS train_cutoff
, COUNTIF(CLASS > 0) AS pos
, COUNTIF(CLASS = 0) AS neg
FROM `bigquery-public-data`.ml_datasets.ulb_fraud_detection
)
SELECT
train_cutoff
, SAFE.LOG(SAFE_DIVIDE(pos,neg)) AS output_bias
, 0.5*SAFE_DIVIDE(pos + neg, pos) AS weight_pos
, 0.5*SAFE_DIVIDE(pos + neg, neg) AS weight_neg
FROM counts
Out[11]:
The time cutoff is 144803 and the Keras model's output bias needs to be set at -6.36 The class weights need to be 289.4 and 0.5
In [1]:
import tensorflow as tf
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
def features_and_labels(features):
label = features.pop('Class') # this is what we will train for
return features, label
def read_dataset(client, row_restriction, batch_size=2048):
GCP_PROJECT_ID='ai-analytics-solutions' # CHANGE
COL_NAMES = ['Time', 'Amount', 'Class'] + ['V{}'.format(i) for i in range(1,29)]
COL_TYPES = [dtypes.float64, dtypes.float64, dtypes.int64] + [dtypes.float64 for i in range(1,29)]
DATASET_GCP_PROJECT_ID, DATASET_ID, TABLE_ID, = 'bigquery-public-data.ml_datasets.ulb_fraud_detection'.split('.')
bqsession = client.read_session(
"projects/" + GCP_PROJECT_ID,
DATASET_GCP_PROJECT_ID, TABLE_ID, DATASET_ID,
COL_NAMES, COL_TYPES,
requested_streams=2,
row_restriction=row_restriction)
dataset = bqsession.parallel_read_rows()
return dataset.prefetch(1).map(features_and_labels).shuffle(batch_size*10).batch(batch_size)
client = BigQueryClient()
In [2]:
temp_df = read_dataset(client, 'Time <= 144803', 2)
for row in temp_df:
print(row)
break
In [3]:
train_df = read_dataset(client, 'Time <= 144803', 2048)
eval_df = read_dataset(client, 'Time > 144803', 2048)
Create Keras model
In [7]:
metrics = [
tf.keras.metrics.BinaryAccuracy(name='accuracy'),
tf.keras.metrics.Precision(name='precision'),
tf.keras.metrics.Recall(name='recall'),
tf.keras.metrics.AUC(name='roc_auc'),
]
# create inputs, and pass them into appropriate types of feature columns (here, everything is numeric)
inputs = {
'V{}'.format(i) : tf.keras.layers.Input(name='V{}'.format(i), shape=(), dtype='float64') for i in range(1, 29)
}
inputs['Amount'] = tf.keras.layers.Input(name='Amount', shape=(), dtype='float64')
input_fc = [tf.feature_column.numeric_column(colname) for colname in inputs.keys()]
# transformations. only the Amount is transformed
transformed = inputs.copy()
transformed['Amount'] = tf.keras.layers.Lambda(
lambda x: tf.math.log(tf.math.maximum(x, 0.01)), name='log_amount')(inputs['Amount'])
input_layer = tf.keras.layers.DenseFeatures(input_fc, name='inputs')(transformed)
# Deep learning model
d1 = tf.keras.layers.Dense(16, activation='relu', name='d1')(input_layer)
d2 = tf.keras.layers.Dropout(0.25, name='d2')(d1)
d3 = tf.keras.layers.Dense(16, activation='relu', name='d3')(d2)
output = tf.keras.layers.Dense(1, activation='sigmoid', name='d4', bias_initializer=tf.keras.initializers.Constant())(d3)
model = tf.keras.Model(inputs, output)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=metrics)
tf.keras.utils.plot_model(model, rankdir='LR')
Out[7]:
In [9]:
class_weight = {0: 0.5, 1: 289.4}
history = model.fit(train_df, validation_data=eval_df, epochs=20, class_weight=class_weight)
In [23]:
import matplotlib.pyplot as plt
plt.plot(history.history['val_roc_auc']);
plt.xlabel('Epoch');
plt.ylabel('AUC');
In [14]:
BUCKET='ai-analytics-solutions-kfpdemo' # CHANGE TO SOMETHING THAT YOU OWN
model.save('gs://{}/bqexample/export'.format(BUCKET))
In [ ]:
%%bigquery
CREATE OR REPLACE MODEL advdata.keras_fraud_detection
OPTIONS(model_type='tensorflow', model_path='gs://ai-analytics-solutions-kfpdemo/bqexample/export/*')
Now predict with this model (the reason it's called 'd4' is because the output node of my Keras model was called 'd4'). To get probabilities, etc. we'd have to add the corresponding outputs to the Keras model.
In [19]:
%%bigquery
SELECT d4, Class
FROM ML.PREDICT( MODEL advdata.keras_fraud_detection,
(SELECT * FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection` WHERE Time = 85285.0)
)
Out[19]:
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.