Continuous Evaluation

This notebook demonstrates how to use Cloud AI Platform to execute continuous evaluation of a deployed machine learning model. You'll need to have a project set up with Google Cloud Platform.

Set up

Start by creating environment variables for your Google Cloud project and bucket. Also, import the libraries we'll need for this notebook.


In [18]:
# change these to try this notebook out
PROJECT = '<YOUR-GCS-BUCKET>'
BUCKET = '<YOUR-GCS-BUCKET>'

In [19]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['TFVERSION'] = '2.1'

In [131]:
import shutil

import pandas as pd
import tensorflow as tf

from google.cloud import bigquery
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow_hub import KerasLayer
from tensorflow.keras.layers import Dense, Input, Lambda
from tensorflow.keras.models import Model
print(tf.__version__)

%matplotlib inline


2.1.1

Train and deploy the model

For this notebook, we'll build a text classification model using the Hacker News dataset. Each training example consists of an article title and the article source. The model will be trained to classify a given article title as belonging to either nytimes, github or techcrunch.

Load the data


In [25]:
DATASET_NAME = "titles_full.csv"
COLUMNS = ['title', 'source']

titles_df = pd.read_csv(DATASET_NAME, header=None, names=COLUMNS)
titles_df.head()


Out[25]:
title source
0 attempts to fix hn comment problems techcrunch
1 stop trusting yourself nytimes
2 scrollability github
3 toward our 3d future techcrunch
4 open source mechanical split flap display github

We one-hot encode the label...


In [27]:
CLASSES = {
    'github': 0,
    'nytimes': 1,
    'techcrunch': 2
}
N_CLASSES = len(CLASSES)

In [28]:
def encode_labels(sources):
    classes = [CLASSES[source] for source in sources]
    one_hots = to_categorical(classes, num_classes=N_CLASSES)
    return one_hots

In [29]:
encode_labels(titles_df.source[:4])


Out[29]:
array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]], dtype=float32)

...and create a train/test split.


In [30]:
N_TRAIN = int(len(titles_df) * 0.80)

titles_train, sources_train = (
    titles_df.title[:N_TRAIN], titles_df.source[:N_TRAIN])

titles_valid, sources_valid = (
    titles_df.title[N_TRAIN:], titles_df.source[N_TRAIN:])

In [31]:
X_train, Y_train = titles_train.values, encode_labels(sources_train)
X_valid, Y_valid = titles_valid.values, encode_labels(sources_valid)

In [32]:
X_train[:3]


Out[32]:
array(['attempts to fix hn comment problems ', 'stop trusting yourself',
       'scrollability'], dtype=object)

Swivel Model

We'll build a simple text classification model using a Tensorflow Hub embedding module derived from Swivel. Swivel is an algorithm that essentially factorizes word co-occurrence matrices to create the words embeddings. TF-Hub hosts the pretrained gnews-swivel-20dim-with-oov 20-dimensional Swivel module.


In [37]:
SWIVEL = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
swivel_module = KerasLayer(SWIVEL, output_shape=[20], input_shape=[], dtype=tf.string, trainable=True)

The build_model function is written so that the TF Hub module can easily be exchanged with another module.


In [46]:
def build_model(hub_module, model_name):
    inputs = Input(shape=[], dtype=tf.string, name="text")
    module = hub_module(inputs)
    h1 = Dense(16, activation='relu', name="h1")(module)
    outputs = Dense(N_CLASSES, activation='softmax', name='outputs')(h1)
    model = Model(inputs=inputs, outputs=[outputs], name=model_name)
    
    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

In [47]:
def train_and_evaluate(train_data, val_data, model, batch_size=5000):
    tf.random.set_seed(33)    
    X_train, Y_train = train_data

    history = model.fit(
        X_train, Y_train,
        epochs=100,
        batch_size=batch_size,
        validation_data=val_data,
        callbacks=[EarlyStopping()],
    )
    return history

In [51]:
txtcls_model = build_model(swivel_module, model_name='txtcls_swivel')

In [52]:
txtcls_model.summary()


Model: "txtcls_swivel"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
text (InputLayer)            [(None,)]                 0         
_________________________________________________________________
keras_layer_3 (KerasLayer)   (None, 20)                389380    
_________________________________________________________________
h1 (Dense)                   (None, 16)                336       
_________________________________________________________________
outputs (Dense)              (None, 3)                 51        
=================================================================
Total params: 389,767
Trainable params: 389,767
Non-trainable params: 0
_________________________________________________________________

Train and evaluation the model

With the model defined and data set up, next we'll train and evaluate the model.


In [43]:
# set up train and validation data
train_data = (X_train, Y_train)
val_data = (X_valid, Y_valid)

For training we'll call train_and_evaluate on txtcls_model.


In [45]:
txtcls_history = train_and_evaluate(train_data, val_data, txtcls_model)


Train on 76962 samples, validate on 19241 samples
Epoch 1/100
76962/76962 [==============================] - 3s 44us/sample - loss: 1.3122 - accuracy: 0.3642 - val_loss: 1.2068 - val_accuracy: 0.3918
Epoch 2/100
76962/76962 [==============================] - 1s 10us/sample - loss: 1.1478 - accuracy: 0.4204 - val_loss: 1.0969 - val_accuracy: 0.4471
Epoch 3/100
76962/76962 [==============================] - 1s 9us/sample - loss: 1.0601 - accuracy: 0.4737 - val_loss: 1.0330 - val_accuracy: 0.4915
Epoch 4/100
76962/76962 [==============================] - 1s 9us/sample - loss: 1.0019 - accuracy: 0.5167 - val_loss: 0.9832 - val_accuracy: 0.5313
Epoch 5/100
76962/76962 [==============================] - 1s 9us/sample - loss: 0.9533 - accuracy: 0.5522 - val_loss: 0.9384 - val_accuracy: 0.5624
Epoch 6/100
76962/76962 [==============================] - 1s 9us/sample - loss: 0.9086 - accuracy: 0.5828 - val_loss: 0.8963 - val_accuracy: 0.5905
Epoch 7/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.8663 - accuracy: 0.6088 - val_loss: 0.8560 - val_accuracy: 0.6154
Epoch 8/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.8256 - accuracy: 0.6325 - val_loss: 0.8177 - val_accuracy: 0.6347
Epoch 9/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.7868 - accuracy: 0.6542 - val_loss: 0.7817 - val_accuracy: 0.6544
Epoch 10/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.7503 - accuracy: 0.6733 - val_loss: 0.7481 - val_accuracy: 0.6710
Epoch 11/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.7161 - accuracy: 0.6903 - val_loss: 0.7175 - val_accuracy: 0.6871
Epoch 12/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.6847 - accuracy: 0.7056 - val_loss: 0.6897 - val_accuracy: 0.6999
Epoch 13/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.6559 - accuracy: 0.7196 - val_loss: 0.6649 - val_accuracy: 0.7110
Epoch 14/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.6297 - accuracy: 0.7320 - val_loss: 0.6430 - val_accuracy: 0.7218
Epoch 15/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.6060 - accuracy: 0.7433 - val_loss: 0.6238 - val_accuracy: 0.7318
Epoch 16/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.5846 - accuracy: 0.7521 - val_loss: 0.6067 - val_accuracy: 0.7381
Epoch 17/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.5651 - accuracy: 0.7618 - val_loss: 0.5916 - val_accuracy: 0.7464
Epoch 18/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.5475 - accuracy: 0.7696 - val_loss: 0.5783 - val_accuracy: 0.7517
Epoch 19/100
76962/76962 [==============================] - 1s 9us/sample - loss: 0.5314 - accuracy: 0.7770 - val_loss: 0.5664 - val_accuracy: 0.7570
Epoch 20/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.5167 - accuracy: 0.7840 - val_loss: 0.5558 - val_accuracy: 0.7629
Epoch 21/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.5032 - accuracy: 0.7900 - val_loss: 0.5464 - val_accuracy: 0.7678
Epoch 22/100
76962/76962 [==============================] - 1s 9us/sample - loss: 0.4909 - accuracy: 0.7954 - val_loss: 0.5381 - val_accuracy: 0.7720
Epoch 23/100
76962/76962 [==============================] - 1s 8us/sample - loss: 0.4795 - accuracy: 0.8010 - val_loss: 0.5307 - val_accuracy: 0.7761
Epoch 24/100
76962/76962 [==============================] - 1s 9us/sample - loss: 0.4690 - accuracy: 0.8052 - val_loss: 0.5241 - val_accuracy: 0.7787
Epoch 25/100
76962/76962 [==============================] - 1s 9us/sample - loss: 0.4593 - accuracy: 0.8095 - val_loss: 0.5183 - val_accuracy: 0.7826
Epoch 26/100
76962/76962 [==============================] - 1s 9us/sample - loss: 0.4503 - accuracy: 0.8138 - val_loss: 0.5131 - val_accuracy: 0.7850
Epoch 27/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.4419 - accuracy: 0.8178 - val_loss: 0.5085 - val_accuracy: 0.7866
Epoch 28/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.4341 - accuracy: 0.8214 - val_loss: 0.5043 - val_accuracy: 0.7885
Epoch 29/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.4269 - accuracy: 0.8252 - val_loss: 0.5006 - val_accuracy: 0.7902
Epoch 30/100
76962/76962 [==============================] - 1s 12us/sample - loss: 0.4201 - accuracy: 0.8278 - val_loss: 0.4976 - val_accuracy: 0.7916
Epoch 31/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.4137 - accuracy: 0.8311 - val_loss: 0.4947 - val_accuracy: 0.7932
Epoch 32/100
76962/76962 [==============================] - 1s 12us/sample - loss: 0.4077 - accuracy: 0.8334 - val_loss: 0.4924 - val_accuracy: 0.7948
Epoch 33/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.4021 - accuracy: 0.8363 - val_loss: 0.4902 - val_accuracy: 0.7957
Epoch 34/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.3968 - accuracy: 0.8390 - val_loss: 0.4885 - val_accuracy: 0.7973
Epoch 35/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.3917 - accuracy: 0.8409 - val_loss: 0.4869 - val_accuracy: 0.7979
Epoch 36/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.3869 - accuracy: 0.8434 - val_loss: 0.4858 - val_accuracy: 0.7987
Epoch 37/100
76962/76962 [==============================] - 1s 12us/sample - loss: 0.3824 - accuracy: 0.8455 - val_loss: 0.4844 - val_accuracy: 0.7997
Epoch 38/100
76962/76962 [==============================] - 1s 13us/sample - loss: 0.3781 - accuracy: 0.8475 - val_loss: 0.4837 - val_accuracy: 0.7999
Epoch 39/100
76962/76962 [==============================] - 1s 11us/sample - loss: 0.3740 - accuracy: 0.8495 - val_loss: 0.4832 - val_accuracy: 0.8013
Epoch 40/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.3701 - accuracy: 0.8513 - val_loss: 0.4830 - val_accuracy: 0.8014
Epoch 41/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.3663 - accuracy: 0.8528 - val_loss: 0.4827 - val_accuracy: 0.8027
Epoch 42/100
76962/76962 [==============================] - 1s 10us/sample - loss: 0.3627 - accuracy: 0.8547 - val_loss: 0.4824 - val_accuracy: 0.8036
Epoch 43/100
76962/76962 [==============================] - 1s 9us/sample - loss: 0.3594 - accuracy: 0.8562 - val_loss: 0.4826 - val_accuracy: 0.8032

In [53]:
history = txtcls_history
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()


Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3da455f150>

Calling predicition from model head produces output from final dense layer. This final layer is used to compute categorical cross-entropy when training.


In [54]:
txtcls_model.predict(x=["YouTube introduces Video Chapters to make it easier to navigate longer videos"])


Out[54]:
array([[0.25257826, 0.5127273 , 0.23469436]], dtype=float32)

We can save the model artifacts in the local directory called ./txtcls_swivel.


In [55]:
tf.saved_model.save(txtcls_model, './txtcls_swivel/')


WARNING:tensorflow:From /home/jupyter/.local/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1786: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /home/jupyter/.local/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1786: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: ./txtcls_swivel/assets
INFO:tensorflow:Assets written to: ./txtcls_swivel/assets

....and examine the model's serving default signature. As expected the model takes as input a text string (e.g. an article title) and retrns a 3-dimensional vector of floats (i.e. the softmax output layer).


In [57]:
!saved_model_cli show \
 --tag_set serve \
 --signature_def serving_default \
 --dir ./txtcls_swivel/


2020-06-26 02:27:42.046042: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-06-26 02:27:42.049327: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
The given SavedModel SignatureDef contains the following input(s):
  inputs['text'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_text:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['outputs'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 3)
      name: StatefulPartitionedCall_2:0
Method name is: tensorflow/serving/predict

To simplify the returned predictions, we'll modify the model signature so that the model outputs the predicted article source (either nytimes, techcrunch, or github) rather than the final softmax layer. We'll also return the 'confidence' of the model's prediction. This will be the softmax value corresonding to the predicted article source.


In [59]:
@tf.function(input_signature=[tf.TensorSpec([None], dtype=tf.string)])
def source_name(text):
    labels = tf.constant(['github', 'techcrunch', 'nytimes'], dtype=tf.string)
    probs = txtcls_model(text, training=False)
    indices = tf.argmax(probs, axis=1)
    pred_source = tf.gather(params=labels, indices=indices)
    pred_confidence = tf.reduce_max(probs, axis=1)
    
    return {'source': pred_source,
            'confidence': pred_confidence}

Now, we'll re-save the new Swivel model that has this updated model signature by referencing the source_name function for the model's serving_default.


In [60]:
shutil.rmtree('./txtcls_swivel', ignore_errors=True)
txtcls_model.save('./txtcls_swivel', signatures={'serving_default': source_name})


INFO:tensorflow:Assets written to: ./txtcls_swivel/assets
INFO:tensorflow:Assets written to: ./txtcls_swivel/assets

Examine the model signature to confirm the changes:


In [61]:
!saved_model_cli show \
 --tag_set serve \
 --signature_def serving_default \
 --dir ./txtcls_model/


2020-06-26 02:32:11.529183: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-06-26 02:32:11.531565: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
The given SavedModel SignatureDef contains the following input(s):
  inputs['text'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_text:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['confidence'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: StatefulPartitionedCall_2:0
  outputs['source'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: StatefulPartitionedCall_2:1
Method name is: tensorflow/serving/predict

Now when we call predictions using the updated serving input function, the model will return the predicted article source as a readable string, and the model's confidence for that prediction.


In [66]:
title1 = "House Passes Sweeping Policing Bill Targeting Racial Bias and Use of Force"
title2 = "YouTube introduces Video Chapters to make it easier to navigate longer videos"
title3 = "A native Mac app wrapper for WhatsApp Web"

restored = tf.keras.models.load_model('./txtcls_swivel')
infer = restored.signatures['serving_default']
outputs = infer(text=tf.constant([title1, title2, title3]))

In [67]:
print(outputs['source'].numpy())
print(outputs['confidence'].numpy())


[b'nytimes' b'techcrunch' b'techcrunch']
[0.52479076 0.5127273  0.48214597]

Deploy the model for online serving

Once the model is trained and the assets saved, deploying the model to GCP is straightforward. After some time you should be able to see your deployed model and its version on the model page of GCP console.


In [69]:
%%bash
MODEL_NAME="txtcls"
MODEL_VERSION="swivel"
MODEL_LOCATION="./txtcls_swivel/"

gcloud ai-platform versions create ${MODEL_VERSION} \
--model ${MODEL_NAME} \
--origin ${MODEL_LOCATION} \
--staging-bucket gs://${BUCKET} \
--runtime-version=2.1


Creating version (this might take a few minutes)......
......................................................................................................................................................................................................................................................................................................done.

Set up the Evaluation job on CAIP

Now that the model is deployed, go to Cloud AI Platform to see the model version you've deployed and set up an evaluation job by clicking on the button called "Create Evaluation Job". You will be asked to provide some relevant information:

  • Job description: txtcls_swivel_eval
  • Model objective: text classification
  • Classification type: single-label classification
  • Prediction label file path for the annotation specification set: When you create an evaluation job on CAIP, you must specify a CSV file that defines your annotation specification set. This file must have one row for every possible label your model outputs during prediction. Each row should be a comma-separated pair containing the label and a description of the label: label-name,description
  • Daily sample percentage: We'll set this to 100% so that all online predicitons are captured for evaluation.
  • BigQuery table to house online prediction requests: We'll use the BQ dataset and table txtcls_eval.swivel. If you enter a BigQuery table that doesn’t exist, one with that name will be created with the correct schema.
  • Prediction input
    • Data key: this is The key for the raw prediction data. From examining our deployed model signature, the input data key is text.
    • Data reference key: this is for image models, so we can ignore
  • Prediction output
    • Prediction labels key: This is the prediction key which contains the predicted label (i.e. the article source). For our model, the label key is source.
    • Prediction score key: This is the prediction key which contains the predicted scores (i.e. the model confidence). For our model, the score key is confidence.
  • Ground-truth method: Check the box that indicates we will provide our own labels, and not use a Human data labeling service.

Once the evaluation job is set up, the table will be made in BigQuery to capture the online prediction requests.


In [70]:
%load_ext google.cloud.bigquery


The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery

In [99]:
%%bigquery --project $PROJECT
SELECT * FROM `txtcls_eval.swivel`


Out[99]:
model model_version time raw_data raw_prediction groundtruth

Now, every time this model version receives an online prediction request, this information will be captured and stored in the BQ table. Note, this happens everytime because we set the sampling proportion to 100%.

Send prediction requests to your model

Here are some article titles and their groundtruth sources that we can test with prediciton.

title groundtruth
YouTube introduces Video Chapters to make it easier to navigate longer videos techcrunch
A Filmmaker Put Away for Tax Fraud Takes Us Inside a British Prison nytimes
A native Mac app wrapper for WhatsApp Web github
Astronauts Dock With Space Station After Historic SpaceX Launch nytimes
House Passes Sweeping Policing Bill Targeting Racial Bias and Use of Force nytimes
Scrollability github
iOS 14 lets deaf users set alerts for important sounds, among other clever accessibility perks techcrunch

In [100]:
%%writefile input.json
{"text": "YouTube introduces Video Chapters to make it easier to navigate longer videos"}


Overwriting input.json

In [101]:
!gcloud ai-platform predict \
  --model txtcls \
  --json-instances input.json \
  --version swivel


CONFIDENCE  SOURCE
0.512727    techcrunch

In [102]:
%%writefile input.json
{"text": "A Filmmaker Put Away for Tax Fraud Takes Us Inside a British Prison"}


Overwriting input.json

In [103]:
!gcloud ai-platform predict \
  --model txtcls \
  --json-instances input.json \
  --version swivel


CONFIDENCE  SOURCE
0.429436    techcrunch

In [104]:
%%writefile input.json
{"text": "A native Mac app wrapper for WhatsApp Web"}


Overwriting input.json

In [105]:
!gcloud ai-platform predict \
  --model txtcls \
  --json-instances input.json \
  --version swivel


CONFIDENCE  SOURCE
0.482146    techcrunch

In [106]:
%%writefile input.json
{"text": "Astronauts Dock With Space Station After Historic SpaceX Launch"}


Overwriting input.json

In [107]:
!gcloud ai-platform predict \
  --model txtcls \
  --json-instances input.json \
  --version swivel


CONFIDENCE  SOURCE
0.516605    techcrunch

In [108]:
%%writefile input.json
{"text": "House Passes Sweeping Policing Bill Targeting Racial Bias and Use of Force"}


Overwriting input.json

In [109]:
!gcloud ai-platform predict \
  --model txtcls \
  --json-instances input.json \
  --version swivel


CONFIDENCE  SOURCE
0.524791    nytimes

In [110]:
%%writefile input.json
{"text": "Scrollability"}


Overwriting input.json

In [111]:
!gcloud ai-platform predict \
  --model txtcls \
  --json-instances input.json \
  --version swivel


CONFIDENCE  SOURCE
0.510411    techcrunch

In [112]:
%%writefile input.json
{"text": "iOS 14 lets deaf users set alerts for important sounds, among other clever accessibility perks"}


Overwriting input.json

In [113]:
!gcloud ai-platform predict \
  --model txtcls \
  --json-instances input.json \
  --version swivel


CONFIDENCE  SOURCE
0.484371    nytimes

Summarizing the results from our model:

title groundtruth predicted
YouTube introduces Video Chapters to make it easier to navigate longer videos techcrunch techcrunch
A Filmmaker Put Away for Tax Fraud Takes Us Inside a British Prison nytimes techcrunch
A native Mac app wrapper for WhatsApp Web github techcrunch
Astronauts Dock With Space Station After Historic SpaceX Launch nytimes techcrunch
House Passes Sweeping Policing Bill Targeting Racial Bias and Use of Force nytimes nytimes
Scrollability github techcrunch
iOS 14 lets deaf users set alerts for important sounds, among other clever accessibility perks techcrunch nytimes

In [115]:
%%bigquery --project $PROJECT
SELECT * FROM `txtcls_eval.swivel`


Out[115]:
model model_version time raw_data raw_prediction groundtruth
0 txtcls swivel 2020-06-26 03:15:21+00:00 {"instances": [{"text": "House Passes Sweeping... {"predictions": [{"confidence": 0.524790823459... None
1 txtcls swivel 2020-06-26 03:15:26+00:00 {"instances": [{"text": "iOS 14 lets deaf user... {"predictions": [{"confidence": 0.484371215105... None
2 txtcls swivel 2020-06-26 03:15:09+00:00 {"instances": [{"text": "YouTube introduces Vi... {"predictions": [{"confidence": 0.512727320194... None
3 txtcls swivel 2020-06-26 03:15:12+00:00 {"instances": [{"text": "A Filmmaker Put Away ... {"predictions": [{"confidence": 0.429436147212... None
4 txtcls swivel 2020-06-26 03:15:23+00:00 {"instances": [{"text": "Scrollability"}]} {"predictions": [{"confidence": 0.510410726070... None
5 txtcls swivel 2020-06-26 03:15:15+00:00 {"instances": [{"text": "A native Mac app wrap... {"predictions": [{"confidence": 0.482146084308... None
6 txtcls swivel 2020-06-26 03:15:17+00:00 {"instances": [{"text": "Astronauts Dock With ... {"predictions": [{"confidence": 0.516605079174... None

Provide the ground truth for the raw prediction input

Notice the groundtruth is missing. We'll update the evaluation table to contain the ground truth.


In [117]:
%%bigquery --project $PROJECT
UPDATE `txtcls_eval.swivel`
SET 
    groundtruth = '{"predictions": [{"source": "techcrunch"}]}'
WHERE
    raw_data = '{"instances": [{"text": "YouTube introduces Video Chapters to make it easier to navigate longer videos"}]}';


Out[117]:

In [118]:
%%bigquery --project $PROJECT
UPDATE `txtcls_eval.swivel`
SET 
    groundtruth = '{"predictions": [{"source": "nytimes"}]}'
WHERE
    raw_data = '{"instances": [{"text": "A Filmmaker Put Away for Tax Fraud Takes Us Inside a British Prison"}]}';


Out[118]:

In [125]:
%%bigquery --project $PROJECT
UPDATE `txtcls_eval.swivel`
SET 
    groundtruth = '{"predictions": [{"source": "github"}]}'
WHERE
    raw_data = '{"instances": [{"text": "A native Mac app wrapper for WhatsApp Web"}]}';


Out[125]:

In [119]:
%%bigquery --project $PROJECT
UPDATE `txtcls_eval.swivel`
SET 
    groundtruth = '{"predictions": [{"source": "nytimes"}]}'
WHERE
    raw_data = '{"instances": [{"text": "Astronauts Dock With Space Station After Historic SpaceX Launch"}]}';


Out[119]:

In [120]:
%%bigquery --project $PROJECT
UPDATE `txtcls_eval.swivel`
SET 
    groundtruth = '{"predictions": [{"source": "nytimes"}]}'
WHERE
    raw_data = '{"instances": [{"text": "House Passes Sweeping Policing Bill Targeting Racial Bias and Use of Force"}]}';


Out[120]:

In [121]:
%%bigquery --project $PROJECT
UPDATE `txtcls_eval.swivel`
SET 
    groundtruth = '{"predictions": [{"source": "github"}]}'
WHERE
    raw_data = '{"instances": [{"text": "Scrollability"}]}';


Out[121]:

In [122]:
%%bigquery --project $PROJECT
UPDATE `txtcls_eval.swivel`
SET 
    groundtruth = '{"predictions": [{"source": "techcrunch"}]}'
WHERE
    raw_data = '{"instances": [{"text": "iOS 14 lets deaf users set alerts for important sounds, among other clever accessibility perks"}]}';


Out[122]:

We can confirm that the ground truch has been properly added to the table.


In [126]:
%%bigquery --project $PROJECT
SELECT * FROM `txtcls_eval.swivel`


Out[126]:
model model_version time raw_data raw_prediction groundtruth
0 txtcls swivel 2020-06-26 03:15:15+00:00 {"instances": [{"text": "A native Mac app wrap... {"predictions": [{"confidence": 0.482146084308... {"predictions": [{"source": "github"}]}
1 txtcls swivel 2020-06-26 03:15:23+00:00 {"instances": [{"text": "Scrollability"}]} {"predictions": [{"confidence": 0.510410726070... {"predictions": [{"source": "github"}]}
2 txtcls swivel 2020-06-26 03:15:17+00:00 {"instances": [{"text": "Astronauts Dock With ... {"predictions": [{"confidence": 0.516605079174... {"predictions": [{"source": "nytimes"}]}
3 txtcls swivel 2020-06-26 03:15:12+00:00 {"instances": [{"text": "A Filmmaker Put Away ... {"predictions": [{"confidence": 0.429436147212... {"predictions": [{"source": "nytimes"}]}
4 txtcls swivel 2020-06-26 03:15:21+00:00 {"instances": [{"text": "House Passes Sweeping... {"predictions": [{"confidence": 0.524790823459... {"predictions": [{"source": "nytimes"}]}
5 txtcls swivel 2020-06-26 03:15:09+00:00 {"instances": [{"text": "YouTube introduces Vi... {"predictions": [{"confidence": 0.512727320194... {"predictions": [{"source": "techcrunch"}]}
6 txtcls swivel 2020-06-26 03:15:26+00:00 {"instances": [{"text": "iOS 14 lets deaf user... {"predictions": [{"confidence": 0.484371215105... {"predictions": [{"source": "techcrunch"}]}

Compute evaluation metrics

With the raw prediction input, the model output and the groundtruth in one place, we can evaluation how our model performs. And how the model performs across various aspects (e.g. over time, different model versions, different labels, etc)


In [145]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import classification_report

Using regex we can extract the model predictions, to have an easier to read format:


In [128]:
%%bigquery --project $PROJECT
SELECT
  model,
  model_version,
  time,
  REGEXP_EXTRACT(raw_data, r'.*"text": "(.*)"') AS text,
  REGEXP_EXTRACT(raw_prediction, r'.*"source": "(.*?)"') AS prediction,
  REGEXP_EXTRACT(raw_prediction, r'.*"confidence": (0.\d{2}).*') AS confidence,
  REGEXP_EXTRACT(groundtruth, r'.*"source": "(.*?)"') AS groundtruth,
FROM
  `txtcls_eval.swivel`


Out[128]:
model model_version time text prediction confidence groundtruth
0 txtcls swivel 2020-06-26 03:15:12+00:00 A Filmmaker Put Away for Tax Fraud Takes Us In... techcrunch 0.42 nytimes
1 txtcls swivel 2020-06-26 03:15:21+00:00 House Passes Sweeping Policing Bill Targeting ... nytimes 0.52 nytimes
2 txtcls swivel 2020-06-26 03:15:23+00:00 Scrollability techcrunch 0.51 github
3 txtcls swivel 2020-06-26 03:15:09+00:00 YouTube introduces Video Chapters to make it e... techcrunch 0.51 techcrunch
4 txtcls swivel 2020-06-26 03:15:26+00:00 iOS 14 lets deaf users set alerts for importan... nytimes 0.48 techcrunch
5 txtcls swivel 2020-06-26 03:15:17+00:00 Astronauts Dock With Space Station After Histo... techcrunch 0.51 nytimes
6 txtcls swivel 2020-06-26 03:15:15+00:00 A native Mac app wrapper for WhatsApp Web techcrunch 0.48 github

In [132]:
query = '''
SELECT
  model,
  model_version,
  time,
  REGEXP_EXTRACT(raw_data, r'.*"text": "(.*)"') AS text,
  REGEXP_EXTRACT(raw_prediction, r'.*"source": "(.*?)"') AS prediction,
  REGEXP_EXTRACT(raw_prediction, r'.*"confidence": (0.\d{2}).*') AS confidence,
  REGEXP_EXTRACT(groundtruth, r'.*"source": "(.*?)"') AS groundtruth,
FROM
  `txtcls_eval.swivel`
'''

client = bigquery.Client()
df_results = client.query(query).to_dataframe()

In [133]:
df_results.head(20)


Out[133]:
model model_version time text prediction confidence groundtruth
0 txtcls swivel 2020-06-26 03:15:12+00:00 A Filmmaker Put Away for Tax Fraud Takes Us In... techcrunch 0.42 nytimes
1 txtcls swivel 2020-06-26 03:15:21+00:00 House Passes Sweeping Policing Bill Targeting ... nytimes 0.52 nytimes
2 txtcls swivel 2020-06-26 03:15:15+00:00 A native Mac app wrapper for WhatsApp Web techcrunch 0.48 github
3 txtcls swivel 2020-06-26 03:15:23+00:00 Scrollability techcrunch 0.51 github
4 txtcls swivel 2020-06-26 03:15:09+00:00 YouTube introduces Video Chapters to make it e... techcrunch 0.51 techcrunch
5 txtcls swivel 2020-06-26 03:15:17+00:00 Astronauts Dock With Space Station After Histo... techcrunch 0.51 nytimes
6 txtcls swivel 2020-06-26 03:15:26+00:00 iOS 14 lets deaf users set alerts for importan... nytimes 0.48 techcrunch

In [134]:
prediction = list(df_results.prediction)
groundtruth = list(df_results.groundtruth)

In [135]:
precision, recall, fscore, support = score(groundtruth, prediction)


/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

In [140]:
from tabulate import tabulate
sources = list(CLASSES.keys())
results = list(zip(sources, precision, recall, fscore, support))
print(tabulate(results, headers = ['source', 'precision', 'recall', 'fscore', 'support'],
         tablefmt='orgtbl'))


| source     |   precision |   recall |   fscore |   support |
|------------+-------------+----------+----------+-----------|
| github     |         0   | 0        | 0        |         2 |
| nytimes    |         0.5 | 0.333333 | 0.4      |         3 |
| techcrunch |         0.2 | 0.5      | 0.285714 |         2 |

Or a full classification report from the sklearn library:


In [142]:
print(classification_report(y_true=groundtruth, y_pred=prediction))


              precision    recall  f1-score   support

      github       0.00      0.00      0.00         2
     nytimes       0.50      0.33      0.40         3
  techcrunch       0.20      0.50      0.29         2

    accuracy                           0.29         7
   macro avg       0.23      0.28      0.23         7
weighted avg       0.27      0.29      0.25         7

Can also examine a confusion matrix:


In [144]:
cm = confusion_matrix(groundtruth, prediction, labels=sources)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, cmap="Blues")

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix') 
ax.xaxis.set_ticklabels(sources)
ax.yaxis.set_ticklabels(sources)
plt.savefig("./txtcls_cm.png")


Examine eval metrics by model version or timestamp

By specifying the same evaluation table, two different model versions can be evaluated. Also, since the timestamp is captured, it is straightforward to evaluation model performance over time.


In [152]:
now = pd.Timestamp.now(tz='UTC')
one_week_ago = now - pd.DateOffset(weeks=1)
one_month_ago = now - pd.DateOffset(months=1)

In [156]:
df_prev_week = df_results[df_results.time > one_week_ago]
df_prev_month = df_results[df_results.time > one_month_ago]

In [157]:
df_prev_month


Out[157]:
model model_version time text prediction confidence groundtruth
0 txtcls swivel 2020-06-26 03:15:12+00:00 A Filmmaker Put Away for Tax Fraud Takes Us In... techcrunch 0.42 nytimes
1 txtcls swivel 2020-06-26 03:15:21+00:00 House Passes Sweeping Policing Bill Targeting ... nytimes 0.52 nytimes
2 txtcls swivel 2020-06-26 03:15:15+00:00 A native Mac app wrapper for WhatsApp Web techcrunch 0.48 github
3 txtcls swivel 2020-06-26 03:15:23+00:00 Scrollability techcrunch 0.51 github
4 txtcls swivel 2020-06-26 03:15:09+00:00 YouTube introduces Video Chapters to make it e... techcrunch 0.51 techcrunch
5 txtcls swivel 2020-06-26 03:15:17+00:00 Astronauts Dock With Space Station After Histo... techcrunch 0.51 nytimes
6 txtcls swivel 2020-06-26 03:15:26+00:00 iOS 14 lets deaf users set alerts for importan... nytimes 0.48 techcrunch

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License