Copyright 2018 Google LLC.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This notebook intends to evaluate a list of models on two dimensions:
This script takes the following steps:
In [1]:
%load_ext autoreload
In [2]:
%autoreload 2
In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import getpass
from IPython.display import display
import json
import nltk
import numpy as np
import pandas as pd
import pkg_resources
import os
import random
import re
import seaborn as sns
import tensorflow as tf
from tensorflow.python.lib.io import file_io
In [4]:
#from google.colab import auth
#auth.authenticate_user()
In [5]:
#!pip install -U -q git+https://github.com/conversationai/unintended-ml-bias-analysis
In [6]:
from unintended_ml_bias import model_bias_analysis
In [7]:
import input_fn_example
from utils_export.dataset import Dataset, Model
from utils_export import utils_cloudml
from utils_export import utils_tfrecords
In [8]:
os.environ['GCS_READ_CACHE_MAX_SIZE_MB'] = '0' #Faster to access GCS file + https://github.com/tensorflow/tensorflow/issues/15530
In [9]:
nltk.download('punkt')
Out[9]:
In [10]:
# User inputs
PROJECT_NAME = 'conversationai-models'
An important user input is the description of the deployed models that are evaluated.
1- Defining which model will be used. $MODEL_NAMES defined the different names (format: "model_name:version").
2- Defining the model signature.
Currently, the Dataset API does not detect the signature of a CMLE model, so this information is given by a Model instance.
You need to describe:
feature_keys_spec). It is a dictionary which describes the name of the fields and their types.prediction_keys). It is the name of the prediction field in the model output.example_key). A unique identifier for each sentence which will be generated by the dataset API (a.k.a. your input data does not need to have this field).
In [11]:
# User inputs:
MODEL_NAMES = [
'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738', # ??
'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132748', # ??
'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132820', # ??
'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828', # ??
]
In [12]:
# User inputs: Model description (see above for more info).
TEXT_FEATURE_NAME = 'tokens' #Input defined in serving function called in run.py (arg: `text_feature_name`).
SENTENCE_KEY = 'comment_key' #Input key defined in serving functioncalled in run.py (arg: `example_key_name`).
#LABEL_NAME_PREDICTION_MODEL = 'scores' # Output prediction: typically $label_name/logistic
LABEL_NAME_PREDICTION_MODEL = 'probabilities' # Output prediction: typically $label_name/logistic
In [13]:
model_input_spec = {
TEXT_FEATURE_NAME: utils_tfrecords.EncodingFeatureSpec.LIST_STRING} #library will use this automatically
model = Model(
feature_keys_spec=model_input_spec,
prediction_keys=LABEL_NAME_PREDICTION_MODEL,
example_key=SENTENCE_KEY,
model_names=MODEL_NAMES,
project_name=PROJECT_NAME)
In [14]:
def tokenizer(text, lowercase=True):
"""Converts text to a list of words.
Args:
text: piece of text to tokenize (string).
lowercase: whether to include lowercasing in preprocessing (boolean).
tokenizer: Python function to tokenize the text on.
Returns:
A list of strings (words).
"""
words = nltk.word_tokenize(text.decode('utf-8'))
if lowercase:
words = [w.lower() for w in words]
return words
We need to define first some input_fn which will be fed to the Dataset API.
An input_fn must follow the following requirements:
We will define two different input_fn (1 for performance, 1 for bias). The bias input_fn should also contain identity information.
Note: You can use ANY input_fn that matches those requirements. You can find a few examples of input_fn in the file input_fn_example.py (for toxicity and civil_comments dataset).
In [15]:
# User inputs: Choose which one you want to use OR create your own!
INPUT_FN_PERFORMANCE = input_fn_example.create_input_fn_biasbios(
tokenizer,
model_input_comment_field=TEXT_FEATURE_NAME,
)
In [16]:
# User inputs
SIZE_PERFORMANCE_DATA_SET = 10000
In [17]:
# Pattern for path of tf_records
PERFORMANCE_DATASET_DIR = os.path.join(
'gs://conversationai-models/',
getpass.getuser(),
'tfrecords',
'performance_dataset_dir')
print(PERFORMANCE_DATASET_DIR)
In [18]:
dataset_performance = Dataset(INPUT_FN_PERFORMANCE, PERFORMANCE_DATASET_DIR)
random.seed(2018) # Need to set seed before loading data to be able to reload same data in the future
dataset_performance.load_data(SIZE_PERFORMANCE_DATA_SET, random_filter_keep_rate=0.5)
In [19]:
dataset_performance.show_data()
Out[19]:
In [20]:
dataset_performance.show_data().shape
Out[20]:
In [21]:
dataset_performance.show_data().columns
Out[21]:
In [22]:
CLASS_NAMES = range(33)
In [23]:
INPUT_DATA = 'gs://conversationai-models/biosbias/dataflow_dir/data-preparation-20190220165938/eval-00000-of-00003.tfrecord'
record_iterator = tf.python_io.tf_record_iterator(path=INPUT_DATA)
string_record = next(record_iterator)
example = tf.train.Example()
example.ParseFromString(string_record)
text = example.features.feature
print(example)
In [24]:
# Set recompute_predictions=False to save time if predictions are available.
dataset_performance.add_model_prediction_to_data(model, recompute_predictions=False, class_names=CLASS_NAMES)
In [25]:
def _load_predictions(pred_file):
with file_io.FileIO(pred_file, 'r') as f:
# prediction file needs to fit in memory.
try:
predictions = [json.loads(line) for line in f]
except:
predictions = []
return predictions
model_name_tmp = MODEL_NAMES[0]
prediction_file = dataset_performance.get_path_prediction(model_name_tmp)
print(prediction_file)
prediction_file = os.path.join(prediction_file,
'prediction.results-00000-of-00001')
print(len(_load_predictions(prediction_file)[0]['probabilities']))
In [ ]:
In [26]:
test_performance_df = dataset_performance.show_data()
In [27]:
test_bias_df = test_performance_df.copy()
In [28]:
test_performance_df.head()
Out[28]:
In [29]:
test_bias_df.head()
Out[29]:
At this point, our performance data is in DataFrame df, with columns:
In [30]:
import sklearn.metrics as metrics
In [31]:
test_performance_df.label.value_counts()
Out[31]:
In [32]:
test_performance_df['label'] == 3
Out[32]:
In [33]:
_model = 'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738'
_class = 3
test_performance_df['{}_{}'.format(_model, _class)]
Out[33]:
In [34]:
auc_list = []
for _model in MODEL_NAMES:
for _class in CLASS_NAMES:
fpr, tpr, thresholds = metrics.roc_curve(
test_performance_df['label'] == _class,
test_performance_df['{}_{}'.format(_model, _class)])
_auc = metrics.auc(fpr, tpr)
auc_list.append(_auc)
print ('Auc for class {} model {}: {}'.format(_class, _model, _auc))
In [55]:
def get_class_from_col_name(col_name):
pattern = r'^.*_(\d+)$'
return int(re.search(pattern, col_name).group(1))
In [62]:
def find_best_class(df, model_name, class_names):
model_class_names = ['{}_{}'.format(model_name, class_name) for class_name in class_names]
sub_df = df[model_class_names]
df['{}_class'.format(model_name)] = sub_df.idxmax(axis=1).apply(get_class_from_col_name)
In [63]:
for _model in MODEL_NAMES:
find_best_class(test_performance_df, _model, CLASS_NAMES)
In [64]:
accuracy_list = []
for _model in MODEL_NAMES:
is_correct = (test_performance_df['{}_class'.format(_model)] == test_performance_df['label'])
_acc = sum(is_correct)/len(is_correct)
accuracy_list.append(_acc)
print ('Accuracy for model {}: {}'.format(_model, _acc))
At this point, our bias data is in DataFrame df, with columns:
You can run the analysis below on any data in this format. Subgroup labels can be generated via words in the text as done above, or come from human labels if you have them.
In [35]:
identity_terms_civil_included = []
for _term in input_fn_example.identity_terms_civil:
if sum(test_bias_df[_term]) >= 20:
print ('keeping {}'.format(_term))
identity_terms_civil_included.append(_term)
In [ ]:
test_bias_df['model_1'] = test_bias_df['tf_gru_attention_civil:v_20181109_164318']
test_bias_df['model_2'] = test_bias_df['tf_gru_attention_civil:v_20181109_164403']
test_bias_df['model_3'] = test_bias_df['tf_gru_attention_civil:v_20181109_164535']
test_bias_df['model_4'] = test_bias_df['tf_gru_attention_civil:v_20181109_164630']
In [ ]:
MODEL_NAMES = ['model_1', 'model_2', 'model_3', 'model_4']
In [ ]:
bias_metrics = model_bias_analysis.compute_bias_metrics_for_models(test_bias_df, identity_terms_civil_included, MODEL_NAMES, 'label')
In [ ]:
model_bias_analysis.plot_auc_heatmap(bias_metrics, MODEL_NAMES)
In [ ]:
model_bias_analysis.plot_aeg_heatmap(bias_metrics, MODEL_NAMES)
In [ ]: