Copyright 2019 The Google Research Authors.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This notebook describes the user-guide of corrupted sample discovery and robust learning with transfer learning on image data applications using "Data Valuation using Reinforcement Learning (DVRL)". We use Inception-v3 model (pre-trained on Imagenet dataset (http://www.image-net.org/)) as the encoder model to encode image data.
There are some scenarios where training samples may contain corrupted samples, e.g. due to cheap label collection methods. An automated corrupted sample discovery method would be highly beneficial for distinguishing samples with clean vs. noisy labels. Data valuation can be used in this setting by having a small clean validation set to assign low data values to the potential samples with noisy labels. With an optimal data value estimator, all noisy labels would get the lowest data values.
DVRL can also reliably learn with noisy data in an end-to-end way. Ideally, noisy samples should get low data values as DVRL converges and a high performance model can be returned.
You need:
Training set (low-quality data (e.g. noisy data)) / Validation set (high-quality data (e.g. clean data)) / Testing set (high-quality data (e.g. clean data))
Clone https://github.com/google-research/google-research.git to the current directory.
In [1]:
import os
from git import Repo
# Current working directory
repo_dir = os.getcwd() + '/repo'
if not os.path.exists(repo_dir):
os.makedirs(repo_dir)
# Clones github repository
if not os.listdir(repo_dir):
git_url = "https://github.com/google-research/google-research.git"
Repo.clone_from(git_url, repo_dir)
In [1]:
from dvrl import data_loading, dvrl, dvrl_metrics
import keras
from keras import applications, layers, models
# Sets current directory
os.chdir(repo_dir)
import numpy as np
from sklearn import linear_model
import tensorflow as tf
import warnings
warnings.filterwarnings("ignore")
In [2]:
# Data name: 'cifar10' or 'cifar100' in this notebook
data_name = 'cifar10'
# The number of training, validation and testing samples
dict_no = dict()
dict_no['train'] = 4000
dict_no['valid'] = 1000
dict_no['test'] = 2000
# Label noise ratio
noise_rate = 0.2
# Loads data and corrupts labels
noise_idx = data_loading.load_image_data(data_name, dict_no, noise_rate)
# noise_idx: ground truth noisy label indices
print('Finished data loading.')
In [3]:
# Extracts features and labels.
x_train, y_train, x_valid, y_valid, x_test, y_test = \
data_loading.load_image_data_from_file('train.npz', 'valid.npz', 'test.npz')
print('Finished data preprocess.')
In [4]:
# Encodes samples
# The preprocessing function used on the pre-training dataset is also applied while encoding the inputs.
preprocess_function = applications.inception_v3.preprocess_input
input_shape = (299, 299)
# Defines the encoder model to learn the representations for image dataset. In this example, we are considering the InceptionV3 model trained on
# ImageNet dataset, followed by simple average pooling-based downsampling.
def encoder_model(architecture='inception_v3', pre_trained_dataset='imagenet', downsample_factor=8):
"""Returns encoder model.
Args:
architecture: Base architecture of encoder model (e.g. 'inception_v3')
pre_trained_dataset: The dataset used to pre-train the encoder model
downsample_factor: Downsample factor for the outputs
"""
tf_input = layers.Input(shape=(input_shape[0], input_shape[1], 3))
if architecture == 'inception_v3':
model = applications.inception_v3.InceptionV3(
input_tensor=tf_input, weights=pre_trained_dataset, include_top=False)
output_pooled = layers.AveragePooling2D((downsample_factor, downsample_factor),
strides=(downsample_factor, downsample_factor))(model.output)
return models.Model(model.input, output_pooled)
# Encodes training samples
enc_x_train = data_loading.encode_image(x_train, encoder_model, input_shape, preprocess_function)
# Encodes validation samples
enc_x_valid = data_loading.encode_image(x_valid, encoder_model, input_shape, preprocess_function)
# Encodes testing samples
enc_x_test = data_loading.encode_image(x_test, encoder_model, input_shape, preprocess_function)
print('Finished data encoding')
Input:
In [5]:
# Resets the graph
tf.reset_default_graph()
keras.backend.clear_session()
# Network parameters
parameters = dict()
parameters['hidden_dim'] = 100
parameters['comb_dim'] = 10
parameters['iterations'] = 1000
parameters['activation'] = tf.nn.relu
parameters['layer_number'] = 5
parameters['batch_size'] = 2000
parameters['learning_rate'] = 0.01
parameters['inner_iterations'] = 100
parameters['batch_size_predictor'] = 256
# Sets checkpoint file name
checkpoint_file_name = './tmp/model.ckpt'
# Defines predictive model
problem = 'classification'
pred_model = keras.models.Sequential()
pred_model.add(keras.layers.Dense(len(set(y_train)), activation='softmax'))
pred_model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
# Flags for using stochastic gradient descent / pre-trained model
flags = {'sgd': True, 'pretrain': False}
# Initalizes DVRL
dvrl_class = dvrl.Dvrl(enc_x_train, y_train, enc_x_valid, y_valid,
problem, pred_model, parameters, checkpoint_file_name, flags)
# Trains DVRL
dvrl_class.train_dvrl('accuracy')
print('Finished DVRL training.')
# Estimates data values
dve_out = dvrl_class.data_valuator(enc_x_train, y_train)
# Predicts with DVRL
y_test_hat = dvrl_class.dvrl_predictor(enc_x_test)
print('Finished data valuation.')
In [6]:
# Defines evaluation model
eval_model = linear_model.LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
# Robust learning (DVRL-weighted learning)
robust_perf = dvrl_metrics.learn_with_dvrl(dve_out, eval_model, enc_x_train, y_train,
enc_x_valid, y_valid, enc_x_test, y_test, 'accuracy')
print('DVRL-weighted learning performance: ' + str(np.round(robust_perf, 4)))
Removing low value samples from the training dataset can improve the predictor model performance, especially in the cases where the training dataset contains corrupted samples. On the other hand, removing high value samples, especially if the dataset is small, would decrease the performance significantly. Overall, the performance after removing high/low value samples is a strong indicator for the quality of data valuation.
DVRL can rank the training data samples according to their estimated data value, and by removing the low value samples we can improve performance, whereas removing the high value samples degrades the performance.
In [7]:
# Evaluates performance after removing high/low values
remove_high_low_performance = \
dvrl_metrics.remove_high_low(dve_out, eval_model, enc_x_train, y_train,
enc_x_valid, y_valid, enc_x_test, y_test, 'accuracy', plot = True)
For our synthetically-generated noisy training dataset, we can assess the performance of our method in finding the noisy samples by using the known noise indices. Note that unlike the first two evaluations, this cell is only for academic purposes because you need the ground truth noisy sample indices so if users come with their own .csv files, they cannot use this cell.
In [8]:
# If noise_rate is positive value.
if noise_rate > 0:
# Evaluates true positive rates (TPR) of corrupted sample discovery and plot TPRs
noise_discovery_performance = \
dvrl_metrics.discover_corrupted_sample(dve_out, noise_idx, noise_rate, plot = True)