Copyright 2019 The Google Research Authors.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This notebook describes the user-guide of corrupted sample discovery and robust learning applications using "Data Valuation using Reinforcement Learning (DVRL)".
There are some scenarios where training samples may contain corrupted samples, e.g. due to cheap label collection methods. An automated corrupted sample discovery method would be highly beneficial for distinguishing samples with clean vs. noisy labels. Data valuation can be used in this setting by having a small clean validation set to assign low data values to the potential samples with noisy labels. With an optimal data value estimator, all noisy labels would get the lowest data values.
DVRL can also reliably learn with noisy data in an end-to-end way. Ideally, noisy samples should get low data values as DVRL converges and a high performance model can be returned.
You need:
Training set (low-quality data (e.g. noisy data)) / Validation set (high-quality data (e.g. clean data)) / Testing set (high-quality data (e.g. clean data))
In [ ]:
# Uses pip3 to install necessary package (lightgbm)
!pip3 install lightgbm
# Resets the IPython kernel to import the installed package.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
In [2]:
import os
from git import Repo
# Current working directory
repo_dir = os.getcwd() + '/repo'
if not os.path.exists(repo_dir):
os.makedirs(repo_dir)
# Clones github repository
if not os.listdir(repo_dir):
git_url = "https://github.com/google-research/google-research.git"
Repo.clone_from(git_url, repo_dir)
In [3]:
import numpy as np
import tensorflow as tf
from sklearn import linear_model
import lightgbm
# Sets current directory
os.chdir(repo_dir)
from dvrl.data_loading import load_tabular_data, preprocess_data
from dvrl import dvrl
from dvrl.dvrl_metrics import discover_corrupted_sample, remove_high_low, learn_with_dvrl
In [4]:
# Data name: 'adult' in this notebook
data_name = 'adult'
# The number of training and validation samples
dict_no = dict()
dict_no['train'] = 1000
dict_no['valid'] = 400
# Label noise ratio
noise_rate = 0.2
# Loads data and corrupts labels
noise_idx = load_tabular_data(data_name, dict_no, noise_rate)
# noise_idx: ground truth noisy sample indices
print('Finished data loading.')
In [5]:
# Normalization methods: 'minmax' or 'standard'
normalization = 'minmax'
# Extracts features and labels. Then, normalizes features.
x_train, y_train, x_valid, y_valid, x_test, y_test, _ = \
preprocess_data(normalization, 'train.csv', 'valid.csv', 'test.csv')
print('Finished data preprocess.')
Input:
In [6]:
# Resets the graph
tf.reset_default_graph()
# Network parameters
parameters = dict()
parameters['hidden_dim'] = 100
parameters['comb_dim'] = 10
parameters['iterations'] = 2000
parameters['activation'] = tf.nn.relu
parameters['layer_number'] = 5
parameters['batch_size'] = 2000
parameters['learning_rate'] = 0.01
# Sets checkpoint file name
checkpoint_file_name = './tmp/model.ckpt'
# Defines predictive model
pred_model = linear_model.LogisticRegression(solver='lbfgs')
problem = 'classification'
# Flags for using stochastic gradient descent / pre-trained model
flags = {'sgd': False, 'pretrain': False}
# Initalizes DVRL
dvrl_class = dvrl.Dvrl(x_train, y_train, x_valid, y_valid,
problem, pred_model, parameters, checkpoint_file_name, flags)
# Trains DVRL
dvrl_class.train_dvrl('auc')
print('Finished dvrl training.')
# Estimates data values
dve_out = dvrl_class.data_valuator(x_train, y_train)
# Predicts with DVRL
y_test_hat = dvrl_class.dvrl_predictor(x_test)
print('Finished data valuation.')
In [7]:
# Defines evaluation model
eval_model = lightgbm.LGBMClassifier()
# Robust learning (DVRL-weighted learning)
robust_perf = learn_with_dvrl(dve_out, eval_model,
x_train, y_train, x_valid, y_valid, x_test, y_test, 'accuracy')
print('DVRL-weighted learning performance: ' + str(np.round(robust_perf, 4)))
Removing low value samples from the training dataset can improve the predictor model performance, especially in the cases where the training dataset contains corrupted samples. On the other hand, removing high value samples, especially if the dataset is small, would decrease the performance significantly. Overall, the performance after removing high/low value samples is a strong indicator for the quality of data valuation.
DVRL can rank the training data samples according to their estimated data value, and by removing the low value samples we can significantly improve performance, whereas removing the high value samples degrades the performance severely. Thus for a high performance data valuation method, a large gap is expected in the performance curves with removal of high vs. low value samples
In [8]:
# Evaluates performance after removing high/low valued samples
remove_high_low_performance = remove_high_low(dve_out, eval_model, x_train, y_train,
x_valid, y_valid, x_test, y_test, 'accuracy', plot = True)
For our synthetically-generated noisy training dataset, we can assess the performance of our method in finding the noisy samples by using the known noise indices. Note that unlike the first two evaluations, this cell is only for academic purposes because you need the ground truth noisy sample indices so if users come with their own .csv files, they cannot use this cell.
In [9]:
# If noise_rate is positive value.
if noise_rate > 0:
# Evaluates true positive rates (TPR) of corrupted sample discovery and plot TPR
noise_discovery_performance = discover_corrupted_sample(dve_out, noise_idx, noise_rate, plot = True)