Copyright 2019 The Google Research Authors.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This notebook describes the user-guide of a data valuation application using "Data Valuation using Reinforcement Learning (DVRL)".
With a small validation set, DVRL can provide computationally highly efficient and high quality ranking of data values for the training dataset.
You need:
Training / Validation / Testing sets
In [1]:
# Uses pip3 to install necessary package (lightgbm)
!pip3 install lightgbm
# Resets the IPython kernel to import the installed package.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
In [2]:
import os
from git import Repo
# Current working directory
repo_dir = os.getcwd() + '/repo'
if not os.path.exists(repo_dir):
os.makedirs(repo_dir)
# Clones github repository
if not os.listdir(repo_dir):
git_url = "https://github.com/google-research/google-research.git"
Repo.clone_from(git_url, repo_dir)
In [3]:
import numpy as np
import keras
import tensorflow as tf
import pandas as pd
import lightgbm
# Sets current directory
os.chdir(repo_dir)
from dvrl.data_loading import load_tabular_data, preprocess_data
from dvrl import dvrl
from dvrl.dvrl_metrics import remove_high_low
In [ ]:
# Data name: 'adult' in this notebook
data_name = 'adult'
# The number of training and validation samples
dict_no = dict()
dict_no['train'] = 1000
dict_no['valid'] = 400
# Loads data
_ = load_tabular_data(data_name, dict_no, 0.0)
print('Finished data loading.')
In [5]:
# Normalization methods: either 'minmax' or 'standard'
normalization = 'minmax'
# Extracts features and labels. Then, normalizes features.
x_train, y_train, x_valid, y_valid, x_test, y_test, col_names = \
preprocess_data(normalization, 'train.csv', 'valid.csv', 'test.csv')
print('Finished data preprocess.')
Input:
In [6]:
# Resets the graph
tf.reset_default_graph()
keras.backend.clear_session()
# Defines problem
problem = 'classification'
# Network parameters
parameters = dict()
parameters['hidden_dim'] = 100
parameters['comb_dim'] = 10
parameters['iterations'] = 2000
parameters['activation'] = tf.nn.relu
parameters['inner_iterations'] = 100
parameters['layer_number'] = 5
parameters['batch_size'] = 2000
parameters['batch_size_predictor'] = 256
parameters['learning_rate'] = 0.01
# Defines predictive model
pred_model = keras.models.Sequential()
pred_model.add(keras.layers.Dense(parameters['hidden_dim'], activation='relu'))
pred_model.add(keras.layers.Dense(parameters['hidden_dim'], activation='relu'))
pred_model.add(keras.layers.Dense(2, activation='softmax'))
pred_model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
# Sets checkpoint file name
checkpoint_file_name = './tmp/model.ckpt'
# Flags for using stochastic gradient descent / pre-trained model
flags = {'sgd': True, 'pretrain': False}
# Initializes DVRL
dvrl_class = dvrl.Dvrl(x_train, y_train, x_valid, y_valid,
problem, pred_model, parameters, checkpoint_file_name, flags)
# Trains DVRL
dvrl_class.train_dvrl('auc')
print('Finished dvrl training.')
# Estimates data values
dve_out = dvrl_class.data_valuator(x_train, y_train)
# Predicts with DVRL
y_test_hat = dvrl_class.dvrl_predictor(x_test)
print('Finished data valuation.')
DVRL learns the value of each training sample (individually) with a small validation set via reinforcement learning. Therefore, DVRL can provide the ranking of the training samples based on the estimated values of training samples.
In [7]:
# Data valuation
sorted_idx = np.argsort(-dve_out)
sorted_x_train = x_train[sorted_idx]
sorted_y_train = y_train[sorted_idx]
# The number of examples
n_exp = 5
# Indices of top n high valued samples
print('Indices of top ' + str(n_exp) + ' high valued samples: ' + str(sorted_idx[:n_exp]))
pd.DataFrame(data=sorted_x_train[:n_exp, :], index=range(n_exp), columns=col_names).head()
Out[7]:
In [8]:
# Indices of top n low valued samples
print('Indices of top ' + str(n_exp) + ' low valued samples: ' + str(sorted_idx[-n_exp:]))
pd.DataFrame(data=sorted_x_train[-n_exp:, :], index=range(n_exp), columns=col_names).head()
Out[8]:
Removing low value samples from the training dataset can improve the predictor model performance, especially in the cases where the training dataset contains corrupted samples. On the other hand, removing high value samples, especially if the dataset is small, would decrease the performance significantly. Overall, the performance after removing high/low value samples is a strong indicator for the quality of data valuation.
DVRL can rank the training data samples according to their estimated data value, and by removing the low value samples we can improve performance, whereas removing the high value samples degrades the performance.
In [6]:
# Defines evaluation model
eval_model = lightgbm.LGBMClassifier()
# Evaluates performances after removing high/low valued samples
remove_high_low_performance = remove_high_low(dve_out, eval_model, x_train, y_train,
x_valid, y_valid, x_test, y_test, 'accuracy', plot = True)