Copyright 2019 The Google Research Authors.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This notebook describes the user-guide of a domain adaptation application using "Data Valuation using Reinforcement Learning (DVRL)".
We consider the scenario where the training dataset comes from a substantially different distribution from the validation and testing sets. Data valuation is expected to be beneficial for this task by selecting the samples from the training dataset that best match the distribution of the validation dataset.
You need:
Source / Target / Validation Datasets
We use Rossmann store sales dataset (https://www.kaggle.com/c/rossmann-store-sales) as an example in this notebook. Please download the dataset (rossmann-store-sales.zip) from the following link (https://www.kaggle.com/c/rossmann-store-sales/data) and save it to './repo/data_files/' directory after cloning github repository.
In [1]:
# Uses pip3 to install necessary package (lightgbm)
!pip3 install lightgbm
# Resets the IPython kernel to import the installed package.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
In [1]:
import os
from git import Repo
# Current working directory
repo_dir = os.getcwd() + '/repo'
if not os.path.exists(repo_dir):
os.makedirs(repo_dir)
# Clones github repository
if not os.listdir(repo_dir):
git_url = "https://github.com/google-research/google-research.git"
Repo.clone_from(git_url, repo_dir)
In [2]:
import numpy as np
import tensorflow as tf
import lightgbm
# Sets current directory
os.chdir(repo_dir)
from dvrl.data_loading import load_rossmann_data, preprocess_data
from dvrl import dvrl
from dvrl.dvrl_metrics import learn_with_dvrl, learn_with_baseline
Input:
For instance, to evaluate the performance of store type 'A', (1) 'train-on-all' setting uses the entire source dataset, (2) 'train-on-rest' setting uses the source samples with store type 'B', 'C', and 'D', (3) 'train-on-specific' setting uses the source samples with store type 'A'. Therefore, 'train-on-rest' has the maximum distribution differences between source and target datasets.
In [3]:
# The number of source / validation / target samples (79%/1%/20%)
dict_no = dict()
dict_no['source'] = 667027 # 79% of data
dict_no['valid'] = 8443 # 1% of data
# Selects a setting and target store type
setting = 'train-on-rest'
target_store_type = 'B'
# Loads data and selects source, target, validation datasets
load_rossmann_data(dict_no, setting, target_store_type)
print('Finished data loading.')
In [6]:
# Normalization methods: either 'minmax' or 'standard'
normalization = 'minmax'
# Extracts features and labels. Then, normalizes features.
x_source, y_source, x_valid, y_valid, x_target, y_target, _ = \
preprocess_data(normalization, 'source.csv', 'valid.csv', 'target.csv')
print('Finished data preprocess.')
Input:
In [7]:
# Resets the graph
tf.reset_default_graph()
# Defines the problem
problem = 'regression'
# Network parameters
parameters = dict()
parameters['hidden_dim'] = 100
parameters['comb_dim'] = 10
parameters['iterations'] = 1000
parameters['activation'] = tf.nn.tanh
parameters['layer_number'] = 5
parameters['batch_size'] = 50000
parameters['learning_rate'] = 0.001
# Defines predictive model
pred_model = lightgbm.LGBMRegressor()
# Sets checkpoint file name
checkpoint_file_name = './tmp/model.ckpt'
# Defines flag for using stochastic gradient descent / pre-trained model
flags = {'sgd': False, 'pretrain': False}
# Initializes DVRL
dvrl_class = dvrl.Dvrl(x_source, y_source, x_valid, y_valid, problem, pred_model, parameters, checkpoint_file_name, flags)
# Trains DVRL
dvrl_class.train_dvrl('rmspe')
# Estimates data values
dve_out = dvrl_class.data_valuator(x_source, y_source)
# Predicts with DVRL
y_target_hat = dvrl_class.dvrl_predictor(x_target)
print('Finished data valuation.')
DVRL learns robustly although the training data has different distribution from the target data distribution, using the guidance from the small validation data (which comes from the target distribution) via reinforcement learning.
In [8]:
# Defines evaluation model
eval_model = lightgbm.LGBMRegressor()
# DVRL-weighted learning
dvrl_perf = learn_with_dvrl(dve_out, eval_model,
x_source, y_source, x_valid, y_valid, x_target, y_target, 'rmspe')
# Baseline prediction performance (treat all training samples equally)
base_perf = learn_with_baseline(eval_model, x_source, y_source, x_target, y_target, 'rmspe')
print('Finished evaluation.')
print('DVRL learning performance: ' + str(np.round(dvrl_perf, 4)))
print('Baseline performance: ' + str(np.round(base_perf, 4)))