Copyright 2019 The Google Research Authors.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Domain Adaptation using DVRL

  • Jinsung Yoon, Sercan O Arik, Tomas Pfister, "Data Valuation using Reinforcement Learning", arXiv preprint arXiv:1909.11671 (2019) - https://arxiv.org/abs/1909.11671

This notebook describes the user-guide of a domain adaptation application using "Data Valuation using Reinforcement Learning (DVRL)".

We consider the scenario where the training dataset comes from a substantially different distribution from the validation and testing sets. Data valuation is expected to be beneficial for this task by selecting the samples from the training dataset that best match the distribution of the validation dataset.

You need:

Source / Target / Validation Datasets

  • If there is no explicit validation set, users can utilize a small portion of target set as the validation set and the remaining as the target set.
  • If users come with their own source / target / validation datasets, the users should save those files as 'source.csv', 'target.csv', 'valid.csv' in './repo/data_files/' directory.

Requirements

We use Rossmann store sales dataset (https://www.kaggle.com/c/rossmann-store-sales) as an example in this notebook. Please download the dataset (rossmann-store-sales.zip) from the following link (https://www.kaggle.com/c/rossmann-store-sales/data) and save it to './repo/data_files/' directory after cloning github repository.

Prerequisite


In [1]:
# Uses pip3 to install necessary package (lightgbm)
!pip3 install lightgbm

# Resets the IPython kernel to import the installed package.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

In [1]:
import os
from git import Repo

# Current working directory
repo_dir = os.getcwd() + '/repo'

if not os.path.exists(repo_dir):
    os.makedirs(repo_dir)

# Clones github repository
if not os.listdir(repo_dir):
    git_url = "https://github.com/google-research/google-research.git"
    Repo.clone_from(git_url, repo_dir)

Necessary packages and functions call

  • rossmann_data_loading: Data loader for rossmann dataset.
  • data_preprocess: Data extraction and normalization.
  • dvrl_regress: Data valuation function for regression problem.
  • metrics: Evaluation metrics of the quality of data valuation in domain adaptation setting.

In [2]:
import numpy as np
import tensorflow as tf
import lightgbm

# Sets current directory
os.chdir(repo_dir)

from dvrl.data_loading import load_rossmann_data, preprocess_data
from dvrl import dvrl
from dvrl.dvrl_metrics import learn_with_dvrl, learn_with_baseline

Data loading & Select source, target, validation datasets

  • Load source, target, validation dataset and save those datasets as source.csv, target.csv, valid.csv in './repo/data_files/' directory.
  • If users have their own source.csv, target.csv, valid.csv, the users can skip this cell and just save those files to './repo/data_files/' directory .

Input:

  • dict_no: The number of source / valid / target samples. We use 79% / 1% / 20% as the ratio of each dataset.
  • settings: 'train-on-all', 'train-on-rest', 'train-on-specific'.
  • target_store_type: Target store types ('A','B','C','D').

For instance, to evaluate the performance of store type 'A', (1) 'train-on-all' setting uses the entire source dataset, (2) 'train-on-rest' setting uses the source samples with store type 'B', 'C', and 'D', (3) 'train-on-specific' setting uses the source samples with store type 'A'. Therefore, 'train-on-rest' has the maximum distribution differences between source and target datasets.


In [3]:
# The number of source / validation / target samples (79%/1%/20%)
dict_no = dict()
dict_no['source'] = 667027 # 79% of data
dict_no['valid'] = 8443 # 1% of data

# Selects a setting and target store type
setting = 'train-on-rest'
target_store_type = 'B'

# Loads data and selects source, target, validation datasets
load_rossmann_data(dict_no, setting, target_store_type)

print('Finished data loading.')


/usr/local/google/home/jinsungyoon/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3214: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
  if (yield from self.run_code(code, result)):
Finished data loading.

Data preprocessing

  • Extract features and labels from source.csv, valid.csv, target.csv in './repo/data_files/' directory.
  • Normalize the features of source, validation, and target sets.

In [6]:
# Normalization methods: either 'minmax' or 'standard'
normalization = 'minmax' 

# Extracts features and labels. Then, normalizes features.
x_source, y_source, x_valid, y_valid, x_target, y_target, _ = \
preprocess_data(normalization, 'source.csv', 'valid.csv', 'target.csv')

print('Finished data preprocess.')


/usr/local/google/home/jinsungyoon/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py:334: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
Finished data preprocess.

Run DVRL

  1. Input:

    • data valuator network parameters: Set network parameters of data valuator.
    • pred_model: The predictor model that maps output from the input. Any machine learning model (e.g. a neural network or ensemble decision tree) can be used as the predictor model, as long as it has fit, and predict (for regression)/predict_proba (for classification) as its subfunctions. Fit can be implemented using multiple backpropagation iterations.
  1. Output:
    • data_valuator: Function that uses training set as inputs to estimate data values.
    • dvrl_predictor: Function that predicts labels of the testing samples.
    • dve_out: Estimated data values of the entire training samples.

In [7]:
# Resets the graph
tf.reset_default_graph()

# Defines the problem
problem = 'regression'

# Network parameters
parameters = dict()
parameters['hidden_dim'] = 100
parameters['comb_dim'] = 10
parameters['iterations'] = 1000
parameters['activation'] = tf.nn.tanh
parameters['layer_number'] = 5
parameters['batch_size'] = 50000
parameters['learning_rate'] = 0.001

# Defines predictive model
pred_model = lightgbm.LGBMRegressor()

# Sets checkpoint file name
checkpoint_file_name = './tmp/model.ckpt'

# Defines flag for using stochastic gradient descent / pre-trained model
flags = {'sgd': False, 'pretrain': False}

# Initializes DVRL
dvrl_class = dvrl.Dvrl(x_source, y_source, x_valid, y_valid, problem, pred_model, parameters, checkpoint_file_name, flags)

# Trains DVRL
dvrl_class.train_dvrl('rmspe')

# Estimates data values
dve_out = dvrl_class.data_valuator(x_source, y_source)

# Predicts with DVRL
y_target_hat = dvrl_class.dvrl_predictor(x_target)

print('Finished data valuation.')


WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /usr/local/google/home/jinsungyoon/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
100%|██████████| 1000/1000 [14:26<00:00,  1.22it/s]
WARNING:tensorflow:From /usr/local/google/home/jinsungyoon/anaconda3/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./tmp/model.ckpt
Finished data valuation.

Evaluations

  • In this notebook, we use LightGBM as the predictor model for evaluation purposes (but you can also replace it with another model).
  • Here, we use Root Mean Squared Percentage Error (RMSPE) as the performance metric.

DVRL Performance

DVRL learns robustly although the training data has different distribution from the target data distribution, using the guidance from the small validation data (which comes from the target distribution) via reinforcement learning.

  • Train predictive model with weighted optimization using estimated data values by DVRL as the weights.

In [8]:
# Defines evaluation model
eval_model = lightgbm.LGBMRegressor()

# DVRL-weighted learning
dvrl_perf = learn_with_dvrl(dve_out, eval_model, 
                            x_source, y_source, x_valid, y_valid, x_target, y_target, 'rmspe')

# Baseline prediction performance (treat all training samples equally)
base_perf = learn_with_baseline(eval_model, x_source, y_source, x_target, y_target, 'rmspe')

print('Finished evaluation.')
print('DVRL learning performance: ' + str(np.round(dvrl_perf, 4)))
print('Baseline performance: ' + str(np.round(base_perf, 4)))


Finished evaluation.
DVRL learning performance: 0.2955
Baseline performance: 0.8471