This notebook illustrates how to use R interface for TensorFlow to build an ML model to estimate the baby's weight given a number of factors, using the BigQuery natality dataset. We use AI Platform Training to train the TensorFlow model at scale, and then use the AI Platform Prediction to serve the trained model for online predictions.
R is one of the most widely used programming languages for statistical modeling, which has a large and active community of data scientists and ML professional. With over 10,000 packages in the open-source repository of CRAN, R caters to all statistical data analysis applications, ML, and visualisation.
The dataset used in this tutorial is natality data, which describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008, with more than 137 million records. The dataset is available in BigQuery public dataset. We use the data extracted from BigQuery and stored as CSV in Cloud Storage (GCS) in the Exploratory Data Analysis notebook.
In this notebook, we focus on Exploratory Data Analysis, while the goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.
The goal of this tutorial is to:
This tutorial uses billable components of Google Cloud Platform (GCP):
Learn about GCP pricing, use the Pricing Calculator to generate a cost estimate based on your projected usage.
In [1]:
version
Install and import the required libraries.
This may take several minutes if not installed already...
In [2]:
install.packages(c("tfestimators", "tfdatasets", "cloudml", "rjson", "reticulate"))
In [3]:
library(tfestimators) # used for creating tensorflow estimators
library(tfdatasets) # used for creating data input functions
library(cloudml) # used for training and deploying models to AI Platform
install_tensorflow()
Set your PROJECT_ID
, BUCKET_NAME
, and REGION
In [4]:
# Set the project id
PROJECT_ID <- "r-on-gcp"
# Set yout GCS bucket
BUCKET_NAME <- "r-on-gcp"
# Set your training and model deployment region
REGION <- 'us-central1'
In [5]:
# numerical columns
mother_age <- tf$feature_column$numeric_column("mother_age")
plurality <- tf$feature_column$numeric_column('plurality')
gestation_weeks <- tf$feature_column$numeric_column('gestation_weeks')
# categorical columns
is_male <- tf$feature_column$categorical_column_with_vocabulary_list("is_male", vocabulary_list = c("True", "False"))
mother_race <- tf$feature_column$categorical_column_with_vocabulary_list(
'mother_race', vocabulary_list = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '18', '28', '38', '48', '58', '69', '78'))
mother_married <- tf$feature_column$categorical_column_with_vocabulary_list('mother_married', c('True', 'False'))
cigarette_use <- tf$feature_column$categorical_column_with_vocabulary_list('cigarette_use', c('True', 'False', 'None'))
alcohol_use <- tf$feature_column$categorical_column_with_vocabulary_list('alcohol_use', c('True', 'False', 'None'))
# extended feature columns
cigarette_use_X_alcohol_use = tf$feature_column$crossed_column(c("cigarette_use", "alcohol_use"), 9)
mother_race_embedded = tf$feature_column$embedding_column(mother_race, 3)
mother_age_bucketized = tf$feature_column$bucketized_column(mother_age, boundaries=c(18, 22, 28, 32, 36, 40, 42, 45, 50))
mother_race_X_mother_age_bucketized = tf$feature_column$crossed_column(c(mother_age_bucketized, "mother_race"), 120)
mother_race_X_mother_age_bucketized_embedded = tf$feature_column$embedding_column(mother_race_X_mother_age_bucketized, 5)
# wide and deep columns
wide_columns <- feature_columns(
is_male, mother_race, plurality, mother_married, cigarette_use, alcohol_use, cigarette_use_X_alcohol_use, mother_age_bucketized)
deep_columns <- feature_columns(
mother_age, gestation_weeks, mother_race_embedded, mother_race_X_mother_age_bucketized_embedded)
We use the premade dnn_linear_combined_regressor. This is a Wide & Deep model that is useful for generic large-scale regression problems with sparse input features (e.g., categorical features with a large number of possible feature values) and dense input features (numerical features).
In [6]:
model_dir = 'models/tf_babyweight_estimator'
model <- dnn_linear_combined_regressor(
model_dir = model_dir,
linear_feature_columns = wide_columns,
dnn_feature_columns = deep_columns,
dnn_optimizer = "Adagrad",
linear_optimizer = "Ftrl",
dnn_hidden_units = c(64, 64),
dnn_activation_fn = "relu",
dnn_dropout = 0.1,
)
If you run the Exploratory Data Analysis Notebook, you should have the train_data.csv and eval_data.csv files uploaded to GCS. You can download them to train your model locally using the following cell. However, if you have the files available locally, you can skip the following cell.
In [7]:
dir.create(file.path('data'), showWarnings = FALSE)
gcs_data_dir <- paste0("gs://", BUCKET_NAME, "/data/*_data.csv")
gsutil_exec("cp", "-r", gcs_data_dir, "data/")
Create data input function for training and evaluation, based on the data files.
In [8]:
train_file <- "data/train_data.csv"
eval_file <- "data/eval_data.csv"
header <- c(
"weight_pounds",
"is_male", "mother_age", "mother_race", "plurality", "gestation_weeks",
"mother_married", "cigarette_use", "alcohol_use",
"key")
types <- c(
"double",
"character", "double", "character", "double", "double",
"character", "character", "character",
"character")
target <- "weight_pounds"
key <- "key"
features <- setdiff(header, c(target, key))
data_input_fn <- function(data, batch_size, num_epochs = 1, shuffle = FALSE) {
input_fn(data, features = features, response = target,
batch_size = batch_size, shuffle = shuffle, num_epochs = num_epochs)
}
train_data <- read.table(train_file, col.names = header, sep=",", colClasses = types)
eval_data <- read.table(eval_file, col.names = header, sep=",", colClasses = types)
In [9]:
batch_size = 64
num_epochs = 2
unlink(model_dir, recursive = TRUE)
history <- train(
model,
input_fn = data_input_fn(train_data, batch_size = batch_size, num_epochs = num_epochs, shuffle = TRUE)
)
plot(history)
In [10]:
evaluate(
model,
input_fn = data_input_fn(eval_data, batch_size = batch_size)
)
In [11]:
feature_spec <- list()
for (i in 1:length(header)) {
column <- header[i]
if (column %in% features) {
default_value = 'NA'
column_type <- types[i]
if (column_type != 'character'){
default_value = 0
}
default_tensor <- tf$constant(value = default_value, shape = shape(1, 1))
feature_spec[[column]] <- tf$placeholder_with_default(
input = default_tensor, shape = shape(NULL, 1))
}
}
serving_input_receiver_fn <- tf$estimator$export$build_raw_serving_input_receiver_fn(feature_spec)
saved_model_dir = paste0(model_dir, '/export')
export_savedmodel(model, saved_model_dir, serving_input_receiver_fn = serving_input_receiver_fn)
print(paste("Model exported to:", saved_model_dir))
In order to train your TensorFlow estimator in at scale using AI Platform Training, you need to write your training implementation in a model_trainer.R file. The file includes the code in the previous cells to create, train, evaluate, and export the TensorFlow dnn_linear_combined_regressor model.
The job will take several minutes to complete ...
In [13]:
setwd("src/tensorflow")
getwd()
cloudml_train('model_trainer.R', region = REGION)
setwd("../..")
getwd()
Verify the trained model in GCS after the job finishes
In [21]:
model_name <- 'tf_babyweight_estimator'
gcs_model_dir <- paste0("gs://", BUCKET_NAME, "/models/", model_name)
gsutil_exec("ls", gcs_model_dir, echo = TRUE)
In [ ]:
model_name <- 'tf_babyweight_estimator'
gcs_model_dir <- paste0("gs://", BUCKET_NAME, "/models/", model_name)
gsutil_exec("cp", "-r", saved_model_dir, gcs_model_dir)
In [ ]:
gcloud_exec("ai-platform", "models", "list")
# Create model
model_name <- 'tf_babyweight_estimator'
gcloud_exec("ai-platform", "models", "create", model_name, "--regions", REGION)
In [17]:
# List models
gcloud_exec("ai-platform", "models", "list", "--filter","name~tf_babyweight_estimator")
In [ ]:
# Create version
model_version <- 'v01'
framework <- 'tensorflow'
runtime_version <- '1.14'
gcloud_exec("ai-platform", "versions", "create", model_version,
"--model", model_name,
"--framework", framework,
"--runtime-version", runtime_version,
"--origin", gcs_model_dir
)
In [16]:
# List versions
gcloud_exec("ai-platform", "versions", "list", "--model", model_name)
In [20]:
library("rjson")
model_version <- 'v01'
instances_string <- '
[
{
"is_male": ["TRUE"],
"mother_age": [28],
"mother_race": ["8"],
"plurality": [1],
"gestation_weeks": [18],
"mother_married": ["TRUE"],
"cigarette_use": ["FALSE"],
"alcohol_use": ["FALSE"]
}
]
'
instances <- jsonlite::fromJSON(instances_string, simplifyVector = FALSE)
predictions <- cloudml_predict(instances, model_name, version = model_version, verbose = TRUE)
print(paste("Estimated weight(s):", predictions))
Authors: Daniel Sparing & Khalid Salama
Disclaimer: This is not an official Google product. The sample code provided for an educational purpose.
Copyright 2019 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
In [ ]: