This notebook illustrates how to use CARET R package to build an ML model to estimate the baby's weight given a number of factors, using the BigQuery natality dataset. We use AI Platform Training with Custom Containers to train the TensorFlow model at scale. Rhen use the Cloud Run to serve the trained model as a Web API for online predictions.
R is one of the most widely used programming languages for statistical modeling, which has a large and active community of data scientists and ML professional. With over 10,000 packages in the open-source repository of CRAN, R caters to all statistical data analysis applications, ML, and visualisation.
The dataset used in this tutorial is natality data, which describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008, with more than 137 million records. The dataset is available in BigQuery public dataset. We use the data extracted from BigQuery and stored as CSV in Cloud Storage (GCS) in the Exploratory Data Analysis notebook.
In this notebook, we focus on Exploratory Data Analysis, while the goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.
The goal of this tutorial is to:
This tutorial uses billable components of Google Cloud Platform (GCP):
Learn about GCP pricing, use the Pricing Calculator to generate a cost estimate based on your projected usage.
In [1]:
version
Install and import the required libraries.
This may take several minutes if not installed already...
In [2]:
install.packages(c("caret"))
In [3]:
library(caret) # used to build a regression model
Set your PROJECT_ID
, BUCKET_NAME
, and REGION
In [4]:
# Set the project id
PROJECT_ID <- "r-on-gcp"
# Set yout GCS bucket
BUCKET_NAME <- "r-on-gcp"
# Set your training and model deployment region
REGION <- 'europe-west1'
If you run the Exploratory Data Analysis Notebook, you should have the train_data.csv and eval_data.csv files uploaded to GCS. You can download them to train your model locally using the following cell. However, if you have the files available locally, you can skip the following cell.
In [5]:
dir.create(file.path('data'), showWarnings = FALSE)
gcs_data_dir <- paste0("gs://", BUCKET_NAME, "/data/*_data.csv")
command <- paste("gsutil cp -r", gcs_data_dir, "data/")
print(command)
system(command, intern = TRUE)
In [6]:
train_file <- "data/train_data.csv"
eval_file <- "data/eval_data.csv"
header <- c(
"weight_pounds",
"is_male", "mother_age", "mother_race", "plurality", "gestation_weeks",
"mother_married", "cigarette_use", "alcohol_use",
"key")
target <- "weight_pounds"
key <- "key"
features <- setdiff(header, c(target, key))
train_data <- read.table(train_file, col.names = header, sep=",")
eval_data <- read.table(eval_file, col.names = header, sep=",")
In [7]:
trainControl <- trainControl(method = 'boot', number = 10)
hyper_parameters <- expand.grid(
nrounds = 100,
max_depth = 6,
eta = 0.3,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
print('Training the model...')
model <- train(
y=train_data$weight_pounds,
x=train_data[, features],
preProc = c("center", "scale"),
method='xgbTree',
trControl=trainControl,
tuneGrid=hyper_parameters
)
print('Model is trained.')
In [8]:
eval(model)
In [9]:
model_dir <- "models"
model_name <- "caret_babyweight_estimator"
In [10]:
# Saving the trained model
dir.create(model_dir, showWarnings = FALSE)
dir.create(file.path(model_dir, model_name), showWarnings = FALSE)
saveRDS(model, file.path(model_dir, model_name, "trained_model.rds"))
This is an implementation of wrapper function to the model to perform prediction. The function expects a list of instances in a JSON format, and returns a list of predictions (estimated weights). This prediction function implementation will be used when serving the model as a Web API for online predictions.
In [11]:
xgbtree <- readRDS(file.path(model_dir, model_name, "trained_model.rds"))
estimate_babyweights <- function(instances_json){
library("rjson")
instances <- jsonlite::fromJSON(instances_json)
df_instances <- data.frame(instances)
# fix data types
boolean_columns <- c("is_male", "mother_married", "cigarette_use", "alcohol_use")
for(col in boolean_columns){
df_instances[[col]] <- as.logical(df_instances[[col]])
}
estimates <- predict(xgbtree, df_instances)
return(estimates)
}
instances_json <- '
[
{
"is_male": "TRUE",
"mother_age": 28,
"mother_race": 8,
"plurality": 1,
"gestation_weeks": 28,
"mother_married": "TRUE",
"cigarette_use": "FALSE",
"alcohol_use": "FALSE"
},
{
"is_male": "FALSE",
"mother_age": 38,
"mother_race": 18,
"plurality": 1,
"gestation_weeks": 28,
"mother_married": "TRUE",
"cigarette_use": "TRUE",
"alcohol_use": "TRUE"
}
]
'
estimate <- round(estimate_babyweights(instances_json), digits = 2)
print(paste("Estimated weight(s):", estimate))
In order to train your CARET model in at scale using AI Platform Training, you need to implement your training logic in an R script file, containerize it in a Docker image, and submit the Docker image to AI Platform Training.
The src/caret/training directory includes the following code files:
To submit the training job with the custom container to AI Platform, you need to do the following steps:
In [13]:
# Create base image
base_image_url <- paste0("gcr.io/", PROJECT_ID, "/caret_base")
print(base_image_url)
setwd("src/caret")
getwd()
print("Building the base Docker container image...")
command <- paste0("docker build -f Dockerfile --tag ", base_image_url, " ./")
print(command)
system(command, intern = TRUE)
print("Pushing the baseDocker container image...")
command <- paste0("gcloud docker -- push ", base_image_url)
print(command)
system(command, intern = TRUE)
setwd("../..")
getwd()
In [14]:
training_image_url <- paste0("gcr.io/", PROJECT_ID, "/", model_name, "_training")
print(training_image_url)
setwd("src/caret/training")
getwd()
print("Building the Docker container image...")
command <- paste0("docker build -f Dockerfile --tag ", training_image_url, " ./")
print(command)
system(command, intern = TRUE)
print("Pushing the Docker container image...")
command <- paste0("gcloud docker -- push ", training_image_url)
print(command)
system(command, intern = TRUE)
setwd("../../..")
getwd()
In [15]:
command <- paste0("gcloud container images list --repository=gcr.io/", PROJECT_ID)
system(command, intern = TRUE)
In [16]:
job_name <- paste0("train_caret_contrainer_", format(Sys.time(), "%Y%m%d_%H%M%S"))
command = paste0("gcloud beta ai-platform jobs submit training ", job_name,
" --master-image-uri=", training_image_url,
" --scale-tier=BASIC",
" --region=", REGION
)
print(command)
system(command, intern = TRUE)
Verify the trained model in GCS after the job finishes
In [18]:
model_name <- 'caret_babyweight_estimator'
gcs_model_dir <- paste0("gs://", BUCKET_NAME, "/models/", model_name)
command <- paste0("gsutil ls ", gcs_model_dir)
system(command, intern = TRUE)
In order to serve the trained CARET model as a Web API, you need to wrap it with a prediction function, as serve this prediction function as a REST API. Then you containerize this Web API and deploy it in Cloud Run.
The src/caret/serving directory includes the following code files:
To deploy the prediction Web API to Cloud Run, you need to do the following steps:
In [ ]:
model_name <- 'caret_babyweight_estimator'
gcs_model_dir = paste0("gs://", BUCKET_NAME, "/models/", model_name, "/")
command <- paste0("gsutil cp -r models/", model_name ,"/* ",gcs_model_dir)
print(command)
system(command, intern = TRUE)
In [19]:
serving_image_url <- paste0("gcr.io/", PROJECT_ID, "/", model_name, "_serving")
print(serving_image_url)
setwd("src/caret/serving")
getwd()
print("Building the Docker container image...")
command <- paste0("docker build -f Dockerfile --tag ", serving_image_url, " ./")
print(command)
system(command, intern = TRUE)
print("Pushing the Docker container image...")
command <- paste0("gcloud docker -- push ", serving_image_url)
print(command)
system(command, intern = TRUE)
setwd("../../..")
getwd()
In [20]:
command <- paste0("gcloud container images list --repository=gcr.io/", PROJECT_ID)
system(command, intern = TRUE)
In [ ]:
service_name <- "caret-babyweight-estimator"
command <- paste(
"gcloud beta run deploy", service_name,
"--image", serving_image_url,
"--platform managed",
"--allow-unauthenticated",
"--region", REGION
)
print(command)
system(command, intern = TRUE)
When the caret-babyweight-estimator service is deployed to Cloud Run:
In [21]:
# Update to the deployed service URL
url <- "https://caret-babyweight-estimator-lbcii4x34q-uc.a.run.app/"
endpoint <- "estimate"
In [22]:
instances_json <- '
[
{
"is_male": "TRUE",
"mother_age": 28,
"mother_race": 8,
"plurality": 1,
"gestation_weeks": 28,
"mother_married": "TRUE",
"cigarette_use": "FALSE",
"alcohol_use": "FALSE"
},
{
"is_male": "FALSE",
"mother_age": 38,
"mother_race": 18,
"plurality": 1,
"gestation_weeks": 28,
"mother_married": "TRUE",
"cigarette_use": "TRUE",
"alcohol_use": "TRUE"
}
]
'
In [23]:
library("httr")
full_url <- paste0(url, endpoint)
response <- POST(full_url, body = instances_json)
estimates <- content(response)
print(paste("Estimated weight(s):", estimate))
Authors: Daniel Sparing & Khalid Salama
Disclaimer: This is not an official Google product. The sample code provided for an educational purpose.
Copyright 2019 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
In [ ]: