This notebook illustrates:
In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst
In [ ]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1
Set environment variables so that we can use them throughout the entire notebook. We will be using our project name for our bucket, so you only need to change your project and region.
In [2]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml' # Replace with the your bucket name
PROJECT = 'cloud-training-demos' # Replace with your project-id
REGION = 'us-central1'
In [ ]:
import os
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "2.1"
os.environ["PYTHONVERSION"] = "3.7"
Check that the Google BigQuery library is installed and if not, install it.
In [1]:
%%bash
sudo pip freeze | grep google-cloud-bigquery==1.6.1 || \
sudo pip install google-cloud-bigquery==1.6.1
In [2]:
import os
from google.cloud import bigquery
In [ ]:
%%bash
export PROJECT=$(gcloud config list project --format "value(core.project)")
echo "Your current GCP Project Name is: "$PROJECT
Our dataset is hosted in BigQuery. The CDC's Natality data has details on US births from 1969 to 2008 and is a publically available dataset, meaning anyone with a GCP account has access. Click here to access the dataset.
The natality dataset is relatively large at almost 138 million rows and 31 columns, but simple to understand. weight_pounds
is the target, the continuous value we’ll train a model to predict.
In [ ]:
%%bash
# Create a BigQuery dataset for babyweight if it doesn't exist
datasetexists=$(bq ls -d | grep -w babyweight)
if [ -n "$datasetexists" ]; then
echo -e "BigQuery dataset already exists, let's not recreate it."
else
echo "Creating BigQuery dataset titled: babyweight"
bq --location=US mk --dataset \
--description "Babyweight" \
$PROJECT:babyweight
echo "Here are your current datasets:"
bq ls
fi
## Create GCS bucket if it doesn't exist already...
exists=$(gsutil ls -d | grep -w gs://${BUCKET}/)
if [ -n "$exists" ]; then
echo -e "Bucket exists, let's not recreate it."
else
echo "Creating a new GCS bucket."
gsutil mb -l ${REGION} gs://${BUCKET}
echo "Here are your current buckets:"
gsutil ls
fi
Since there is already a publicly available dataset, we can simply create the training and evaluation data tables using this raw input data. First we are going to create a subset of the data limiting our columns to weight_pounds
, is_male
, mother_age
, plurality
, and gestation_weeks
as well as some simple filtering and a column to hash on for repeatable splitting.
We have some preprocessing and filtering we would like to do to get our data in the right format for training.
Preprocessing:
is_male
from BOOL
to STRING
plurality
from INTEGER
to STRING
where [1, 2, 3, 4, 5]
becomes ["Single(1)", "Twins(2)", "Triplets(3)", "Quadruplets(4)", "Quintuplets(5)"]
hashcolumn
hashing on year
and month
Filtering:
2000
0
0
0
0
In [6]:
%%bigquery
CREATE OR REPLACE TABLE
babyweight.babyweight_data AS
SELECT
weight_pounds,
CAST(is_male AS STRING) AS is_male,
mother_age,
CASE
WHEN plurality = 1 THEN "Single(1)"
WHEN plurality = 2 THEN "Twins(2)"
WHEN plurality = 3 THEN "Triplets(3)"
WHEN plurality = 4 THEN "Quadruplets(4)"
WHEN plurality = 5 THEN "Quintuplets(5)"
END AS plurality,
gestation_weeks,
FARM_FINGERPRINT(
CONCAT(
CAST(year AS STRING),
CAST(month AS STRING)
)
) AS hashmonth
FROM
publicdata.samples.natality
WHERE
year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
Out[6]:
In [7]:
%%bigquery
CREATE OR REPLACE TABLE
babyweight.babyweight_augmented_data AS
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
hashmonth
FROM
babyweight.babyweight_data
UNION ALL
SELECT
weight_pounds,
"Unknown" AS is_male,
mother_age,
CASE
WHEN plurality = "Single(1)" THEN plurality
ELSE "Multiple(2+)"
END AS plurality,
gestation_weeks,
hashmonth
FROM
babyweight.babyweight_data
Out[7]:
In [8]:
%%bigquery
CREATE OR REPLACE TABLE
babyweight.babyweight_data_train AS
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks
FROM
babyweight.babyweight_augmented_data
WHERE
ABS(MOD(hashmonth, 4)) < 3
Out[8]:
In [9]:
%%bigquery
CREATE OR REPLACE TABLE
babyweight.babyweight_data_eval AS
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks
FROM
babyweight.babyweight_augmented_data
WHERE
ABS(MOD(hashmonth, 4)) = 3
Out[9]:
In [10]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_train
LIMIT 0
Out[10]:
In [11]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_eval
LIMIT 0
Out[11]:
Use BigQuery Python API to export our train and eval tables to Google Cloud Storage in the CSV format to be used later for TensorFlow/Keras training. We'll want to use the dataset we've been using above as well as repeat the process for both training and evaluation data.
In [14]:
# Construct a BigQuery client object.
client = bigquery.Client()
dataset_name = "babyweight"
# Create dataset reference object
dataset_ref = client.dataset(
dataset_id=dataset_name, project=client.project)
# Export both train and eval tables
for step in ["train", "eval"]:
destination_uri = os.path.join(
"gs://", BUCKET, dataset_name, "data", "{}*.csv".format(step))
table_name = "babyweight_data_{}".format(step)
table_ref = dataset_ref.table(table_name)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location="US",
) # API request
extract_job.result() # Waits for job to complete.
print("Exported {}:{}.{} to {}".format(
client.project, dataset_name, table_name, destination_uri))
In [15]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*.csv
In [ ]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*000000000000.csv
To submit to the Cloud we use gcloud ai-platform jobs submit training [jobname]
and simply specify some additional parameters for AI Platform Training Service:
Below the -- \
we add in the arguments for our task.py
file.
In [ ]:
%%bash
OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)
gcloud ai-platform jobs submit training ${JOBID} \
--region=${REGION} \
--module-name=trainer.task \
--package-path=$(pwd)/babyweight/trainer \
--job-dir=${OUTDIR} \
--staging-bucket=gs://${BUCKET} \
--master-machine-type=n1-standard-8 \
--scale-tier=CUSTOM \
--runtime-version=${TFVERSION} \
--python-version=${PYTHONVERSION} \
-- \
--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
--output_dir=${OUTDIR} \
--num_epochs=10 \
--train_examples=10000 \
--eval_steps=100 \
--batch_size=32 \
--nembeds=8
The training job should complete within 15 to 20 minutes. You do not need to wait for this training job to finish before moving forward in the notebook, but will need a trained model.
Let's check the directory structure of our outputs of our trained model in folder we exported. We'll want to deploy the saved_model.pb within the timestamped directory as well as the variable values in the variables folder. Therefore, we need the path of the timestamped directory so that everything within it can be found by Cloud AI Platform's model deployment service.
In [10]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/trained_model
In [11]:
%%bash
MODEL_LOCATION=$(gsutil ls -ld -- gs://${BUCKET}/babyweight/trained_model/2* \
| tail -1)
gsutil ls ${MODEL_LOCATION}
In [12]:
%%bash
MODEL_NAME="babyweight"
MODEL_VERSION="ml_on_gcp"
MODEL_LOCATION=$(gsutil ls -ld -- gs://${BUCKET}/babyweight/trained_model/2* \
| tail -1 | tr -d '[:space:]')
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION"
# gcloud ai-platform versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
# gcloud ai-platform models delete ${MODEL_NAME}
gcloud ai-platform models create ${MODEL_NAME} --regions ${REGION}
gcloud ai-platform versions create ${MODEL_VERSION} \
--model=${MODEL_NAME} \
--origin=${MODEL_LOCATION} \
--runtime-version=2.1 \
--python-version=3.7
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License