Learning Objectives
While Pandas is fine for experimenting, for operationalization of your workflow, it is better to do preprocessing in Apache Beam. This will also help if you need to preprocess data in flight, since Apache Beam also allows for streaming.
Execute the following cells to install the necessary libraries if they have not been installed already.
In [ ]:
#Ensure that we have Apache Beam version installed.
!pip freeze | grep apache-beam || sudo pip install apache-beam[gcp]==2.12.0
In [ ]:
import tensorflow as tf
import apache_beam as beam
import shutil
import os
print(tf.__version__)
Next, set the environment variables related to your GCP Project.
In [ ]:
PROJECT = "cloud-training-demos" # Replace with your PROJECT
BUCKET = "cloud-training-bucket" # Replace with your BUCKET
REGION = "us-central1" # Choose an available region for Cloud MLE
TFVERSION = "1.14" # TF version for CMLE to use
In [ ]:
import os
os.environ["BUCKET"] = BUCKET
os.environ["PROJECT"] = PROJECT
os.environ["REGION"] = REGION
In [ ]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
gsutil mb -l ${REGION} gs://${BUCKET}
fi
The data is natality data (record of births in the US). My goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother. Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that.
In [ ]:
# Create SQL query using natality data after the year 2000
query_string = """
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
publicdata.samples.natality
WHERE
year > 2000
"""
Use the query_string
we defined above to call BigQuery and create a local Pandas dataframe. Look at the documentation for calling BigQuery within a Jupyter notebook if you need to remind yourself its usage.
Hint: it might help to add a LIMIT
to the query string to control the size of the resulting dataframe.
In [ ]:
# Call BigQuery and examine in dataframe
from google.cloud import bigquery
bq = # TODO: Your code goes here
df = # TODO: Your code goes here
df.head()
Let's use Cloud Dataflow to read in the BigQuery data, do some preprocessing, and write it out as CSV files.
Instead of using Beam/Dataflow, I had three other options:
However, in this case, I want to do some preprocessing, modifying data so that we can simulate what is known if no ultrasound has been performed. If I didn't need preprocessing, I could have used the web console. Also, I prefer to script it out rather than run queries on the user interface, so I am using Cloud Dataflow for the preprocessing.
The preprocess
function below includes an arugment in_test_mode
. When this is set to True
, running preprocess
initiates a local Beam job. This is helpful for quickly debugging your pipeline and ensuring it works before submitting a job to the Cloud. Setting in_test_mode
to False
will launch a processing that is happening on the Cloud. Go to the GCP webconsole to the Dataflow section and monitor the running job. It took about 20 minutes for me.
If you wish to continue without doing this step, you can copy my preprocessed output:
gsutil -m cp -r gs://cloud-training-demos/babyweight/preproc gs://YOUR_BUCKET/
The cell block below contains a collection of TODOs that will complete the pipeline for processing the baby weight dataset with Apache Beam and Cloud Dataflow.
In the first block of TODOs we use the original dataset to create synthetic data where we assume no ultrasound has been performed. Look back to the 2_sample.ipynb
notebook to remind yourself how this was done.
Note, these operations are done on the row level as that is how the data will be processed in the pipeline via the map function.
The next block of TODOs comes at the bottom of the cell, where we actually create the preprocessing pipeline. There are three TODOs for you to complete
selquery
created before using the beam.io.Read
functionalitybeam.FlatMap
to apply the to_csv
function you modified in the previous TODOsOUTPUT_DIR
using beam.io.Write
functionality.Look at the documentation for Beam to remind yourself the correct usage of these operations.
In [ ]:
import apache_beam as beam
import datetime, os
def to_csv(rowdict):
# Pull columns from BQ and create a line
import hashlib
import copy
CSV_COLUMNS = "weight_pounds,is_male,mother_age,plurality,gestation_weeks".split(',')
# Create synthetic data where we assume that no ultrasound has been performed
# and so we don"t know sex of the baby. Let"s assume that we can tell the difference
# between single and multiple, but that the errors rates in determining exact number
# is difficult in the absence of an ultrasound.
no_ultrasound = copy.deepcopy(rowdict)
w_ultrasound = copy.deepcopy(rowdict)
no_ultrasound["is_male"] = # TODO: Your code goes here
if rowdict["plurality"] > 1:
no_ultrasound["plurality"] = # TODO: Your code goes here
else:
no_ultrasound["plurality"] = # TODO: Your code goes here
# Change the plurality column to strings
w_ultrasound["plurality"] = ["Single(1)", "Twins(2)", "Triplets(3)", "Quadruplets(4)", "Quintuplets(5)"][rowdict["plurality"] - 1]
# Write out two rows for each input row, one with ultrasound and one without
for result in [no_ultrasound, w_ultrasound]:
data = ','.join([str(result[k]) if k in result else "None" for k in CSV_COLUMNS])
yield str("{}".format(data))
def preprocess(in_test_mode):
import shutil, os, subprocess
job_name = "preprocess-babyweight-features" + "-" + datetime.datetime.now().strftime("%y%m%d-%H%M%S")
if in_test_mode:
print("Launching local job ... hang on")
OUTPUT_DIR = "./preproc"
shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
os.makedirs(OUTPUT_DIR)
else:
print("Launching Dataflow job {} ... hang on".format(job_name))
OUTPUT_DIR = "gs://{0}/babyweight/preproc/".format(BUCKET)
try:
subprocess.check_call("gsutil -m rm -r {}".format(OUTPUT_DIR).split())
except:
pass
options = {
"staging_location": os.path.join(OUTPUT_DIR, "tmp", "staging"),
"temp_location": os.path.join(OUTPUT_DIR, "tmp"),
"job_name": job_name,
"project": PROJECT,
"teardown_policy": "TEARDOWN_ALWAYS",
"no_save_main_session": True
}
opts = beam.pipeline.PipelineOptions(flags = [], **options)
if in_test_mode:
RUNNER = "DirectRunner"
else:
RUNNER = "DataflowRunner"
p = beam.Pipeline(RUNNER, options = opts)
query = """
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
publicdata.samples.natality
WHERE
year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
AND month > 0
"""
if in_test_mode:
query = query + " LIMIT 100"
for step in ["train", "eval"]:
if step == "train":
selquery = "SELECT * FROM ({}) WHERE ABS(MOD(hashmonth, 100)) < 80".format(query)
elif step == "eval":
selquery = "SELECT * FROM ({}) WHERE ABS(MOD(hashmonth, 100)) >= 80 AND ABS(MOD(hashmonth, 100)) < 90".format(query)
else:
selquery = "SELECT * FROM ({}) WHERE ABS(MOD(hashmonth, 100)) >= 90".format(query)
(p
| "{}_read".format(step) >> # TODO: Your code goes here
| "{}_csv".format(step) >> # TODO: Your code goes here
| "{}_out".format(step) >> # TODO: Your code goes here
)
job = p.run()
if in_test_mode:
job.wait_until_finish()
print("Done!")
preprocess(in_test_mode = True)
For a Cloud preprocessing job (i.e. setting in_test_mode
to False
), the above step will take 20+ minutes. Go to the GCP web console, navigate to the Dataflow section and wait for the job to finish before you run the follwing step.
We can have a look at the elements in our bucket to see the results of our pipeline above.
In [ ]:
!gsutil ls gs://$BUCKET/babyweight/preproc/*-00000*
Create SQL query for BigQuery that will union all both the ultrasound and no ultrasound datasets.
The cell block below contains a collection of TODOs that will complete the query for processing the baby weight dataset with BigQuery.
In the block of TODOs we use the original dataset to create synthetic data where we assume no ultrasound has been performed. Look back to the 2_sample.ipynb
notebook to remind yourself how this was done.
In [4]:
query = """
WITH CTE_Raw_Data AS (
SELECT
weight_pounds,
CAST(is_male AS STRING) AS is_male,
mother_age,
plurality,
gestation_weeks,
FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
publicdata.samples.natality
WHERE
year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
AND month > 0)
-- Ultrasound
SELECT
weight_pounds,
is_male,
mother_age,
CASE
# TODO Convert plurality from integers to strings
END AS plurality,
gestation_weeks,
hashmonth
FROM
CTE_Raw_Data
UNION ALL
-- No ultrasound
SELECT
weight_pounds,
# TODO Mask is_male
mother_age,
CASE
# TODO Convert plurality from integers to strings and mask plurality > 1
END AS plurality,
gestation_weeks,
hashmonth
FROM
CTE_Raw_Data
"""
Create temporary BigQuery dataset
In [ ]:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# Set dataset_id to the ID of the dataset to create.
dataset_name = "temp_babyweight_dataset"
dataset_id = "{}.{}".format(client.project, dataset_name)
# Construct a full Dataset object to send to the API.
dataset = bigquery.Dataset.from_string(dataset_id)
# Specify the geographic location where the dataset should reside.
dataset.location = "US"
# Send the dataset to the API for creation.
# Raises google.api_core.exceptions.Conflict if the Dataset already
# exists within the project.
try:
dataset = client.create_dataset(dataset) # API request
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))
except:
print("Dataset {}.{} already exists".format(client.project, dataset.dataset_id))
Execute query and write to BigQuery table.
In [ ]:
job_config = bigquery.QueryJobConfig()
for step in ["train", "eval"]:
if step == "train":
selquery = "SELECT * FROM ({}) WHERE ABS(MOD(hashmonth, 100)) < 80".format(query)
elif step == "eval":
selquery = "SELECT * FROM ({}) WHERE ABS(MOD(hashmonth, 100)) >= 80 AND ABS(MOD(hashmonth, 100)) < 90".format(query)
else:
selquery = "SELECT * FROM ({}) WHERE ABS(MOD(hashmonth, 100)) >= 90".format(query)
# Set the destination table
table_name = "babyweight_{}".format(step)
table_ref = client.dataset(dataset_name).table(table_name)
job_config.destination = table_ref
job_config.write_disposition = "WRITE_TRUNCATE"
# Start the query, passing in the extra configuration.
query_job = client.query(
query=selquery,
# Location must match that of the dataset(s) referenced in the query
# and of the destination table.
location="US",
job_config=job_config) # API request - starts the query
query_job.result() # Waits for the query to finish
print("Query results loaded to table {}".format(table_ref.path))
Export BigQuery table to CSV in GCS.
In [ ]:
dataset_ref = client.dataset(dataset_id=dataset_name, project=PROJECT)
for step in ["train", "eval"]:
destination_uri = "gs://{}/{}".format(BUCKET, "babyweight/bq_data/{}*.csv".format(step))
table_name = "babyweight_{}".format(step)
table_ref = dataset_ref.table(table_name)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location="US",
) # API request
extract_job.result() # Waits for job to complete.
print("Exported {}:{}.{} to {}".format(PROJECT, dataset_name, table_name, destination_uri))
In [ ]:
!gsutil ls gs://$BUCKET/babyweight/bq_data/*000000000000*
Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License