2. Creating a sampled dataset

In this notebook, you will implement:

  1. Sampling a BigQuery dataset to create datasets for ML
  2. Preprocessing with Pandas

In [ ]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1

In [ ]:
# TODO: change these to reflect your environment
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [ ]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [ ]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

Create ML dataset by sampling using BigQuery

Sample the BigQuery table publicdata.samples.natality to create a smaller dataset of approximately 10,000 training and 3,000 evaluation records. Restrict your samples to data after the year 2000.


In [ ]:
# TODO

Preprocess data using Pandas

Carry out the following preprocessing operations:

  • Add extra rows to simulate the lack of ultrasound.
  • Change the plurality column to be one of the following strings:
    ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']
    
  • Remove rows where any of the important numeric fields are missing.

In [ ]:
## TODO

Write out

In the final versions, we want to read from files, not Pandas dataframes. So, write the Pandas dataframes out as CSV files. Using CSV files gives us the advantage of shuffling during read. This is important for distributed training because some workers might be slower than others, and shuffling the data helps prevent the same data from being assigned to the slow workers.

Modify this code appropriately (i.e. change the name of the Pandas dataframe to reflect your variable names)


In [ ]:
traindf.to_csv('train.csv', index=False, header=False)
evaldf.to_csv('eval.csv', index=False, header=False)

In [ ]:
%%bash
wc -l *.csv
head *.csv
tail *.csv

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License


In [ ]: