This notebook illustrates:
In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst
In [ ]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1
In [ ]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'
In [ ]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
In [ ]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
gsutil mb -l ${REGION} gs://${BUCKET}
fi
Let's sample the BigQuery data to create smaller datasets.
In [ ]:
# Create SQL query using natality data after the year 2000
from google.cloud import bigquery
query = """
SELECT
weight_pounds,
is_male,
mother_age,
plurality,
gestation_weeks,
FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
publicdata.samples.natality
WHERE year > 2000
"""
Sample the BigQuery resultset (above) so that you have approximately 12,000 training examples and 3000 evaluation examples. The training and evaluation datasets have to be well-distributed (not all the babies are born in Jan 2005, for example) and should not overlap (no baby is part of both training and evaluation datasets).
Hint (highlight to see):
You will use ABS(MOD()) on the hashmonth to divide the dataset into non-overlapping training and evaluation datasets, and RAND() to sample these to the desired size.
Use Pandas to:
Hint (highlight to see):
Filtering:
df = df[df.weight_pounds > 0]Lack of ultrasound:
nous = df.copy(deep=True) nous['is_male'] = 'Unknown'Modify plurality to be a string:
twins_etc = dict(zip([1,2,3,4,5], ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)'])) df['plurality'].replace(twins_etc, inplace=True)</p>
In [ ]:
traindf.to_csv('train.csv', index=False, header=False)
evaldf.to_csv('eval.csv', index=False, header=False)
In [ ]:
%%bash
wc -l *.csv
head *.csv
tail *.csv
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License