LAB 1a: Exploring natality dataset.

Learning Objectives

  1. Use BigQuery to explore natality dataset
  2. Use Cloud AI Platform Notebooks to plot data explorations

Introduction

In this notebook, we will explore the natality dataset before we begin model development and training to predict the weight of a baby before it is born. We will use BigQuery to explore the data and use Cloud AI Platform Notebooks to plot data explorations.

Load necessary libraries

Check that the Google BigQuery library is installed and if not, install it.


In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst

In [1]:
%%bash
sudo pip freeze | grep google-cloud-bigquery==1.6.1 || \
sudo pip install google-cloud-bigquery==1.6.1


google-cloud-bigquery==1.6.1

In [1]:
from google.cloud import bigquery

The source dataset

Our dataset is hosted in BigQuery. The CDC's Natality data has details on US births from 1969 to 2008 and is a publically available dataset, meaning anyone with a GCP account has access. Click here to access the dataset.

The natality dataset is relatively large at almost 138 million rows and 31 columns, but simple to understand. weight_pounds is the target, the continuous value we’ll train a model to predict.

Explore data

The data is natality data (record of births in the US). The goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother. Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that -- this way, twins born on the same day won't end up in different cuts of the data. We'll first create a SQL query using the natality data after the year 2000.


In [2]:
query = """
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks,
    FARM_FINGERPRINT(
        CONCAT(
            CAST(YEAR AS STRING),
            CAST(month AS STRING)
        )
    ) AS hashmonth
FROM
    publicdata.samples.natality
WHERE
    year > 2000
"""

Let's create a BigQuery client that we can use throughout the notebook.


In [3]:
bq = bigquery.Client()

Let's now examine the result of a BiqQuery call in a Pandas DataFrame using our newly created client.


In [4]:
df = bq.query(query + " LIMIT 100").to_dataframe()
df.head()


Out[4]:
weight_pounds is_male mother_age plurality gestation_weeks hashmonth
0 8.818490 False 17 1 42 1403073183891835564
1 8.141671 False 29 1 38 8904940584331855459
2 5.948072 True 38 1 38 7108882242435606404
3 8.838332 True 27 1 39 3408502330831153141
4 9.259415 True 28 1 38 1088037545023002395

First, let's get the set of all valid column names in the natality dataset. We can do this by accessing the INFORMATION_SCHEMA for the table from the dataset.


In [5]:
# Query to get all column names within table schema
sql = """
SELECT
    column_name
FROM
    publicdata.samples.INFORMATION_SCHEMA.COLUMNS
WHERE
    table_name = "natality"
"""

# Send query through BigQuery client and store output to a dataframe
valid_columns_df = bq.query(sql).to_dataframe()

# Convert column names in dataframe to a set
valid_columns_set = valid_columns_df["column_name"].tolist()

We can print our valid columns set to see all of the possible columns we have available in the dataset. Of course, you could also find this information by going to the Schema tab when selecting the table in the BigQuery UI.


In [6]:
print(valid_columns_set)


['source_year', 'year', 'month', 'day', 'wday', 'state', 'is_male', 'child_race', 'weight_pounds', 'plurality', 'apgar_1min', 'apgar_5min', 'mother_residence_state', 'mother_race', 'mother_age', 'gestation_weeks', 'lmp', 'mother_married', 'mother_birth_state', 'cigarette_use', 'cigarettes_per_day', 'alcohol_use', 'drinks_per_week', 'weight_gain_pounds', 'born_alive_alive', 'born_alive_dead', 'born_dead', 'ever_born', 'father_race', 'father_age', 'record_weight']

Let's write a query to find the unique values for each of the columns and the count of those values. This is important to ensure that we have enough examples of each data value, and to verify our hunch that the parameter has predictive value.


In [7]:
def get_distinct_values(valid_columns_set, column_name):
    """Gets distinct value statistics of BigQuery data column.

    Args:
        valid_columns_set: set, the set of all possible valid column names in
            table.
        column_name: str, name of column in BigQuery.
    Returns:
        Dataframe of unique values, their counts, and averages.
    """
    assert column_name in valid_columns_set, (
        "{column_name} is not a valid column_name".format(
            column_name=column_name))

    sql = """
    SELECT
        {column_name},
        COUNT(1) AS num_babies,
        AVG(weight_pounds) AS avg_wt
    FROM
        publicdata.samples.natality
    WHERE
        year > 2000
    GROUP BY
        {column_name}
    """.format(column_name=column_name)

    return bq.query(sql).to_dataframe()

In [8]:
def plot_distinct_values(valid_columns_set, column_name, logy=False):
    """Plots distinct value statistics of BigQuery data column.

    Args:
        valid_columns_set: set, the set of all possible valid column names in
            table.
        column_name: str, name of column in BigQuery.
        logy: bool, if plotting counts in log scale or not.
    """
    df = get_distinct_values(valid_columns_set, column_name)
    df = df.sort_values(column_name)
    df.plot(
        x=column_name, y="num_babies", logy=logy, kind="bar", figsize=(12, 5))
    df.plot(x=column_name, y="avg_wt", kind="bar", figsize=(12, 5))

Make a bar plot to see is_male with avg_wt linearly scaled and num_babies logarithmically scaled.


In [9]:
plot_distinct_values(valid_columns_set, column_name="is_male", logy=False)


Make a bar plot to see mother_age with avg_wt linearly scaled and num_babies linearly scaled.


In [10]:
plot_distinct_values(valid_columns_set, column_name="mother_age", logy=False)


Make a bar plot to see plurality with avg_wt linearly scaled and num_babies logarithmically scaled.


In [11]:
plot_distinct_values(valid_columns_set, column_name="plurality", logy=True)


Make a bar plot to see gestation_weeks with avg_wt linearly scaled and num_babies logarithmically scaled.


In [12]:
plot_distinct_values(
    valid_columns_set, column_name="gestation_weeks", logy=True)


All these factors seem to play a part in the baby's weight. Male babies are heavier on average than female babies. Teenaged and older moms tend to have lower-weight babies. Twins, triplets, etc. are lower weight than single births. Preemies weigh in lower as do babies born to single moms. In addition, it is important to check whether you have enough data (number of babies) for each input value. Otherwise, the model prediction against input values that doesn't have enough data may not be reliable.

In the next notebooks, we will develop a machine learning model to combine all of these factors to come up with a prediction of a baby's weight.

Lab Summary:

In this lab, we used BigQuery to explore the data and used Cloud AI Platform Notebooks to plot data explorations.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License