Exploratory Data Analysis with R and BigQuery

Overview

This notebook illustrates how to perform Exploratory Data Analysis (EDA) using R, on data extracted from BigQuery. After you analyze and process the data, the transformed data is stored in Cloud Storage for further ML tasks.

R is one of the most widely used programming languages for statistical modeling. It has a large and active community of data scientists and machine learning (ML) professionals. With over 15,000 packages in the open-source repository of CRAN, R has tools for all statistical data analysis applications, ML, and visualization.

Dataset

The dataset used in this tutorial is the BigQuery natality dataset. This public dataset includes information about more than 137 million births registered in the United States from 1969 to 2008. The dataset is available here.

In this notebook, we focus on exploratory data analysis and visualization using R and BigQuery, with an eye toward a potential machine learning goal of predicting a baby's weight given a number of factors about the pregnancy and about the baby's mother.

Objective

The goal of this tutorial is to:

  1. Create an AI Platform Notebooks instance that has R support.
  2. Query and analyze data from BigQuery using the bigrquery R library.
  3. Prepare and store data for ML in Cloud Storage.

Costs

This tutorial uses the following billable components of Google Cloud:

  1. BigQuery
  2. AI Platform Notebooks instance
  3. Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage.


In [ ]:
version

0. Setup

Import the required libraries, all of which have been pre-installed in an R notebook.


In [ ]:
# Import R libraries
library(bigrquery) # used for querying BigQuery
library(ggplot2) # used for visualization
library(dplyr) # used for data wrangling

Use BigQuery out-of-band authentication


In [ ]:
bq_auth(use_oob = TRUE)

Set a variable to the name of the project that you want to use for this tutorial.


In [ ]:
# Set the project ID
PROJECT_ID <- "gcp-data-science-demo"

Set a variable to the name of the Cloud Storage bucket that you want to use.


In [ ]:
# Set your GCS bucket
BUCKET_NAME <- "r-on-gcp"

1. Querying Data from BigQuery

1.1. Prepare the BigQuery query


In [ ]:
sql_query <- "
    SELECT
      ROUND(weight_pounds, 2) AS weight_pounds ,
      is_male,
      mother_age,
      plurality,
      gestation_weeks,
      cigarette_use,
      alcohol_use,
      CAST(ABS(FARM_FINGERPRINT(CONCAT(
        CAST(YEAR AS STRING), CAST(month AS STRING), 
        CAST(weight_pounds AS STRING)))
        ) AS STRING) AS key
    FROM
        publicdata.samples.natality
    WHERE 
      year > 2000
      AND weight_pounds > 0
      AND mother_age > 0
      AND plurality > 0
      AND gestation_weeks > 0
      AND month > 0
    LIMIT %s
"

1.2. Execute the query

The data will be retreived from BigQuery, and the results will be stored in an in-memory dataframe.


In [ ]:
sample_size <- 10000
sql_query <- sprintf(sql_query, sample_size)

natality_data <- bq_table_download(
    bq_project_query(
        PROJECT_ID, 
        query = sql_query
    )
)

1.3. View the query results


In [ ]:
# View the query result
head(natality_data)

In [ ]:
# Show # of rows and data types of each column
str(natality_data)

In [ ]:
# View the results summary
summary(natality_data)

2. Visualizing retrieved data


In [ ]:
# Display the distribution of baby weights using a histogram
ggplot(
    data = natality_data, 
    aes(x = weight_pounds)
) + geom_histogram(bins = 200)

In [ ]:
# Display the relationship between gestation weeks and baby weights 
ggplot(
    data = natality_data, 
    aes(x = gestation_weeks, y = weight_pounds)
) + geom_point() + geom_smooth(method = "lm")

Performing the processing in BigQuery

Create function that finds the number of records and the average weight for each value of the chosen column


In [ ]:
get_distinct_values <- function(column_name) {
    query <- paste0(
        'SELECT ', column_name, ', 
            COUNT(1) AS num_babies,
            AVG(weight_pounds) AS avg_wt
        FROM publicdata.samples.natality
        WHERE year > 2000
        GROUP BY ', column_name)
    
    bq_table_download(
        bq_project_query(
            PROJECT_ID, 
            query = query
        )
    )
}

In [ ]:
df <- get_distinct_values('is_male')

ggplot(data = df, aes(x = is_male, y = num_babies)
) + geom_col()

ggplot(data = df, aes(x = is_male, y = avg_wt)
) + geom_col()

In [ ]:
df <- get_distinct_values('mother_age')

ggplot(data = df, aes(x = mother_age, y = num_babies)
) + geom_line()

ggplot(data = df, aes(x = mother_age, y = avg_wt)
) + geom_line()

In [ ]:
df <- get_distinct_values('plurality')

ggplot(data = df, aes(x = plurality, y = num_babies)
) + geom_col() + scale_y_log10()

ggplot(data = df, aes(x = plurality, y = avg_wt)
) + geom_col()

In [ ]:
df <- get_distinct_values('gestation_weeks')

ggplot(data = df, aes(x = gestation_weeks, y = num_babies)
) + geom_col() + scale_y_log10()

ggplot(data = df, aes(x = gestation_weeks, y = avg_wt)
) + geom_col()

3. Saving the data as CSV to Google Cloud Storage


In [ ]:
# Prepare training and evaluation data from BigQuery
sample_size <- 10000
sql_query <- sprintf(sql_query, sample_size)

train_query <- paste('SELECT * FROM (', sql_query, 
  ') WHERE MOD(CAST(key AS INT64), 100) <= 75')
eval_query <- paste('SELECT * FROM (', sql_query,
  ') WHERE MOD(CAST(key AS INT64), 100) > 75')

# Load training data to data frame
train_data <- bq_table_download(
    bq_project_query(
        PROJECT_ID, 
        query = train_query
    )
)

# Load evaluation data to data frame
eval_data <- bq_table_download(
    bq_project_query(
        PROJECT_ID, 
        query = eval_query
    )
)

In [ ]:
print(paste0("Training instances count: ", nrow(train_data)))
print(paste0("Evaluation instances count: ", nrow(eval_data)))

In [ ]:
# Write data frames to local CSV files, without headers or row names
dir.create(file.path('data'), showWarnings = FALSE)
write.table(train_data, "data/train_data.csv", 
   row.names = FALSE, col.names = FALSE, sep = ",")
write.table(eval_data, "data/eval_data.csv", 
   row.names = FALSE, col.names = FALSE, sep = ",")

In [ ]:
# Upload CSV data to Google Cloud Storage by passing gsutil commands to system
gcs_url <- paste0("gs://", BUCKET_NAME, "/")
command <- paste("gsutil mb", gcs_url)
system(command)
gcs_data_dir <- paste0("gs://", BUCKET_NAME, "/data")
command <- paste("gsutil cp data/*_data.csv", gcs_data_dir)
system(command)
command <- paste("gsutil ls -l", gcs_data_dir)
system(command, intern = TRUE)

License

Authors: Daniel Sparing, Khalid Salama, Alok Pattani, Polong Lin


Disclaimer: This is not an official Google product. The sample code is provided for an educational purpose.


Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.