This notebook illustrates how to perform Exploratory Data Analysis (EDA) using R, on data extracted from BigQuery. After you analyze and process the data, the transformed data is stored in Cloud Storage for further ML tasks.
R is one of the most widely used programming languages for statistical modeling. It has a large and active community of data scientists and machine learning (ML) professionals. With over 15,000 packages in the open-source repository of CRAN, R has tools for all statistical data analysis applications, ML, and visualization.
The dataset used in this tutorial is the BigQuery natality dataset. This public dataset includes information about more than 137 million births registered in the United States from 1969 to 2008. The dataset is available here.
In this notebook, we focus on exploratory data analysis and visualization using R and BigQuery, with an eye toward a potential machine learning goal of predicting a baby's weight given a number of factors about the pregnancy and about the baby's mother.
The goal of this tutorial is to:
This tutorial uses the following billable components of Google Cloud:
Use the Pricing Calculator to generate a cost estimate based on your projected usage.
In [ ]:
version
Import the required libraries, all of which have been pre-installed in an R notebook.
In [ ]:
# Import R libraries
library(bigrquery) # used for querying BigQuery
library(ggplot2) # used for visualization
library(dplyr) # used for data wrangling
Use BigQuery out-of-band authentication
In [ ]:
bq_auth(use_oob = TRUE)
Set a variable to the name of the project that you want to use for this tutorial.
In [ ]:
# Set the project ID
PROJECT_ID <- "gcp-data-science-demo"
Set a variable to the name of the Cloud Storage bucket that you want to use.
In [ ]:
# Set your GCS bucket
BUCKET_NAME <- "r-on-gcp"
In [ ]:
sql_query <- "
SELECT
ROUND(weight_pounds, 2) AS weight_pounds ,
is_male,
mother_age,
plurality,
gestation_weeks,
cigarette_use,
alcohol_use,
CAST(ABS(FARM_FINGERPRINT(CONCAT(
CAST(YEAR AS STRING), CAST(month AS STRING),
CAST(weight_pounds AS STRING)))
) AS STRING) AS key
FROM
publicdata.samples.natality
WHERE
year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
AND month > 0
LIMIT %s
"
In [ ]:
sample_size <- 10000
sql_query <- sprintf(sql_query, sample_size)
natality_data <- bq_table_download(
bq_project_query(
PROJECT_ID,
query = sql_query
)
)
In [ ]:
# View the query result
head(natality_data)
In [ ]:
# Show # of rows and data types of each column
str(natality_data)
In [ ]:
# View the results summary
summary(natality_data)
In [ ]:
# Display the distribution of baby weights using a histogram
ggplot(
data = natality_data,
aes(x = weight_pounds)
) + geom_histogram(bins = 200)
In [ ]:
# Display the relationship between gestation weeks and baby weights
ggplot(
data = natality_data,
aes(x = gestation_weeks, y = weight_pounds)
) + geom_point() + geom_smooth(method = "lm")
In [ ]:
get_distinct_values <- function(column_name) {
query <- paste0(
'SELECT ', column_name, ',
COUNT(1) AS num_babies,
AVG(weight_pounds) AS avg_wt
FROM publicdata.samples.natality
WHERE year > 2000
GROUP BY ', column_name)
bq_table_download(
bq_project_query(
PROJECT_ID,
query = query
)
)
}
In [ ]:
df <- get_distinct_values('is_male')
ggplot(data = df, aes(x = is_male, y = num_babies)
) + geom_col()
ggplot(data = df, aes(x = is_male, y = avg_wt)
) + geom_col()
In [ ]:
df <- get_distinct_values('mother_age')
ggplot(data = df, aes(x = mother_age, y = num_babies)
) + geom_line()
ggplot(data = df, aes(x = mother_age, y = avg_wt)
) + geom_line()
In [ ]:
df <- get_distinct_values('plurality')
ggplot(data = df, aes(x = plurality, y = num_babies)
) + geom_col() + scale_y_log10()
ggplot(data = df, aes(x = plurality, y = avg_wt)
) + geom_col()
In [ ]:
df <- get_distinct_values('gestation_weeks')
ggplot(data = df, aes(x = gestation_weeks, y = num_babies)
) + geom_col() + scale_y_log10()
ggplot(data = df, aes(x = gestation_weeks, y = avg_wt)
) + geom_col()
In [ ]:
# Prepare training and evaluation data from BigQuery
sample_size <- 10000
sql_query <- sprintf(sql_query, sample_size)
train_query <- paste('SELECT * FROM (', sql_query,
') WHERE MOD(CAST(key AS INT64), 100) <= 75')
eval_query <- paste('SELECT * FROM (', sql_query,
') WHERE MOD(CAST(key AS INT64), 100) > 75')
# Load training data to data frame
train_data <- bq_table_download(
bq_project_query(
PROJECT_ID,
query = train_query
)
)
# Load evaluation data to data frame
eval_data <- bq_table_download(
bq_project_query(
PROJECT_ID,
query = eval_query
)
)
In [ ]:
print(paste0("Training instances count: ", nrow(train_data)))
print(paste0("Evaluation instances count: ", nrow(eval_data)))
In [ ]:
# Write data frames to local CSV files, without headers or row names
dir.create(file.path('data'), showWarnings = FALSE)
write.table(train_data, "data/train_data.csv",
row.names = FALSE, col.names = FALSE, sep = ",")
write.table(eval_data, "data/eval_data.csv",
row.names = FALSE, col.names = FALSE, sep = ",")
In [ ]:
# Upload CSV data to Google Cloud Storage by passing gsutil commands to system
gcs_url <- paste0("gs://", BUCKET_NAME, "/")
command <- paste("gsutil mb", gcs_url)
system(command)
gcs_data_dir <- paste0("gs://", BUCKET_NAME, "/data")
command <- paste("gsutil cp data/*_data.csv", gcs_data_dir)
system(command)
command <- paste("gsutil ls -l", gcs_data_dir)
system(command, intern = TRUE)
Authors: Daniel Sparing, Khalid Salama, Alok Pattani, Polong Lin
Disclaimer: This is not an official Google product. The sample code is provided for an educational purpose.
Copyright 2020 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.