In this lab you will take an existing 2TB+ TPC-DS benchmark dataset and learn the data warehouse optimization methods you can apply to the dataset in BigQuery to improve performance.
In this lab, you will learn how to:
This is an advanced level SQL lab. Before taking it, you should have experience with SQL. Familiarity with BigQuery is also highly recommended. If you need to get up to speed in these areas, you should take this Data Analyst series of labs first:
Once you're ready, scroll down to learn about the services you will be using and how to properly set up your lab environment.
BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without managing infrastructure or needing a database administrator. BigQuery uses SQL and takes advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
In order to benchmark the performance of a data warehouse we first must get tables and data to run queries against. There is a public organization, TPC, that provides large benchmarking datasets to companies explicitly for this purpose. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users.
The TPC-DS Dataset we will be using comprises of 25 tables and 99 queries that simulate common data analysis tasks. View the full documentation here.
The TPC-DS tables have been loaded into BigQuery and you will explore ways to optimize the performance of common queries by using BigQuery data warehousing best practices. We have limited the size to 2TB for the timing of this lab but the dataset itself can be expanded as needed.
Note: The TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (http://www.tpc.org). The Cloud DW benchmark is derived from the TPC-DS Benchmark and as such is not comparable to published TPC-DS results.
In [80]:
%%bigquery
SELECT
dataset_id,
table_id,
-- Convert bytes to GB.
ROUND(size_bytes/pow(10,9),2) as size_gb,
-- Convert UNIX EPOCH to a timestamp.
TIMESTAMP_MILLIS(creation_time) AS creation_time,
TIMESTAMP_MILLIS(last_modified_time) as last_modified_time,
row_count,
CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
ELSE NULL
END AS type
FROM
`dw-workshop.tpcds_2t_baseline.__TABLES__`
ORDER BY size_gb DESC
Out[80]:
The core tables in the data warehouse are derived from 5 separate core operational systems (each with many tables):
These systems are driven by the core functions of our retail business. As you can see, our store accepts sales from online (web), mail-order (catalog), and in-store. The business must keep track of inventory and can offer promotional discounts on items sold.
In [81]:
%%bigquery
SELECT * FROM
`dw-workshop.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`
Out[81]:
Question:
In [82]:
%%bigquery
SELECT * FROM
`dw-workshop.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`
WHERE
is_partitioning_column = 'YES' OR clustering_ordinal_position IS NOT NULL
Out[82]:
Question
In [83]:
%%bigquery
SELECT
COUNT(column_name) AS column_count,
table_name
FROM
`dw-workshop.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`
GROUP BY table_name
ORDER BY column_count DESC, table_name
Out[83]:
In the BigQuery UI, find the Resources panel and search for catalog_sales
Click on the catalog_sales
table name for the tpcds_2t_baseline
dataset
Question
Hint: Use the Details
button in the web UI to quickly access table metadata
Question:
Preview
the data and find the Catalog Sales Extended Sales Price cs_ext_sales_price
field (which is calculated based on product quantity * sales price)cs_quantity
)?
In [84]:
%%bigquery --verbose
SELECT
cs_item_sk,
COUNT(cs_order_number) AS total_orders,
SUM(cs_quantity) AS total_quantity,
SUM(cs_ext_sales_price) AS total_revenue,
SUM(cs_net_profit) AS total_profit
FROM
`dw-workshop.tpcds_2t_baseline.catalog_sales`
GROUP BY
cs_item_sk
ORDER BY
total_orders DESC
LIMIT
10
Out[84]:
A note on our data: The TPC-DS benchmark allows data warehouse practicioners to generate any volume of data programatically. Since the rows of data are system generated, they may not make the most sense in a business context (like why are we selling our top product at such a huge profit loss!).
The good news is that to benchmark our performance we care most about the volume of rows and columns to run our benchmark against.
Click on Execution details
Refer to the chart below (which should be similar to your results) and answer the following questions.
Question
We know the query took 5.1 seconds to run so what does the 1hr 24 min slot time metric mean?
Inside of the BigQuery service are lots of virtual machines that massively process your data and query logic in parallel. These workers, or "slots", work together to process a single query job really quickly. For accounts with on-demand pricing, you can have up to 2,000 slots.
So say we had 30 minutes of slot time or 1800 seconds. If the query took 20 seconds in total to run, but it was 1800 seconds worth of work, how many workers at minimum worked on it? 1800/20 = 90
And that's assuming each worker instantly had all the data it needed (no shuffling of data between workers) and was at full capacity for all 20 seconds!
In reality, workers have a variety of tasks (waiting for data, reading it, performing computations, and writing data) and also need to compare notes with eachother on what work was already done on the job. The good news for you is that you don't need to worry about optimizing these workers or the underlying data to run perfectly in parallel. That's why BigQuery is a managed service -- there's an entire team dedicated to hardware and data storage optimization.
In case you were wondering, the worker limit for your project is 2,000 slots at once.
To performance benchmark our data warehouse in BigQuery we need to create more than just a single SQL report. The good news is the TPC-DS dataset ships with 99 standard benchmark queries that we can run and log the performance outcomes.
In this lab, we are doing no adjustments to the existing data warehouse tables (no partitioning, no clustering, no nesting) so we can establish a performance benchmark to beat in future labs.
In [85]:
!head --lines=50 'sql/example_baseline_queries.sql'
Now let's run the first query against our dataset and note the execution time. Tip: You can use the --verbose flag in %%bigquery magics to return the job and completion time.
In [105]:
%%bigquery --verbose
# start query 1 in stream 0 using template query96.tpl
select count(*)
from `dw-workshop.tpcds_2t_baseline.store_sales` as store_sales
,`dw-workshop.tpcds_2t_baseline.household_demographics` as household_demographics
,`dw-workshop.tpcds_2t_baseline.time_dim` as time_dim,
`dw-workshop.tpcds_2t_baseline.store` as store
where ss_sold_time_sk = time_dim.t_time_sk
and ss_hdemo_sk = household_demographics.hd_demo_sk
and ss_store_sk = s_store_sk
and time_dim.t_hour = 8
and time_dim.t_minute >= 30
and household_demographics.hd_dep_count = 5
and store.s_store_name = 'ese'
order by count(*)
limit 100;
Out[105]:
It should execute in just a few seconds. Then try running it again and see if you get the same performance. BigQuery will automatically cache the results from the first time you ran the query and then serve those same results to you when you can the query again. We can confirm this by analyzing the query job statistics.
Let's list our five most recent query jobs run on BigQuery using the bq
command line interface. Then we will get even more detail on our most recent job with the bq show
command. Be sure to replace the job id with your own.
In [87]:
!bq ls -j -a -n 5
In [88]:
!bq show --format=prettyjson -j 612a4b28-cb5c-4e0b-ad5b-ebd51c3b2439
Looking at the job statistics we can see our most recent query hit cache
cacheHit: true
and therefore totalBytesProcessed: 0
. While this is great in normal uses for BigQuery (you aren't charged for queries that hit cache) it kind of ruins our performance test. While cache is super useful we want to disable it for testing purposes.
As of the time this lab was created, you can't pass a flag to %%bigquery
iPython notebook magics to disable cache or to quickly see the amount of data processed. So we will use the traditional bq
command line interface in bash.
First we will do a dry run
of the query without processing any data just to see how many bytes of data would be processed. Then we will remove that flag and ensure nouse_cache
is set to avoid hitting cache as well.
In [ ]:
%%bash
bq query \
--dry_run \
--nouse_cache \
--use_legacy_sql=false \
"""\
select count(*)
from \`dw-workshop.tpcds_2t_baseline.store_sales\` as store_sales
,\`dw-workshop.tpcds_2t_baseline.household_demographics\` as household_demographics
,\`dw-workshop.tpcds_2t_baseline.time_dim\` as time_dim, \`dw-workshop.tpcds_2t_baseline.store\` as store
where ss_sold_time_sk = time_dim.t_time_sk
and ss_hdemo_sk = household_demographics.hd_demo_sk
and ss_store_sk = s_store_sk
and time_dim.t_hour = 8
and time_dim.t_minute >= 30
and household_demographics.hd_dep_count = 5
and store.s_store_name = 'ese'
order by count(*)
limit 100;
"""
In [63]:
# Convert bytes to GB
132086388641 / 1e+9
Out[63]:
132 GB will be processed. At the time of writing, BigQuery pricing is \$5 per 1 TB (or 1000 GB) of data after the first free 1 TB each month. Assuming we've exhausted our 1 TB free this month, this would be \\$0.66 to run.
Now let's run it an ensure we're not pulling from cache so we get an accurate time-to-completion benchmark.
In [ ]:
%%bash
bq query \
--nouse_cache \
--use_legacy_sql=false \
"""\
select count(*)
from \`dw-workshop.tpcds_2t_baseline.store_sales\` as store_sales
,\`dw-workshop.tpcds_2t_baseline.household_demographics\` as household_demographics
,\`dw-workshop.tpcds_2t_baseline.time_dim\` as time_dim, \`dw-workshop.tpcds_2t_baseline.store\` as store
where ss_sold_time_sk = time_dim.t_time_sk
and ss_hdemo_sk = household_demographics.hd_demo_sk
and ss_store_sk = s_store_sk
and time_dim.t_hour = 8
and time_dim.t_minute >= 30
and household_demographics.hd_dep_count = 5
and store.s_store_name = 'ese'
order by count(*)
limit 100;
"""
If you're an experienced BigQuery user, you likely have seen these same metrics in the Web UI as well as highlighted in the red box below:
It's a matter of preference whether you do your work in the Web UI or the command line -- each has it's advantages.
One major advantage of using the bq
command line interface is the ability to create a script that will run the remaining 98 benchmark queries for us and log the results.
We will use the new BigQuery Transfer Service to quickly copy our large dataset from the dw-workshop
GCP project into your own so you can perform the benchmarking.
In [ ]:
%%bash
export PROJECT_ID=$(gcloud config list --format 'value(core.project)')
export BENCHMARK_DATASET_NAME=tpcds_2t_baseline # Name of the dataset you want to create
## Create a BigQuery dataset for tpcds_2t_flat_part_clust if it doesn't exist
datasetexists=$(bq ls -d | grep -w $BENCHMARK_DATASET_NAME)
if [ -n "$datasetexists" ]; then
echo -e "BigQuery dataset $BENCHMARK_DATASET_NAME already exists, let's not recreate it."
else
echo "Creating BigQuery dataset titled: $BENCHMARK_DATASET_NAME"
bq --location=US mk --dataset \
--description 'Benchmark Dataset' \
$PROJECT:$BENCHMARK_DATASET_NAME
echo "\nHere are your current datasets:"
bq ls
fi
dw-workshop
datasetClick Copy
Wait for the transfer to complete
In [110]:
%%bigquery
SELECT COUNT(*) AS store_transaction_count
FROM tpcds_2t_baseline.store_sales
Out[110]:
Running each of the 99 queries manually via the Console UI would be a tedious effort. We'll show you how you can run all 99 programatically and automatically log the output (time and GB processed) to a log file for analysis.
Below is a shell script that:
bq
query commandperf
Execute the below statement and follow along with the results as you benchmark a few example queries (don't worry, we've already ran the full 99 recently so you won't have to).
After executing, wait 1-2 minutes for the benchmark test to complete
In [91]:
%%bash
# runs the SQL queries from the TPCDS benchmark
# Pull the current Google Cloud Platform project name
BQ_DATASET="tpcds_2t_baseline" # let's start by benchmarking our baseline dataset
QUERY_FILE_PATH="./sql/example_baseline_queries.sql" # the full test is on 99_baseline_queries but that will take 80+ mins to run
IFS=";"
# create perf table to keep track of run times for all 99 queries
printf "\033[32;1m Housekeeping tasks... \033[0m\n\n";
printf "Creating a reporting table perf to track how fast each query runs...";
perf_table_ddl="CREATE TABLE IF NOT EXISTS $BQ_DATASET.perf(performance_test_num int64, query_num int64, elapsed_time_sec int64, ran_on int64)"
bq rm -f $BQ_DATASET.perf
bq query --nouse_legacy_sql $perf_table_ddl
start=$(date +%s)
index=0
for select_stmt in $(<$QUERY_FILE_PATH)
do
# run the test until you hit a line with the string 'END OF BENCHMARK' in the file
if [[ "$select_stmt" == *'END OF BENCHMARK'* ]]; then
break
fi
printf "\n\033[32;1m Let's benchmark this query... \033[0m\n";
printf "$select_stmt";
SECONDS=0;
bq query --use_cache=false --nouse_legacy_sql $select_stmt # critical to turn cache off for this test
duration=$SECONDS
# get current timestamp in milliseconds
ran_on=$(date +%s)
index=$((index+1))
printf "\n\033[32;1m Here's how long it took... \033[0m\n\n";
echo "Query $index ran in $(($duration / 60)) minutes and $(($duration % 60)) seconds."
printf "\n\033[32;1m Writing to our benchmark table... \033[0m\n\n";
insert_stmt="insert into $BQ_DATASET.perf(performance_test_num, query_num, elapsed_time_sec, ran_on) values($start, $index, $duration, $ran_on)"
printf "$insert_stmt"
bq query --nouse_legacy_sql $insert_stmt
done
end=$(date +%s)
printf "Benchmark test complete"
In [66]:
%%bigquery
SELECT * FROM tpcds_2t_baseline.perf
WHERE
# Let's only pull the results from our most recent test
performance_test_num = (SELECT MAX(performance_test_num) FROM tpcds_2t_baseline.perf)
ORDER BY ran_on
Out[66]:
And finally, the overall statistics for the entire test:
In [67]:
%%bigquery
SELECT
TIMESTAMP_SECONDS(MAX(performance_test_num)) AS test_date,
MAX(performance_test_num) AS latest_performance_test_num,
COUNT(DISTINCT query_num) AS count_queries_benchmarked,
SUM(elapsed_time_sec) AS total_time_sec,
MIN(elapsed_time_sec) AS fastest_query_time_sec,
MAX(elapsed_time_sec) AS slowest_query_time_sec
FROM
tpcds_2t_baseline.perf
WHERE
performance_test_num = (SELECT MAX(performance_test_num) FROM tpcds_2t_baseline.perf)
Out[67]:
In [113]:
%%bigquery
SELECT
TIMESTAMP_SECONDS(performance_test_num) AS test_date,
query_num,
TIMESTAMP_SECONDS(ran_on) AS query_ran_on,
TIMESTAMP_SECONDS(ran_on + elapsed_time_sec) AS query_completed_on,
elapsed_time_sec
FROM `dw-workshop.tpcds_2t_baseline.perf` # public table
WHERE
# Let's only pull the results from our most recent test
performance_test_num = (SELECT MAX(performance_test_num) FROM `dw-workshop.tpcds_2t_baseline.perf`)
ORDER BY ran_on
Out[113]:
And the results of the complete test:
In [117]:
%%bigquery
SELECT
TIMESTAMP_SECONDS(MAX(performance_test_num)) AS test_date,
COUNT(DISTINCT query_num) AS count_queries_benchmarked,
SUM(elapsed_time_sec) AS total_time_sec,
ROUND(SUM(elapsed_time_sec)/60,2) AS total_time_min,
MIN(elapsed_time_sec) AS fastest_query_time_sec,
MAX(elapsed_time_sec) AS slowest_query_time_sec,
ROUND(AVG(elapsed_time_sec),2) AS avg_query_time_sec
FROM
`dw-workshop.tpcds_2t_baseline.perf`
WHERE
performance_test_num = (SELECT MAX(performance_test_num) FROM `dw-workshop.tpcds_2t_baseline.perf`)
Out[117]:
Note the total_time_sec
of 4760 seconds (or 79 minutes) which we will look to beat in future labs by applying BigQuery optimization techniques like: