Lab 1: Explore and Benchmark a BigQuery Dataset for Performance

Overview

In this lab you will take an existing 2TB+ TPC-DS benchmark dataset and learn the data warehouse optimization methods you can apply to the dataset in BigQuery to improve performance.

What you'll do

In this lab, you will learn how to:

  • Use BigQuery to access and query the TPC-DS benchmark dataset
  • Run pre-defined queries to establish baseline performance benchmarks

Prerequisites

This is an advanced level SQL lab. Before taking it, you should have experience with SQL. Familiarity with BigQuery is also highly recommended. If you need to get up to speed in these areas, you should take this Data Analyst series of labs first:

Once you're ready, scroll down to learn about the services you will be using and how to properly set up your lab environment.

BigQuery

BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without managing infrastructure or needing a database administrator. BigQuery uses SQL and takes advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

TPC-DS Background

In order to benchmark the performance of a data warehouse we first must get tables and data to run queries against. There is a public organization, TPC, that provides large benchmarking datasets to companies explicitly for this purpose. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users.

The TPC-DS Dataset we will be using comprises of 25 tables and 99 queries that simulate common data analysis tasks. View the full documentation here.

Exploring TPC-DS in BigQuery

The TPC-DS tables have been loaded into BigQuery and you will explore ways to optimize the performance of common queries by using BigQuery data warehousing best practices. We have limited the size to 2TB for the timing of this lab but the dataset itself can be expanded as needed.

Note: The TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (http://www.tpc.org). The Cloud DW benchmark is derived from the TPC-DS Benchmark and as such is not comparable to published TPC-DS results.

Exploring the Schema with SQL

Question:

  • How many tables are in the dataset?
  • What is the name of the largest table (in GB)? How many rows does it have?

In [80]:
%%bigquery
SELECT 
  dataset_id,
  table_id,
  -- Convert bytes to GB.
  ROUND(size_bytes/pow(10,9),2) as size_gb,
  -- Convert UNIX EPOCH to a timestamp.
  TIMESTAMP_MILLIS(creation_time) AS creation_time,
  TIMESTAMP_MILLIS(last_modified_time) as last_modified_time,
  row_count,
  CASE 
    WHEN type = 1 THEN 'table'
    WHEN type = 2 THEN 'view'
  ELSE NULL
  END AS type
FROM
  `dw-workshop.tpcds_2t_baseline.__TABLES__`
ORDER BY size_gb DESC


Out[80]:
dataset_id table_id size_gb creation_time last_modified_time row_count type
0 tpcds_2t_baseline store_sales 1545.13 2019-10-13 19:15:03.190000+00:00 2019-10-13 19:15:03.190000+00:00 5762820700 table
1 tpcds_2t_baseline catalog_sales 1124.82 2019-10-13 19:14:55.693000+00:00 2019-10-13 19:14:55.693000+00:00 2881495086 table
2 tpcds_2t_baseline web_sales 564.63 2019-10-13 19:15:03.986000+00:00 2019-10-13 19:15:03.986000+00:00 1440681379 table
3 tpcds_2t_baseline store_returns 129.34 2019-10-13 19:15:01.018000+00:00 2019-10-13 19:15:01.018000+00:00 576280209 table
4 tpcds_2t_baseline catalog_returns 81.51 2019-10-13 19:14:54.124000+00:00 2019-10-13 19:14:54.124000+00:00 288154642 table
5 tpcds_2t_baseline web_returns 36.43 2019-10-13 19:15:03.043000+00:00 2019-10-13 19:15:03.043000+00:00 144074630 table
6 tpcds_2t_baseline customer 1.44 2019-10-13 19:14:54.618000+00:00 2019-10-13 19:14:54.618000+00:00 9100000 table
7 tpcds_2t_baseline inventory 1.19 2019-10-13 19:14:57.402000+00:00 2019-10-13 19:14:57.402000+00:00 37584000 table
8 tpcds_2t_baseline customer_address 0.61 2019-10-13 19:14:55.631000+00:00 2019-10-13 19:14:55.631000+00:00 4550000 table
9 tpcds_2t_baseline customer_demographics 0.13 2019-10-13 19:14:55.340000+00:00 2019-10-13 19:14:55.340000+00:00 1920800 table
10 tpcds_2t_baseline item 0.02 2019-10-13 19:14:58.165000+00:00 2019-10-13 19:14:58.165000+00:00 48000 table
11 tpcds_2t_baseline date_dim 0.01 2019-10-13 19:14:55.691000+00:00 2019-10-13 19:14:55.691000+00:00 73049 table
12 tpcds_2t_baseline time_dim 0.01 2019-10-13 19:15:01.382000+00:00 2019-10-13 19:15:01.382000+00:00 86400 table
13 tpcds_2t_baseline call_center 0.00 2019-10-13 19:14:52.907000+00:00 2019-10-13 19:14:52.907000+00:00 8 table
14 tpcds_2t_baseline catalog_page 0.00 2019-10-13 19:14:53.022000+00:00 2019-10-13 19:14:53.022000+00:00 11718 table
15 tpcds_2t_baseline dbgen_version 0.00 2019-10-13 19:14:56.109000+00:00 2019-10-13 19:14:56.109000+00:00 1 table
16 tpcds_2t_baseline household_demographics 0.00 2019-10-13 19:14:56.855000+00:00 2019-10-13 19:14:56.855000+00:00 7200 table
17 tpcds_2t_baseline income_band 0.00 2019-10-13 19:14:57.134000+00:00 2019-10-13 19:14:57.134000+00:00 20 table
18 tpcds_2t_baseline perf 0.00 2019-10-14 07:04:04.205000+00:00 2019-10-14 08:29:47.141000+00:00 99 table
19 tpcds_2t_baseline promotion 0.00 2019-10-13 19:14:58.728000+00:00 2019-10-13 19:14:58.728000+00:00 450 table
20 tpcds_2t_baseline reason 0.00 2019-10-13 19:14:59.216000+00:00 2019-10-13 19:14:59.216000+00:00 36 table
21 tpcds_2t_baseline ship_mode 0.00 2019-10-13 19:15:00.140000+00:00 2019-10-13 19:15:00.140000+00:00 20 table
22 tpcds_2t_baseline store 0.00 2019-10-13 19:15:00.160000+00:00 2019-10-13 19:15:00.160000+00:00 186 table
23 tpcds_2t_baseline warehouse 0.00 2019-10-13 19:15:01.593000+00:00 2019-10-13 19:15:01.593000+00:00 6 table
24 tpcds_2t_baseline web_page 0.00 2019-10-13 19:15:01.927000+00:00 2019-10-13 19:15:01.927000+00:00 360 table
25 tpcds_2t_baseline web_site 0.00 2019-10-13 19:15:03.301000+00:00 2019-10-13 19:15:03.301000+00:00 36 table

The core tables in the data warehouse are derived from 5 separate core operational systems (each with many tables):

These systems are driven by the core functions of our retail business. As you can see, our store accepts sales from online (web), mail-order (catalog), and in-store. The business must keep track of inventory and can offer promotional discounts on items sold.

Exploring all available columns of data

Question:

  • How many columns of data are in the entire dataset (all tables)?

In [81]:
%%bigquery
SELECT * FROM 
 `dw-workshop.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`


Out[81]:
table_catalog table_schema table_name column_name ordinal_position is_nullable data_type is_generated generation_expression is_stored is_hidden is_updatable is_system_defined is_partitioning_column clustering_ordinal_position
0 dw-workshop tpcds_2t_baseline household_demographics hd_demo_sk 1 NO INT64 NEVER None None NO None NO NO None
1 dw-workshop tpcds_2t_baseline household_demographics hd_income_band_sk 2 YES INT64 NEVER None None NO None NO NO None
2 dw-workshop tpcds_2t_baseline household_demographics hd_buy_potential 3 YES STRING NEVER None None NO None NO NO None
3 dw-workshop tpcds_2t_baseline household_demographics hd_dep_count 4 YES INT64 NEVER None None NO None NO NO None
4 dw-workshop tpcds_2t_baseline household_demographics hd_vehicle_count 5 YES INT64 NEVER None None NO None NO NO None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
428 dw-workshop tpcds_2t_baseline customer c_birth_year 14 YES INT64 NEVER None None NO None NO NO None
429 dw-workshop tpcds_2t_baseline customer c_birth_country 15 YES STRING NEVER None None NO None NO NO None
430 dw-workshop tpcds_2t_baseline customer c_login 16 YES STRING NEVER None None NO None NO NO None
431 dw-workshop tpcds_2t_baseline customer c_email_address 17 YES STRING NEVER None None NO None NO NO None
432 dw-workshop tpcds_2t_baseline customer c_last_review_date_sk 18 YES INT64 NEVER None None NO None NO NO None

433 rows × 15 columns

Question:

  • Are any of the columns of data in this baseline dataset partitioned or clustered?

In [82]:
%%bigquery
SELECT * FROM 
 `dw-workshop.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`
WHERE 
  is_partitioning_column = 'YES' OR clustering_ordinal_position IS NOT NULL


Out[82]:
table_catalog table_schema table_name column_name ordinal_position is_nullable data_type is_generated generation_expression is_stored is_hidden is_updatable is_system_defined is_partitioning_column clustering_ordinal_position

Question

  • How many columns of data does each table have (sorted by most to least?)
  • Which table has the most columns of data?

In [83]:
%%bigquery
SELECT 
  COUNT(column_name) AS column_count, 
  table_name 
FROM 
 `dw-workshop.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`
GROUP BY table_name
ORDER BY column_count DESC, table_name


Out[83]:
column_count table_name
0 34 catalog_sales
1 34 web_sales
2 31 call_center
3 29 store
4 28 date_dim
5 27 catalog_returns
6 26 web_site
7 24 web_returns
8 23 store_sales
9 22 item
10 20 store_returns
11 19 promotion
12 18 customer
13 14 warehouse
14 14 web_page
15 13 customer_address
16 10 time_dim
17 9 catalog_page
18 9 customer_demographics
19 6 ship_mode
20 5 household_demographics
21 4 dbgen_version
22 4 inventory
23 4 perf
24 3 income_band
25 3 reason

Previewing sample rows of data values

In the BigQuery UI, find the Resources panel and search for catalog_sales

Click on the catalog_sales table name for the tpcds_2t_baseline dataset

Question

  • How many rows are in the table?
  • How large is the table in TB?

Hint: Use the Details button in the web UI to quickly access table metadata

Question:

  • Preview the data and find the Catalog Sales Extended Sales Price cs_ext_sales_price field (which is calculated based on product quantity * sales price)
  • Are there any missing data values for Catalog Sales Quantity (cs_quantity)?
  • Are there any missing values for cs_ext_ship_cost? For what type of product could this be expected? (Digital products)

Create an example sales report

Write a query that shows key sales stats for each item sold from the Catalog:

  • total orders
  • total unit quantity
  • total revenue
  • total profit
  • sorted by total orders highest to lowest, limit 100

In [84]:
%%bigquery --verbose
SELECT
  cs_item_sk,
  COUNT(cs_order_number) AS total_orders,
  SUM(cs_quantity) AS total_quantity,
  SUM(cs_ext_sales_price) AS total_revenue,
  SUM(cs_net_profit) AS total_profit
FROM
  `dw-workshop.tpcds_2t_baseline.catalog_sales`
GROUP BY
  cs_item_sk
ORDER BY
  total_orders DESC
LIMIT
  10


Executing query with job ID: 94a5a441-178c-4efa-a81a-bd450b27bca7
Query executing: 69.08s
Query complete after 69.67s
Out[84]:
cs_item_sk total_orders total_quantity total_revenue total_profit
0 9253 121222 6103140 307670152.68 -30932953.61
1 1927 121217 6106523 309693527.62 -29261211.24
2 16603 121181 6099067 307870874.85 -31511409.31
3 193 121150 6074650 305044085.7 -31913535.21
4 38845 121142 6096548 306754379.03 -30723801.34
5 14335 121109 6098853 307479510.28 -30412623.1
6 21145 121101 6076957 306499554.37 -30122065.46
7 7255 121098 6075854 308807049.12 -30202631
8 37315 121094 6075936 307200736.08 -31155035.24
9 9973 121091 6089771 308538961.44 -31130175.79

A note on our data: The TPC-DS benchmark allows data warehouse practicioners to generate any volume of data programatically. Since the rows of data are system generated, they may not make the most sense in a business context (like why are we selling our top product at such a huge profit loss!).

The good news is that to benchmark our performance we care most about the volume of rows and columns to run our benchmark against.

Analyzing query performance

Click on Execution details

Refer to the chart below (which should be similar to your results) and answer the following questions.

Question

  • How long did it take the query to run? 5.1s
  • How much data in GB was processed? 150GB
  • How much slot time was consumed? 1hr 24min
  • How many rows were input? 2,881,495,086
  • How many rows were output as the end result (before the limit)? 23,300
  • What does the output rows mean in the context of our query? (23,300 unique cs_item_sk)

Side note: Slot Time

We know the query took 5.1 seconds to run so what does the 1hr 24 min slot time metric mean?

Inside of the BigQuery service are lots of virtual machines that massively process your data and query logic in parallel. These workers, or "slots", work together to process a single query job really quickly. For accounts with on-demand pricing, you can have up to 2,000 slots.

So say we had 30 minutes of slot time or 1800 seconds. If the query took 20 seconds in total to run, but it was 1800 seconds worth of work, how many workers at minimum worked on it? 1800/20 = 90

And that's assuming each worker instantly had all the data it needed (no shuffling of data between workers) and was at full capacity for all 20 seconds!

In reality, workers have a variety of tasks (waiting for data, reading it, performing computations, and writing data) and also need to compare notes with eachother on what work was already done on the job. The good news for you is that you don't need to worry about optimizing these workers or the underlying data to run perfectly in parallel. That's why BigQuery is a managed service -- there's an entire team dedicated to hardware and data storage optimization.

In case you were wondering, the worker limit for your project is 2,000 slots at once.

Running a performance benchmark

To performance benchmark our data warehouse in BigQuery we need to create more than just a single SQL report. The good news is the TPC-DS dataset ships with 99 standard benchmark queries that we can run and log the performance outcomes.

In this lab, we are doing no adjustments to the existing data warehouse tables (no partitioning, no clustering, no nesting) so we can establish a performance benchmark to beat in future labs.

Viewing the 99 pre-made SQL queries

We have a long SQL file with 99 standard queries against this dataset stored in our /sql/ directory.

Let's view the first 50 lines of those baseline queries to get familiar with how we will be performance benchmarking our dataset.


In [85]:
!head --lines=50 'sql/example_baseline_queries.sql'


-- start query 1 in stream 0 using template query96.tpl
select  count(*) 
from tpcds_2t_baseline.store_sales
    ,tpcds_2t_baseline.household_demographics 
    ,tpcds_2t_baseline.time_dim, tpcds_2t_baseline.store
where ss_sold_time_sk = time_dim.t_time_sk   
    and ss_hdemo_sk = household_demographics.hd_demo_sk 
    and ss_store_sk = s_store_sk
    and time_dim.t_hour = 8
    and time_dim.t_minute >= 30
    and household_demographics.hd_dep_count = 5
    and store.s_store_name = 'ese'
order by count(*)
limit 100;

-- end query 1 in stream 0 using template query96.tpl
-- start query 2 in stream 0 using template query7.tpl
select  i_item_id, 
        avg(ss_quantity) agg1,
        avg(ss_list_price) agg2,
        avg(ss_coupon_amt) agg3,
        avg(ss_sales_price) agg4 
 from tpcds_2t_baseline.store_sales, tpcds_2t_baseline.customer_demographics, tpcds_2t_baseline.date_dim, tpcds_2t_baseline.item, tpcds_2t_baseline.promotion
 where ss_sold_date_sk = d_date_sk and
       ss_item_sk = i_item_sk and
       ss_cdemo_sk = cd_demo_sk and
       ss_promo_sk = p_promo_sk and
       cd_gender = 'M' and 
       cd_marital_status = 'M' and
       cd_education_status = '4 yr Degree' and
       (p_channel_email = 'N' or p_channel_event = 'N') and
       d_year = 2001 
 group by i_item_id
 order by i_item_id
 limit 100;

-- end query 2 in stream 0 using template query7.tpl
-- start query 3 in stream 0 using template query75.tpl
WITH all_sales AS (
 SELECT d_year
       ,i_brand_id
       ,i_class_id
       ,i_category_id
       ,i_manufact_id
       ,SUM(sales_cnt) AS sales_cnt
       ,SUM(sales_amt) AS sales_amt
 FROM (SELECT d_year
             ,i_brand_id
             ,i_class_id

Running the first benchmark test

Now let's run the first query against our dataset and note the execution time. Tip: You can use the --verbose flag in %%bigquery magics to return the job and completion time.


In [105]:
%%bigquery --verbose
# start query 1 in stream 0 using template query96.tpl
select  count(*) 
from `dw-workshop.tpcds_2t_baseline.store_sales` as store_sales
    ,`dw-workshop.tpcds_2t_baseline.household_demographics` as household_demographics 
    ,`dw-workshop.tpcds_2t_baseline.time_dim` as time_dim, 
    `dw-workshop.tpcds_2t_baseline.store` as store
where ss_sold_time_sk = time_dim.t_time_sk   
    and ss_hdemo_sk = household_demographics.hd_demo_sk 
    and ss_store_sk = s_store_sk
    and time_dim.t_hour = 8
    and time_dim.t_minute >= 30
    and household_demographics.hd_dep_count = 5
    and store.s_store_name = 'ese'
order by count(*)
limit 100;


Executing query with job ID: 0a6ca437-346e-4416-aa3a-2d279a3631ae
Query executing: 1.08s
Query complete after 1.90s
Out[105]:
f0_
0 1488312

It should execute in just a few seconds. Then try running it again and see if you get the same performance. BigQuery will automatically cache the results from the first time you ran the query and then serve those same results to you when you can the query again. We can confirm this by analyzing the query job statistics.

Viewing BigQuery job statistics

Let's list our five most recent query jobs run on BigQuery using the bq command line interface. Then we will get even more detail on our most recent job with the bq show command. Be sure to replace the job id with your own.


In [87]:
!bq ls -j -a -n 5


                 jobId                   Job Type    State      Start Time         Duration     
 -------------------------------------- ---------- --------- ----------------- ---------------- 
  14ec0f66-91ab-41f7-bf1f-3e5660cb9d3f   query      SUCCESS   15 Oct 15:34:27   0:00:03.835000  
  <REDACTED>                             None       FAILURE   15 Oct 15:33:01   0:00:17.561000  
  94a5a441-178c-4efa-a81a-bd450b27bca7   query      SUCCESS   15 Oct 15:32:15   0:01:09.065000  
  7fab5973-be13-4b84-acd5-63c2c3946225   query      SUCCESS   15 Oct 15:31:40   0:00:00.671000  
  5224adc7-d6d1-463a-9820-bb71139e27aa   query      SUCCESS   15 Oct 15:31:32   0:00:00.572000  

In [88]:
!bq show --format=prettyjson -j 612a4b28-cb5c-4e0b-ad5b-ebd51c3b2439


{
  "configuration": {
    "dryRun": false, 
    "jobType": "QUERY", 
    "query": {
      "createDisposition": "CREATE_IF_NEEDED", 
      "destinationTable": {
        "datasetId": "_26faa896567219ad1a8a420ad92caebbb6af636a", 
        "projectId": "dw-workshop", 
        "tableId": "anon0e42104bd9f24ce9a8762cb60cc3f67fa6aadf66"
      }, 
      "priority": "INTERACTIVE", 
      "query": "# start query 1 in stream 0 using template query96.tpl\nselect  count(*) \nfrom tpcds_2t_baseline.store_sales\n    ,tpcds_2t_baseline.household_demographics \n    ,tpcds_2t_baseline.time_dim, tpcds_2t_baseline.store\nwhere ss_sold_time_sk = time_dim.t_time_sk   \n    and ss_hdemo_sk = household_demographics.hd_demo_sk \n    and ss_store_sk = s_store_sk\n    and time_dim.t_hour = 8\n    and time_dim.t_minute >= 30\n    and household_demographics.hd_dep_count = 5\n    and store.s_store_name = 'ese'\norder by count(*)\nlimit 100;\n", 
      "useLegacySql": false, 
      "writeDisposition": "WRITE_TRUNCATE"
    }
  }, 
  "etag": "8VQdnl1bV4S/9jpByl9Qvg==", 
  "id": "dw-workshop:US.612a4b28-cb5c-4e0b-ad5b-ebd51c3b2439", 
  "jobReference": {
    "jobId": "612a4b28-cb5c-4e0b-ad5b-ebd51c3b2439", 
    "location": "US", 
    "projectId": "dw-workshop"
  }, 
  "kind": "bigquery#job", 
  "selfLink": "https://bigquery.googleapis.com/bigquery/v2/projects/dw-workshop/jobs/612a4b28-cb5c-4e0b-ad5b-ebd51c3b2439?location=US", 
  "statistics": {
    "creationTime": "1571002972799", 
    "endTime": "1571002973547", 
    "query": {
      "cacheHit": true, 
      "statementType": "SELECT", 
      "totalBytesBilled": "0", 
      "totalBytesProcessed": "0"
    }, 
    "startTime": "1571002973360", 
    "totalBytesProcessed": "0"
  }, 
  "status": {
    "state": "DONE"
  }, 
  "user_email": "796822948179-compute@developer.gserviceaccount.com"
}

Looking at the job statistics we can see our most recent query hit cache

  • cacheHit: true and therefore
  • totalBytesProcessed: 0.

While this is great in normal uses for BigQuery (you aren't charged for queries that hit cache) it kind of ruins our performance test. While cache is super useful we want to disable it for testing purposes.

Disabling Cache and Dry Running Queries

As of the time this lab was created, you can't pass a flag to %%bigquery iPython notebook magics to disable cache or to quickly see the amount of data processed. So we will use the traditional bq command line interface in bash.

First we will do a dry run of the query without processing any data just to see how many bytes of data would be processed. Then we will remove that flag and ensure nouse_cache is set to avoid hitting cache as well.


In [ ]:
%%bash 
bq query \
--dry_run \
--nouse_cache \
--use_legacy_sql=false \
"""\
select  count(*) 
from \`dw-workshop.tpcds_2t_baseline.store_sales\` as store_sales
    ,\`dw-workshop.tpcds_2t_baseline.household_demographics\` as household_demographics  
    ,\`dw-workshop.tpcds_2t_baseline.time_dim\` as time_dim, \`dw-workshop.tpcds_2t_baseline.store\` as store
where ss_sold_time_sk = time_dim.t_time_sk   
    and ss_hdemo_sk = household_demographics.hd_demo_sk 
    and ss_store_sk = s_store_sk
    and time_dim.t_hour = 8
    and time_dim.t_minute >= 30
    and household_demographics.hd_dep_count = 5
    and store.s_store_name = 'ese'
order by count(*)
limit 100;
"""

In [63]:
# Convert bytes to GB
132086388641 / 1e+9


Out[63]:
132.086388641

132 GB will be processed. At the time of writing, BigQuery pricing is \$5 per 1 TB (or 1000 GB) of data after the first free 1 TB each month. Assuming we've exhausted our 1 TB free this month, this would be \\$0.66 to run.

Now let's run it an ensure we're not pulling from cache so we get an accurate time-to-completion benchmark.


In [ ]:
%%bash 
bq query \
--nouse_cache \
--use_legacy_sql=false \
"""\
select  count(*) 
from \`dw-workshop.tpcds_2t_baseline.store_sales\` as store_sales
    ,\`dw-workshop.tpcds_2t_baseline.household_demographics\` as household_demographics  
    ,\`dw-workshop.tpcds_2t_baseline.time_dim\` as time_dim, \`dw-workshop.tpcds_2t_baseline.store\` as store
where ss_sold_time_sk = time_dim.t_time_sk   
    and ss_hdemo_sk = household_demographics.hd_demo_sk 
    and ss_store_sk = s_store_sk
    and time_dim.t_hour = 8
    and time_dim.t_minute >= 30
    and household_demographics.hd_dep_count = 5
    and store.s_store_name = 'ese'
order by count(*)
limit 100;
"""

If you're an experienced BigQuery user, you likely have seen these same metrics in the Web UI as well as highlighted in the red box below:

It's a matter of preference whether you do your work in the Web UI or the command line -- each has it's advantages.

One major advantage of using the bq command line interface is the ability to create a script that will run the remaining 98 benchmark queries for us and log the results.

Copy the dw-workshop dataset into your own GCP project

We will use the new BigQuery Transfer Service to quickly copy our large dataset from the dw-workshop GCP project into your own so you can perform the benchmarking.

Create a new baseline dataset in your project


In [ ]:
%%bash

export PROJECT_ID=$(gcloud config list --format 'value(core.project)')
export BENCHMARK_DATASET_NAME=tpcds_2t_baseline # Name of the dataset you want to create

## Create a BigQuery dataset for tpcds_2t_flat_part_clust if it doesn't exist
datasetexists=$(bq ls -d | grep -w $BENCHMARK_DATASET_NAME)

if [ -n "$datasetexists" ]; then
    echo -e "BigQuery dataset $BENCHMARK_DATASET_NAME already exists, let's not recreate it."

else
    echo "Creating BigQuery dataset titled: $BENCHMARK_DATASET_NAME"
    
    bq --location=US mk --dataset \
        --description 'Benchmark Dataset' \
        $PROJECT:$BENCHMARK_DATASET_NAME
   echo "\nHere are your current datasets:"
   bq ls
fi

Use the BigQuery Data Transfer Service to copy an existing dataset

  1. Enable the BigQuery Data Transfer Service API
  2. Navigate to the BigQuery console and the existing dw-workshop dataset
  3. Click Copy Dataset

  1. In the pop-up, choose your project name and the newly created dataset name from the previous step

  1. Click Copy

  2. Wait for the transfer to complete

Verify you now have the baseline data in your project

Run the below query and confirm you see data. Note that if you omit the project-id ahead of the dataset name in the FROM clause, BigQuery will assume your default project.


In [110]:
%%bigquery
SELECT COUNT(*) AS store_transaction_count
FROM tpcds_2t_baseline.store_sales


Out[110]:
store_transaction_count
0 5762820700

Setup an automated test

Running each of the 99 queries manually via the Console UI would be a tedious effort. We'll show you how you can run all 99 programatically and automatically log the output (time and GB processed) to a log file for analysis.

Below is a shell script that:

  1. Accepts a BigQuery dataset to benchmark
  2. Accepts a list of semi-colon separated queries to run
  3. Loops through each query and calls the bq query command
  4. Records the execution time into a separate BigQuery performance table perf

Execute the below statement and follow along with the results as you benchmark a few example queries (don't worry, we've already ran the full 99 recently so you won't have to).

After executing, wait 1-2 minutes for the benchmark test to complete


In [91]:
%%bash
# runs the SQL queries from the TPCDS benchmark 

# Pull the current Google Cloud Platform project name

BQ_DATASET="tpcds_2t_baseline" # let's start by benchmarking our baseline dataset 
QUERY_FILE_PATH="./sql/example_baseline_queries.sql" # the full test is on 99_baseline_queries but that will take 80+ mins to run
IFS=";"

# create perf table to keep track of run times for all 99 queries
printf "\033[32;1m Housekeeping tasks... \033[0m\n\n";
printf "Creating a reporting table perf to track how fast each query runs...";
perf_table_ddl="CREATE TABLE IF NOT EXISTS $BQ_DATASET.perf(performance_test_num int64, query_num int64, elapsed_time_sec int64, ran_on int64)"
bq rm -f $BQ_DATASET.perf
bq query --nouse_legacy_sql $perf_table_ddl 

start=$(date +%s)
index=0
for select_stmt in $(<$QUERY_FILE_PATH) 
do 
  # run the test until you hit a line with the string 'END OF BENCHMARK' in the file
  if [[ "$select_stmt" == *'END OF BENCHMARK'* ]]; then
    break
  fi

  printf "\n\033[32;1m Let's benchmark this query... \033[0m\n";
  printf "$select_stmt";
  
  SECONDS=0;
  bq query --use_cache=false --nouse_legacy_sql $select_stmt # critical to turn cache off for this test
  duration=$SECONDS

  # get current timestamp in milliseconds  
  ran_on=$(date +%s)

  index=$((index+1))

  printf "\n\033[32;1m Here's how long it took... \033[0m\n\n";
  echo "Query $index ran in $(($duration / 60)) minutes and $(($duration % 60)) seconds."

  printf "\n\033[32;1m Writing to our benchmark table... \033[0m\n\n";
  insert_stmt="insert into $BQ_DATASET.perf(performance_test_num, query_num, elapsed_time_sec, ran_on) values($start, $index, $duration, $ran_on)"
  printf "$insert_stmt"
  bq query --nouse_legacy_sql $insert_stmt
done

end=$(date +%s)

printf "Benchmark test complete"


Process is interrupted.

Viewing the benchmark results

As part of the benchmark test, we stored the processing time of each query into a new perf BigQuery table. We can query that table and get some performance stats for our test.

First are each of the tests we ran:


In [66]:
%%bigquery
SELECT * FROM tpcds_2t_baseline.perf
WHERE 
 # Let's only pull the results from our most recent test
 performance_test_num = (SELECT MAX(performance_test_num) FROM tpcds_2t_baseline.perf)
ORDER BY ran_on


Out[66]:
performance_test_num query_num elapsed_time_sec ran_on
0 1571014412 1 5 1571014417
1 1571014412 2 10 1571014430
2 1571014412 3 25 1571014458
3 1571014412 4 25 1571014487

And finally, the overall statistics for the entire test:


In [67]:
%%bigquery
SELECT
  TIMESTAMP_SECONDS(MAX(performance_test_num)) AS test_date,
  MAX(performance_test_num) AS latest_performance_test_num,
  COUNT(DISTINCT query_num) AS count_queries_benchmarked,
  SUM(elapsed_time_sec) AS total_time_sec,
  MIN(elapsed_time_sec) AS fastest_query_time_sec,
  MAX(elapsed_time_sec) AS slowest_query_time_sec
FROM
  tpcds_2t_baseline.perf
WHERE
  performance_test_num = (SELECT MAX(performance_test_num) FROM tpcds_2t_baseline.perf)


Out[67]:
test_date latest_performance_test_num count_queries_benchmarked total_time_sec fastest_query_time_sec slowest_query_time_sec
0 2019-10-14 00:53:32+00:00 1571014412 4 65 5 25

Benchmarking all 99 queries

As we mentioned before, we already ran all 99 queries and recorded the results and made them available for you to query in a public table:


In [113]:
%%bigquery
SELECT 
    TIMESTAMP_SECONDS(performance_test_num) AS test_date,
    query_num,
    TIMESTAMP_SECONDS(ran_on) AS query_ran_on,
    TIMESTAMP_SECONDS(ran_on + elapsed_time_sec) AS query_completed_on,
    elapsed_time_sec
FROM `dw-workshop.tpcds_2t_baseline.perf` # public table
WHERE 
 # Let's only pull the results from our most recent test
 performance_test_num = (SELECT MAX(performance_test_num) FROM `dw-workshop.tpcds_2t_baseline.perf`)
ORDER BY ran_on


Out[113]:
test_date query_num query_ran_on query_completed_on elapsed_time_sec
0 2019-10-15 15:36:24+00:00 1 2019-10-15 15:36:28+00:00 2019-10-15 15:36:32+00:00 4
1 2019-10-15 15:36:24+00:00 2 2019-10-15 15:37:22+00:00 2019-10-15 15:38:13+00:00 51
2 2019-10-15 15:36:24+00:00 3 2019-10-15 15:38:16+00:00 2019-10-15 15:39:06+00:00 50
3 2019-10-15 15:36:24+00:00 4 2019-10-15 15:39:10+00:00 2019-10-15 15:40:00+00:00 50
4 2019-10-15 15:36:24+00:00 5 2019-10-15 15:39:23+00:00 2019-10-15 15:39:32+00:00 9
5 2019-10-15 15:36:24+00:00 6 2019-10-15 15:40:53+00:00 2019-10-15 15:42:20+00:00 87
6 2019-10-15 15:36:24+00:00 7 2019-10-15 15:41:33+00:00 2019-10-15 15:42:09+00:00 36
7 2019-10-15 15:36:24+00:00 8 2019-10-15 15:41:50+00:00 2019-10-15 15:42:03+00:00 13
8 2019-10-15 15:36:24+00:00 9 2019-10-15 15:43:44+00:00 2019-10-15 15:45:35+00:00 111
9 2019-10-15 15:36:24+00:00 10 2019-10-15 15:47:33+00:00 2019-10-15 15:51:19+00:00 226
10 2019-10-15 15:36:24+00:00 11 2019-10-15 15:48:02+00:00 2019-10-15 15:48:27+00:00 25
11 2019-10-15 15:36:24+00:00 12 2019-10-15 15:49:13+00:00 2019-10-15 15:50:21+00:00 68
12 2019-10-15 15:36:24+00:00 13 2019-10-15 15:49:33+00:00 2019-10-15 15:49:50+00:00 17
13 2019-10-15 15:36:24+00:00 14 2019-10-15 15:49:41+00:00 2019-10-15 15:49:46+00:00 5
14 2019-10-15 15:36:24+00:00 15 2019-10-15 15:50:21+00:00 2019-10-15 15:50:57+00:00 36
15 2019-10-15 15:36:24+00:00 16 2019-10-15 15:50:49+00:00 2019-10-15 15:51:14+00:00 25
16 2019-10-15 15:36:24+00:00 17 2019-10-15 15:51:57+00:00 2019-10-15 15:52:48+00:00 51
17 2019-10-15 15:36:24+00:00 18 2019-10-15 15:54:46+00:00 2019-10-15 15:57:32+00:00 166
18 2019-10-15 15:36:24+00:00 19 2019-10-15 15:55:15+00:00 2019-10-15 15:55:41+00:00 26
19 2019-10-15 15:36:24+00:00 20 2019-10-15 15:59:36+00:00 2019-10-15 16:03:54+00:00 258
20 2019-10-15 15:36:24+00:00 21 2019-10-15 16:00:04+00:00 2019-10-15 16:00:29+00:00 25
21 2019-10-15 15:36:24+00:00 22 2019-10-15 16:00:59+00:00 2019-10-15 16:01:50+00:00 51
22 2019-10-15 15:36:24+00:00 23 2019-10-15 16:02:10+00:00 2019-10-15 16:03:18+00:00 68
23 2019-10-15 15:36:24+00:00 24 2019-10-15 16:02:31+00:00 2019-10-15 16:02:48+00:00 17
24 2019-10-15 15:36:24+00:00 25 2019-10-15 16:05:55+00:00 2019-10-15 16:09:12+00:00 197
25 2019-10-15 15:36:24+00:00 26 2019-10-15 16:07:06+00:00 2019-10-15 16:08:14+00:00 68
26 2019-10-15 15:36:24+00:00 27 2019-10-15 16:07:17+00:00 2019-10-15 16:07:25+00:00 8
27 2019-10-15 15:36:24+00:00 28 2019-10-15 16:08:12+00:00 2019-10-15 16:09:03+00:00 51
28 2019-10-15 15:36:24+00:00 29 2019-10-15 16:09:06+00:00 2019-10-15 16:09:57+00:00 51
29 2019-10-15 15:36:24+00:00 30 2019-10-15 16:10:00+00:00 2019-10-15 16:10:51+00:00 51
30 2019-10-15 15:36:24+00:00 31 2019-10-15 16:10:56+00:00 2019-10-15 16:11:46+00:00 50
31 2019-10-15 15:36:24+00:00 32 2019-10-15 16:11:43+00:00 2019-10-15 16:12:20+00:00 37
32 2019-10-15 15:36:24+00:00 33 2019-10-15 16:12:54+00:00 2019-10-15 16:14:02+00:00 68
33 2019-10-15 15:36:24+00:00 34 2019-10-15 16:14:05+00:00 2019-10-15 16:15:13+00:00 68
34 2019-10-15 15:36:24+00:00 35 2019-10-15 16:16:27+00:00 2019-10-15 16:18:44+00:00 137
35 2019-10-15 15:36:24+00:00 36 2019-10-15 16:17:12+00:00 2019-10-15 16:17:51+00:00 39
36 2019-10-15 15:36:24+00:00 37 2019-10-15 16:18:05+00:00 2019-10-15 16:18:55+00:00 50
37 2019-10-15 15:36:24+00:00 38 2019-10-15 16:20:55+00:00 2019-10-15 16:23:41+00:00 166
38 2019-10-15 15:36:24+00:00 39 2019-10-15 16:21:26+00:00 2019-10-15 16:21:52+00:00 26
39 2019-10-15 15:36:24+00:00 40 2019-10-15 16:21:48+00:00 2019-10-15 16:22:06+00:00 18
40 2019-10-15 15:36:24+00:00 41 2019-10-15 16:25:39+00:00 2019-10-15 16:29:26+00:00 227
41 2019-10-15 15:36:24+00:00 42 2019-10-15 16:27:33+00:00 2019-10-15 16:29:24+00:00 111
42 2019-10-15 15:36:24+00:00 43 2019-10-15 16:31:37+00:00 2019-10-15 16:35:24+00:00 227
43 2019-10-15 15:36:24+00:00 44 2019-10-15 16:32:08+00:00 2019-10-15 16:32:33+00:00 25
44 2019-10-15 15:36:24+00:00 45 2019-10-15 16:32:22+00:00 2019-10-15 16:32:33+00:00 11
45 2019-10-15 15:36:24+00:00 46 2019-10-15 16:34:42+00:00 2019-10-15 16:36:59+00:00 137
46 2019-10-15 15:36:24+00:00 47 2019-10-15 16:36:14+00:00 2019-10-15 16:37:43+00:00 89
47 2019-10-15 15:36:24+00:00 48 2019-10-15 16:37:26+00:00 2019-10-15 16:38:34+00:00 68
48 2019-10-15 15:36:24+00:00 49 2019-10-15 16:38:06+00:00 2019-10-15 16:38:43+00:00 37
49 2019-10-15 15:36:24+00:00 50 2019-10-15 16:39:01+00:00 2019-10-15 16:39:53+00:00 52
50 2019-10-15 15:36:24+00:00 51 2019-10-15 16:43:30+00:00 2019-10-15 16:47:47+00:00 257

And the results of the complete test:


In [117]:
%%bigquery
SELECT
  TIMESTAMP_SECONDS(MAX(performance_test_num)) AS test_date,
  COUNT(DISTINCT query_num) AS count_queries_benchmarked,
  SUM(elapsed_time_sec) AS total_time_sec,
  ROUND(SUM(elapsed_time_sec)/60,2) AS total_time_min,
  MIN(elapsed_time_sec) AS fastest_query_time_sec,
  MAX(elapsed_time_sec) AS slowest_query_time_sec,
  ROUND(AVG(elapsed_time_sec),2) AS avg_query_time_sec
FROM
  `dw-workshop.tpcds_2t_baseline.perf`
WHERE
  performance_test_num = (SELECT MAX(performance_test_num) FROM `dw-workshop.tpcds_2t_baseline.perf`)


Out[117]:
test_date count_queries_benchmarked total_time_sec total_time_min fastest_query_time_sec slowest_query_time_sec avg_query_time_sec
0 2019-10-15 15:36:24+00:00 51 3801 63.35 4 258 74.53

Note the total_time_sec of 4760 seconds (or 79 minutes) which we will look to beat in future labs by applying BigQuery optimization techniques like:

  • Partitioning and Clustering
  • Nesting repeated fields
  • Denormalizing with STRUCT data types

Congratulations!

And there you have it! You successfully ran a performance benchmark test against your data warehouse. Continue on with the labs in this series to learn optimiztion strategies to boost your performance.