Lab 1: Explore and Benchmark a BigQuery Dataset for Performance

Overview

In this lab you will take an existing 2TB+ TPC-DS benchmark dataset and learn the data warehouse optimization methods you can apply to the dataset in BigQuery to improve performance.

What you'll do

In this lab, you will learn how to:

Use BigQuery to access and query the TPC-DS benchmark dataset
Run pre-defined queries to establish baseline performance benchmarks

Prerequisites

This is an advanced level SQL lab. Before taking it, you should have experience with SQL. Familiarity with BigQuery is also highly recommended. If you need to get up to speed in these areas, you should take this Data Analyst series of labs first:

Quest: BigQuery for Data Analysts

Once you're ready, scroll down to learn about the services you will be using and how to properly set up your lab environment.

BigQuery

BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without managing infrastructure or needing a database administrator. BigQuery uses SQL and takes advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

TPC-DS Background

In order to benchmark the performance of a data warehouse we first must get tables and data to run queries against. There is a public organization, TPC, that provides large benchmarking datasets to companies explicitly for this purpose. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users.

The TPC-DS Dataset we will be using comprises of 25 tables and 99 queries that simulate common data analysis tasks. View the full documentation here.

Exploring TPC-DS in BigQuery

The TPC-DS tables have been loaded into BigQuery and you will explore ways to optimize the performance of common queries by using BigQuery data warehousing best practices. We have limited the size to 2TB for the timing of this lab but the dataset itself can be expanded as needed.

Note: The TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (http://www.tpc.org). The Cloud DW benchmark is derived from the TPC-DS Benchmark and as such is not comparable to published TPC-DS results.

Exploring the Schema with SQL

Question:

How many tables are in the dataset?
What is the name of the largest table (in GB)? How many rows does it have?



In [80]:

    
%%bigquery
SELECT 
  dataset_id,
  table_id,
  -- Convert bytes to GB.
  ROUND(size_bytes/pow(10,9),2) as size_gb,
  -- Convert UNIX EPOCH to a timestamp.
  TIMESTAMP_MILLIS(creation_time) AS creation_time,
  TIMESTAMP_MILLIS(last_modified_time) as last_modified_time,
  row_count,
  CASE 
    WHEN type = 1 THEN 'table'
    WHEN type = 2 THEN 'view'
  ELSE NULL
  END AS type
FROM
  `dw-workshop.tpcds_2t_baseline.__TABLES__`
ORDER BY size_gb DESC









    Out[80]:







  
    
      
      dataset_id
      table_id
      size_gb
      creation_time
      last_modified_time
      row_count
      type
    
  
  
    
      0
      tpcds_2t_baseline
      store_sales
      1545.13
      2019-10-13 19:15:03.190000+00:00
      2019-10-13 19:15:03.190000+00:00
      5762820700
      table
    
    
      1
      tpcds_2t_baseline
      catalog_sales
      1124.82
      2019-10-13 19:14:55.693000+00:00
      2019-10-13 19:14:55.693000+00:00
      2881495086
      table
    
    
      2
      tpcds_2t_baseline
      web_sales
      564.63
      2019-10-13 19:15:03.986000+00:00
      2019-10-13 19:15:03.986000+00:00
      1440681379
      table
    
    
      3
      tpcds_2t_baseline
      store_returns
      129.34
      2019-10-13 19:15:01.018000+00:00
      2019-10-13 19:15:01.018000+00:00
      576280209
      table
    
    
      4
      tpcds_2t_baseline
      catalog_returns
      81.51
      2019-10-13 19:14:54.124000+00:00
      2019-10-13 19:14:54.124000+00:00
      288154642
      table
    
    
      5
      tpcds_2t_baseline
      web_returns
      36.43
      2019-10-13 19:15:03.043000+00:00
      2019-10-13 19:15:03.043000+00:00
      144074630
      table
    
    
      6
      tpcds_2t_baseline
      customer
      1.44
      2019-10-13 19:14:54.618000+00:00
      2019-10-13 19:14:54.618000+00:00
      9100000
      table
    
    
      7
      tpcds_2t_baseline
      inventory
      1.19
      2019-10-13 19:14:57.402000+00:00
      2019-10-13 19:14:57.402000+00:00
      37584000
      table
    
    
      8
      tpcds_2t_baseline
      customer_address
      0.61
      2019-10-13 19:14:55.631000+00:00
      2019-10-13 19:14:55.631000+00:00
      4550000
      table
    
    
      9
      tpcds_2t_baseline
      customer_demographics
      0.13
      2019-10-13 19:14:55.340000+00:00
      2019-10-13 19:14:55.340000+00:00
      1920800
      table
    
    
      10
      tpcds_2t_baseline
      item
      0.02
      2019-10-13 19:14:58.165000+00:00
      2019-10-13 19:14:58.165000+00:00
      48000
      table
    
    
      11
      tpcds_2t_baseline
      date_dim
      0.01
      2019-10-13 19:14:55.691000+00:00
      2019-10-13 19:14:55.691000+00:00
      73049
      table
    
    
      12
      tpcds_2t_baseline
      time_dim
      0.01
      2019-10-13 19:15:01.382000+00:00
      2019-10-13 19:15:01.382000+00:00
      86400
      table
    
    
      13
      tpcds_2t_baseline
      call_center
      0.00
      2019-10-13 19:14:52.907000+00:00
      2019-10-13 19:14:52.907000+00:00
      8
      table
    
    
      14
      tpcds_2t_baseline
      catalog_page
      0.00
      2019-10-13 19:14:53.022000+00:00
      2019-10-13 19:14:53.022000+00:00
      11718
      table
    
    
      15
      tpcds_2t_baseline
      dbgen_version
      0.00
      2019-10-13 19:14:56.109000+00:00
      2019-10-13 19:14:56.109000+00:00
      1
      table
    
    
      16
      tpcds_2t_baseline
      household_demographics
      0.00
      2019-10-13 19:14:56.855000+00:00
      2019-10-13 19:14:56.855000+00:00
      7200
      table
    
    
      17
      tpcds_2t_baseline
      income_band
      0.00
      2019-10-13 19:14:57.134000+00:00
      2019-10-13 19:14:57.134000+00:00
      20
      table
    
    
      18
      tpcds_2t_baseline
      perf
      0.00
      2019-10-14 07:04:04.205000+00:00
      2019-10-14 08:29:47.141000+00:00
      99
      table
    
    
      19
      tpcds_2t_baseline
      promotion
      0.00
      2019-10-13 19:14:58.728000+00:00
      2019-10-13 19:14:58.728000+00:00
      450
      table
    
    
      20
      tpcds_2t_baseline
      reason
      0.00
      2019-10-13 19:14:59.216000+00:00
      2019-10-13 19:14:59.216000+00:00
      36
      table
    
    
      21
      tpcds_2t_baseline
      ship_mode
      0.00
      2019-10-13 19:15:00.140000+00:00
      2019-10-13 19:15:00.140000+00:00
      20
      table
    
    
      22
      tpcds_2t_baseline
      store
      0.00
      2019-10-13 19:15:00.160000+00:00
      2019-10-13 19:15:00.160000+00:00
      186
      table
    
    
      23
      tpcds_2t_baseline
      warehouse
      0.00
      2019-10-13 19:15:01.593000+00:00
      2019-10-13 19:15:01.593000+00:00
      6
      table
    
    
      24
      tpcds_2t_baseline
      web_page
      0.00
      2019-10-13 19:15:01.927000+00:00
      2019-10-13 19:15:01.927000+00:00
      360
      table
    
    
      25
      tpcds_2t_baseline
      web_site
      0.00
      2019-10-13 19:15:03.301000+00:00
      2019-10-13 19:15:03.301000+00:00
      36
      table

The core tables in the data warehouse are derived from 5 separate core operational systems (each with many tables):

These systems are driven by the core functions of our retail business. As you can see, our store accepts sales from online (web), mail-order (catalog), and in-store. The business must keep track of inventory and can offer promotional discounts on items sold.

Exploring all available columns of data

Question:

How many columns of data are in the entire dataset (all tables)?



In [81]:

    
%%bigquery
SELECT * FROM 
 `dw-workshop.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`









    Out[81]:







  
    
      
      table_catalog
      table_schema
      table_name
      column_name
      ordinal_position
      is_nullable
      data_type
      is_generated
      generation_expression
      is_stored
      is_hidden
      is_updatable
      is_system_defined
      is_partitioning_column
      clustering_ordinal_position
    
  
  
    
      0
      dw-workshop
      tpcds_2t_baseline
      household_demographics
      hd_demo_sk
      1
      NO
      INT64
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
    
      1
      dw-workshop
      tpcds_2t_baseline
      household_demographics
      hd_income_band_sk
      2
      YES
      INT64
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
    
      2
      dw-workshop
      tpcds_2t_baseline
      household_demographics
      hd_buy_potential
      3
      YES
      STRING
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
    
      3
      dw-workshop
      tpcds_2t_baseline
      household_demographics
      hd_dep_count
      4
      YES
      INT64
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
    
      4
      dw-workshop
      tpcds_2t_baseline
      household_demographics
      hd_vehicle_count
      5
      YES
      INT64
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      428
      dw-workshop
      tpcds_2t_baseline
      customer
      c_birth_year
      14
      YES
      INT64
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
    
      429
      dw-workshop
      tpcds_2t_baseline
      customer
      c_birth_country
      15
      YES
      STRING
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
    
      430
      dw-workshop
      tpcds_2t_baseline
      customer
      c_login
      16
      YES
      STRING
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
    
      431
      dw-workshop
      tpcds_2t_baseline
      customer
      c_email_address
      17
      YES
      STRING
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
    
      432
      dw-workshop
      tpcds_2t_baseline
      customer
      c_last_review_date_sk
      18
      YES
      INT64
      NEVER
      None
      None
      NO
      None
      NO
      NO
      None
    
  

433 rows × 15 columns

Question:

Are any of the columns of data in this baseline dataset partitioned or clustered?



In [82]:

    
%%bigquery
SELECT * FROM 
 `dw-workshop.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`
WHERE 
  is_partitioning_column = 'YES' OR clustering_ordinal_position IS NOT NULL









    Out[82]:







  
    
      
      table_catalog
      table_schema
      table_name
      column_name
      ordinal_position
      is_nullable
      data_type
      is_generated
      generation_expression
      is_stored
      is_hidden
      is_updatable
      is_system_defined
      is_partitioning_column
      clustering_ordinal_position

Question

How many columns of data does each table have (sorted by most to least?)
Which table has the most columns of data?



In [83]:

    
%%bigquery
SELECT 
  COUNT(column_name) AS column_count, 
  table_name 
FROM 
 `dw-workshop.tpcds_2t_baseline.INFORMATION_SCHEMA.COLUMNS`
GROUP BY table_name
ORDER BY column_count DESC, table_name









    Out[83]:







  
    
      
      column_count
      table_name
    
  
  
    
      0
      34
      catalog_sales
    
    
      1
      34
      web_sales
    
    
      2
      31
      call_center
    
    
      3
      29
      store
    
    
      4
      28
      date_dim
    
    
      5
      27
      catalog_returns
    
    
      6
      26
      web_site
    
    
      7
      24
      web_returns
    
    
      8
      23
      store_sales
    
    
      9
      22
      item
    
    
      10
      20
      store_returns
    
    
      11
      19
      promotion
    
    
      12
      18
      customer
    
    
      13
      14
      warehouse
    
    
      14
      14
      web_page
    
    
      15
      13
      customer_address
    
    
      16
      10
      time_dim
    
    
      17
      9
      catalog_page
    
    
      18
      9
      customer_demographics
    
    
      19
      6
      ship_mode
    
    
      20
      5
      household_demographics
    
    
      21
      4
      dbgen_version
    
    
      22
      4
      inventory
    
    
      23
      4
      perf
    
    
      24
      3
      income_band
    
    
      25
      3
      reason

Previewing sample rows of data values

In the BigQuery UI, find the Resources panel and search for catalog_sales

Click on the catalog_sales table name for the tpcds_2t_baseline dataset

Question

How many rows are in the table?
How large is the table in TB?

Hint: Use the Details button in the web UI to quickly access table metadata

Question:

Preview the data and find the Catalog Sales Extended Sales Price cs_ext_sales_price field (which is calculated based on product quantity * sales price)
Are there any missing data values for Catalog Sales Quantity (cs_quantity)?
Are there any missing values for cs_ext_ship_cost? For what type of product could this be expected? (Digital products)

Create an example sales report

Write a query that shows key sales stats for each item sold from the Catalog:

total orders
total unit quantity
total revenue
total profit
sorted by total orders highest to lowest, limit 100



In [84]:

    
%%bigquery --verbose
SELECT
  cs_item_sk,
  COUNT(cs_order_number) AS total_orders,
  SUM(cs_quantity) AS total_quantity,
  SUM(cs_ext_sales_price) AS total_revenue,
  SUM(cs_net_profit) AS total_profit
FROM
  `dw-workshop.tpcds_2t_baseline.catalog_sales`
GROUP BY
  cs_item_sk
ORDER BY
  total_orders DESC
LIMIT
  10









    



Executing query with job ID: 94a5a441-178c-4efa-a81a-bd450b27bca7
Query executing: 69.08s
Query complete after 69.67s






    Out[84]:







  
    
      
      cs_item_sk
      total_orders
      total_quantity
      total_revenue
      total_profit
    
  
  
    
      0
      9253
      121222
      6103140
      307670152.68
      -30932953.61
    
    
      1
      1927
      121217
      6106523
      309693527.62
      -29261211.24
    
    
      2
      16603
      121181
      6099067
      307870874.85
      -31511409.31
    
    
      3
      193
      121150
      6074650
      305044085.7
      -31913535.21
    
    
      4
      38845
      121142
      6096548
      306754379.03
      -30723801.34
    
    
      5
      14335
      121109
      6098853
      307479510.28
      -30412623.1
    
    
      6
      21145
      121101
      6076957
      306499554.37
      -30122065.46
    
    
      7
      7255
      121098
      6075854
      308807049.12
      -30202631
    
    
      8
      37315
      121094
      6075936
      307200736.08
      -31155035.24
    
    
      9
      9973
      121091
      6089771
      308538961.44
      -31130175.79

A note on our data: The TPC-DS benchmark allows data warehouse practicioners to generate any volume of data programatically. Since the rows of data are system generated, they may not make the most sense in a business context (like why are we selling our top product at such a huge profit loss!).

The good news is that to benchmark our performance we care most about the volume of rows and columns to run our benchmark against.

Analyzing query performance

Click on Execution details

Refer to the chart below (which should be similar to your results) and answer the following questions.

Question

How long did it take the query to run? 5.1s
How much data in GB was processed? 150GB
How much slot time was consumed? 1hr 24min
How many rows were input? 2,881,495,086
How many rows were output as the end result (before the limit)? 23,300
What does the output rows mean in the context of our query? (23,300 unique cs_item_sk)

Side note: Slot Time

We know the query took 5.1 seconds to run so what does the 1hr 24 min slot time metric mean?

Inside of the BigQuery service are lots of virtual machines that massively process your data and query logic in parallel. These workers, or "slots", work together to process a single query job really quickly. For accounts with on-demand pricing, you can have up to 2,000 slots.

So say we had 30 minutes of slot time or 1800 seconds. If the query took 20 seconds in total to run, but it was 1800 seconds worth of work, how many workers at minimum worked on it? 1800/20 = 90

And that's assuming each worker instantly had all the data it needed (no shuffling of data between workers) and was at full capacity for all 20 seconds!

In reality, workers have a variety of tasks (waiting for data, reading it, performing computations, and writing data) and also need to compare notes with eachother on what work was already done on the job. The good news for you is that you don't need to worry about optimizing these workers or the underlying data to run perfectly in parallel. That's why BigQuery is a managed service -- there's an entire team dedicated to hardware and data storage optimization.

In case you were wondering, the worker limit for your project is 2,000 slots at once.

Running a performance benchmark

To performance benchmark our data warehouse in BigQuery we need to create more than just a single SQL report. The good news is the TPC-DS dataset ships with 99 standard benchmark queries that we can run and log the performance outcomes.

In this lab, we are doing no adjustments to the existing data warehouse tables (no partitioning, no clustering, no nesting) so we can establish a performance benchmark to beat in future labs.

Viewing the 99 pre-made SQL queries

We have a long SQL file with 99 standard queries against this dataset stored in our /sql/ directory.

Let's view the first 50 lines of those baseline queries to get familiar with how we will be performance benchmarking our dataset.



In [85]:

    
!head --lines=50 'sql/example_baseline_queries.sql'









    



-- start query 1 in stream 0 using template query96.tpl
select  count(*) 
from tpcds_2t_baseline.store_sales
    ,tpcds_2t_baseline.household_demographics 
    ,tpcds_2t_baseline.time_dim, tpcds_2t_baseline.store
where ss_sold_time_sk = time_dim.t_time_sk   
    and ss_hdemo_sk = household_demographics.hd_demo_sk 
    and ss_store_sk = s_store_sk
    and time_dim.t_hour = 8
    and time_dim.t_minute >= 30
    and household_demographics.hd_dep_count = 5
    and store.s_store_name = 'ese'
order by count(*)
limit 100;

-- end query 1 in stream 0 using template query96.tpl
-- start query 2 in stream 0 using template query7.tpl
select  i_item_id, 
        avg(ss_quantity) agg1,
        avg(ss_list_price) agg2,
        avg(ss_coupon_amt) agg3,
        avg(ss_sales_price) agg4 
 from tpcds_2t_baseline.store_sales, tpcds_2t_baseline.customer_demographics, tpcds_2t_baseline.date_dim, tpcds_2t_baseline.item, tpcds_2t_baseline.promotion
 where ss_sold_date_sk = d_date_sk and
       ss_item_sk = i_item_sk and
       ss_cdemo_sk = cd_demo_sk and
       ss_promo_sk = p_promo_sk and
       cd_gender = 'M' and 
       cd_marital_status = 'M' and
       cd_education_status = '4 yr Degree' and
       (p_channel_email = 'N' or p_channel_event = 'N') and
       d_year = 2001 
 group by i_item_id
 order by i_item_id
 limit 100;

-- end query 2 in stream 0 using template query7.tpl
-- start query 3 in stream 0 using template query75.tpl
WITH all_sales AS (
 SELECT d_year
       ,i_brand_id
       ,i_class_id
       ,i_category_id
       ,i_manufact_id
       ,SUM(sales_cnt) AS sales_cnt
       ,SUM(sales_amt) AS sales_amt
 FROM (SELECT d_year
             ,i_brand_id
             ,i_class_id

Running the first benchmark test

Now let's run the first query against our dataset and note the execution time. Tip: You can use the --verbose flag in %%bigquery magics to return the job and completion time.



In [105]:

    
%%bigquery --verbose
# start query 1 in stream 0 using template query96.tpl
select  count(*) 
from `dw-workshop.tpcds_2t_baseline.store_sales` as store_sales
    ,`dw-workshop.tpcds_2t_baseline.household_demographics` as household_demographics 
    ,`dw-workshop.tpcds_2t_baseline.time_dim` as time_dim, 
    `dw-workshop.tpcds_2t_baseline.store` as store
where ss_sold_time_sk = time_dim.t_time_sk   
    and ss_hdemo_sk = household_demographics.hd_demo_sk 
    and ss_store_sk = s_store_sk
    and time_dim.t_hour = 8
    and time_dim.t_minute >= 30
    and household_demographics.hd_dep_count = 5
    and store.s_store_name = 'ese'
order by count(*)
limit 100;









    



Executing query with job ID: 0a6ca437-346e-4416-aa3a-2d279a3631ae
Query executing: 1.08s
Query complete after 1.90s






    Out[105]:







  
    
      
      f0_
    
  
  
    
      0
      1488312

It should execute in just a few seconds. Then try running it again and see if you get the same performance. BigQuery will automatically cache the results from the first time you ran the query and then serve those same results to you when you can the query again. We can confirm this by analyzing the query job statistics.

Viewing BigQuery job statistics

Let's list our five most recent query jobs run on BigQuery using the bq command line interface. Then we will get even more detail on our most recent job with the bq show command. Be sure to replace the job id with your own.



In [87]:

    
!bq ls -j -a -n 5









    



                 jobId                   Job Type    State      Start Time         Duration     
 -------------------------------------- ---------- --------- ----------------- ---------------- 
  14ec0f66-91ab-41f7-bf1f-3e5660cb9d3f   query      SUCCESS   15 Oct 15:34:27   0:00:03.835000  
  <REDACTED>                             None       FAILURE   15 Oct 15:33:01   0:00:17.561000  
  94a5a441-178c-4efa-a81a-bd450b27bca7   query      SUCCESS   15 Oct 15:32:15   0:01:09.065000  
  7fab5973-be13-4b84-acd5-63c2c3946225   query      SUCCESS   15 Oct 15:31:40   0:00:00.671000  
  5224adc7-d6d1-463a-9820-bb71139e27aa   query      SUCCESS   15 Oct 15:31:32   0:00:00.572000



In [88]:

    
!bq show --format=prettyjson -j 612a4b28-cb5c-4e0b-ad5b-ebd51c3b2439









    



{
  "configuration": {
    "dryRun": false, 
    "jobType": "QUERY", 
    "query": {
      "createDisposition": "CREATE_IF_NEEDED", 
      "destinationTable": {
        "datasetId": "_26faa896567219ad1a8a420ad92caebbb6af636a", 
        "projectId": "dw-workshop", 
        "tableId": "anon0e42104bd9f24ce9a8762cb60cc3f67fa6aadf66"
      }, 
      "priority": "INTERACTIVE", 
      "query": "# start query 1 in stream 0 using template query96.tpl\nselect  count(*) \nfrom tpcds_2t_baseline.store_sales\n    ,tpcds_2t_baseline.household_demographics \n    ,tpcds_2t_baseline.time_dim, tpcds_2t_baseline.store\nwhere ss_sold_time_sk = time_dim.t_time_sk   \n    and ss_hdemo_sk = household_demographics.hd_demo_sk \n    and ss_store_sk = s_store_sk\n    and time_dim.t_hour = 8\n    and time_dim.t_minute >= 30\n    and household_demographics.hd_dep_count = 5\n    and store.s_store_name = 'ese'\norder by count(*)\nlimit 100;\n", 
      "useLegacySql": false, 
      "writeDisposition": "WRITE_TRUNCATE"
    }
  }, 
  "etag": "8VQdnl1bV4S/9jpByl9Qvg==", 
  "id": "dw-workshop:US.612a4b28-cb5c-4e0b-ad5b-ebd51c3b2439", 
  "jobReference": {
    "jobId": "612a4b28-cb5c-4e0b-ad5b-ebd51c3b2439", 
    "location": "US", 
    "projectId": "dw-workshop"
  }, 
  "kind": "bigquery#job", 
  "selfLink": "https://bigquery.googleapis.com/bigquery/v2/projects/dw-workshop/jobs/612a4b28-cb5c-4e0b-ad5b-ebd51c3b2439?location=US", 
  "statistics": {
    "creationTime": "1571002972799", 
    "endTime": "1571002973547", 
    "query": {
      "cacheHit": true, 
      "statementType": "SELECT", 
      "totalBytesBilled": "0", 
      "totalBytesProcessed": "0"
    }, 
    "startTime": "1571002973360", 
    "totalBytesProcessed": "0"
  }, 
  "status": {
    "state": "DONE"
  }, 
  "user_email": "796822948179-compute@developer.gserviceaccount.com"
}

Looking at the job statistics we can see our most recent query hit cache

cacheHit: true and therefore
totalBytesProcessed: 0.

While this is great in normal uses for BigQuery (you aren't charged for queries that hit cache) it kind of ruins our performance test. While cache is super useful we want to disable it for testing purposes.

Disabling Cache and Dry Running Queries

As of the time this lab was created, you can't pass a flag to %%bigquery iPython notebook magics to disable cache or to quickly see the amount of data processed. So we will use the traditional bq command line interface in bash.

First we will do a dry run of the query without processing any data just to see how many bytes of data would be processed. Then we will remove that flag and ensure nouse_cache is set to avoid hitting cache as well.



In [ ]:

    
%%bash 
bq query \
--dry_run \
--nouse_cache \
--use_legacy_sql=false \
"""\
select  count(*) 
from \`dw-workshop.tpcds_2t_baseline.store_sales\` as store_sales
    ,\`dw-workshop.tpcds_2t_baseline.household_demographics\` as household_demographics  
    ,\`dw-workshop.tpcds_2t_baseline.time_dim\` as time_dim, \`dw-workshop.tpcds_2t_baseline.store\` as store
where ss_sold_time_sk = time_dim.t_time_sk   
    and ss_hdemo_sk = household_demographics.hd_demo_sk 
    and ss_store_sk = s_store_sk
    and time_dim.t_hour = 8
    and time_dim.t_minute >= 30
    and household_demographics.hd_dep_count = 5
    and store.s_store_name = 'ese'
order by count(*)
limit 100;
"""



In [63]:

    
# Convert bytes to GB
132086388641 / 1e+9









    Out[63]:





132.086388641

132 GB will be processed. At the time of writing, BigQuery pricing is \$5 per 1 TB (or 1000 GB) of data after the first free 1 TB each month. Assuming we've exhausted our 1 TB free this month, this would be \\$0.66 to run.

Now let's run it an ensure we're not pulling from cache so we get an accurate time-to-completion benchmark.



In [ ]:

    
%%bash 
bq query \
--nouse_cache \
--use_legacy_sql=false \
"""\
select  count(*) 
from \`dw-workshop.tpcds_2t_baseline.store_sales\` as store_sales
    ,\`dw-workshop.tpcds_2t_baseline.household_demographics\` as household_demographics  
    ,\`dw-workshop.tpcds_2t_baseline.time_dim\` as time_dim, \`dw-workshop.tpcds_2t_baseline.store\` as store
where ss_sold_time_sk = time_dim.t_time_sk   
    and ss_hdemo_sk = household_demographics.hd_demo_sk 
    and ss_store_sk = s_store_sk
    and time_dim.t_hour = 8
    and time_dim.t_minute >= 30
    and household_demographics.hd_dep_count = 5
    and store.s_store_name = 'ese'
order by count(*)
limit 100;
"""

If you're an experienced BigQuery user, you likely have seen these same metrics in the Web UI as well as highlighted in the red box below:

It's a matter of preference whether you do your work in the Web UI or the command line -- each has it's advantages.

One major advantage of using the bq command line interface is the ability to create a script that will run the remaining 98 benchmark queries for us and log the results.

Copy the dw-workshop dataset into your own GCP project

We will use the new BigQuery Transfer Service to quickly copy our large dataset from the dw-workshop GCP project into your own so you can perform the benchmarking.

Create a new baseline dataset in your project



In [ ]:

    
%%bash

export PROJECT_ID=$(gcloud config list --format 'value(core.project)')
export BENCHMARK_DATASET_NAME=tpcds_2t_baseline # Name of the dataset you want to create

## Create a BigQuery dataset for tpcds_2t_flat_part_clust if it doesn't exist
datasetexists=$(bq ls -d | grep -w $BENCHMARK_DATASET_NAME)

if [ -n "$datasetexists" ]; then
    echo -e "BigQuery dataset $BENCHMARK_DATASET_NAME already exists, let's not recreate it."

else
    echo "Creating BigQuery dataset titled: $BENCHMARK_DATASET_NAME"
    
    bq --location=US mk --dataset \
        --description 'Benchmark Dataset' \
        $PROJECT:$BENCHMARK_DATASET_NAME
   echo "\nHere are your current datasets:"
   bq ls
fi

Use the BigQuery Data Transfer Service to copy an existing dataset

Enable the BigQuery Data Transfer Service API
Navigate to the BigQuery console and the existing dw-workshop dataset
Click Copy Dataset

In the pop-up, choose your project name and the newly created dataset name from the previous step

Click Copy
Wait for the transfer to complete

Verify you now have the baseline data in your project

Run the below query and confirm you see data. Note that if you omit the project-id ahead of the dataset name in the FROM clause, BigQuery will assume your default project.



In [110]:

    
%%bigquery
SELECT COUNT(*) AS store_transaction_count
FROM tpcds_2t_baseline.store_sales









    Out[110]:







  
    
      
      store_transaction_count
    
  
  
    
      0
      5762820700

Setup an automated test

Running each of the 99 queries manually via the Console UI would be a tedious effort. We'll show you how you can run all 99 programatically and automatically log the output (time and GB processed) to a log file for analysis.

Below is a shell script that:

Accepts a BigQuery dataset to benchmark
Accepts a list of semi-colon separated queries to run
Loops through each query and calls the bq query command
Records the execution time into a separate BigQuery performance table perf

Execute the below statement and follow along with the results as you benchmark a few example queries (don't worry, we've already ran the full 99 recently so you won't have to).

After executing, wait 1-2 minutes for the benchmark test to complete



In [91]:

    
%%bash
# runs the SQL queries from the TPCDS benchmark 

# Pull the current Google Cloud Platform project name

BQ_DATASET="tpcds_2t_baseline" # let's start by benchmarking our baseline dataset 
QUERY_FILE_PATH="./sql/example_baseline_queries.sql" # the full test is on 99_baseline_queries but that will take 80+ mins to run
IFS=";"

# create perf table to keep track of run times for all 99 queries
printf "\033[32;1m Housekeeping tasks... \033[0m\n\n";
printf "Creating a reporting table perf to track how fast each query runs...";
perf_table_ddl="CREATE TABLE IF NOT EXISTS $BQ_DATASET.perf(performance_test_num int64, query_num int64, elapsed_time_sec int64, ran_on int64)"
bq rm -f $BQ_DATASET.perf
bq query --nouse_legacy_sql $perf_table_ddl 

start=$(date +%s)
index=0
for select_stmt in $(<$QUERY_FILE_PATH)　
do 
  # run the test until you hit a line with the string 'END OF BENCHMARK' in the file
  if [[ "$select_stmt" == *'END OF BENCHMARK'* ]]; then
    break
  fi

  printf "\n\033[32;1m Let's benchmark this query... \033[0m\n";
  printf "$select_stmt";
  
  SECONDS=0;
  bq query --use_cache=false --nouse_legacy_sql $select_stmt # critical to turn cache off for this test
  duration=$SECONDS

  # get current timestamp in milliseconds  
  ran_on=$(date +%s)

  index=$((index+1))

  printf "\n\033[32;1m Here's how long it took... \033[0m\n\n";
  echo "Query $index ran in $(($duration / 60)) minutes and $(($duration % 60)) seconds."

  printf "\n\033[32;1m Writing to our benchmark table... \033[0m\n\n";
  insert_stmt="insert into $BQ_DATASET.perf(performance_test_num, query_num, elapsed_time_sec, ran_on) values($start, $index, $duration, $ran_on)"
  printf "$insert_stmt"
  bq query --nouse_legacy_sql $insert_stmt
done

end=$(date +%s)

printf "Benchmark test complete"









    



Process is interrupted.

Viewing the benchmark results

As part of the benchmark test, we stored the processing time of each query into a new perf BigQuery table. We can query that table and get some performance stats for our test.

First are each of the tests we ran:



In [66]:

    
%%bigquery
SELECT * FROM tpcds_2t_baseline.perf
WHERE 
 # Let's only pull the results from our most recent test
 performance_test_num = (SELECT MAX(performance_test_num) FROM tpcds_2t_baseline.perf)
ORDER BY ran_on









    Out[66]:







  
    
      
      performance_test_num
      query_num
      elapsed_time_sec
      ran_on
    
  
  
    
      0
      1571014412
      1
      5
      1571014417
    
    
      1
      1571014412
      2
      10
      1571014430
    
    
      2
      1571014412
      3
      25
      1571014458
    
    
      3
      1571014412
      4
      25
      1571014487

And finally, the overall statistics for the entire test:



In [67]:

    
%%bigquery
SELECT
  TIMESTAMP_SECONDS(MAX(performance_test_num)) AS test_date,
  MAX(performance_test_num) AS latest_performance_test_num,
  COUNT(DISTINCT query_num) AS count_queries_benchmarked,
  SUM(elapsed_time_sec) AS total_time_sec,
  MIN(elapsed_time_sec) AS fastest_query_time_sec,
  MAX(elapsed_time_sec) AS slowest_query_time_sec
FROM
  tpcds_2t_baseline.perf
WHERE
  performance_test_num = (SELECT MAX(performance_test_num) FROM tpcds_2t_baseline.perf)









    Out[67]:







  
    
      
      test_date
      latest_performance_test_num
      count_queries_benchmarked
      total_time_sec
      fastest_query_time_sec
      slowest_query_time_sec
    
  
  
    
      0
      2019-10-14 00:53:32+00:00
      1571014412
      4
      65
      5
      25

Benchmarking all 99 queries

As we mentioned before, we already ran all 99 queries and recorded the results and made them available for you to query in a public table:



In [113]:

    
%%bigquery
SELECT 
    TIMESTAMP_SECONDS(performance_test_num) AS test_date,
    query_num,
    TIMESTAMP_SECONDS(ran_on) AS query_ran_on,
    TIMESTAMP_SECONDS(ran_on + elapsed_time_sec) AS query_completed_on,
    elapsed_time_sec
FROM `dw-workshop.tpcds_2t_baseline.perf` # public table
WHERE 
 # Let's only pull the results from our most recent test
 performance_test_num = (SELECT MAX(performance_test_num) FROM `dw-workshop.tpcds_2t_baseline.perf`)
ORDER BY ran_on









    Out[113]:







  
    
      
      test_date
      query_num
      query_ran_on
      query_completed_on
      elapsed_time_sec
    
  
  
    
      0
      2019-10-15 15:36:24+00:00
      1
      2019-10-15 15:36:28+00:00
      2019-10-15 15:36:32+00:00
      4
    
    
      1
      2019-10-15 15:36:24+00:00
      2
      2019-10-15 15:37:22+00:00
      2019-10-15 15:38:13+00:00
      51
    
    
      2
      2019-10-15 15:36:24+00:00
      3
      2019-10-15 15:38:16+00:00
      2019-10-15 15:39:06+00:00
      50
    
    
      3
      2019-10-15 15:36:24+00:00
      4
      2019-10-15 15:39:10+00:00
      2019-10-15 15:40:00+00:00
      50
    
    
      4
      2019-10-15 15:36:24+00:00
      5
      2019-10-15 15:39:23+00:00
      2019-10-15 15:39:32+00:00
      9
    
    
      5
      2019-10-15 15:36:24+00:00
      6
      2019-10-15 15:40:53+00:00
      2019-10-15 15:42:20+00:00
      87
    
    
      6
      2019-10-15 15:36:24+00:00
      7
      2019-10-15 15:41:33+00:00
      2019-10-15 15:42:09+00:00
      36
    
    
      7
      2019-10-15 15:36:24+00:00
      8
      2019-10-15 15:41:50+00:00
      2019-10-15 15:42:03+00:00
      13
    
    
      8
      2019-10-15 15:36:24+00:00
      9
      2019-10-15 15:43:44+00:00
      2019-10-15 15:45:35+00:00
      111
    
    
      9
      2019-10-15 15:36:24+00:00
      10
      2019-10-15 15:47:33+00:00
      2019-10-15 15:51:19+00:00
      226
    
    
      10
      2019-10-15 15:36:24+00:00
      11
      2019-10-15 15:48:02+00:00
      2019-10-15 15:48:27+00:00
      25
    
    
      11
      2019-10-15 15:36:24+00:00
      12
      2019-10-15 15:49:13+00:00
      2019-10-15 15:50:21+00:00
      68
    
    
      12
      2019-10-15 15:36:24+00:00
      13
      2019-10-15 15:49:33+00:00
      2019-10-15 15:49:50+00:00
      17
    
    
      13
      2019-10-15 15:36:24+00:00
      14
      2019-10-15 15:49:41+00:00
      2019-10-15 15:49:46+00:00
      5
    
    
      14
      2019-10-15 15:36:24+00:00
      15
      2019-10-15 15:50:21+00:00
      2019-10-15 15:50:57+00:00
      36
    
    
      15
      2019-10-15 15:36:24+00:00
      16
      2019-10-15 15:50:49+00:00
      2019-10-15 15:51:14+00:00
      25
    
    
      16
      2019-10-15 15:36:24+00:00
      17
      2019-10-15 15:51:57+00:00
      2019-10-15 15:52:48+00:00
      51
    
    
      17
      2019-10-15 15:36:24+00:00
      18
      2019-10-15 15:54:46+00:00
      2019-10-15 15:57:32+00:00
      166
    
    
      18
      2019-10-15 15:36:24+00:00
      19
      2019-10-15 15:55:15+00:00
      2019-10-15 15:55:41+00:00
      26
    
    
      19
      2019-10-15 15:36:24+00:00
      20
      2019-10-15 15:59:36+00:00
      2019-10-15 16:03:54+00:00
      258
    
    
      20
      2019-10-15 15:36:24+00:00
      21
      2019-10-15 16:00:04+00:00
      2019-10-15 16:00:29+00:00
      25
    
    
      21
      2019-10-15 15:36:24+00:00
      22
      2019-10-15 16:00:59+00:00
      2019-10-15 16:01:50+00:00
      51
    
    
      22
      2019-10-15 15:36:24+00:00
      23
      2019-10-15 16:02:10+00:00
      2019-10-15 16:03:18+00:00
      68
    
    
      23
      2019-10-15 15:36:24+00:00
      24
      2019-10-15 16:02:31+00:00
      2019-10-15 16:02:48+00:00
      17
    
    
      24
      2019-10-15 15:36:24+00:00
      25
      2019-10-15 16:05:55+00:00
      2019-10-15 16:09:12+00:00
      197
    
    
      25
      2019-10-15 15:36:24+00:00
      26
      2019-10-15 16:07:06+00:00
      2019-10-15 16:08:14+00:00
      68
    
    
      26
      2019-10-15 15:36:24+00:00
      27
      2019-10-15 16:07:17+00:00
      2019-10-15 16:07:25+00:00
      8
    
    
      27
      2019-10-15 15:36:24+00:00
      28
      2019-10-15 16:08:12+00:00
      2019-10-15 16:09:03+00:00
      51
    
    
      28
      2019-10-15 15:36:24+00:00
      29
      2019-10-15 16:09:06+00:00
      2019-10-15 16:09:57+00:00
      51
    
    
      29
      2019-10-15 15:36:24+00:00
      30
      2019-10-15 16:10:00+00:00
      2019-10-15 16:10:51+00:00
      51
    
    
      30
      2019-10-15 15:36:24+00:00
      31
      2019-10-15 16:10:56+00:00
      2019-10-15 16:11:46+00:00
      50
    
    
      31
      2019-10-15 15:36:24+00:00
      32
      2019-10-15 16:11:43+00:00
      2019-10-15 16:12:20+00:00
      37
    
    
      32
      2019-10-15 15:36:24+00:00
      33
      2019-10-15 16:12:54+00:00
      2019-10-15 16:14:02+00:00
      68
    
    
      33
      2019-10-15 15:36:24+00:00
      34
      2019-10-15 16:14:05+00:00
      2019-10-15 16:15:13+00:00
      68
    
    
      34
      2019-10-15 15:36:24+00:00
      35
      2019-10-15 16:16:27+00:00
      2019-10-15 16:18:44+00:00
      137
    
    
      35
      2019-10-15 15:36:24+00:00
      36
      2019-10-15 16:17:12+00:00
      2019-10-15 16:17:51+00:00
      39
    
    
      36
      2019-10-15 15:36:24+00:00
      37
      2019-10-15 16:18:05+00:00
      2019-10-15 16:18:55+00:00
      50
    
    
      37
      2019-10-15 15:36:24+00:00
      38
      2019-10-15 16:20:55+00:00
      2019-10-15 16:23:41+00:00
      166
    
    
      38
      2019-10-15 15:36:24+00:00
      39
      2019-10-15 16:21:26+00:00
      2019-10-15 16:21:52+00:00
      26
    
    
      39
      2019-10-15 15:36:24+00:00
      40
      2019-10-15 16:21:48+00:00
      2019-10-15 16:22:06+00:00
      18
    
    
      40
      2019-10-15 15:36:24+00:00
      41
      2019-10-15 16:25:39+00:00
      2019-10-15 16:29:26+00:00
      227
    
    
      41
      2019-10-15 15:36:24+00:00
      42
      2019-10-15 16:27:33+00:00
      2019-10-15 16:29:24+00:00
      111
    
    
      42
      2019-10-15 15:36:24+00:00
      43
      2019-10-15 16:31:37+00:00
      2019-10-15 16:35:24+00:00
      227
    
    
      43
      2019-10-15 15:36:24+00:00
      44
      2019-10-15 16:32:08+00:00
      2019-10-15 16:32:33+00:00
      25
    
    
      44
      2019-10-15 15:36:24+00:00
      45
      2019-10-15 16:32:22+00:00
      2019-10-15 16:32:33+00:00
      11
    
    
      45
      2019-10-15 15:36:24+00:00
      46
      2019-10-15 16:34:42+00:00
      2019-10-15 16:36:59+00:00
      137
    
    
      46
      2019-10-15 15:36:24+00:00
      47
      2019-10-15 16:36:14+00:00
      2019-10-15 16:37:43+00:00
      89
    
    
      47
      2019-10-15 15:36:24+00:00
      48
      2019-10-15 16:37:26+00:00
      2019-10-15 16:38:34+00:00
      68
    
    
      48
      2019-10-15 15:36:24+00:00
      49
      2019-10-15 16:38:06+00:00
      2019-10-15 16:38:43+00:00
      37
    
    
      49
      2019-10-15 15:36:24+00:00
      50
      2019-10-15 16:39:01+00:00
      2019-10-15 16:39:53+00:00
      52
    
    
      50
      2019-10-15 15:36:24+00:00
      51
      2019-10-15 16:43:30+00:00
      2019-10-15 16:47:47+00:00
      257

And the results of the complete test:



In [117]:

    
%%bigquery
SELECT
  TIMESTAMP_SECONDS(MAX(performance_test_num)) AS test_date,
  COUNT(DISTINCT query_num) AS count_queries_benchmarked,
  SUM(elapsed_time_sec) AS total_time_sec,
  ROUND(SUM(elapsed_time_sec)/60,2) AS total_time_min,
  MIN(elapsed_time_sec) AS fastest_query_time_sec,
  MAX(elapsed_time_sec) AS slowest_query_time_sec,
  ROUND(AVG(elapsed_time_sec),2) AS avg_query_time_sec
FROM
  `dw-workshop.tpcds_2t_baseline.perf`
WHERE
  performance_test_num = (SELECT MAX(performance_test_num) FROM `dw-workshop.tpcds_2t_baseline.perf`)









    Out[117]:







  
    
      
      test_date
      count_queries_benchmarked
      total_time_sec
      total_time_min
      fastest_query_time_sec
      slowest_query_time_sec
      avg_query_time_sec
    
  
  
    
      0
      2019-10-15 15:36:24+00:00
      51
      3801
      63.35
      4
      258
      74.53

Note the total_time_sec of 4760 seconds (or 79 minutes) which we will look to beat in future labs by applying BigQuery optimization techniques like:

Partitioning and Clustering
Nesting repeated fields
Denormalizing with STRUCT data types

Congratulations!

And there you have it! You successfully ran a performance benchmark test against your data warehouse. Continue on with the labs in this series to learn optimiztion strategies to boost your performance.

	dataset_id	table_id	size_gb	creation_time	last_modified_time	row_count	type
0	tpcds_2t_baseline	store_sales	1545.13	2019-10-13 19:15:03.190000+00:00	2019-10-13 19:15:03.190000+00:00	5762820700	table
1	tpcds_2t_baseline	catalog_sales	1124.82	2019-10-13 19:14:55.693000+00:00	2019-10-13 19:14:55.693000+00:00	2881495086	table
2	tpcds_2t_baseline	web_sales	564.63	2019-10-13 19:15:03.986000+00:00	2019-10-13 19:15:03.986000+00:00	1440681379	table
3	tpcds_2t_baseline	store_returns	129.34	2019-10-13 19:15:01.018000+00:00	2019-10-13 19:15:01.018000+00:00	576280209	table
4	tpcds_2t_baseline	catalog_returns	81.51	2019-10-13 19:14:54.124000+00:00	2019-10-13 19:14:54.124000+00:00	288154642	table
5	tpcds_2t_baseline	web_returns	36.43	2019-10-13 19:15:03.043000+00:00	2019-10-13 19:15:03.043000+00:00	144074630	table
6	tpcds_2t_baseline	customer	1.44	2019-10-13 19:14:54.618000+00:00	2019-10-13 19:14:54.618000+00:00	9100000	table
7	tpcds_2t_baseline	inventory	1.19	2019-10-13 19:14:57.402000+00:00	2019-10-13 19:14:57.402000+00:00	37584000	table
8	tpcds_2t_baseline	customer_address	0.61	2019-10-13 19:14:55.631000+00:00	2019-10-13 19:14:55.631000+00:00	4550000	table
9	tpcds_2t_baseline	customer_demographics	0.13	2019-10-13 19:14:55.340000+00:00	2019-10-13 19:14:55.340000+00:00	1920800	table
10	tpcds_2t_baseline	item	0.02	2019-10-13 19:14:58.165000+00:00	2019-10-13 19:14:58.165000+00:00	48000	table
11	tpcds_2t_baseline	date_dim	0.01	2019-10-13 19:14:55.691000+00:00	2019-10-13 19:14:55.691000+00:00	73049	table
12	tpcds_2t_baseline	time_dim	0.01	2019-10-13 19:15:01.382000+00:00	2019-10-13 19:15:01.382000+00:00	86400	table
13	tpcds_2t_baseline	call_center	0.00	2019-10-13 19:14:52.907000+00:00	2019-10-13 19:14:52.907000+00:00	8	table
14	tpcds_2t_baseline	catalog_page	0.00	2019-10-13 19:14:53.022000+00:00	2019-10-13 19:14:53.022000+00:00	11718	table
15	tpcds_2t_baseline	dbgen_version	0.00	2019-10-13 19:14:56.109000+00:00	2019-10-13 19:14:56.109000+00:00	1	table
16	tpcds_2t_baseline	household_demographics	0.00	2019-10-13 19:14:56.855000+00:00	2019-10-13 19:14:56.855000+00:00	7200	table
17	tpcds_2t_baseline	income_band	0.00	2019-10-13 19:14:57.134000+00:00	2019-10-13 19:14:57.134000+00:00	20	table
18	tpcds_2t_baseline	perf	0.00	2019-10-14 07:04:04.205000+00:00	2019-10-14 08:29:47.141000+00:00	99	table
19	tpcds_2t_baseline	promotion	0.00	2019-10-13 19:14:58.728000+00:00	2019-10-13 19:14:58.728000+00:00	450	table
20	tpcds_2t_baseline	reason	0.00	2019-10-13 19:14:59.216000+00:00	2019-10-13 19:14:59.216000+00:00	36	table
21	tpcds_2t_baseline	ship_mode	0.00	2019-10-13 19:15:00.140000+00:00	2019-10-13 19:15:00.140000+00:00	20	table
22	tpcds_2t_baseline	store	0.00	2019-10-13 19:15:00.160000+00:00	2019-10-13 19:15:00.160000+00:00	186	table
23	tpcds_2t_baseline	warehouse	0.00	2019-10-13 19:15:01.593000+00:00	2019-10-13 19:15:01.593000+00:00	6	table
24	tpcds_2t_baseline	web_page	0.00	2019-10-13 19:15:01.927000+00:00	2019-10-13 19:15:01.927000+00:00	360	table
25	tpcds_2t_baseline	web_site	0.00	2019-10-13 19:15:03.301000+00:00	2019-10-13 19:15:03.301000+00:00	36	table

	table_catalog	table_schema	table_name	column_name	ordinal_position	is_nullable	data_type	is_generated	generation_expression	is_stored	is_hidden	is_updatable	is_system_defined	is_partitioning_column	clustering_ordinal_position
0	dw-workshop	tpcds_2t_baseline	household_demographics	hd_demo_sk	1	NO	INT64	NEVER	None	None	NO	None	NO	NO	None
1	dw-workshop	tpcds_2t_baseline	household_demographics	hd_income_band_sk	2	YES	INT64	NEVER	None	None	NO	None	NO	NO	None
2	dw-workshop	tpcds_2t_baseline	household_demographics	hd_buy_potential	3	YES	STRING	NEVER	None	None	NO	None	NO	NO	None
3	dw-workshop	tpcds_2t_baseline	household_demographics	hd_dep_count	4	YES	INT64	NEVER	None	None	NO	None	NO	NO	None
4	dw-workshop	tpcds_2t_baseline	household_demographics	hd_vehicle_count	5	YES	INT64	NEVER	None	None	NO	None	NO	NO	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
428	dw-workshop	tpcds_2t_baseline	customer	c_birth_year	14	YES	INT64	NEVER	None	None	NO	None	NO	NO	None
429	dw-workshop	tpcds_2t_baseline	customer	c_birth_country	15	YES	STRING	NEVER	None	None	NO	None	NO	NO	None
430	dw-workshop	tpcds_2t_baseline	customer	c_login	16	YES	STRING	NEVER	None	None	NO	None	NO	NO	None
431	dw-workshop	tpcds_2t_baseline	customer	c_email_address	17	YES	STRING	NEVER	None	None	NO	None	NO	NO	None
432	dw-workshop	tpcds_2t_baseline	customer	c_last_review_date_sk	18	YES	INT64	NEVER	None	None	NO	None	NO	NO	None

	cs_item_sk	total_orders	total_quantity	total_revenue	total_profit
0	9253	121222	6103140	307670152.68	-30932953.61
1	1927	121217	6106523	309693527.62	-29261211.24
2	16603	121181	6099067	307870874.85	-31511409.31
3	193	121150	6074650	305044085.7	-31913535.21
4	38845	121142	6096548	306754379.03	-30723801.34
5	14335	121109	6098853	307479510.28	-30412623.1
6	21145	121101	6076957	306499554.37	-30122065.46
7	7255	121098	6075854	308807049.12	-30202631
8	37315	121094	6075936	307200736.08	-31155035.24
9	9973	121091	6089771	308538961.44	-31130175.79

	performance_test_num	query_num	elapsed_time_sec	ran_on
0	1571014412	1	5	1571014417
1	1571014412	2	10	1571014430
2	1571014412	3	25	1571014458
3	1571014412	4	25	1571014487

	test_date	query_num	query_ran_on	query_completed_on	elapsed_time_sec
0	2019-10-15 15:36:24+00:00	1	2019-10-15 15:36:28+00:00	2019-10-15 15:36:32+00:00	4
1	2019-10-15 15:36:24+00:00	2	2019-10-15 15:37:22+00:00	2019-10-15 15:38:13+00:00	51
2	2019-10-15 15:36:24+00:00	3	2019-10-15 15:38:16+00:00	2019-10-15 15:39:06+00:00	50
3	2019-10-15 15:36:24+00:00	4	2019-10-15 15:39:10+00:00	2019-10-15 15:40:00+00:00	50
4	2019-10-15 15:36:24+00:00	5	2019-10-15 15:39:23+00:00	2019-10-15 15:39:32+00:00	9
5	2019-10-15 15:36:24+00:00	6	2019-10-15 15:40:53+00:00	2019-10-15 15:42:20+00:00	87
6	2019-10-15 15:36:24+00:00	7	2019-10-15 15:41:33+00:00	2019-10-15 15:42:09+00:00	36
7	2019-10-15 15:36:24+00:00	8	2019-10-15 15:41:50+00:00	2019-10-15 15:42:03+00:00	13
8	2019-10-15 15:36:24+00:00	9	2019-10-15 15:43:44+00:00	2019-10-15 15:45:35+00:00	111
9	2019-10-15 15:36:24+00:00	10	2019-10-15 15:47:33+00:00	2019-10-15 15:51:19+00:00	226
10	2019-10-15 15:36:24+00:00	11	2019-10-15 15:48:02+00:00	2019-10-15 15:48:27+00:00	25
11	2019-10-15 15:36:24+00:00	12	2019-10-15 15:49:13+00:00	2019-10-15 15:50:21+00:00	68
12	2019-10-15 15:36:24+00:00	13	2019-10-15 15:49:33+00:00	2019-10-15 15:49:50+00:00	17
13	2019-10-15 15:36:24+00:00	14	2019-10-15 15:49:41+00:00	2019-10-15 15:49:46+00:00	5
14	2019-10-15 15:36:24+00:00	15	2019-10-15 15:50:21+00:00	2019-10-15 15:50:57+00:00	36
15	2019-10-15 15:36:24+00:00	16	2019-10-15 15:50:49+00:00	2019-10-15 15:51:14+00:00	25
16	2019-10-15 15:36:24+00:00	17	2019-10-15 15:51:57+00:00	2019-10-15 15:52:48+00:00	51
17	2019-10-15 15:36:24+00:00	18	2019-10-15 15:54:46+00:00	2019-10-15 15:57:32+00:00	166
18	2019-10-15 15:36:24+00:00	19	2019-10-15 15:55:15+00:00	2019-10-15 15:55:41+00:00	26
19	2019-10-15 15:36:24+00:00	20	2019-10-15 15:59:36+00:00	2019-10-15 16:03:54+00:00	258
20	2019-10-15 15:36:24+00:00	21	2019-10-15 16:00:04+00:00	2019-10-15 16:00:29+00:00	25
21	2019-10-15 15:36:24+00:00	22	2019-10-15 16:00:59+00:00	2019-10-15 16:01:50+00:00	51
22	2019-10-15 15:36:24+00:00	23	2019-10-15 16:02:10+00:00	2019-10-15 16:03:18+00:00	68
23	2019-10-15 15:36:24+00:00	24	2019-10-15 16:02:31+00:00	2019-10-15 16:02:48+00:00	17
24	2019-10-15 15:36:24+00:00	25	2019-10-15 16:05:55+00:00	2019-10-15 16:09:12+00:00	197
25	2019-10-15 15:36:24+00:00	26	2019-10-15 16:07:06+00:00	2019-10-15 16:08:14+00:00	68
26	2019-10-15 15:36:24+00:00	27	2019-10-15 16:07:17+00:00	2019-10-15 16:07:25+00:00	8
27	2019-10-15 15:36:24+00:00	28	2019-10-15 16:08:12+00:00	2019-10-15 16:09:03+00:00	51
28	2019-10-15 15:36:24+00:00	29	2019-10-15 16:09:06+00:00	2019-10-15 16:09:57+00:00	51
29	2019-10-15 15:36:24+00:00	30	2019-10-15 16:10:00+00:00	2019-10-15 16:10:51+00:00	51
30	2019-10-15 15:36:24+00:00	31	2019-10-15 16:10:56+00:00	2019-10-15 16:11:46+00:00	50
31	2019-10-15 15:36:24+00:00	32	2019-10-15 16:11:43+00:00	2019-10-15 16:12:20+00:00	37
32	2019-10-15 15:36:24+00:00	33	2019-10-15 16:12:54+00:00	2019-10-15 16:14:02+00:00	68
33	2019-10-15 15:36:24+00:00	34	2019-10-15 16:14:05+00:00	2019-10-15 16:15:13+00:00	68
34	2019-10-15 15:36:24+00:00	35	2019-10-15 16:16:27+00:00	2019-10-15 16:18:44+00:00	137
35	2019-10-15 15:36:24+00:00	36	2019-10-15 16:17:12+00:00	2019-10-15 16:17:51+00:00	39
36	2019-10-15 15:36:24+00:00	37	2019-10-15 16:18:05+00:00	2019-10-15 16:18:55+00:00	50
37	2019-10-15 15:36:24+00:00	38	2019-10-15 16:20:55+00:00	2019-10-15 16:23:41+00:00	166
38	2019-10-15 15:36:24+00:00	39	2019-10-15 16:21:26+00:00	2019-10-15 16:21:52+00:00	26
39	2019-10-15 15:36:24+00:00	40	2019-10-15 16:21:48+00:00	2019-10-15 16:22:06+00:00	18
40	2019-10-15 15:36:24+00:00	41	2019-10-15 16:25:39+00:00	2019-10-15 16:29:26+00:00	227
41	2019-10-15 15:36:24+00:00	42	2019-10-15 16:27:33+00:00	2019-10-15 16:29:24+00:00	111
42	2019-10-15 15:36:24+00:00	43	2019-10-15 16:31:37+00:00	2019-10-15 16:35:24+00:00	227
43	2019-10-15 15:36:24+00:00	44	2019-10-15 16:32:08+00:00	2019-10-15 16:32:33+00:00	25
44	2019-10-15 15:36:24+00:00	45	2019-10-15 16:32:22+00:00	2019-10-15 16:32:33+00:00	11
45	2019-10-15 15:36:24+00:00	46	2019-10-15 16:34:42+00:00	2019-10-15 16:36:59+00:00	137
46	2019-10-15 15:36:24+00:00	47	2019-10-15 16:36:14+00:00	2019-10-15 16:37:43+00:00	89
47	2019-10-15 15:36:24+00:00	48	2019-10-15 16:37:26+00:00	2019-10-15 16:38:34+00:00	68
48	2019-10-15 15:36:24+00:00	49	2019-10-15 16:38:06+00:00	2019-10-15 16:38:43+00:00	37
49	2019-10-15 15:36:24+00:00	50	2019-10-15 16:39:01+00:00	2019-10-15 16:39:53+00:00	52
50	2019-10-15 15:36:24+00:00	51	2019-10-15 16:43:30+00:00	2019-10-15 16:47:47+00:00	257