BigQuery Parameterization

Google BigQuery Standard SQL supports parameterization. It is interesting to be able to use Python variables defined in the notebook as parameter values for SQL.

This notebook is an example how to use parameterized queries.

Data Preview



In [2]:

    
%%bq query -n logs_query
SELECT * FROM `cloud-datalab-samples.httplogs.logs_20140615`



In [3]:

    
%bq sample -q logs_query --count 10









    Out[3]:





    timestamp latency status method endpoint
2014-06-15 07:00:01.000102 124 405 GET Interact2
2014-06-15 07:00:00.652760 28 405 GET Interact2
2014-06-15 07:00:00.834251 121 405 GET Interact2
2014-06-15 07:00:00.943075 28 200 GET Other
2014-06-15 07:00:01.159701 119 200 GET Other
2014-06-15 07:00:00.536486 48 200 GET Interact3
2014-06-15 07:00:00.003772 122 200 GET Interact3
2014-06-15 07:00:01.071107 49 200 GET Interact3
2014-06-15 07:00:00.670100 103 200 GET Interact3
2014-06-15 07:00:00.428897 144 200 GET Interact3
    
(rows: 10, time: 1.9s,    24MB processed, job: job_PQaM8KiFmTkJeZAMZ7Pbt4QKX28)



In [6]:

    
%%bq query
SELECT endpoint FROM `cloud-datalab-samples.httplogs.logs_20140615` GROUP BY endpoint









    Out[6]:





    endpoint
Admin
Interact1
Home
Warmup
Create
Popular
Interact3
Recent
Interact2
Other
    
(rows: 10, time: 0.2s, cached, job: job_yZVXYwERTLvext8mayphi5_5x6g)

Parameterization within SQL queries

Parameters are declared in SQL queries using a @name syntax within the SQL, and then defining name's value when executing the query. Notice you will have to define the query and execute it in two different cells. The shorthand way of running queries (using %%bq query without --name) gives you little control over the execution of the query.



In [7]:

    
%%bq query -n endpoint_stats
SELECT *
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = @endpoint
LIMIT 10



In [8]:

    
%%bq execute -q endpoint_stats
parameters:
- name: endpoint
  type: STRING
  value: Interact2









    Out[8]:





    timestamp latency status method endpoint
2014-06-15 07:00:03.093084 21 302 GET Interact2
2014-06-15 07:00:03.263381 22 302 GET Interact2
2014-06-15 07:00:01.000102 124 405 GET Interact2
2014-06-15 07:00:03.018770 125 405 GET Interact2
2014-06-15 07:00:00.652760 28 405 GET Interact2
2014-06-15 07:00:01.456010 123 405 GET Interact2
2014-06-15 07:00:00.834251 121 405 GET Interact2
2014-06-15 07:00:01.166912 27 405 GET Interact2
2014-06-15 07:00:01.694210 29 405 GET Interact2
2014-06-15 07:00:03.062767 173 200 POST Interact2
    
(rows: 10, time: 1.7s,    24MB processed, job: job_pb3EUsKwE1dpQGqa3lzq28-am6I)

This defined a SQL query with a string parameter named endpoint, which can be filled when executing the query. Let's give it some value in a separate cell:



In [9]:

    
endpoint_val = 'Interact3'

In order to reference the variable defined above, Google Cloud Datalab offers the $var syntax, which can be invoked in the magic command:



In [10]:

    
%%bq execute -q endpoint_stats
parameters:
- name: endpoint
  type: STRING
  value: $endpoint_val









    Out[10]:





    timestamp latency status method endpoint
2014-06-15 07:00:00.536486 48 200 GET Interact3
2014-06-15 07:00:03.925336 46 200 GET Interact3
2014-06-15 07:00:00.003772 122 200 GET Interact3
2014-06-15 07:00:04.348899 47 200 GET Interact3
2014-06-15 07:00:01.071107 49 200 GET Interact3
2014-06-15 07:00:05.221120 382 200 GET Interact3
2014-06-15 07:00:04.337463 297 200 GET Interact3
2014-06-15 07:00:00.428897 144 200 GET Interact3
2014-06-15 07:00:00.670100 103 200 GET Interact3
2014-06-15 07:00:02.015507 363 200 GET Interact3
    
(rows: 10, time: 2.2s,    24MB processed, job: job_p5tCdFhq8OxReVfZmntdXH2YIQA)

This can also be achieved using the Python API instead of the magic commands (%%bq). This is how we will create and execute a parameterized query using the API:



In [11]:

    
import google.datalab.bigquery as bq
endpoint_stats2 = bq.Query(sql='''
SELECT *
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = @endpoint
LIMIT 10
''')

endpoint_value = 'Interact3'

query_parameters = [
  {
    'name': 'endpoint',
    'parameterType': {'type': 'STRING'},
    'parameterValue': {'value': endpoint_value}
  }
]

job = endpoint_stats2.execute(query_params=query_parameters)

job.result()









    Out[11]:





    timestamp latency status method endpoint
2014-06-15 07:00:00.536486 48 200 GET Interact3
2014-06-15 07:00:03.925336 46 200 GET Interact3
2014-06-15 07:00:00.003772 122 200 GET Interact3
2014-06-15 07:00:04.348899 47 200 GET Interact3
2014-06-15 07:00:01.071107 49 200 GET Interact3
2014-06-15 07:00:05.221120 382 200 GET Interact3
2014-06-15 07:00:04.337463 297 200 GET Interact3
2014-06-15 07:00:00.428897 144 200 GET Interact3
2014-06-15 07:00:00.670100 103 200 GET Interact3
2014-06-15 07:00:02.015507 363 200 GET Interact3
    
(rows: 10, time: 2.2s,    24MB processed, job: job_11dsxMuiczeA72TNlpqSYG-ovJg)

Looking Ahead

Parameterization enables one part of the SQL and Python integration: being able to use values in Python code in the notebook, and passing them in as part of the query when retrieving data from BigQuery.

The next notebook will cover the other part of the SQL and Python integration: retrieving query results into the notebook for use with Python code.

timestamp	latency	status	method	endpoint
2014-06-15 07:00:01.000102	124	405	GET	Interact2
2014-06-15 07:00:00.652760	28	405	GET	Interact2
2014-06-15 07:00:00.834251	121	405	GET	Interact2
2014-06-15 07:00:00.943075	28	200	GET	Other
2014-06-15 07:00:01.159701	119	200	GET	Other
2014-06-15 07:00:00.536486	48	200	GET	Interact3
2014-06-15 07:00:00.003772	122	200	GET	Interact3
2014-06-15 07:00:01.071107	49	200	GET	Interact3
2014-06-15 07:00:00.670100	103	200	GET	Interact3
2014-06-15 07:00:00.428897	144	200	GET	Interact3

timestamp	latency	status	method	endpoint
2014-06-15 07:00:03.093084	21	302	GET	Interact2
2014-06-15 07:00:03.263381	22	302	GET	Interact2
2014-06-15 07:00:01.000102	124	405	GET	Interact2
2014-06-15 07:00:03.018770	125	405	GET	Interact2
2014-06-15 07:00:00.652760	28	405	GET	Interact2
2014-06-15 07:00:01.456010	123	405	GET	Interact2
2014-06-15 07:00:00.834251	121	405	GET	Interact2
2014-06-15 07:00:01.166912	27	405	GET	Interact2
2014-06-15 07:00:01.694210	29	405	GET	Interact2
2014-06-15 07:00:03.062767	173	200	POST	Interact2

timestamp	latency	status	method	endpoint
2014-06-15 07:00:00.536486	48	200	GET	Interact3
2014-06-15 07:00:03.925336	46	200	GET	Interact3
2014-06-15 07:00:00.003772	122	200	GET	Interact3
2014-06-15 07:00:04.348899	47	200	GET	Interact3
2014-06-15 07:00:01.071107	49	200	GET	Interact3
2014-06-15 07:00:05.221120	382	200	GET	Interact3
2014-06-15 07:00:04.337463	297	200	GET	Interact3
2014-06-15 07:00:00.428897	144	200	GET	Interact3
2014-06-15 07:00:00.670100	103	200	GET	Interact3
2014-06-15 07:00:02.015507	363	200	GET	Interact3