UDFs in BigQuery

This notebook shows how to use UDFs (user-defined functions) in Google BigQuery. UDFs allow you to operate on columns of a table, performing arbitrary transformations, and returning the result of those transformations as a value. Datalab currently supports temporary UDFS, which live only within the query that uses them.

You can read more about UDFs here

Scenario

In this notebook we are going to look at anonymized logs that originated in Google AppEngine. These logs include the paths of requested URIs, which contain a number of query parameters of interest. To help us use the logs, we will create a UDF that extracts the query parameters values, and puts them into a new column of type ARRAY.

Examining the Data

We are going to look at logs from a week in October 2015. These logs were imported from Google AppEngine, with a few relevant fields extracted and anonymized. There is a separate table for each day. Let's look at an example, starting with the schema, and then look at sample rows:


In [1]:
%bq tables describe --name cloud-datalab-samples.appenginelogs.sample_logs_20151027


Out[1]:

In [2]:
%bq sample --count 5 --table cloud-datalab-samples.appenginelogs.sample_logs_20151027


Out[2]:
timestampmethodstatuslatencypath
2015-10-27 22:00:47.660171POST2040.00337/log/signin?project=5&instance=2&user=54&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 01:03:10.959946POST2040.003195/log/page?project=36&instance=40&user=131&page=detail&path=38&version=0.1.1&release=alpha
2015-10-27 01:24:18.065954POST2040.003023/log/page?project=20&instance=29&user=42&page=detail&path=6&version=0.1.1&release=alpha
2015-10-27 00:57:44.694484POST2040.003418/log/start?project=143&instance=215&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 20:10:19.547390POST2040.00368/log/start?project=149&instance=232&user=2&page=master&path=3&version=0.1.1&release=alpha

(rows: 5, time: 2.0s, 256KB processed, job: job_RDk8HuzdIx5_W-CMxKKuKXBhKaA)

You can see we have five columns; the 'path' column needs the most processing. Each URI has the form log/event?params, where params can be one of project, instance, user, page, path, version, or release. We are going to extract these values into a separate column of type ARRAY.

Creating and Testing the UDF

UDFs are functions written in one of the supported languages (currently SQL and Javascript), that take a column and produces a value, after performing some computation. The BigQuery UDF documentation explains that the CREATE TEMP FUNCTION call is needed to define a UDF, including its parameter names and types, return type, and language. Datalab simplifies this syntax; it makes use of jsdoc-style // @param comments to achieve the same result. Also, it exposes UDFs as a Python class, and a magic command, to make building queries simpler. Let's see how we can do this:


In [3]:
%%bq query
SELECT * FROM `cloud-datalab-samples.appenginelogs.sample_logs_20151027`
LIMIT 5


Out[3]:
timestampmethodstatuslatencypath
2015-10-27 22:00:47.660171POST2040.00337/log/signin?project=5&instance=2&user=54&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 01:03:10.959946POST2040.003195/log/page?project=36&instance=40&user=131&page=detail&path=38&version=0.1.1&release=alpha
2015-10-27 01:24:18.065954POST2040.003023/log/page?project=20&instance=29&user=42&page=detail&path=6&version=0.1.1&release=alpha
2015-10-27 00:57:44.694484POST2040.003418/log/start?project=143&instance=215&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 20:10:19.547390POST2040.00368/log/start?project=149&instance=232&user=2&page=master&path=3&version=0.1.1&release=alpha

(rows: 5, time: 1.7s, 256KB processed, job: job__gM_SgEw_T7aztsO--sc4fZ6ytY)

In [4]:
%%bq udf --name extract_params -l js
// A function to split a set of URL query parameters into an array
// @param path STRING
// @returns ARRAY<STRING>
var re = /[?&]([^=]*)=([^&]*)/g;
var result = [];
var match;
while ((match = re.exec(path)) != null) {
  result.push(decodeURIComponent(match[2]));
}
return result;

Now we can try calling the UDF. We need to define a query to do this. We can call our UDF like any regular function call, taking one or more columns as input.


In [6]:
%%bq query -n extract_params_query --udfs extract_params
SELECT *, extract_params(path) as parameters FROM `cloud-datalab-samples.appenginelogs.sample_logs_20151027`
LIMIT 5

In [7]:
%bq execute -q extract_params_query


Out[7]:
timestampmethodstatuslatencypathparameters
2015-10-27 22:00:47.660171POST2040.00337/log/signin?project=5&instance=2&user=54&page=master&path=3&version=0.1.1&release=alpha['5', '2', '54', 'master', '3', '0.1.1', 'alpha']
2015-10-27 01:03:10.959946POST2040.003195/log/page?project=36&instance=40&user=131&page=detail&path=38&version=0.1.1&release=alpha['36', '40', '131', 'detail', '38', '0.1.1', 'alpha']
2015-10-27 01:24:18.065954POST2040.003023/log/page?project=20&instance=29&user=42&page=detail&path=6&version=0.1.1&release=alpha['20', '29', '42', 'detail', '6', '0.1.1', 'alpha']
2015-10-27 00:57:44.694484POST2040.003418/log/start?project=143&instance=215&user=2&page=master&path=3&version=0.1.1&release=alpha['143', '215', '2', 'master', '3', '0.1.1', 'alpha']
2015-10-27 20:10:19.547390POST2040.00368/log/start?project=149&instance=232&user=2&page=master&path=3&version=0.1.1&release=alpha['149', '232', '2', 'master', '3', '0.1.1', 'alpha']

(rows: 5, time: 6.0s, 256KB processed, job: job_zCt57EKsBwXt2oYDBZftWfinv8s)

In order to see the actual expanded SQL, including the UDF defined above, we can inspect the query object, by typing its name:


In [8]:
extract_params_query


Out[8]:
CREATE TEMPORARY FUNCTION extract_params (path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js
AS """
// A function to split a set of URL query parameters into an array
// @param path STRING
// @returns ARRAY<STRING>
var re = /[?&]([^=]*)=([^&]*)/g;
var result = [];
var match;
while ((match = re.exec(path)) != null) {
  result.push(decodeURIComponent(match[2]));
}
return result;
"""
OPTIONS (

);
SELECT *, extract_params(path) as parameters FROM `cloud-datalab-samples.appenginelogs.sample_logs_20151027`
LIMIT 5

Next Steps

You can learn how to test your UDF in the notebook by following the UDF Testing in the Notebook tutorial. If you have code that you regularly use in your UDFs, you can factor it out and put it in Google Cloud Storage, then import it. This technique is covered in the UDFs using Code in Cloud Storage tutorial.