SQL and Pandas DataFrames

Google Cloud Datalab allows you to use SQL to interact with Google BigQuery. However, SQL is just the starting point. Datalab enables you to combine the power of declarative SQL with imperative code (Python) to perform interesting data analysis, visualization, and transformation tasks.

You can use a number of Python data analysis, data wrangling, and visualization libraries, such as numpy, pandas, matplotlib, and many others. Several of these libraries build on top of a DataFrame object.

This notebook shows how to easily integrate these capabilites together in a single notebook. This functionality is provided by the BigQuery APIs, so the first step is, of course, to import the module along with pandas.


In [4]:
import google.datalab.bigquery as bq
import pandas as pd

From SQL to DataFrames

In the BigQuery APIs notebook, we've covered how to define a SQL query, execute it, and get a reference to the results in the form of a DataFrame. Let's start with a query to see what more we can do.


In [1]:
%%bq query -n requests
SELECT timestamp, latency, endpoint
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Popular' OR endpoint = 'Recent'

In [2]:
%%bq sample --count 5 --query requests


Out[2]:
timestamplatencyendpoint
2014-06-15 07:00:01.97344121Popular
2014-06-15 07:00:02.28837321Popular
2014-06-15 07:00:01.70021121Popular
2014-06-15 07:00:02.04323121Popular
2014-06-15 07:00:01.51419722Popular

(rows: 5, time: 2.0s, 15MB processed, job: job_EWjDOLxn2S0zCH3eWnS56Jlq8O8)

In [5]:
df = requests.execute(output_options=bq.QueryOutput.dataframe()).result()
len(df)


Out[5]:
8392

Data Manipulation

The resulting DataFrame can be further transformed, sliced, projected, and inspected using the extensive set of APIs available on the class. Let's print out the first five rows.


In [6]:
df.head(5)


Out[6]:
timestamp latency endpoint
0 2014-06-15 19:40:43.414446 256 Recent
1 2014-06-16 03:32:55.519498 256 Recent
2 2014-06-16 01:38:26.398083 256 Recent
3 2014-06-15 17:25:47.865026 256 Recent
4 2014-06-15 14:41:43.465275 256 Recent

Or it can be inspected for schema,


In [7]:
df.dtypes


Out[7]:
timestamp    datetime64[ns]
latency               int64
endpoint             object
dtype: object

or further transformed locally, for example to perform grouping,


In [8]:
groups = df.groupby('endpoint')
groups.dtypes


Out[8]:
latency timestamp
endpoint
Popular int64 NaT
Recent int64 NaT

In [9]:
for name, df_group in groups:
  print('%s - %d items' % (name, len(df_group)))
  print(df_group.head(3))
  print()


Popular - 7658 items
                     timestamp  latency endpoint
734 2014-06-15 12:32:44.619481      256  Popular
735 2014-06-15 15:33:51.416032      256  Popular
736 2014-06-15 07:11:53.720912      256  Popular

Recent - 734 items
                   timestamp  latency endpoint
0 2014-06-15 19:40:43.414446      256   Recent
1 2014-06-16 03:32:55.519498      256   Recent
2 2014-06-16 01:38:26.398083      256   Recent

and then analyze a dimension per group,


In [10]:
groups['latency'].describe()


Out[10]:
endpoint       
Popular   count     7658.000000
          mean       274.439018
          std        187.050788
          min          2.000000
          25%        132.000000
          50%        313.000000
          75%        392.000000
          max       6443.000000
Recent    count      734.000000
          mean       246.268392
          std        690.888990
          min          2.000000
          25%        184.250000
          50%        229.000000
          75%        288.000000
          max      18715.000000
Name: latency, dtype: float64

or even run a set of custom aggregation functions.


In [11]:
def unique(x):
  return sorted(set(list(x)))

groups['latency'].agg({ 'list': lambda x: list(x), 'unique': lambda x: unique(x) })


Out[11]:
unique list
endpoint
Popular [2, 3, 4, 5, 6, 7, 8, 9, 11, 15, 19, 20, 21, 2... [256, 256, 256, 256, 256, 256, 256, 256, 512, ...
Recent [2, 3, 4, 5, 6, 10, 21, 22, 23, 24, 29, 31, 35... [256, 256, 256, 256, 256, 257, 257, 257, 257, ...

Data Visualization

DataFrame provides built-in visualization capabilities using matplotlib.

Conversion to TimeSeries Data

Our DataFrame instance contains timestamps, latencies, and endpoints. Let's reshape this DataFrame so that each endpoint is a column, that is, an independent series. The timestamp, itself, will be the index row. For timestamps for which one of the endpoints has no data point, we'll use the previous value.


In [16]:
df_series = df.pivot(index='timestamp', columns='endpoint', values='latency').fillna(method = 'backfill')
df_series[10:20]


Out[16]:
endpoint Popular Recent
timestamp
2014-06-15 07:00:20.649279 177.0 427.0
2014-06-15 07:00:54.666471 352.0 427.0
2014-06-15 07:01:28.139716 471.0 427.0
2014-06-15 07:01:33.672656 394.0 427.0
2014-06-15 07:01:37.050927 321.0 427.0
2014-06-15 07:02:21.453438 390.0 427.0
2014-06-15 07:02:31.533299 390.0 331.0
2014-06-15 07:02:51.365249 130.0 331.0
2014-06-15 07:02:57.402798 121.0 331.0
2014-06-15 07:02:57.703736 139.0 331.0

In [13]:
len(df_series)


Out[13]:
8392

In [20]:
df_series.plot(logy = True)


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f51c8f8d150>

Resampling

The plot above is not meaningful. Our time-series data is taken arbitrary timestamps, whenever a request was processed. It is irregularly spaced, and there are a large number of data points. DataFrames provide the ability to resample a time-series into more meaningful time windows.


In [21]:
df_series.resample(rule='10min').mean().plot(logy = True)


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f51a97a9210>

Looking Ahead

Python pandas provide an extensive toolbox of capabilities for working with data. Datalab combines the power of BigQuery and DataFrames.

Subsequent notebooks cover additional SQL capabilities that you to use the full power of BigQuery, allowing you to run queries close to your entire dataset before bringing a useful subset of data into the notebook.