Using External Tables from BigQuery

Google BigQuery has the ability to query data directly from Google Cloud Storage (a feature called "External Data Sources"). This feature can be useful when querying small amounts of data that you may not want to load into a BigQuery table. It is not recommended for large queries, however, because BigQuery billing is based on the amount of data read to process a query. BigQuery can very efficiently query subsets of tables in its own store since these are stored in columnar format, so the unused columns are not read and don't add to the cost. But since data stored in Cloud Storage is typically in the form of a compressed CSV file, typically, the entire file must be read. Hence, while querying data in Cloud Storage can he helpful, it should be used judiciously.

In this notebook we will show you how to download data from a source on the Internet, put it in Cloud Storage, and then query it directly.

Getting the Data and Loading into GCS

For this sample we want to use external data in a CSV, load it into Cloud Storage, and query it. We will use the Seattle bike station data from the Pronto 2015 Data Challenge dataset.



In [1]:

    
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as gs



In [1]:

    
try:
  from urllib2 import urlopen
except ImportError:
  from urllib.request import urlopen
data_source = "https://storage.googleapis.com/cloud-datalab-samples/udfsample/2015_station_data.csv"

f = urlopen(data_source)
data = f.read()
f.close()

print('Read %d bytes' % len(data))









    



Read 4230 bytes



In [3]:

    
# Get a bucket in the current project
project = Context.default().project_id
sample_bucket_name = project + '-station_data'

# Create and write to the GCS item
sample_bucket = gs.Bucket(sample_bucket_name)
sample_bucket.create()
sample_object = sample_bucket.object('station_data.csv')
sample_object.write_stream(data, 'text/plain')

Creating an External Data Source Object

Now we need to create a special ExternalDataSource object that refers to the data, which can, in turn, be used as a table in our BigQuery queries. We need to provide a schema for BigQuery to use the data. The CSV file has a header row that we want to skip; we will use a CSVOptions object to do this.



In [23]:

    
options = bq.CSVOptions(skip_leading_rows=1) # Skip the header row

schema = bq.Schema([
  {'name': 'id', 'type': 'INTEGER'},         # row ID
  {'name': 'name', 'type': 'STRING'},        # friendly name
  {'name': 'terminal', 'type': 'STRING'},    # terminal ID
  {'name': 'lat', 'type': 'FLOAT'},          # latitude
  {'name': 'long', 'type': 'FLOAT'},         # longitude
  {'name': 'dockcount', 'type': 'INTEGER'},  # bike capacity
  {'name': 'online', 'type': 'STRING'}       # date station opened
])

drivedata = bq.ExternalDataSource(source=sample_object.uri, # The gs:// URL of the file 
                                  csv_options=options,
                                  schema=schema,
                                  max_bad_records=10)



In [24]:

    
drivedata









    Out[24]:





BigQuery External Datasource - paths: gs://yelsayed-project1-station_data/station_data.csv

Querying the Table

Now let's verify that we can access the data. We will run a simple query to show the first five rows. Note that we specify the federated table by using a name in the query, and then pass the table in using a data_sources dictionary parameter.



In [21]:

    
bq.Query('SELECT * FROM drivedatasource LIMIT 5', data_sources={'drivedatasource': drivedata}).execute().result()









    Out[21]:





    id name terminal lat long dockcount online
4 2nd Ave & Blanchard St BT-05 47.61311 -122.344208 14 10/13/2014
3 6th Ave & Blanchard St BT-04 47.616094 -122.341102 16 10/13/2014
2 2nd Ave & Vine St BT-03 47.615829 -122.348564 16 10/13/2014
1 3rd Ave & Broad St BT-01 47.618418 -122.350964 18 10/13/2014
5 2nd Ave & Pine St CBD-13 47.610185 -122.339641 18 10/13/2014
    
(rows: 5, time: 2.4s,     4KB processed, job: job_K_B1UXklE05N5hkcN2SINo33tBg)

Finally, let's clean up.



In [25]:

    
sample_object.delete()
sample_bucket.delete()

id	name	terminal	lat	long	dockcount	online
4	2nd Ave & Blanchard St	BT-05	47.61311	-122.344208	14	10/13/2014
3	6th Ave & Blanchard St	BT-04	47.616094	-122.341102	16	10/13/2014
2	2nd Ave & Vine St	BT-03	47.615829	-122.348564	16	10/13/2014
1	3rd Ave & Broad St	BT-01	47.618418	-122.350964	18	10/13/2014
5	2nd Ave & Pine St	CBD-13	47.610185	-122.339641	18	10/13/2014