Google BigQuery has the ability to query data directly from Google Cloud Storage (a feature called "External Data Sources"). This feature can be useful when querying small amounts of data that you may not want to load into a BigQuery table. It is not recommended for large queries, however, because BigQuery billing is based on the amount of data read to process a query. BigQuery can very efficiently query subsets of tables in its own store since these are stored in columnar format, so the unused columns are not read and don't add to the cost. But since data stored in Cloud Storage is typically in the form of a compressed CSV file, typically, the entire file must be read. Hence, while querying data in Cloud Storage can he helpful, it should be used judiciously.
In this notebook we will show you how to download data from a source on the Internet, put it in Cloud Storage, and then query it directly.
For this sample we want to use external data in a CSV, load it into Cloud Storage, and query it. We will use the Seattle bike station data from the Pronto 2015 Data Challenge dataset.
In [1]:
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as gs
In [1]:
try:
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
data_source = "https://storage.googleapis.com/cloud-datalab-samples/udfsample/2015_station_data.csv"
f = urlopen(data_source)
data = f.read()
f.close()
print('Read %d bytes' % len(data))
In [3]:
# Get a bucket in the current project
project = Context.default().project_id
sample_bucket_name = project + '-station_data'
# Create and write to the GCS item
sample_bucket = gs.Bucket(sample_bucket_name)
sample_bucket.create()
sample_object = sample_bucket.object('station_data.csv')
sample_object.write_stream(data, 'text/plain')
Now we need to create a special ExternalDataSource
object that refers to the data, which can, in turn, be used as a table in our BigQuery queries. We need to provide a schema for BigQuery to use the data. The CSV file has a header row that we want to skip; we will use a CSVOptions
object to do this.
In [23]:
options = bq.CSVOptions(skip_leading_rows=1) # Skip the header row
schema = bq.Schema([
{'name': 'id', 'type': 'INTEGER'}, # row ID
{'name': 'name', 'type': 'STRING'}, # friendly name
{'name': 'terminal', 'type': 'STRING'}, # terminal ID
{'name': 'lat', 'type': 'FLOAT'}, # latitude
{'name': 'long', 'type': 'FLOAT'}, # longitude
{'name': 'dockcount', 'type': 'INTEGER'}, # bike capacity
{'name': 'online', 'type': 'STRING'} # date station opened
])
drivedata = bq.ExternalDataSource(source=sample_object.uri, # The gs:// URL of the file
csv_options=options,
schema=schema,
max_bad_records=10)
In [24]:
drivedata
Out[24]:
In [21]:
bq.Query('SELECT * FROM drivedatasource LIMIT 5', data_sources={'drivedatasource': drivedata}).execute().result()
Out[21]:
Finally, let's clean up.
In [25]:
sample_object.delete()
sample_bucket.delete()