Data can be imported into Google BigQuery from a CSV file stored within Google Cloud Storage, or it can be streamed directly into BigQuery from Python code.
Similarly, the results of a query can be exported to Google Cloud Storage as a set of shards, or they can be streamed directly into a file within Datalab. Note that for larger data sizes, it is recommended to choose the sharded method.
In [1]:
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as storage
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import BytesIO as StringIO
To interact with Google Cloud Storage, Datalab includes the %%gcs
command. First, see the available options on %%gcs
:
In [2]:
%gcs -h
Let's use the read
option to read a storage object into a local Python variable:
In [3]:
%%gcs read --object gs://cloud-datalab-samples/cars.csv --variable cars
In [4]:
print(cars)
In [6]:
# Create the schema, conveniently using a DataFrame example.
df = pd.read_csv(StringIO(cars))
schema = bq.Schema.from_data(df)
# Create the dataset
bq.Dataset('importingsample').create()
# Create the table
sample_table = bq.Table('importingsample.cars').create(schema = schema, overwrite = True)
In [27]:
sample_table.load('gs://cloud-datalab-samples/cars.csv', mode='append',
source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))
In [8]:
%%bq query -n importingSample
SELECT * FROM importingsample.cars
In [9]:
%bq execute -q importingSample
Out[9]:
In [10]:
cars2 = storage.Object('cloud-datalab-samples', 'cars2.csv').read_stream()
df2 = pd.read_csv(StringIO(cars2))
df2
Out[10]:
In [11]:
df2.fillna(value='', inplace=True)
df2
Out[11]:
In [12]:
sample_table.insert(df2)
sample_table.to_dataframe()
Out[12]:
In [25]:
project = Context.default().project_id
sample_bucket_name = project + '-datalab-samples'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/tmp/cars.csv'
print('Bucket: ' + sample_bucket_name)
print('Object: ' + sample_bucket_object)
In [14]:
sample_bucket = storage.Bucket(sample_bucket_name)
sample_bucket.create()
sample_bucket.exists()
Out[14]:
In [26]:
table = bq.Table('importingsample.cars')
table.extract(destination = sample_bucket_object)
In [16]:
%gcs list --objects gs://$sample_bucket_name/*
Out[16]:
In [17]:
bucket = storage.Bucket(sample_bucket_name)
In [18]:
obj = list(bucket.objects())[0]
In [19]:
data = obj.read_stream()
In [20]:
print(data)
In [21]:
table.to_file('/tmp/cars.csv')
In [22]:
%%bash
ls -l /tmp/cars.csv
In [23]:
lines = None
with open('/tmp/cars.csv') as datafile:
lines = datafile.readlines()
print(''.join(lines))
In [24]:
sample_bucket.object('tmp/cars.csv').delete()
sample_bucket.delete()
bq.Dataset('importingsample').delete(delete_contents = True)