Importing and Exporting Data

Data can be imported into Google BigQuery from a CSV file stored within Google Cloud Storage, or it can be streamed directly into BigQuery from Python code.

Similarly, the results of a query can be exported to Google Cloud Storage as a set of shards, or they can be streamed directly into a file within Datalab. Note that for larger data sizes, it is recommended to choose the sharded method.



In [1]:

    
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as storage
import pandas as pd
try:
  from StringIO import StringIO
except ImportError:
  from io import BytesIO as StringIO

Importing Data

The first step to analyzing and querying your data is importing it. For this demo, we'll create a temporary table in a temporary dataset within BigQuery, using a small data file within Cloud Storage.

Importing Data from Cloud Storage

To interact with Google Cloud Storage, Datalab includes the %%gcs command. First, see the available options on %%gcs:



In [2]:

    
%gcs -h









    



usage: gcs [-h] {copy,create,delete,list,read,view,write} ...

Execute various Google Cloud Storage related operations. Use "%gcs <command>
-h" for help on a specific command.

positional arguments:
  {copy,create,delete,list,read,view,write}
                        commands
    copy                Copy one or more Google Cloud Storage objects to a
                        different location.
    create              Create one or more Google Cloud Storage buckets.
    delete              Delete one or more Google Cloud Storage buckets or
                        objects.
    list                List buckets in a project, or contents of a bucket.
    read                Read the contents of a Google Cloud Storage object
                        into a Python variable.
    view                View the contents of a Google Cloud Storage object.
    write               Write the value of a Python variable to a Google Cloud
                        Storage object.

optional arguments:
  -h, --help            show this help message and exit
None

Let's use the read option to read a storage object into a local Python variable:



In [3]:

    
%%gcs read --object gs://cloud-datalab-samples/cars.csv --variable cars



In [4]:

    
print(cars)









    



Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture Extended Edition","",4900.00
1999,Chevy,"Venture Extended Edition",Very Large,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00



In [6]:

    
# Create the schema, conveniently using a DataFrame example.
df = pd.read_csv(StringIO(cars))
schema = bq.Schema.from_data(df)

# Create the dataset
bq.Dataset('importingsample').create()

# Create the table
sample_table = bq.Table('importingsample.cars').create(schema = schema, overwrite = True)



In [27]:

    
sample_table.load('gs://cloud-datalab-samples/cars.csv', mode='append',
                  source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))









    



Job mysampleproject/job_RInz5SUINexdU5RUrA2CBgDBsUo completed



In [8]:

    
%%bq query -n importingSample
SELECT * FROM importingsample.cars



In [9]:

    
%bq execute -q importingSample









    Out[9]:





    Year Make Model Description Price
1996 Jeep Grand Cherokee MUST SELL! air, moon roof, loaded 4799.0
1997 Ford E350 ac, abs, moon 3000.0
1999 Chevy Venture Extended Edition Very Large 5000.0
1999 Chevy Venture Extended Edition 4900.0
    
(rows: 4, time: 1.7s,   228B processed, job: job_v9Pm_m1RXKg5vkpxPmKI0VAnX2s)

Importing Data from a DataFrame



In [10]:

    
cars2 = storage.Object('cloud-datalab-samples', 'cars2.csv').read_stream()
df2 = pd.read_csv(StringIO(cars2))
df2









    Out[10]:






  
    
      
      Year
      Make
      Model
      Description
      Price
    
  
  
    
      0
      2010
      Honda
      Civic
      NaN
      15000.0
    
    
      1
      2015
      Tesla
      Model S
      NaN
      64900.0



In [11]:

    
df2.fillna(value='', inplace=True)
df2









    Out[11]:






  
    
      
      Year
      Make
      Model
      Description
      Price
    
  
  
    
      0
      2010
      Honda
      Civic
      
      15000.0
    
    
      1
      2015
      Tesla
      Model S
      
      64900.0



In [12]:

    
sample_table.insert(df2)
sample_table.to_dataframe()









    Out[12]:






  
    
      
      Year
      Make
      Model
      Description
      Price
    
  
  
    
      0
      1997
      Ford
      E350
      ac, abs, moon
      3000.0
    
    
      1
      1999
      Chevy
      Venture Extended Edition
      
      4900.0
    
    
      2
      1999
      Chevy
      Venture Extended Edition
      Very Large
      5000.0
    
    
      3
      1996
      Jeep
      Grand Cherokee
      MUST SELL! air, moon roof, loaded
      4799.0
    
    
      4
      2010
      Honda
      Civic
      
      15000.0
    
    
      5
      2015
      Tesla
      Model S
      
      64900.0

Exporting Data

Exporting Data to Cloud Storage



In [25]:

    
project = Context.default().project_id
sample_bucket_name = project + '-datalab-samples'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/tmp/cars.csv'
print('Bucket: ' + sample_bucket_name)
print('Object: ' + sample_bucket_object)









    



Bucket: mysampleproject-datalab-samples
Object: gs://mysampleproject-datalab-samples/tmp/cars.csv



In [14]:

    
sample_bucket = storage.Bucket(sample_bucket_name)
sample_bucket.create()
sample_bucket.exists()









    Out[14]:





True



In [26]:

    
table = bq.Table('importingsample.cars')
table.extract(destination = sample_bucket_object)









    



Job mysampleproject/job_b6TqQiGuFu5d-8UiB5U5osV0cf8 completed



In [16]:

    
%gcs list --objects gs://$sample_bucket_name/*









    Out[16]:




Name Type Size Updated
tmp/cars.csv application/octet-stream 232 2017-03-07 09:34:30.347000+00:00



In [17]:

    
bucket = storage.Bucket(sample_bucket_name)



In [18]:

    
obj = list(bucket.objects())[0]



In [19]:

    
data = obj.read_stream()



In [20]:

    
print(data)









    



Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000
1999,Chevy,Venture Extended Edition,"",4900
1999,Chevy,Venture Extended Edition,Very Large,5000
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799

Exporting Data to a Local File



In [21]:

    
table.to_file('/tmp/cars.csv')



In [22]:

    
%%bash
ls -l /tmp/cars.csv









    



-rw-r--r-- 1 root root 299 Mar  7 09:34 /tmp/cars.csv



In [23]:

    
lines = None
with open('/tmp/cars.csv') as datafile:
  lines = datafile.readlines()
print(''.join(lines))









    



Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.0
1999,Chevy,Venture Extended Edition,,4900.0
1999,Chevy,Venture Extended Edition,Very Large,5000.0
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.0
2010,Honda,Civic,,15000.0
2015,Tesla,Model S,,64900.0

Cleanup



In [24]:

    
sample_bucket.object('tmp/cars.csv').delete()
sample_bucket.delete()
bq.Dataset('importingsample').delete(delete_contents = True)

Year	Make	Model	Description	Price
1996	Jeep	Grand Cherokee	MUST SELL! air, moon roof, loaded	4799.0
1997	Ford	E350	ac, abs, moon	3000.0
1999	Chevy	Venture Extended Edition	Very Large	5000.0
1999	Chevy	Venture Extended Edition		4900.0