Project Setup

First run the following steps only if you are running Datalab from your local desktop or laptop (not running Datalab from a GCE VM):

  1. Make sure you have a GCP project which is enabled for Machine Learning API and Dataflow API.
  2. Run "%datalab project set --project [project-id]" to set the default project in Datalab.

If you run Datalab from a GCE VM, then make sure the project of the GCE VM is enabled for Machine Learning API and Dataflow API.


In [1]:
bucket = 'gs://' + datalab_project_id() + '-coast'

In [2]:
!gsutil mb $bucket


Creating gs://bradley-playground-coast/...
ServiceException: 409 Bucket bradley-playground-coast already exists.

Data Preparation

All data is under gs://cloud-datalab/sampledata/coast. See https://storage.googleapis.com/tamucc_coastline/GooglePermissionForImages_20170119.pdf for details.

Load the data from CSV files to Bigquery table.


In [3]:
import google.datalab.bigquery as bq

# Create the dataset
bq.Dataset('coast').create()

schema = [
  {'name':'image_url', 'type': 'STRING'},
  {'name':'label', 'type': 'STRING'},
]

# Create the table
train_table = bq.Table('coast.train').create(schema=schema, overwrite=True)
train_table.load('gs://cloud-datalab/sampledata/coast/train.csv', mode='overwrite', source_format='csv')
eval_table = bq.Table('coast.eval').create(schema=schema, overwrite=True)
eval_table.load('gs://cloud-datalab/sampledata/coast/eval.csv', mode='overwrite', source_format='csv')


Out[3]:
Job bradley-playground/job_ip-qacbM3liCTrLhGEhOrpHdmo8 completed

See the following file for the label description:


In [4]:
!gsutil cat gs://cloud-datalab/sampledata/coast/dict_explanation.csv





















In [5]:
%%bq query --name coast_train
SELECT image_url, label FROM coast.train

In [6]:
coast_train.execute().result()


Out[6]:
image_urllabel
gs://tamucc_coastline/esi_images/IMG_5894_SecC_Sum12_Pt1.JPG1
gs://tamucc_coastline/esi_images/IMG_2319_SecOP_Sum12_Pt3.jpg1
gs://tamucc_coastline/esi_images/IMG_6674_SecQN_Sum12_Pt3.jpg1
gs://tamucc_coastline/esi_images/IMG_9483_SecMO_Sum12_Pt3.jpg1
gs://tamucc_coastline/esi_images/IMG_0223_SecEFG_Sum12_Pt1.JPG1
gs://tamucc_coastline/esi_images/IMG_1012_SecEFG_Sum12_Pt1.JPG1
gs://tamucc_coastline/esi_images/IMG_9996_SecBC_Spr12.jpg1
gs://tamucc_coastline/esi_images/IMG_0899_SecEFG_Sum12_Pt1.JPG1
gs://tamucc_coastline/esi_images/IMG_4996_SecEGH_Sum12_Pt2.jpg1
gs://tamucc_coastline/esi_images/IMG_1281_SecEFG_Sum12_Pt1.JPG1
gs://tamucc_coastline/esi_images/IMG_9888_SecBC_Spr12.jpg1
gs://tamucc_coastline/esi_images/IMG_5805_SecFG2_Spr12.jpg1
gs://tamucc_coastline/esi_images/IMG_9896_SecBC_Spr12.jpg1
gs://tamucc_coastline/esi_images/IMG_5917_SecOPQ_Sum12_Pt3.jpg1
gs://tamucc_coastline/esi_images/IMG_7881_SecQN_Sum12_Pt3.jpg1
gs://tamucc_coastline/esi_images/IMG_6718_SecQN_Sum12_Pt3.jpg1
gs://tamucc_coastline/esi_images/IMG_5814_SecFG2_Spr12.jpg1
gs://tamucc_coastline/esi_images/IMG_6518_SecEFG_Sum12_Pt1.JPG1
gs://tamucc_coastline/esi_images/IMG_5906_SecC_Sum12_Pt1.JPG1
gs://tamucc_coastline/esi_images/IMG_7443_SecEFG_Sum12_Pt1.JPG1
gs://tamucc_coastline/esi_images/IMG_0314_SecMO_Sum12_Pt3.jpg1
gs://tamucc_coastline/esi_images/IMG_5476_SecFG_Spr12.jpg1
gs://tamucc_coastline/esi_images/IMG_7458_SecEFG_Sum12_Pt1.JPG1
gs://tamucc_coastline/esi_images/IMG_2369_SecHKL_Sum12_Pt2.jpg1
gs://tamucc_coastline/esi_images/IMG_1328_SecOP_Sum12_Pt3.jpg1

(rows: 8850, time: 3.8s, 590KB processed, job: job_0-8eMy4DVJC6r1VmFxhnWLY8ZDc)

Explore Your Data

Sample the data to around 1000 instances for visualization. Our data is very simple, so we simply draw histogram on the labels and compare training and eval data.


In [7]:
from google.datalab.ml import *

ds_train = BigQueryDataSet(table='coast.train')
ds_eval = BigQueryDataSet(table='coast.eval')

df_train = ds_train.sample(1000)
df_eval = ds_eval.sample(1000)

In [8]:
df_train.label.value_counts().plot(kind='bar');



In [9]:
df_eval.label.value_counts().plot(kind='bar');



In [ ]: