Guide to Google Cloud Datalab

Thanks for using Google Cloud Datalab!

This notebook serves as your guide to the documentation and samples that accompany Datalab. It describes how you can use interactive notebooks, Python, and BigQuery SQL to explore, visualize, analyze and transform your data within the Google Cloud Platform.

As an aside, you'll notice that this content, itself, is distributed in notebook form - very much like the ones you can use to make your data analysis iterative, self-documenting, and shareable.

Documentation Outline

Introduction

Please browse through the following notebooks for a basic orientation to Datalab and how it works.

  • Introduction to Python - Python is essential to working in Datalab. This notebook provides a quick overview of the Python environment with links to online, in-depth language tutorials (if you're new to Python).
  • Using Datalab - Accessing Cloud Data - This notebook briefly describes the Datalab environment, including how the Datalab workspace is configured as well as important information about authorization.

Tutorials

This set of notebooks describes using the product and its set of features, including the tools and Python APIs that you can use within notebooks.

Google BigQuery

  • Hello BigQuery - Datalab puts Google BigQuery at your fingertips. For the most basic example, start here.
  • BigQuery Commands - Use simple, declarative commands to do everything, from exploring to interactively analyzing, transforming, and visualizing your data.
  • BigQuery APIs - Use an extensive and intuitive library of Python APIs, designed with notebooks in mind, to query data and work with BigQuery objects, such as DataSets, Tables and Schemas.
  • SQL Query Composition - Use nested SQL statements and big joins to harness the full power of BigQuery, building these one step at a time.
  • UDFs in BigQuery - An introduction to using UDFs (user-defined functions) to perform custom transformations not possible through plain SQL.

Storage

  • Storage Commands - Use simple, declarative commands to quickly manage your Cloud Storage objects.
  • Storage APIs - Use the equivalent Python APIs, designed with notebooks in mind, to read and write data to Google Cloud Storage.

Stackdriver Monitoring

The tutorials for this module use a sample project. This project is not readable by everyone. In order to execute the tutorials, you will have to set a default project, and make sure that it has at least one GCE Instance.

  • Getting Started - Datalab allows you to access and analyze the monitoring data. For an overview of how to access the time series data, start here.
  • Group Metrics - How to list Stackdriver groups, and query time series data for a given group.
  • Time-shifted Data - How to compare today's metric data against the past week by time-shifting the time series data.

Data Visualization

  • Interactive Charts with Google Charting APIs - Google Charts provide a rich selection of interactive charts, rendered on the client using JavaScript and SVG. In addition to standard charts, such as bar charts, line charts, and pie charts, this notebook provides map viewers, time-series viewers, sankey diagrams, and more.

Samples

This set of notebooks builds on the techniques and concepts illustrated in the documentation, and puts them to practice.

TensorFlow

  • Text Classification with TensorFlow - demonstrates two models built with TensorFlow to do text classifications on 20 newsgroup data. One is feed-forward and the other is recurrent.

MLToolbox

BigQuery

  • Anomaly Detection in HTTP Logs - demonstrates using SQL to convert raw HTTP logs stored in BigQuery into a time-series that can be used for detecting anomalies in a web application.
  • Programming Language Correlation - demonstrates using the combination of SQL and Python data analysis using pandas to determine how programming languages correlate (or not) by tapping into OSS developer activity at GitHub.
  • Exploring Genomics Data - demostrates browsing and understanding gene data provided in the form of publicly accessible BigQuery data.

Updating Documentation

Datalab documentation is distributed as notebooks which are automatically updated at time of installation. In order to update, please save all open notebooks, close the running sessions, then re-run the installation script. It will pull the latest notebooks.

Committing and Refreshing

Once you have updated the local copy of the documents, you can commit them within the git repository.

Next, make sure to refresh your notebooks to load the latest and updated documents.