Copyright 2018 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Datathon Tutorial

The aim of this tutorial is to get you familiarized with BigQuery to query/filter/aggregate/export data with Python.

Prerequisites

You will need to have a valid Google account to be able to log in to Google Cloud Platform. If you do not have one, you can create one at https://accounts.google.com. If you will be also accessing restricted datasets, you may need to notify datathon organizers to register your account for data access.

Setup

To be able to run the queries in this tutorial, you need to create a copy of this Colab notebook by clicking "File" > "Save a copy in Drive..." menu. You can share your copy with your teammates by clicking on the "SHARE" button on the top-right corner of your Colab notebook copy. Everyone with "Edit" permission is able to modify the notebook at the same time, so it is a great way for team collaboration. Before running any cell in this colab, please make sure there is a green check mark before "CONNECTED" on top right corner, if not, please click "CONNECTED" button to connect to a random backend.

Now that you have done the initial setup, let us start playing with the data. First, you need to run some initialization code. You can run the following cell by clicking on the triangle button when you hover over the [ ] space on the top-left corner of the code cell below.


In [0]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path

# Below imports are used to print out pretty pandas dataframes
from IPython.display import display, HTML

# Imports for accessing Datathon data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

Before running any queries using BigQuery, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter.


In [0]:
auth.authenticate_user()

The data-hosting project bigquery-public-data has read-only access, as a result, you need to set a default project that you have BigQuery access to. A shared project should be created by the event organizers, and we will be using it throughout this tutorial.

Note that during the datathon, all participants will be divided into teams and a Google Cloud project will be created for each team specifically. That project would be the preferred project to use. For now we'll stick with the shared project for the purpose of the tutorial.

After datathon is finished, the shared project may either lock down access or be deleted, it's still possible to run queries from a project you own personally as long as you have access to the dataset hosting project.

Change the variable project_id below to list the project you are using.


In [0]:
project_id='REPLACE_WITH_PROJECT_ID'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id

Let's define a few methods to wrap BigQuery operations, so that we don't have to write the configurations again and again.


In [0]:
# Read data from BigQuery into pandas dataframes.
def run_query(query):
  return pd.io.gbq.read_gbq(query, project_id=project_id, verbose=False, configuration={'query':{'useLegacySql': False}})

OK, that's it for setup, now let's get our hands on the CMS Medicare public data!

Analysis

Let's now run some queries.

First let's run the following query to produce data to generate a histogram graph to show the cost of heart transplants & implants in 2015 in ten thousand dollar buckets (i.e. 12=12,000-13000, 13=13,000-14,000, ...).


In [0]:
df = run_query('''
WITH costs AS (
  SELECT
    CAST(FLOOR(average_medicare_payments / 10000) AS INT64) AS average_cost_bucket_in_ten_thousands
    FROM `bigquery-public-data.cms_medicare.inpatient_charges_2015`
    WHERE drg_definition = '001 - HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC')
SELECT
  COUNT(average_cost_bucket_in_ten_thousands) AS number_of_procedures,
  average_cost_bucket_in_ten_thousands
  FROM costs
  GROUP BY average_cost_bucket_in_ten_thousands
  ORDER BY average_cost_bucket_in_ten_thousands ASC
''')

df.set_index('average_cost_bucket_in_ten_thousands').plot(title='costs',kind='bar',legend=False)

This consists of 3 parts:

  1. First we retrieve the costs of heart transplants and round the costs down to the closest $10,000 range, which is saved in a temporary table costs
  2. The result data is filtered to include only the information required, the cost bucket and the number of procedures in each bucket. This is stored in a pandas dataframe.
  3. We plot the chart from the dataframe.

Congratulations! Now you have finished this datathon tutorial, and ready to explore more data by querying Google BigQuery. To do so, simply replace bigquery-public-data.cms_medicare with the project and dataset name you are interested in. One thing to note though, is that it is highly recommended to aggregate data aggressively wherever possible, because large dataframes may cause the performance of colab to drop drastically or even out of memory errors.

Enjoy!