Copyright 2018 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
The aim of this tutorial is to get you familiarized with BigQuery to query/filter/aggregate/export data with Python.
You will need to have a valid Google account to be able to log in to Google Cloud Platform. If you do not have one, you can create one at https://accounts.google.com. If you will be also accessing restricted datasets, you may need to notify datathon organizers to register your account for data access.
To be able to run the queries in this tutorial, you need to create a copy of this Colab notebook by clicking "File" > "Save a copy in Drive..." menu. You can share your copy with your teammates by clicking on the "SHARE" button on the top-right corner of your Colab notebook copy. Everyone with "Edit" permission is able to modify the notebook at the same time, so it is a great way for team collaboration. Before running any cell in this colab, please make sure there is a green check mark before "CONNECTED" on top right corner, if not, please click "CONNECTED" button to connect to a random backend.
Now that you have done the initial setup, let us start playing with the data. First, you need to run some initialization code. You can run the following cell by clicking on the triangle button when you hover over the [ ] space on the top-left corner of the code cell below.
In [0]:
# Import libraries
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.path as path
# Below imports are used to print out pretty pandas dataframes
from IPython.display import display, HTML
# Imports for accessing Datathon data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery
Before running any queries using BigQuery, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter.
In [0]:
auth.authenticate_user()
The data-hosting project bigquery-public-data
has read-only access, as a result, you need to set a default project that you have BigQuery access to. A shared project should be created by the event organizers, and we will be using it throughout this tutorial.
Note that during the datathon, all participants will be divided into teams and a Google Cloud project will be created for each team specifically. That project would be the preferred project to use. For now we'll stick with the shared project for the purpose of the tutorial.
After datathon is finished, the shared project may either lock down access or be deleted, it's still possible to run queries from a project you own personally as long as you have access to the dataset hosting project.
Change the variable project_id below to list the project you are using.
In [0]:
project_id='REPLACE_WITH_PROJECT_ID'
os.environ["GOOGLE_CLOUD_PROJECT"]=project_id
Let's define a few methods to wrap BigQuery operations, so that we don't have to write the configurations again and again.
In [0]:
# Read data from BigQuery into pandas dataframes.
def run_query(query):
return pd.io.gbq.read_gbq(query, project_id=project_id, verbose=False, configuration={'query':{'useLegacySql': False}})
OK, that's it for setup, now let's get our hands on the CMS Medicare public data!
In [0]:
df = run_query('''
WITH costs AS (
SELECT
CAST(FLOOR(average_medicare_payments / 10000) AS INT64) AS average_cost_bucket_in_ten_thousands
FROM `bigquery-public-data.cms_medicare.inpatient_charges_2015`
WHERE drg_definition = '001 - HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC')
SELECT
COUNT(average_cost_bucket_in_ten_thousands) AS number_of_procedures,
average_cost_bucket_in_ten_thousands
FROM costs
GROUP BY average_cost_bucket_in_ten_thousands
ORDER BY average_cost_bucket_in_ten_thousands ASC
''')
df.set_index('average_cost_bucket_in_ten_thousands').plot(title='costs',kind='bar',legend=False)
This consists of 3 parts:
costs
Congratulations! Now you have finished this datathon tutorial, and ready to explore more data by querying Google BigQuery. To do so, simply replace bigquery-public-data.cms_medicare
with the project and dataset name you are interested in. One thing to note though, is that it is highly recommended to aggregate data aggressively wherever possible, because large dataframes may cause the performance of colab to drop drastically or even out of memory errors.
Enjoy!