Before you begin

  1. Use the Cloud Resource Manager to Create a Cloud Platform project if you do not already have one.
  2. Enable billing for the project.
  3. Enable BigQuery APIs for the project.

Provide your credentials to the runtime


In [0]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Optional: Enable data table display

Colab includes the google.colab.data_table package that can be used to display large pandas dataframes as an interactive data table. It can be enabled with:


In [0]:
%load_ext google.colab.data_table

If you would prefer to return to the classic Pandas dataframe display, you can disable this by running:

%unload_ext google.colab.data_table

Use BigQuery via magics

The google.cloud.bigquery library also includes a magic command which runs a query and either displays the result or saves it to a variable as a DataFrame.


In [0]:
# Display query output immediately

%%bigquery --project yourprojectid
SELECT 
  COUNT(*) as total_rows
FROM `bigquery-public-data.samples.gsod`


Out[0]:
total_rows
0 114420316

In [0]:
# Save output in a variable `df`

%%bigquery --project yourprojectid df
SELECT 
  COUNT(*) as total_rows
FROM `bigquery-public-data.samples.gsod`

In [0]:
df


Out[0]:
total_rows
0 114420316

Use BigQuery through google-cloud-bigquery

See BigQuery documentation and library reference documentation.

The GSOD sample table contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.

Declare the Cloud project ID which will be used throughout this notebook


In [0]:
project_id = '[your project ID]'

Sample approximately 2000 random rows


In [0]:
from google.cloud import bigquery

client = bigquery.Client(project=project_id)

sample_count = 2000
row_count = client.query('''
  SELECT 
    COUNT(*) as total
  FROM `bigquery-public-data.samples.gsod`''').to_dataframe().total[0]

df = client.query('''
  SELECT
    *
  FROM
    `bigquery-public-data.samples.gsod`
  WHERE RAND() < %d/%d
''' % (sample_count, row_count)).to_dataframe()

print('Full dataset has %d rows' % row_count)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-1c78a0ab70ae> in <module>()
      1 from google.cloud import bigquery
      2 
----> 3 client = bigquery.Client(project=project_id)
      4 
      5 sample_count = 2000

NameError: name 'project_id' is not defined

Describe the sampled data


In [0]:
df.describe()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-ea8415b8a3ee> in <module>()
----> 1 df.describe()

NameError: name 'df' is not defined

View the first 10 rows


In [0]:
df.head(10)


Out[0]:
station_number wban_number year month day mean_temp num_mean_temp_samples mean_dew_point num_mean_dew_point_samples mean_sealevel_pressure num_mean_sealevel_pressure_samples mean_station_pressure num_mean_station_pressure_samples mean_visibility num_mean_visibility_samples mean_wind_speed num_mean_wind_speed_samples max_sustained_wind_speed max_gust_wind_speed max_temperature max_temperature_explicit min_temperature min_temperature_explicit total_precipitation snow_depth fog rain snow hail thunder tornado
0 105780 99999 1968 9 13 46.000000 8 44.200001 8.0 NaN NaN NaN NaN 4.700000 8.0 15.3 8.0 21.0 NaN 43.000000 False None None 0.02 NaN True True True True True True
1 25710 99999 1989 5 2 51.299999 23 44.900002 23.0 NaN NaN NaN NaN 6.600000 23.0 7.5 23.0 12.0 NaN 42.799999 True None None 0.00 NaN False False False False False False
2 475160 99999 2003 4 26 45.200001 16 44.500000 16.0 NaN NaN NaN NaN 4.500000 16.0 7.4 16.0 13.0 NaN 39.200001 True None None NaN NaN False False False False False False
3 476720 99999 1989 12 8 51.599998 4 34.000000 4.0 1005.400024 4.0 NaN NaN 9.000000 4.0 5.0 4.0 8.0 NaN 41.900002 True None None 0.00 NaN False False False False False False
4 940040 99999 1991 6 9 84.300003 4 75.199997 4.0 1009.900024 4.0 1009.099976 4.0 24.900000 4.0 6.0 4.0 9.9 NaN 79.199997 True None None 0.00 NaN False False False False False False
5 103250 99999 1976 7 23 63.599998 13 48.599998 13.0 1022.099976 5.0 NaN NaN 24.600000 13.0 3.0 12.0 6.0 NaN 48.200001 False None None 0.00 NaN False False False False False False
6 919280 99999 1981 2 21 83.800003 5 75.900002 5.0 1007.900024 5.0 NaN NaN 17.100000 4.0 8.5 5.0 8.9 NaN 77.000000 False None None 0.00 NaN False False False False False False
7 961710 99999 2004 6 23 81.900002 6 74.900002 6.0 1010.500000 6.0 1008.000000 6.0 4.800000 6.0 1.5 6.0 5.1 NaN 74.800003 False None None 0.00 NaN False False False False False False
8 172400 99999 1990 6 6 65.400002 7 43.900002 7.0 1018.200012 7.0 906.099976 7.0 18.600000 7.0 2.0 7.0 4.1 NaN 45.000000 False None None 0.00 NaN False False False False False False
9 38790 99999 1973 4 7 44.799999 8 32.900002 8.0 1018.299988 8.0 NaN NaN 17.200001 8.0 9.1 8.0 15.0 NaN 35.599998 True None None NaN NaN False False False False False False

In [0]:
# 10 highest total_precipitation samples
df.sort_values('total_precipitation', ascending=False).head(10)[['station_number', 'year', 'month', 'day', 'total_precipitation']]


Out[0]:
station_number year month day total_precipitation
644 230220 1964 7 15 5.91
1155 985430 2008 12 8 3.46
1196 248260 1961 11 1 2.95
1588 257670 1959 8 9 2.95
980 299150 1962 3 1 2.95
1325 470250 1965 11 25 2.95
1917 288380 1994 8 6 2.32
1211 585190 1995 4 14 2.32
250 647000 2005 8 19 2.20
1418 964710 1975 9 8 1.97

Use BigQuery through pandas-gbq

The pandas-gbq library is a community led project by the pandas community. It covers basic functionality, such as writing a DataFrame to BigQuery and running a query, but as a third-party library it may not handle all BigQuery features or use cases.

Pandas GBQ Documentation


In [0]:
import pandas as pd

sample_count = 2000
df = pd.io.gbq.read_gbq('''
  SELECT name, SUM(number) as count
  FROM `bigquery-public-data.usa_names.usa_1910_2013`
  WHERE state = 'TX'
  GROUP BY name
  ORDER BY count DESC
  LIMIT 100
''', project_id=project_id, dialect='standard')

df.head()


Out[0]:
name count
0 James 272793
1 John 235139
2 Michael 225320
3 Robert 220399
4 David 219028