Kaggle works great with Google Big Query. Especially when using the 'bq_helper' package (python).
// Credits and a big thanks to Rachael Tatman e.a.

However, there is a caveat. There is a 5TB query limit. This refers to the scanning of the dataset not the size of the 'response'.

This kernel uses the openAQ dataset and the bq_helper package (python) to demonstrate how to see the 'scan' size of your SQL query before actually sending it.


In [47]:
# import the python helper package for bigqueey (thank you Rachael Tatman e.a.)
import bq_helper

In [48]:
# create the helper object
open_aq = bq_helper.BigQueryHelper(active_project="bigquery-public-data", dataset_name="openaq")

In [49]:
#print the tables in the dataset to check everthing went ok so far
open_aq.list_tables()


Out[49]:
['global_air_quality']

In [50]:
# print the first couple of rows to look at the structure of the dataset
# not this is somewhat different from the usual way with dataframes; datafram_name.head(number_of_rows_to_show) ...
# you can still only show e.g. 3 rows by tpying: open_aq.head("global_air_quality", num_rows=3)
open_aq.head("global_air_quality")


Out[50]:
location city country pollutant value timestamp unit source_name latitude longitude averaged_over_in_hours
0 Mobile_Cle Elum 037 US pm25 0.00 2017-09-26 20:00:00+00:00 µg/m³ AirNow 47.197630 -120.958230 1.0
1 Mobile_WhiteSalmon 039 US pm25 0.00 2017-09-26 20:00:00+00:00 µg/m³ AirNow 45.732414 -121.492330 1.0
2 Mobile_Newport 051 US pm25 0.00 2017-09-21 18:00:00+00:00 µg/m³ AirNow 48.186485 -117.049160 1.0
3 FR33305 Ain FR no2 37.95 2018-07-13 06:00:00+00:00 µg/m³ EEA France 46.211666 5.226389 1.0
4 FR33305 Ain FR o3 14.38 2018-07-13 06:00:00+00:00 µg/m³ EEA France 46.211666 5.226389 1.0

Measuring SQL bigquery size before acrtually executing it with the bq_helper package;


In [55]:
query = """SELECT value
           FROM `bigquery-public-data.openaq.global_air_quality`
           WHERE value > 0"""
# ! the quotations marks around 'bigquery..._quality' are NOT quotation marks, they are and have to be 'backticks': ` !

In [56]:
open_aq.estimate_query_size(query)


Out[56]:
0.00012377649545669556

this means the SQL query above would take 0.000124 TB to run.


In [66]:
query2 = """SELECT value
            FROM `bigquery-public-data.openaq.global_air_quality`
            WHERE country = 'NL'"""
# ! the quotations marks around 'bigquery..._quality' are NOT quotation marks, they are and have to be 'backticks': ` !

In [67]:
open_aq.estimate_query_size(query2)


Out[67]:
0.00018566474318504333

and this one would cost 0.000186TB


In [70]:
#or in megabyte;
open_aq.estimate_query_size(query2) * 1000


Out[70]:
0.18566474318504333

In [ ]: