Kaggle works great with Google Big Query. Especially when using the 'bq_helper' package (python).
// Credits and a big thanks to Rachael Tatman e.a.
However, there is a caveat. There is a 5TB query limit. This refers to the scanning of the dataset not the size of the 'response'.
This kernel uses the openAQ dataset and the bq_helper package (python) to demonstrate how to see the 'scan' size of your SQL query before actually sending it.
In [47]:
# import the python helper package for bigqueey (thank you Rachael Tatman e.a.)
import bq_helper
In [48]:
# create the helper object
open_aq = bq_helper.BigQueryHelper(active_project="bigquery-public-data", dataset_name="openaq")
In [49]:
#print the tables in the dataset to check everthing went ok so far
open_aq.list_tables()
Out[49]:
In [50]:
# print the first couple of rows to look at the structure of the dataset
# not this is somewhat different from the usual way with dataframes; datafram_name.head(number_of_rows_to_show) ...
# you can still only show e.g. 3 rows by tpying: open_aq.head("global_air_quality", num_rows=3)
open_aq.head("global_air_quality")
Out[50]:
Measuring SQL bigquery size before acrtually executing it with the bq_helper package;
In [55]:
query = """SELECT value
FROM `bigquery-public-data.openaq.global_air_quality`
WHERE value > 0"""
# ! the quotations marks around 'bigquery..._quality' are NOT quotation marks, they are and have to be 'backticks': ` !
In [56]:
open_aq.estimate_query_size(query)
Out[56]:
this means the SQL query above would take 0.000124 TB to run.
In [66]:
query2 = """SELECT value
FROM `bigquery-public-data.openaq.global_air_quality`
WHERE country = 'NL'"""
# ! the quotations marks around 'bigquery..._quality' are NOT quotation marks, they are and have to be 'backticks': ` !
In [67]:
open_aq.estimate_query_size(query2)
Out[67]:
and this one would cost 0.000186TB
In [70]:
#or in megabyte;
open_aq.estimate_query_size(query2) * 1000
Out[70]:
In [ ]: