Python Guide

Google Cloud Storage

Refer to the storage directory in the same folder.


BigQuery

Refer to the bigquery directory in the same folder.


Cloud SQL

Refer to the sql directory in the same folder


Installing Python modules using pip

  • Below is the command to install package pymysql from the command line:

    pip install pymysql

  • To run shell commands from inside the JupyterLab notebooks, we need to append the command with an exclamation mark:

    !pip install pymysql


In [ ]:
!pip install pymysql

The JupyterLab notebook VMs created using the initialization script will come pre-installed with a wide number of python packages. Please refer the TDD document Section 3.1.2 to review.


Authenticating Python API clients for GCS, Bigquery and Cloud SQL

Retrieve Service Account JSON key for authentication

An authentication call to the GCS, BQ and Cloud SQL can be made by using a valid Service Account JSON key. \ If you do not have a Service Account yet, follow the steps given here to retrieve Service Account JSON key. \ Service Account JSON key retreival involves below steps: \

1. Create a Service Account \
2. Assign it required IAM permissions \
3. Download Service Account JSON key \
4. Upload the key from where it can be used \

Once the key has been obtained, code snippets provided below can be used to authenticate the user and connect to the services via Python APIs.

Google Cloud Storage

Provide the key path to the JSON key for valid Service Account in order to validate a GCS client.


In [ ]:
bucket_name = "name-of-your-bucket"
file_name = "filename.extension"
key_path = "path/to/service_account.json"

# TODO(developer): 
# Set key_path to the path to the service account key file.
# key_path = "path/to/service_account.json"
# Set the bucket_name to the name of the bucket you want to read from
# Set the file_name to the file you want to read along with folder structure inside the mentioned bucket
# bucket_name and file_name variables are used to create the GCS object URL: gs://bucket_name/file_name

storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(file_name)
#Below command will download all the contents of the specified GCS object and save it in string format
contents = blob.download_as_string()
print(contents)

BigQuery

Provide the key path to the JSON key for valid Service Account in order to validate a BigQuery client.


In [ ]:
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas

# TODO(developer): 
# Set key_path to the path to the service account key file.
# key_path = "path/to/service_account.json"


credentials = service_account.Credentials.from_service_account_file(
    key_path,
    scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

client = bigquery.Client(
    credentials=credentials,
    project=credentials.project_id,
)

Cloud SQL

The Cloud SQL proxy connection can be authenticated at the time of connection creation. Please refer this guide to learn more about Cloud SQL authentication.

Download the Cloud SQL proxy


In [ ]:
!wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
!chmod +x cloud_sql_proxy

Run below command from the terminal to start SQLproxy


In [ ]:
# TODO(developer): 
# Set PATH_TO_KEY_FILE to the path to the service account key file.
# Example: PATH_TO_KEY_FILE = "path/to/service_account.json"

!./cloud_sql_proxy -instances=<INSTANCE_NAME>=tcp:3306 -credential_file=<PATH_TO_KEY_FILE> &

In [ ]:
import pymysql

# TODO(developer): 
# Change USERNAME and PASSWORD to the user and password created on Cloud SQL instance
# Set DB to the name of the database to be connected to

connection = pymysql.connect(host='127.0.0.1',
                             user='USERNAME',
                             password='PASSWORD',
                             db='DB')
mycursor = connection.cursor()

Creating Pandas Dataframes

BigQuery

Use the Client.query method to run the query, and the QueryJob.to_dataframe method to return the results as a pandas DataFrame.


In [ ]:
from google.cloud import bigquery
import pandas
client = bigquery.Client(location="US")
print("Client creating using default project: {}".format(client.project))
query = """
    SELECT name, SUM(number) as total
    FROM `bigquery-public-data.usa_names.usa_1910_current`
    GROUP BY name
    ORDER BY total DESC
    LIMIT 10
"""
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

df = query_job.to_dataframe()
df

Google Cloud Storage

Panda's read_csv method supports reading dataframes directly from GCS file URL.


In [ ]:
import pandas as pd
df = pd.read_csv('gs://BUCKET/your_file.csv')

Cloud SQL

Panda's read_sql method can be used to read a table from Cloud SQL database and use it as a Dataframe.


In [ ]:
import pymysql

# TODO(developer): 
# Change USERNAME and PASSWORD to the user and password created on Cloud SQL instance
# Set DB to the name of the database to be connected to

connection = pymysql.connect(host='127.0.0.1',
                             user='USERNAME',
                             password='PASSWORD',
                             db='DB')
query = "SELECT * FROM orders WHERE date_time BETWEEN ? AND ?"
df = pd.read_sql(query, connection,  params=(start_date, end_date))