Imports and dependencies

We recommend you install Anaconda Python (2.7). Please also ensure you have have psycopg2 installed & configured for your operating system.


In [8]:
%matplotlib inline
import psycopg2
import pandas as pd
import pandas.io.sql as psql
import ConfigParser
import os
import IPython
from IPython.display import display
from IPython.display import HTML
#Maximum height of a result set
pd.set_option('display.max_rows', 500)
#Maximum number of rows to display in a cell.
pd.set_option('display.max_columns', 100)
#Show the full content of columns
pd.set_option('display.max_colwidth', -1)

Setting up database connectivity

  1. Create a file in your home directory containing the database credentials for your target environment. For example:
    ~ vim ~/.dslab_user.cred
    The content of this file should look like so (with appropriate values for HOSTNAME, PORT, USER, DATABASE & PASSWORD).
    [database_creds]
    host: HOSTNAME
    port: PORT
    user: USER
    database: DATABASE
    password: PASSWORD
  2. Please set the permissions of this file to u+rwx (700), so that only you can access this file.
    ~ chmod 700 ~/.dslab_user.cred
    You should see the following
    ~ ls -l ~/.dslab_user.cred 
    -rwx------  1 USER  720748206  93 Jun 29 17:27 $HOME/.dslab_user.cred

Creating database connection string


In [9]:
USER_CRED_FILE = os.path.join(os.path.expanduser('~'), '.dslab_user.cred')
def fetchDBCredentials(dbcred_file=USER_CRED_FILE):
    """
       Read database access credentials from the file in $HOME/.ipynb_dslab.cred
    """
    #Read database credentials from user supplied file
    conf = ConfigParser.ConfigParser()
    conf.read(dbcred_file)
    #host, port, user, database, password
    host = conf.get('database_creds','host')
    port = conf.get('database_creds','port')
    user = conf.get('database_creds','user')
    database = conf.get('database_creds','database')
    password = conf.get('database_creds','password')

    #Initialize connection string
    conn_str =  """dbname='{database}' user='{user}' host='{host}' port='{port}' password='{password}'""".format(                       
                    database=database,
                    host=host,
                    port=port,
                    user=user,
                    password=password
            )
    return conn_str

Create a connection object to the database


In [10]:
conn = psycopg2.connect(fetchDBCredentials())

Define magic commands to run SQL inline


In [21]:
from IPython.core.magic import (register_line_magic, register_cell_magic,
                                register_line_cell_magic)

@register_cell_magic
def showsql(line, cell):
    """
        Extract the code in the specific cell (should be valid SQL), and execute
        it using the connection object to the backend database. 
        The resulting Pandas dataframe
        is rendered inline below the cell using IPython.display.
        You'd use this for SELECT
    """
    #Use the global connection object defined above.
    global conn
    _ = psql.read_sql(cell, conn)
    conn.commit()
    display(_)
    return
    
@register_cell_magic
def execsql(line, cell):
    """
        Extract the code in the specific cell (should be valid SQL), and execute
        it using the connection object to the backend database. 
        You'd use this for CREATE/UPDATE/DELETE
    """
    #Use the global connection object defined above.
    global conn
    _ = psql.execute(cell, conn)
    conn.commit()
    return

# We delete these to avoid name conflicts for automagic to work
del execsql, showsql

Test your connection to the database

Using the showsql magic command we defined above

%%showsql
select
    random() as x,
    random() as y
from
    generate_series(1, 5);

Alternatively you could also use plain-old pandas if you'd like to get a dataframe returned

df = psql.read_sql("""select random() as x, random() as y from generate_series(1, 10) q;""", conn)
df.head()

If you see an HTML table of the result set above, your connection to the database was successful.

Opening Connections to Multiple Clusters

If you want to open multiple connections (say one for GPDB and for your HAWQ cluster), you can create another file similar to ~/.dslab_user.cred, populate the appropriate credentials, and supply this file as input to the fetchDBCredentials() function shown above.

For instance, let's say you created another file ~/.dslab_user.cred.gpdb containing the appropriate credentials to connect to your GPDB cluster, then you can open a connection to this cluster with psycopg2 as follows:

db_cred_file_gpdb = os.path.join(os.path.expanduser('~'), '.dslab_user.cred.gpdb')
conn_gpdb = psycopg2.connect(fetchDBCredentials(db_cred_file_gpdb))
df = psql.read_sql("""select random() as x, random() as y from generate_series(1, 10) q;""", conn_gpdb)
df.head()

For all subsequent instance where you want to query the data on GPDB, you can use conn_gpdb in place of conn in your code.