We recommend you install Anaconda Python (2.7). Please also ensure you have have psycopg2 installed & configured for your operating system.
In [8]:
%matplotlib inline
import psycopg2
import pandas as pd
import pandas.io.sql as psql
import ConfigParser
import os
import IPython
from IPython.display import display
from IPython.display import HTML
#Maximum height of a result set
pd.set_option('display.max_rows', 500)
#Maximum number of rows to display in a cell.
pd.set_option('display.max_columns', 100)
#Show the full content of columns
pd.set_option('display.max_colwidth', -1)
~ vim ~/.dslab_user.cred
The content of this file should look like so (with appropriate values for HOSTNAME, PORT, USER, DATABASE & PASSWORD).
[database_creds]
host: HOSTNAME
port: PORT
user: USER
database: DATABASE
password: PASSWORD
~ chmod 700 ~/.dslab_user.cred
You should see the following
~ ls -l ~/.dslab_user.cred
-rwx------ 1 USER 720748206 93 Jun 29 17:27 $HOME/.dslab_user.cred
In [9]:
USER_CRED_FILE = os.path.join(os.path.expanduser('~'), '.dslab_user.cred')
def fetchDBCredentials(dbcred_file=USER_CRED_FILE):
"""
Read database access credentials from the file in $HOME/.ipynb_dslab.cred
"""
#Read database credentials from user supplied file
conf = ConfigParser.ConfigParser()
conf.read(dbcred_file)
#host, port, user, database, password
host = conf.get('database_creds','host')
port = conf.get('database_creds','port')
user = conf.get('database_creds','user')
database = conf.get('database_creds','database')
password = conf.get('database_creds','password')
#Initialize connection string
conn_str = """dbname='{database}' user='{user}' host='{host}' port='{port}' password='{password}'""".format(
database=database,
host=host,
port=port,
user=user,
password=password
)
return conn_str
In [10]:
conn = psycopg2.connect(fetchDBCredentials())
In [21]:
from IPython.core.magic import (register_line_magic, register_cell_magic,
register_line_cell_magic)
@register_cell_magic
def showsql(line, cell):
"""
Extract the code in the specific cell (should be valid SQL), and execute
it using the connection object to the backend database.
The resulting Pandas dataframe
is rendered inline below the cell using IPython.display.
You'd use this for SELECT
"""
#Use the global connection object defined above.
global conn
_ = psql.read_sql(cell, conn)
conn.commit()
display(_)
return
@register_cell_magic
def execsql(line, cell):
"""
Extract the code in the specific cell (should be valid SQL), and execute
it using the connection object to the backend database.
You'd use this for CREATE/UPDATE/DELETE
"""
#Use the global connection object defined above.
global conn
_ = psql.execute(cell, conn)
conn.commit()
return
# We delete these to avoid name conflicts for automagic to work
del execsql, showsql
%%showsql
select
random() as x,
random() as y
from
generate_series(1, 5);
df = psql.read_sql("""select random() as x, random() as y from generate_series(1, 10) q;""", conn)
df.head()
If you see an HTML table of the result set above, your connection to the database was successful.
If you want to open multiple connections (say one for GPDB and for your HAWQ cluster), you can create another file similar to ~/.dslab_user.cred, populate the appropriate credentials, and supply this file as input to the fetchDBCredentials() function shown above.
For instance, let's say you created another file ~/.dslab_user.cred.gpdb containing the appropriate credentials to connect to your GPDB cluster, then you can open a connection to this cluster with psycopg2 as follows:
db_cred_file_gpdb = os.path.join(os.path.expanduser('~'), '.dslab_user.cred.gpdb')
conn_gpdb = psycopg2.connect(fetchDBCredentials(db_cred_file_gpdb))
df = psql.read_sql("""select random() as x, random() as y from generate_series(1, 10) q;""", conn_gpdb)
df.head()
For all subsequent instance where you want to query the data on GPDB, you can use conn_gpdb in place of conn in your code.