Dev Setups -- Connecting Python and SQL

The purpose of this IPython notebook is to demonstrate the usefulness of connecting python to a relational database by using a python toolkit called SQLAlchemy.

First off, what is a relational database?

Basically, it is a way to store information such that information can be retrieved from it.

MySQL and PostgreSQL are examples of relational databases. For the purposes of an Insight project, you can use either one.

Why would you use a relational database instead of a csv or two?

A few reasons:

  • They scale easily

  • They are easy to query

  • It’s possible to do transactions in those cases where you need to write to a database, not just read from it

  • Everyone in industry uses them, so you should get familiar with them, too.

What does a relational database look like?

We can take a look. First we need to set up a few things. The first thing we want to do is to get a PostgreSQL server up and running. Go to http://postgresapp.com/ and follow the three steps listed in the Quick Installation Guide. (If you aren't running a Mac, you can download PostgreSQL at http://www.postgresql.org/) -- you can also use homebrew, but your path will change below --

If you are running Linux a Fellow just posted install directions here: https://gist.github.com/mskoh52/a01d1af3acae43c2c341101a28e504be

We'll come back to PostgreSQL in a moment. First, we'll set up SQLAlchemy. Go ahead and try to implement the following.


In [1]:
## Python packages - you may have to pip install sqlalchemy, sqlalchemy_utils, and psycopg2.
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import pandas as pd

If working in an anaconda environment, we recommend using their install

conda install psycopg2

If you have trouble installing psycopg2 and get the error "pg_config executable not found", try adding "/Applications/Postgres.app/Contents/Versions/9.6/bin" to your PATH by typing the following in your terminal (you may have to check your version number):

export PATH="/Applications/Postgres.app/Contents/Versions/9.6/bin:$PATH"

Then try installing again:

pip install psycopg2

In [2]:
#In Python: Define a database name (we're using a dataset on births, so I call it 
# birth_db), and your username for your computer (CHANGE IT BELOW). 
dbname = 'birth_db'
username = 'rangel'

Start your postgresql server

There are multiple ways to launch a postgres server:

1) Launching Postres.app from LaunchPad will automatically start a server. In Mac OS, you should see an elephant icon in the upper right corner.

2) Launch from the terminal with the following command (CHANGE USER NAME):

postgres -D /Users/rockson/Library/Application\ Support/Postgres/var-9.6

3) Have launchd start postgresql at login:

ln -sfv /usr/local/opt/postgresql/*.plist ~/Library/LaunchAgents

Then to load postgresql now:

launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist

Create a database


In [4]:
## 'engine' is a connection to a database
## Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine('postgres://%s@localhost/%s'%(username,dbname))
print (engine.url)


postgres://rangel@localhost/birth_db

In [5]:
## create a database (if it doesn't exist)
if not database_exists(engine.url):
    create_database(engine.url)
print(database_exists(engine.url))


True

In [6]:
# read a database from CSV and load it into a pandas dataframe
birth_data = pd.DataFrame.from_csv('births2012_downsampled.csv')

In [7]:
## insert data into database from Python (proof of concept - this won't be useful for big data, of course)
birth_data.to_sql('birth_data_table', engine, if_exists='replace')

The above line (to_sql) is doing a lot of heavy lifting. It's reading a dataframe, it's creating a table, and adding the data to the table. So SQLAlchemy is quite useful!

How this works outside of python:

open up the PostgreSQL app, click on the "Open psql" button in the bottom right corner,
or alternatively type

psql -h localhost

into the command line

Connect to the "birth_db" database we created

\c birth_db

You should see something like the following

You are now connected to database "birth_db" as user "rockson".

Then try the following query:

SELECT * FROM birth_data_table;

Note that the semi-colon indicates an end-of-statement.

You can see the table we created! But it's kinda ugly and hard to read.

Try a few other sample queries. Before you type in each one, ask yourself what you think the output will look like:

SELECT * FROM birth_data_table WHERE infant_sex='M';

SELECT COUNT(infant_sex) FROM birth_data_table WHERE infant_sex='M';

SELECT COUNT(gestation_weeks), infant_sex FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks, infant_sex;

SELECT gestation_weeks, COUNT(gestation_weeks) FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks;


In [8]:
## Now try the same queries, but in python!

# connect:
con = None
con = psycopg2.connect(database = dbname, user = username)

# query:
sql_query = """
SELECT * FROM birth_data_table WHERE delivery_method='Cesarean';
"""
birth_data_from_sql = pd.read_sql_query(sql_query,con)

birth_data_from_sql.head()


Out[8]:
index alcohol_use anencephaly attendant birth_loc_type birth_month birth_state birth_weight birth_year cigarette_use ... mother_state population pregnancy_weight resident revision spina_bifida table timestamp uses_tobacco weight_gain
0 1 None NaN MD NaN Jan None 4500.0 2012 None ... None None None Intra-State/Territor Non-resident (diff county) S NaN births12.txt 1325882986 None 49.0
1 2 None NaN MD NaN Jan None 2500.0 2012 None ... None None None Resident S NaN births12.txt 1326367089 None 30.0
2 13 None NaN MD NaN Mar None 4500.0 2012 None ... None None None Resident S NaN births12.txt 1331645804 None 27.0
3 14 None NaN MD NaN Mar None 5000.0 2012 None ... None None None Resident S NaN births12.txt 1332142969 None 70.0
4 18 None NaN MD NaN Apr None 4500.0 2012 None ... None None None Resident S NaN births12.txt 1334107348 None 10.0

5 rows × 38 columns

Is reading from a SQL database faster than from a Pandas dataframe? Probably not for the amount of data you can fit on your machine.


In [10]:
import time

t0 = time.time()
birth_data_from_sql = pd.read_sql_query(sql_query,con)
t1 = time.time()
total = t1-t0
print (total)

birth_data_from_sql.head()


0.04458189010620117
Out[10]:
index alcohol_use anencephaly attendant birth_loc_type birth_month birth_state birth_weight birth_year cigarette_use ... mother_state population pregnancy_weight resident revision spina_bifida table timestamp uses_tobacco weight_gain
0 1 None NaN MD NaN Jan None 4500.0 2012 None ... None None None Intra-State/Territor Non-resident (diff county) S NaN births12.txt 1325882986 None 49.0
1 2 None NaN MD NaN Jan None 2500.0 2012 None ... None None None Resident S NaN births12.txt 1326367089 None 30.0
2 13 None NaN MD NaN Mar None 4500.0 2012 None ... None None None Resident S NaN births12.txt 1331645804 None 27.0
3 14 None NaN MD NaN Mar None 5000.0 2012 None ... None None None Resident S NaN births12.txt 1332142969 None 70.0
4 18 None NaN MD NaN Apr None 4500.0 2012 None ... None None None Resident S NaN births12.txt 1334107348 None 10.0

5 rows × 38 columns


In [12]:
birth_data = pd.DataFrame.from_csv('births2012_downsampled.csv')

t0 = time.time()
birth_data=birth_data.loc[(birth_data['delivery_method'] == 'Cesarean')]
t1 = time.time()
total = t1-t0
print (total)

birth_data.head()


0.006116151809692383
Out[12]:
alcohol_use anencephaly attendant birth_loc_type birth_month birth_state birth_weight birth_year cigarette_use cigarettes_per_day ... mother_state population pregnancy_weight resident revision spina_bifida table timestamp uses_tobacco weight_gain
1 NaN NaN MD NaN Jan NaN 4500.0 2012 NaN NaN ... NaN NaN NaN Intra-State/Territor Non-resident (diff county) S NaN births12.txt 1325882986 NaN 49.0
2 NaN NaN MD NaN Jan NaN 2500.0 2012 NaN NaN ... NaN NaN NaN Resident S NaN births12.txt 1326367089 NaN 30.0
13 NaN NaN MD NaN Mar NaN 4500.0 2012 NaN NaN ... NaN NaN NaN Resident S NaN births12.txt 1331645804 NaN 27.0
14 NaN NaN MD NaN Mar NaN 5000.0 2012 NaN NaN ... NaN NaN NaN Resident S NaN births12.txt 1332142969 NaN 70.0
18 NaN NaN MD NaN Apr NaN 4500.0 2012 NaN NaN ... NaN NaN NaN Resident S NaN births12.txt 1334107348 NaN 10.0

5 rows × 37 columns

This should have given you a quick taste of how to use SQLALchemy, as well as how to run a few SQL queries both at the command line and in python. You can see that pandas is actually a little faster than PostgreSQL here - that is because of the extra time it takes to communicate between python and PostGreSQL. But as your database gets bigger (and certainly when it's too large to store in memory), working with relational databases becomes a necessity.