Dev Setups -- Connecting Python and SQL

The purpose of this IPython notebook is to demonstrate the usefulness of connecting python to a relational database by using a python toolkit called SQLAlchemy.

First off, what is a relational database?

Basically, it is a way to store information such that information can be retrieved from it.

MySQL and PostgreSQL are examples of relational databases. For the purposes of an Insight project, you can use either one.

Why would you use a relational database instead of a csv or two?

A few reasons:

They scale easily
They are easy to query
It’s possible to do transactions in those cases where you need to write to a database, not just read from it
Everyone in industry uses them, so you should get familiar with them, too.

What does a relational database look like?

We can take a look. First we need to set up a few things. The first thing we want to do is to get a PostgreSQL server up and running. Go to http://postgresapp.com/ and follow the three steps listed in the Quick Installation Guide. (If you aren't running a Mac, you can download PostgreSQL at http://www.postgresql.org/) -- you can also use homebrew, but your path will change below --

If you are running Linux a Fellow just posted install directions here: https://gist.github.com/mskoh52/a01d1af3acae43c2c341101a28e504be

We'll come back to PostgreSQL in a moment. First, we'll set up SQLAlchemy. Go ahead and try to implement the following.



In [1]:

    
## Python packages - you may have to pip install sqlalchemy, sqlalchemy_utils, and psycopg2.
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import pandas as pd

If working in an anaconda environment, we recommend using their install

conda install psycopg2

If you have trouble installing psycopg2 and get the error "pg_config executable not found", try adding "/Applications/Postgres.app/Contents/Versions/9.6/bin" to your PATH by typing the following in your terminal (you may have to check your version number):

export PATH="/Applications/Postgres.app/Contents/Versions/9.6/bin:$PATH"

Then try installing again:

pip install psycopg2



In [2]:

    
#In Python: Define a database name (we're using a dataset on births, so I call it 
# birth_db), and your username for your computer (CHANGE IT BELOW). 
dbname = 'birth_db'
username = 'rangel'

Start your postgresql server

There are multiple ways to launch a postgres server:

1) Launching Postres.app from LaunchPad will automatically start a server. In Mac OS, you should see an elephant icon in the upper right corner.

2) Launch from the terminal with the following command (CHANGE USER NAME):

postgres -D /Users/rockson/Library/Application\ Support/Postgres/var-9.6

3) Have launchd start postgresql at login:

ln -sfv /usr/local/opt/postgresql/*.plist ~/Library/LaunchAgents

Then to load postgresql now:

launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist

Create a database



In [4]:

    
## 'engine' is a connection to a database
## Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine('postgres://%s@localhost/%s'%(username,dbname))
print (engine.url)









    



postgres://rangel@localhost/birth_db



In [5]:

    
## create a database (if it doesn't exist)
if not database_exists(engine.url):
    create_database(engine.url)
print(database_exists(engine.url))









    



True



In [6]:

    
# read a database from CSV and load it into a pandas dataframe
birth_data = pd.DataFrame.from_csv('births2012_downsampled.csv')



In [7]:

    
## insert data into database from Python (proof of concept - this won't be useful for big data, of course)
birth_data.to_sql('birth_data_table', engine, if_exists='replace')

The above line (to_sql) is doing a lot of heavy lifting. It's reading a dataframe, it's creating a table, and adding the data to the table. So SQLAlchemy is quite useful!

How this works outside of python:

open up the PostgreSQL app, click on the "Open psql" button in the bottom right corner,
or alternatively type

psql -h localhost

into the command line

Connect to the "birth_db" database we created

\c birth_db

You should see something like the following

You are now connected to database "birth_db" as user "rockson".

Then try the following query:

SELECT * FROM birth_data_table;

Note that the semi-colon indicates an end-of-statement.

You can see the table we created! But it's kinda ugly and hard to read.

Try a few other sample queries. Before you type in each one, ask yourself what you think the output will look like:

SELECT * FROM birth_data_table WHERE infant_sex='M';

SELECT COUNT(infant_sex) FROM birth_data_table WHERE infant_sex='M';

SELECT COUNT(gestation_weeks), infant_sex FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks, infant_sex;

SELECT gestation_weeks, COUNT(gestation_weeks) FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks;



In [8]:

    
## Now try the same queries, but in python!

# connect:
con = None
con = psycopg2.connect(database = dbname, user = username)

# query:
sql_query = """
SELECT * FROM birth_data_table WHERE delivery_method='Cesarean';
"""
birth_data_from_sql = pd.read_sql_query(sql_query,con)

birth_data_from_sql.head()









    Out[8]:







  
    
      
      index
      alcohol_use
      anencephaly
      attendant
      birth_loc_type
      birth_month
      birth_state
      birth_weight
      birth_year
      cigarette_use
      ...
      mother_state
      population
      pregnancy_weight
      resident
      revision
      spina_bifida
      table
      timestamp
      uses_tobacco
      weight_gain
    
  
  
    
      0
      1
      None
      NaN
      MD
      NaN
      Jan
      None
      4500.0
      2012
      None
      ...
      None
      None
      None
      Intra-State/Territor Non-resident (diff county)
      S
      NaN
      births12.txt
      1325882986
      None
      49.0
    
    
      1
      2
      None
      NaN
      MD
      NaN
      Jan
      None
      2500.0
      2012
      None
      ...
      None
      None
      None
      Resident
      S
      NaN
      births12.txt
      1326367089
      None
      30.0
    
    
      2
      13
      None
      NaN
      MD
      NaN
      Mar
      None
      4500.0
      2012
      None
      ...
      None
      None
      None
      Resident
      S
      NaN
      births12.txt
      1331645804
      None
      27.0
    
    
      3
      14
      None
      NaN
      MD
      NaN
      Mar
      None
      5000.0
      2012
      None
      ...
      None
      None
      None
      Resident
      S
      NaN
      births12.txt
      1332142969
      None
      70.0
    
    
      4
      18
      None
      NaN
      MD
      NaN
      Apr
      None
      4500.0
      2012
      None
      ...
      None
      None
      None
      Resident
      S
      NaN
      births12.txt
      1334107348
      None
      10.0
    
  

5 rows × 38 columns

Is reading from a SQL database faster than from a Pandas dataframe? Probably not for the amount of data you can fit on your machine.



In [10]:

    
import time

t0 = time.time()
birth_data_from_sql = pd.read_sql_query(sql_query,con)
t1 = time.time()
total = t1-t0
print (total)

birth_data_from_sql.head()









    



0.04458189010620117






    Out[10]:







  
    
      
      index
      alcohol_use
      anencephaly
      attendant
      birth_loc_type
      birth_month
      birth_state
      birth_weight
      birth_year
      cigarette_use
      ...
      mother_state
      population
      pregnancy_weight
      resident
      revision
      spina_bifida
      table
      timestamp
      uses_tobacco
      weight_gain
    
  
  
    
      0
      1
      None
      NaN
      MD
      NaN
      Jan
      None
      4500.0
      2012
      None
      ...
      None
      None
      None
      Intra-State/Territor Non-resident (diff county)
      S
      NaN
      births12.txt
      1325882986
      None
      49.0
    
    
      1
      2
      None
      NaN
      MD
      NaN
      Jan
      None
      2500.0
      2012
      None
      ...
      None
      None
      None
      Resident
      S
      NaN
      births12.txt
      1326367089
      None
      30.0
    
    
      2
      13
      None
      NaN
      MD
      NaN
      Mar
      None
      4500.0
      2012
      None
      ...
      None
      None
      None
      Resident
      S
      NaN
      births12.txt
      1331645804
      None
      27.0
    
    
      3
      14
      None
      NaN
      MD
      NaN
      Mar
      None
      5000.0
      2012
      None
      ...
      None
      None
      None
      Resident
      S
      NaN
      births12.txt
      1332142969
      None
      70.0
    
    
      4
      18
      None
      NaN
      MD
      NaN
      Apr
      None
      4500.0
      2012
      None
      ...
      None
      None
      None
      Resident
      S
      NaN
      births12.txt
      1334107348
      None
      10.0
    
  

5 rows × 38 columns



In [12]:

    
birth_data = pd.DataFrame.from_csv('births2012_downsampled.csv')

t0 = time.time()
birth_data=birth_data.loc[(birth_data['delivery_method'] == 'Cesarean')]
t1 = time.time()
total = t1-t0
print (total)

birth_data.head()









    



0.006116151809692383






    Out[12]:







  
    
      
      alcohol_use
      anencephaly
      attendant
      birth_loc_type
      birth_month
      birth_state
      birth_weight
      birth_year
      cigarette_use
      cigarettes_per_day
      ...
      mother_state
      population
      pregnancy_weight
      resident
      revision
      spina_bifida
      table
      timestamp
      uses_tobacco
      weight_gain
    
  
  
    
      1
      NaN
      NaN
      MD
      NaN
      Jan
      NaN
      4500.0
      2012
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      Intra-State/Territor Non-resident (diff county)
      S
      NaN
      births12.txt
      1325882986
      NaN
      49.0
    
    
      2
      NaN
      NaN
      MD
      NaN
      Jan
      NaN
      2500.0
      2012
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      Resident
      S
      NaN
      births12.txt
      1326367089
      NaN
      30.0
    
    
      13
      NaN
      NaN
      MD
      NaN
      Mar
      NaN
      4500.0
      2012
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      Resident
      S
      NaN
      births12.txt
      1331645804
      NaN
      27.0
    
    
      14
      NaN
      NaN
      MD
      NaN
      Mar
      NaN
      5000.0
      2012
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      Resident
      S
      NaN
      births12.txt
      1332142969
      NaN
      70.0
    
    
      18
      NaN
      NaN
      MD
      NaN
      Apr
      NaN
      4500.0
      2012
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      Resident
      S
      NaN
      births12.txt
      1334107348
      NaN
      10.0
    
  

5 rows × 37 columns

This should have given you a quick taste of how to use SQLALchemy, as well as how to run a few SQL queries both at the command line and in python. You can see that pandas is actually a little faster than PostgreSQL here - that is because of the extra time it takes to communicate between python and PostGreSQL. But as your database gets bigger (and certainly when it's too large to store in memory), working with relational databases becomes a necessity.

	index	alcohol_use	anencephaly	attendant	birth_loc_type	birth_month	birth_state	birth_weight	birth_year	cigarette_use	...	mother_state	population	pregnancy_weight	resident	revision	spina_bifida	table	timestamp	uses_tobacco	weight_gain
0	1	None	NaN	MD	NaN	Jan	None	4500.0	2012	None	...	None	None	None	Intra-State/Territor Non-resident (diff county)	S	NaN	births12.txt	1325882986	None	49.0
1	2	None	NaN	MD	NaN	Jan	None	2500.0	2012	None	...	None	None	None	Resident	S	NaN	births12.txt	1326367089	None	30.0
2	13	None	NaN	MD	NaN	Mar	None	4500.0	2012	None	...	None	None	None	Resident	S	NaN	births12.txt	1331645804	None	27.0
3	14	None	NaN	MD	NaN	Mar	None	5000.0	2012	None	...	None	None	None	Resident	S	NaN	births12.txt	1332142969	None	70.0
4	18	None	NaN	MD	NaN	Apr	None	4500.0	2012	None	...	None	None	None	Resident	S	NaN	births12.txt	1334107348	None	10.0