First off, what is a relational database?
Basically, it is a way to store information such that information can be retrieved from it.
MySQL and PostgreSQL are examples of relational databases. For the purposes of an Insight project, you can use either one.
Why would you use a relational database instead of a csv or two?
A few reasons:
They scale easily
They are easy to query
It’s possible to do transactions in those cases where you need to write to a database, not just read from it
Everyone in industry uses them, so you should get familiar with them, too.
What does a relational database look like?
We can take a look. First we need to set up a few things. The first thing we want to do is to get a PostgreSQL server up and running. Go to http://postgresapp.com/ and follow the three steps listed in the Quick Installation Guide. (If you aren't running a Mac, you can download PostgreSQL at http://www.postgresql.org/) -- you can also use homebrew, but your path will change below --
If you are running Linux a Fellow just posted install directions here: https://gist.github.com/mskoh52/a01d1af3acae43c2c341101a28e504be
We'll come back to PostgreSQL in a moment. First, we'll set up SQLAlchemy. Go ahead and try to implement the following.
In [1]:
## Python packages - you may have to pip install sqlalchemy, sqlalchemy_utils, and psycopg2.
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import pandas as pd
If working in an anaconda environment, we recommend using their install
conda install psycopg2
If you have trouble installing psycopg2 and get the error "pg_config executable not found", try adding "/Applications/Postgres.app/Contents/Versions/9.6/bin" to your PATH by typing the following in your terminal (you may have to check your version number):
export PATH="/Applications/Postgres.app/Contents/Versions/9.6/bin:$PATH"
Then try installing again:
pip install psycopg2
In [2]:
#In Python: Define a database name (we're using a dataset on births, so I call it
# birth_db), and your username for your computer (CHANGE IT BELOW).
dbname = 'birth_db'
username = 'rangel'
There are multiple ways to launch a postgres server:
1) Launching Postres.app from LaunchPad will automatically start a server. In Mac OS, you should see an elephant icon in the upper right corner.
2) Launch from the terminal with the following command (CHANGE USER NAME):
postgres -D /Users/rockson/Library/Application\ Support/Postgres/var-9.6
3) Have launchd start postgresql at login:
ln -sfv /usr/local/opt/postgresql/*.plist ~/Library/LaunchAgents
Then to load postgresql now:
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist
In [4]:
## 'engine' is a connection to a database
## Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine('postgres://%s@localhost/%s'%(username,dbname))
print (engine.url)
In [5]:
## create a database (if it doesn't exist)
if not database_exists(engine.url):
create_database(engine.url)
print(database_exists(engine.url))
In [6]:
# read a database from CSV and load it into a pandas dataframe
birth_data = pd.DataFrame.from_csv('births2012_downsampled.csv')
In [7]:
## insert data into database from Python (proof of concept - this won't be useful for big data, of course)
birth_data.to_sql('birth_data_table', engine, if_exists='replace')
The above line (to_sql) is doing a lot of heavy lifting. It's reading a dataframe, it's creating a table, and adding the data to the table. So SQLAlchemy is quite useful!
Connect to the "birth_db" database we created
\c birth_db
You should see something like the following
You are now connected to database "birth_db" as user "rockson".
Then try the following query:
SELECT * FROM birth_data_table;
Note that the semi-colon indicates an end-of-statement.
Try a few other sample queries. Before you type in each one, ask yourself what you think the output will look like:
SELECT * FROM birth_data_table WHERE infant_sex='M';
SELECT COUNT(infant_sex) FROM birth_data_table WHERE infant_sex='M';
SELECT COUNT(gestation_weeks), infant_sex FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks, infant_sex;
SELECT gestation_weeks, COUNT(gestation_weeks) FROM birth_data_table WHERE infant_sex = 'M' GROUP BY gestation_weeks;
In [8]:
## Now try the same queries, but in python!
# connect:
con = None
con = psycopg2.connect(database = dbname, user = username)
# query:
sql_query = """
SELECT * FROM birth_data_table WHERE delivery_method='Cesarean';
"""
birth_data_from_sql = pd.read_sql_query(sql_query,con)
birth_data_from_sql.head()
Out[8]:
In [10]:
import time
t0 = time.time()
birth_data_from_sql = pd.read_sql_query(sql_query,con)
t1 = time.time()
total = t1-t0
print (total)
birth_data_from_sql.head()
Out[10]:
In [12]:
birth_data = pd.DataFrame.from_csv('births2012_downsampled.csv')
t0 = time.time()
birth_data=birth_data.loc[(birth_data['delivery_method'] == 'Cesarean')]
t1 = time.time()
total = t1-t0
print (total)
birth_data.head()
Out[12]:
This should have given you a quick taste of how to use SQLALchemy, as well as how to run a few SQL queries both at the command line and in python. You can see that pandas is actually a little faster than PostgreSQL here - that is because of the extra time it takes to communicate between python and PostGreSQL. But as your database gets bigger (and certainly when it's too large to store in memory), working with relational databases becomes a necessity.