SQL

SQL is a language for getting data from databases. It is also becoming essential to have some basic familiarity with SQL because it is universally used to slice and dice data across many different data storage formats and technologies.

We will only show the use of SQL as a query language here, since that is probably all that a statistician will use SQL for. However, SQL can also be used to create and modify tables, as well as manage database permissions.

Three ways of using SQL in Python will be shown:

  • Using SQL to query pandas data frames
  • Using SQL to query traditional database
  • Using SQL via Jupyter magic functions

SQL via pandas DataFrames

We will use the pandas-sql package to practice SQL syntax for querying a pandas DataFrame. Later, we will see how to use an actual relational database.


In [1]:
import seaborn as sns
import pandas as pd
import numpy as np

In [5]:
! pip install pandasql


Requirement already satisfied (use --upgrade to upgrade): pandasql in /opt/conda/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): pandas in /opt/conda/lib/python3.5/site-packages (from pandasql)
Requirement already satisfied (use --upgrade to upgrade): numpy in /opt/conda/lib/python3.5/site-packages (from pandasql)
Requirement already satisfied (use --upgrade to upgrade): sqlalchemy in /opt/conda/lib/python3.5/site-packages (from pandasql)
Requirement already satisfied (use --upgrade to upgrade): python-dateutil>=2 in /opt/conda/lib/python3.5/site-packages (from pandas->pandasql)
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /opt/conda/lib/python3.5/site-packages (from pandas->pandasql)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /opt/conda/lib/python3.5/site-packages (from python-dateutil>=2->pandas->pandasql)
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [6]:
from pandasql import PandaSQL
pdsql = PandaSQL()

In [7]:
tips = sns.load_dataset('tips')
tips.head()


Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Using SQL for Queries

Note that SQL is case-insensitive, but it is traditional to use ALL CAPS for SQL keywords. It is also standard to end SQL statements with a semi-colon.

Simple Queries


In [10]:
pdsql('SELECT * FROM tips LIMIT 5;')


Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

In [18]:
q = """
SELECT sex as gender, smoker, day, size 
FROM tips
WHERE size <= 3 AND smoker = 'Yes'
LIMIT 5"""

In [19]:
pdsql(q)


Out[19]:
gender smoker day size
0 Male Yes Sat 2
1 Male Yes Sat 2
2 Male Yes Sat 2
3 Male Yes Sat 2
4 Female Yes Sat 1

Filtering on strings


In [24]:
q = """
SELECT sex as gender, smoker, day, size , time
FROM tips
WHERE time LIKE '%u_c%'
LIMIT 5"""

In [25]:
pdsql(q)


Out[25]:
gender smoker day size time
0 Male No Thur 4 Lunch
1 Male No Thur 2 Lunch
2 Male No Thur 2 Lunch
3 Male Yes Thur 2 Lunch
4 Male No Thur 2 Lunch

Ordering


In [27]:
q = """
SELECT sex as gender, smoker, day, size , time
FROM tips
WHERE time LIKE '%u_c%'
ORDER BY size DESC
LIMIT 5"""

pdsql(q)


Out[27]:
gender smoker day size time
0 Female No Thur 6 Lunch
1 Male No Thur 6 Lunch
2 Female No Thur 6 Lunch
3 Male No Thur 5 Lunch
4 Male No Thur 4 Lunch

Aggregate queries


In [28]:
pdsql('select * from tips limit 5;')


Out[28]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

In [33]:
q = """
SELECT sex, smoker, sum(total_bill) as total, max(tip) as max_tip 
FROM tips
WHERE time = 'Lunch'
GROUP BY sex, smoker
ORDER BY max_tip DESC
"""
pdsql(q)


Out[33]:
sex smoker total max_tip
0 Male No 369.73 6.70
1 Female No 397.56 5.17
2 Female Yes 174.31 5.00
3 Male Yes 225.87 5.00

Joins

A join occurs when you combine information from two or more database tables, based on information in a column that is common among the tables. As usual, it is easier to understand the concept with examples.


In [34]:
student = pd.read_csv('data/student.txt')
student


Out[34]:
student_id first last email major_id
0 1 frodo baggins frodo.baggins@duke.edu 1
1 2 bilbo baggins b_baggins@duke.edu 3
2 3 golum golum golum.golum@duke.edu 2
3 4 gandalf white g.white@duke.edu 5
4 5 gandalf grey g.grey@duke.edu 6
5 6 saruman wise s.wise@duke.edu 2

In [35]:
cls = pd.read_csv('data/class.txt')
cls


Out[35]:
class_id code name credits
0 1 ANT01 Introduction to Hobbits 4
1 2 MAT802 Abstrct Nonsense 8
2 3 ENG234 Jabberwocky 2
3 4 STA007 Statistics for Secret Agens 4
4 5 PHY211 Physics of Star Wars 4

In [36]:
major = pd.read_csv('data/major.txt')
major


Out[36]:
major_id name
0 1 Computer Science
1 2 Physics
2 3 Statisitcs
3 4 English
4 5 History

In [37]:
student_cls = pd.read_csv('data/student_class.txt')
student_cls


Out[37]:
student_id class_id
0 1 3
1 1 4
2 2 1
3 2 4
4 3 1
5 3 2
6 3 3
7 3 5
8 4 2
9 4 5

Matching students and majors

Inner join


In [40]:
pdsql("""
SELECT * from student s
INNER JOIN major m
ON s.major_id = m.major_id;
""")


Out[40]:
student_id first last email major_id major_id name
0 1 frodo baggins frodo.baggins@duke.edu 1 1 Computer Science
1 2 bilbo baggins b_baggins@duke.edu 3 3 Statisitcs
2 3 golum golum golum.golum@duke.edu 2 2 Physics
3 4 gandalf white g.white@duke.edu 5 5 History
4 6 saruman wise s.wise@duke.edu 2 2 Physics

Left outer join

SQL also has RIGHT OUTER JOIN and FULL OUTER JOIN but these are not currently supported by SQLite3 (the database engine used by pdsql).


In [43]:
pdsql("""
SELECT s.*, m.name from student s
LEFT JOIN major m
ON s.major_id = m.major_id;
""")


Out[43]:
student_id first last email major_id name
0 1 frodo baggins frodo.baggins@duke.edu 1 Computer Science
1 2 bilbo baggins b_baggins@duke.edu 3 Statisitcs
2 3 golum golum golum.golum@duke.edu 2 Physics
3 4 gandalf white g.white@duke.edu 5 History
4 5 gandalf grey g.grey@duke.edu 6 None
5 6 saruman wise s.wise@duke.edu 2 Physics

Emulating a full outer join with UNION ALL

Only necessary if the database does not proivde FULL OUTER JOIN


In [ ]:

Using linker tables to match students to classes (a MANY TO MANY join)


In [ ]:

Using SQLite3

SQLite3 is part of the standard library. However, the mechanics of using essentially any database in Python is similar, because of the Python DB-API.


In [44]:
import sqlite3

In [45]:
c = sqlite3.connect('data/Chinook_Sqlite.sqlite')

SQLite specific commands to get metadata

Unlike SQL syntax for queries, how you get metadata from a relational database is vendor-specific. You'll have to read the docs to find out what is needed for your SQL flavor.

What tables are there in the database?


In [46]:
list(c.execute("SELECT name FROM sqlite_master WHERE type='table';"))


Out[46]:
[('Album',),
 ('Artist',),
 ('Customer',),
 ('Employee',),
 ('Genre',),
 ('Invoice',),
 ('InvoiceLine',),
 ('MediaType',),
 ('Playlist',),
 ('PlaylistTrack',),
 ('Track',)]

What are the columns of the table "Album"?


In [47]:
list(c.execute("PRAGMA table_info(Album);"))


Out[47]:
[(0, 'AlbumId', 'INTEGER', 1, None, 1),
 (1, 'Title', 'NVARCHAR(160)', 1, None, 0),
 (2, 'ArtistId', 'INTEGER', 1, None, 0)]

Standard SQL statements with parameter substitution

Note: Using Python string substitution for Python defined parameters is dangerous because of the risk of SQL injection attacks. Use parameter substitution with ? instead.

Do this


In [48]:
t = ['%rock%', 10]
list(c.execute("SELECT * FROM Album WHERE Title like ? AND ArtistID > ? LIMIT 5;", t))


Out[48]:
[(59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90),
 (213,
  'Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK]',
  139),
 (216, 'Hot Rocks, 1964-1971 (Disc 1)', 142)]

Not this


In [49]:
t = ("'%rock%'", 10)
list(c.execute("SELECT * FROM Album WHERE Title like %s AND ArtistID > %d LIMIT 5;" % t))


Out[49]:
[(59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90),
 (213,
  'Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK]',
  139),
 (216, 'Hot Rocks, 1964-1971 (Disc 1)', 142)]

User defined functions

Sometimes it is useful to have custom functions that run on the database server rather than on the client. These are called User Defined Functions or UDF. How do to do this varies with the database used, but it is fairly simple with Python and SQLite.

A standard UDF


In [50]:
def encode(text, offset):
    """Caesar cipher of text with given offset."""
    from string import ascii_lowercase, ascii_uppercase
    tbl = dict(zip(map(ord, ascii_lowercase + ascii_uppercase), 
               ascii_lowercase[offset:] + ascii_lowercase[:offset] + 
               ascii_uppercase[offset:] + ascii_uppercase[:offset]))
    return text.translate(tbl)

In [51]:
c.create_function("encode", 2, encode)

In [52]:
list(c.execute("SELECT Title, encode(Title, 2) FROM Album limit 5;"))


Out[52]:
[('For Those About To Rock We Salute You',
  'Hqt Vjqug Cdqwv Vq Tqem Yg Ucnwvg Aqw'),
 ('Balls to the Wall', 'Dcnnu vq vjg Ycnn'),
 ('Restless and Wild', 'Tguvnguu cpf Yknf'),
 ('Let There Be Rock', 'Ngv Vjgtg Dg Tqem'),
 ('Big Ones', 'Dki Qpgu')]

An aggregate UDF

We can also add aggregate UDFs similar to SQL MIN, SUM, COUNT etc. Aggregate UDFs require you to write a class __init__, step and finalize methods.


In [53]:
class CV:
    """Aggregate UDF for coefficient of varation in %."""

    def __init__(self):
        self.s = []

    def step(self, value):
        self.s.append(value)

    def finalize(self):
        if len(self.s) < 2:
            return 0
        else:
            return 100.0*np.std(self.s)/np.mean(self.s)

In [54]:
c.create_aggregate("cv", 1, CV)

In [55]:
list(c.execute("PRAGMA table_info(Invoice);"))


Out[55]:
[(0, 'InvoiceId', 'INTEGER', 1, None, 1),
 (1, 'CustomerId', 'INTEGER', 1, None, 0),
 (2, 'InvoiceDate', 'DATETIME', 1, None, 0),
 (3, 'BillingAddress', 'NVARCHAR(70)', 0, None, 0),
 (4, 'BillingCity', 'NVARCHAR(40)', 0, None, 0),
 (5, 'BillingState', 'NVARCHAR(40)', 0, None, 0),
 (6, 'BillingCountry', 'NVARCHAR(40)', 0, None, 0),
 (7, 'BillingPostalCode', 'NVARCHAR(10)', 0, None, 0),
 (8, 'Total', 'NUMERIC(10,2)', 1, None, 0)]

In [56]:
list(c.execute("SELECT cv(Total) from Invoice limit 10;"))


Out[56]:
[(83.85715075292478,)]

In [57]:
c.close()

Using SQL magic functions

We will use the ipython-sql notebook extension for convenience. This will only work in notebooks and IPython scripts with the .ipy extension.


In [58]:
%load_ext sql

Configuring the SqlMagic extension


In [59]:
%config SqlMagic


SqlMagic options
--------------
SqlMagic.autolimit=<Int>
    Current: 0
    Automatically limit the size of the returned result sets
SqlMagic.autopandas=<Bool>
    Current: False
    Return Pandas DataFrames instead of regular result sets
SqlMagic.column_local_vars=<Bool>
    Current: False
    Return data into local variables from column names
SqlMagic.displaylimit=<Int>
    Current: 0
    Automatically limit the number of rows displayed (full result set is still
    stored)
SqlMagic.dsn_filename=<Unicode>
    Current: 'odbc.ini'
    Path to DSN file. When the first argument is of the form [section], a
    sqlalchemy connection string is formed from the matching section in the DSN
    file.
SqlMagic.feedback=<Bool>
    Current: True
    Print number of rows affected by DML
SqlMagic.short_errors=<Bool>
    Current: True
    Don't display the full traceback on SQL Programming Error
SqlMagic.style=<Unicode>
    Current: 'DEFAULT'
    Set the table printing style to any of prettytable's defined styles
    (currently DEFAULT, MSWORD_FRIENDLY, PLAIN_COLUMNS, RANDOM)

In [60]:
%config SqlMagic.displaylimit=10

Connect to SQLite3 database


In [61]:
%sql sqlite:///data/Chinook_Sqlite.sqlite


Out[61]:
'Connected: None@data/Chinook_Sqlite.sqlite'

Other databases

See SQLAlchemy connection strings for how to connect to other databases such as Oracle, MySQL or PostgreSQL.

Line magic


In [62]:
%sql SELECT * from Album LIMIT 5;


Done.
Out[62]:
AlbumId Title ArtistId
1 For Those About To Rock We Salute You 1
2 Balls to the Wall 2
3 Restless and Wild 2
4 Let There Be Rock 1
5 Big Ones 3

In [63]:
%sql SELECT * from Artist LIMIT 5;


Done.
Out[63]:
ArtistId Name
1 AC/DC
2 Accept
3 Aerosmith
4 Alanis Morissette
5 Alice In Chains

Cell magic


In [64]:
%%sql
SELECT Artist.Name, Album.Title
FROM Album
INNER JOIN Artist on Album.ArtistId = Artist.ArtistId
ORDER BY Artist.Name ASC
LIMIT 5;


Done.
Out[64]:
Name Title
AC/DC For Those About To Rock We Salute You
AC/DC Let There Be Rock
Aaron Copland & London Symphony Orchestra A Copland Celebration, Vol. I
Aaron Goldberg Worlds
Academy of St. Martin in the Fields & Sir Neville Marriner The World of Classical Favourites

You can assign results of queries to Python names


In [65]:
result = %sql SELECT * from Album;


Done.

In [66]:
type(result)


Out[66]:
sql.run.ResultSet

Resutls behave like lists


In [67]:
result[2:4]


Out[67]:
[(3, 'Restless and Wild', 2), (4, 'Let There Be Rock', 1)]

You can use Python variables in your queires.

Use :varname where you want to use a Python variable in your query.


In [68]:
artist_id = 10

%sql select * from Artist where ArtistId < :artist_id;


Done.
Out[68]:
ArtistId Name
1 AC/DC
2 Accept
3 Aerosmith
4 Alanis Morissette
5 Alice In Chains
6 Antônio Carlos Jobim
7 Apocalyptica
8 Audioslave
9 BackBeat

In [69]:
word = '%rock%'

%sql select * from Album WHERE Title LIKE :word;


Done.
Out[69]:
AlbumId Title ArtistId
1 For Those About To Rock We Salute You 1
4 Let There Be Rock 1
59 Deep Purple In Rock 58
108 Rock In Rio [CD1] 90
109 Rock In Rio [CD2] 90
213 Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK] 139
216 Hot Rocks, 1964-1971 (Disc 1) 142

Convert to pandas dataframe


In [70]:
df = result.DataFrame()
df.head(5)


Out[70]:
AlbumId Title ArtistId
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
3 4 Let There Be Rock 1
4 5 Big Ones 3

In [ ]: