SQL

SQL is a language for getting data from databases. It is also becoming essential to have some basic familiarity with SQL because it is universally used to slice and dice data across many different data storage formats and technologies.

We will only show the use of SQL as a query language here, since that is probably all that a statistician will use SQL for. However, SQL can also be used to create and modify tables, as well as manage database permissions.

SQL via `pandas` DataFrames

We will use the pandas-sql package to practice SQL syntax for querying a pandas DataFrame. Later, we will see how to use an actual relational database.



In [1]:

    
import seaborn as sns
import pandas as pd
import numpy as np



In [2]:

    
from pandasql import PandaSQL
pdsql = PandaSQL()



In [3]:

    
tips = sns.load_dataset('tips')
tips.head()









    Out[3]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4

Using SQL for Queries

Note that SQL is case-insensitive, but it is traditional to use ALL CAPS for SQL keywords. It is also standard to end SQL statements with a semi-colon.



In [4]:

    
pdsql("SELECT * FROM tips LIMIT 5;")









    Out[4]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4



In [5]:

    
pdsql("SELECT * FROM tips WHERE sex='Female' LIMIT 5;")









    Out[5]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
    
    
      1
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4
    
    
      2
      35.26
      5.00
      Female
      No
      Sun
      Dinner
      4
    
    
      3
      14.83
      3.02
      Female
      No
      Sun
      Dinner
      2
    
    
      4
      10.33
      1.67
      Female
      No
      Sun
      Dinner
      3



In [6]:

    
pdsql("SELECT tip, sex, size FROM tips WHERE total_bill< 10 LIMIT 5;")

Ordering



In [7]:

    
query = """
SELECT * FROM tips
WHERE sex='Female' and smoker='Yes'
ORDER BY total_bill ASC
LIMIT 5;
"""
pdsql(query)









    Out[7]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
      3.07
      1.00
      Female
      Yes
      Sat
      Dinner
      1
    
    
      1
      5.75
      1.00
      Female
      Yes
      Fri
      Dinner
      2
    
    
      2
      9.60
      4.00
      Female
      Yes
      Sun
      Dinner
      2
    
    
      3
      10.09
      2.00
      Female
      Yes
      Fri
      Lunch
      2
    
    
      4
      10.59
      1.61
      Female
      Yes
      Sat
      Dinner
      2

Aggregate queries



In [8]:

    
query = """
SELECT count(*) AS count, max(tip) AS max, min(tip) AS min FROM tips
WHERE size > 1
GROUP BY sex, day
HAVING max < 6
ORDER BY count DESC
LIMIT 5;
"""
pdsql(query)

Joins

A join occurs when you combine information from two or more database tables, based on information in a column that is common among the tables. As usual, it is easier to understand the concept with examples.



In [9]:

    
student = pd.read_csv('data/student.txt')
student









    Out[9]:






  
    
      
      student_id
      first
      last
      email
      major_id
    
  
  
    
      0
      1
      frodo
      baggins
      frodo.baggins@duke.edu
      1
    
    
      1
      2
      bilbo
      baggins
      b_baggins@duke.edu
      3
    
    
      2
      3
      golum
      golum
      golum.golum@duke.edu
      2
    
    
      3
      4
      gandalf
      white
      g.white@duke.edu
      5
    
    
      4
      5
      gandalf
      grey
      g.grey@duke.edu
      6
    
    
      5
      6
      saruman
      wise
      s.wise@duke.edu
      2



In [10]:

    
cls = pd.read_csv('data/class.txt')
cls









    Out[10]:






  
    
      
      class_id
      code
      name
      credits
    
  
  
    
      0
      1
      ANT01
      Introduction to Hobbits
      4
    
    
      1
      2
      MAT802
      Abstrct Nonsense
      8
    
    
      2
      3
      ENG234
      Jabberwocky
      2
    
    
      3
      4
      STA007
      Statistics for Secret Agens
      4
    
    
      4
      5
      PHY211
      Physics of Star Wars
      4



In [11]:

    
major = pd.read_csv('data/major.txt')
major









    Out[11]:






  
    
      
      major_id
      name
    
  
  
    
      0
      1
      Computer Science
    
    
      1
      2
      Physics
    
    
      2
      3
      Statisitcs
    
    
      3
      4
      English
    
    
      4
      5
      History



In [12]:

    
student_cls = pd.read_csv('data/student_class.txt')
student_cls

Matching students and majors

Inner join



In [13]:

    
query = """
SELECT s.first, s.last, m.name
FROM student s
INNER JOIN major m 
ON s.major_id = m.major_id;
"""
pdsql(query)









    Out[13]:






  
    
      
      first
      last
      name
    
  
  
    
      0
      frodo
      baggins
      Computer Science
    
    
      1
      bilbo
      baggins
      Statisitcs
    
    
      2
      golum
      golum
      Physics
    
    
      3
      gandalf
      white
      History
    
    
      4
      saruman
      wise
      Physics

Left outer join

SQL also has RIGHT OUTER JOIN and FULL OUTER JOIN but these are not currently supported by SQLite3 (the database engine used by pdsql).



In [14]:

    
query = """
SELECT s.first, s.last, m.name
FROM student s
LEFT OUTER JOIN major m 
ON s.major_id = m.major_id;
"""
pdsql(query)









    Out[14]:






  
    
      
      first
      last
      name
    
  
  
    
      0
      frodo
      baggins
      Computer Science
    
    
      1
      bilbo
      baggins
      Statisitcs
    
    
      2
      golum
      golum
      Physics
    
    
      3
      gandalf
      white
      History
    
    
      4
      gandalf
      grey
      None
    
    
      5
      saruman
      wise
      Physics

Emulating a full outer join with UNION ALL

Only necessary if the database does not proivde FULL OUTER JOIN



In [15]:

    
query = """
SELECT s.first, s.last, m.name
FROM   student s 
       LEFT JOIN major m 
          ON s.major_id = m.major_id
UNION All
SELECT s.first, s.last, m.name
FROM   major m
       LEFT JOIN student s
          ON s.major_id = m.major_id
WHERE  s.major_id IS NULL;
"""
pdsql(query)









    Out[15]:






  
    
      
      first
      last
      name
    
  
  
    
      0
      frodo
      baggins
      Computer Science
    
    
      1
      bilbo
      baggins
      Statisitcs
    
    
      2
      golum
      golum
      Physics
    
    
      3
      gandalf
      white
      History
    
    
      4
      gandalf
      grey
      None
    
    
      5
      saruman
      wise
      Physics
    
    
      6
      None
      None
      English

Using linker tables to match students to classes (a MANY TO MANY join)



In [16]:

    
query = """
SELECT s.first, s.last, c.code, c.name, c.credits
FROM student s
INNER JOIN student_cls sc ON s.student_id = sc.student_id
INNER JOIN cls c ON c.class_id = sc.class_id;
"""
pdsql(query)









    Out[16]:






  
    
      
      first
      last
      code
      name
      credits
    
  
  
    
      0
      frodo
      baggins
      ENG234
      Jabberwocky
      2
    
    
      1
      frodo
      baggins
      STA007
      Statistics for Secret Agens
      4
    
    
      2
      bilbo
      baggins
      ANT01
      Introduction to Hobbits
      4
    
    
      3
      bilbo
      baggins
      STA007
      Statistics for Secret Agens
      4
    
    
      4
      golum
      golum
      ANT01
      Introduction to Hobbits
      4
    
    
      5
      golum
      golum
      MAT802
      Abstrct Nonsense
      8
    
    
      6
      golum
      golum
      ENG234
      Jabberwocky
      2
    
    
      7
      golum
      golum
      PHY211
      Physics of Star Wars
      4
    
    
      8
      gandalf
      white
      MAT802
      Abstrct Nonsense
      8
    
    
      9
      gandalf
      white
      PHY211
      Physics of Star Wars
      4

Same thing but including students with no majors



In [17]:

    
query = """
SELECT s.first, s.last, c.code, c.name, c.credits
FROM student s
LEFT OUTER JOIN student_cls sc ON s.student_id = sc.student_id
LEFT OUTER JOIN cls c ON c.class_id = sc.class_id;
"""
pdsql(query)









    Out[17]:






  
    
      
      first
      last
      code
      name
      credits
    
  
  
    
      0
      frodo
      baggins
      ENG234
      Jabberwocky
      2.0
    
    
      1
      frodo
      baggins
      STA007
      Statistics for Secret Agens
      4.0
    
    
      2
      bilbo
      baggins
      ANT01
      Introduction to Hobbits
      4.0
    
    
      3
      bilbo
      baggins
      STA007
      Statistics for Secret Agens
      4.0
    
    
      4
      golum
      golum
      ANT01
      Introduction to Hobbits
      4.0
    
    
      5
      golum
      golum
      MAT802
      Abstrct Nonsense
      8.0
    
    
      6
      golum
      golum
      ENG234
      Jabberwocky
      2.0
    
    
      7
      golum
      golum
      PHY211
      Physics of Star Wars
      4.0
    
    
      8
      gandalf
      white
      MAT802
      Abstrct Nonsense
      8.0
    
    
      9
      gandalf
      white
      PHY211
      Physics of Star Wars
      4.0
    
    
      10
      gandalf
      grey
      None
      None
      NaN
    
    
      11
      saruman
      wise
      None
      None
      NaN

Using SQLite3

SQLite3 is part of the standard library. However, the mechanics of using essentially any database in Python is similar, because of the Python DB-API.



In [18]:

    
import sqlite3



In [19]:

    
c = sqlite3.connect('data/Chinook_Sqlite.sqlite')

SQLite specific commands to get metadata

Unlike SQL syntax for queries, how you get metadata from a relational database is vendor-specific. You'll have to read the docs to find out what is needed for your SQL flavor.

What tables are there in the database?



In [20]:

    
list(c.execute("SELECT name FROM sqlite_master WHERE type='table';"))









    Out[20]:





[('Album',),
 ('Artist',),
 ('Customer',),
 ('Employee',),
 ('Genre',),
 ('Invoice',),
 ('InvoiceLine',),
 ('MediaType',),
 ('Playlist',),
 ('PlaylistTrack',),
 ('Track',)]

What are the columns of the table "Album"?



In [21]:

    
list(c.execute("PRAGMA table_info(Album);"))









    Out[21]:





[(0, 'AlbumId', 'INTEGER', 1, None, 1),
 (1, 'Title', 'NVARCHAR(160)', 1, None, 0),
 (2, 'ArtistId', 'INTEGER', 1, None, 0)]

Standard SQL statements with parameter substitution

Note: Using Python string substitution for Python defined parameters is dangerous because of the risk of SQL injection attacks. Use parameter substitution with ? instead.

Do this



In [22]:

    
t = ['%rock%']
list(c.execute("SELECT * FROM Album WHERE Title like ? LIMIT 5;", t))









    Out[22]:





[(1, 'For Those About To Rock We Salute You', 1),
 (4, 'Let There Be Rock', 1),
 (59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90)]

Not this



In [23]:

    
t = ["'%rock%'"]
list(c.execute("SELECT * FROM Album WHERE Title like %s LIMIT 5;" % t[0]))









    Out[23]:





[(1, 'For Those About To Rock We Salute You', 1),
 (4, 'Let There Be Rock', 1),
 (59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90)]

User defined functions

Sometimes it is useful to have custom functions that run on the database server rather than on the client. These are called User Defined Functions or UDF. How do to do this varies with the database used, but it is fairly simple with Python and SQLite.

A standard UDF



In [24]:

    
def encode(text, offset):
    """Caesar cipher of text with given offset."""
    from string import ascii_lowercase, ascii_uppercase
    tbl = dict(zip(map(ord, ascii_lowercase + ascii_uppercase), 
               ascii_lowercase[offset:] + ascii_lowercase[:offset] + 
               ascii_uppercase[offset:] + ascii_uppercase[:offset]))
    return text.translate(tbl)



In [25]:

    
c.create_function("encode", 2, encode)



In [26]:

    
list(c.execute("SELECT Title, encode(Title, 3) FROM Album limit 5;"))









    Out[26]:





[('For Those About To Rock We Salute You',
  'Iru Wkrvh Derxw Wr Urfn Zh Vdoxwh Brx'),
 ('Balls to the Wall', 'Edoov wr wkh Zdoo'),
 ('Restless and Wild', 'Uhvwohvv dqg Zlog'),
 ('Let There Be Rock', 'Ohw Wkhuh Eh Urfn'),
 ('Big Ones', 'Elj Rqhv')]

An aggregate UDF

We can also add aggregate UDFs similar to SQL MIN, SUM, COUNT etc. Aggregate UDFs require you to write a class __init__, step and finalize methods.



In [27]:

    
class CV:
    """Aggregate UDF for coefficient of varation in %."""

    def __init__(self):
        self.s = []

    def step(self, value):
        self.s.append(value)

    def finalize(self):
        if len(self.s) < 2:
            return 0
        else:
            return 100.0*np.std(self.s)/np.mean(self.s)



In [28]:

    
c.create_aggregate("cv", 1, CV)



In [29]:

    
list(c.execute("PRAGMA table_info(Invoice);"))









    Out[29]:





[(0, 'InvoiceId', 'INTEGER', 1, None, 1),
 (1, 'CustomerId', 'INTEGER', 1, None, 0),
 (2, 'InvoiceDate', 'DATETIME', 1, None, 0),
 (3, 'BillingAddress', 'NVARCHAR(70)', 0, None, 0),
 (4, 'BillingCity', 'NVARCHAR(40)', 0, None, 0),
 (5, 'BillingState', 'NVARCHAR(40)', 0, None, 0),
 (6, 'BillingCountry', 'NVARCHAR(40)', 0, None, 0),
 (7, 'BillingPostalCode', 'NVARCHAR(10)', 0, None, 0),
 (8, 'Total', 'NUMERIC(10,2)', 1, None, 0)]



In [30]:

    
list(c.execute("SELECT cv(Total) from Invoice limit 10;"))









    Out[30]:





[(83.85715075292478,)]



In [31]:

    
c.close()

Using SQL magic functions

We will use the ipython-sql notebook extension for convenience. This will only work in notebooks and IPython scripts with the .ipy extension.



In [33]:

    
import warnings

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    %load_ext sql

Configuring the SqlMagic extension



In [34]:

    
%config SqlMagic









    



SqlMagic options
--------------
SqlMagic.autolimit=<Int>
    Current: 0
    Automatically limit the size of the returned result sets
SqlMagic.autopandas=<Bool>
    Current: False
    Return Pandas DataFrames instead of regular result sets
SqlMagic.column_local_vars=<Bool>
    Current: False
    Return data into local variables from column names
SqlMagic.displaylimit=<Int>
    Current: 0
    Automatically limit the number of rows displayed (full result set is still
    stored)
SqlMagic.dsn_filename=<Unicode>
    Current: 'odbc.ini'
    Path to DSN file. When the first argument is of the form [section], a
    sqlalchemy connection string is formed from the matching section in the DSN
    file.
SqlMagic.feedback=<Bool>
    Current: True
    Print number of rows affected by DML
SqlMagic.short_errors=<Bool>
    Current: True
    Don't display the full traceback on SQL Programming Error
SqlMagic.style=<Unicode>
    Current: 'DEFAULT'
    Set the table printing style to any of prettytable's defined styles
    (currently DEFAULT, MSWORD_FRIENDLY, PLAIN_COLUMNS, RANDOM)



In [35]:

    
%config SqlMagic.displaylimit=10

Connect to SQLite3 database



In [36]:

    
%sql sqlite:///data/Chinook_Sqlite.sqlite









    Out[36]:





'Connected: None@data/Chinook_Sqlite.sqlite'

Other databases

See SQLAlchemy connection strings for how to connect to other databases such as Oracle, MySQL or PostgreSQL.

Line magic



In [37]:

    
%sql SELECT * from Album LIMIT 5;









    



Done.






    Out[37]:





    
        AlbumId
        Title
        ArtistId
    
    
        1
        For Those About To Rock We Salute You
        1
    
    
        2
        Balls to the Wall
        2
    
    
        3
        Restless and Wild
        2
    
    
        4
        Let There Be Rock
        1
    
    
        5
        Big Ones
        3



In [38]:

    
%sql SELECT * from Artist LIMIT 5;









    



Done.






    Out[38]:





    
        ArtistId
        Name
    
    
        1
        AC/DC
    
    
        2
        Accept
    
    
        3
        Aerosmith
    
    
        4
        Alanis Morissette
    
    
        5
        Alice In Chains

Cell magic



In [39]:

    
%%sql
SELECT Artist.Name, Album.Title
FROM Album
INNER JOIN Artist on Album.ArtistId = Artist.ArtistId
ORDER BY Artist.Name ASC
LIMIT 5;









    



Done.






    Out[39]:





    
        Name
        Title
    
    
        AC/DC
        For Those About To Rock We Salute You
    
    
        AC/DC
        Let There Be Rock
    
    
        Aaron Copland & London Symphony Orchestra
        A Copland Celebration, Vol. I
    
    
        Aaron Goldberg
        Worlds
    
    
        Academy of St. Martin in the Fields & Sir Neville Marriner
        The World of Classical Favourites

You can assign results of queries to Python names



In [40]:

    
result = %sql SELECT * from Album;









    



Done.



In [41]:

    
type(result)









    Out[41]:





sql.run.ResultSet

Resutls behave like lists



In [42]:

    
result[2:4]









    Out[42]:





[(3, 'Restless and Wild', 2), (4, 'Let There Be Rock', 1)]

You can use Python variables in your queires.

Use :varname where you want to use a Python variable in your query.



In [43]:

    
artist_id = 10

%sql select * from Artist where ArtistId < :artist_id;









    



Done.






    Out[43]:





    
        ArtistId
        Name
    
    
        1
        AC/DC
    
    
        2
        Accept
    
    
        3
        Aerosmith
    
    
        4
        Alanis Morissette
    
    
        5
        Alice In Chains
    
    
        6
        Antônio Carlos Jobim
    
    
        7
        Apocalyptica
    
    
        8
        Audioslave
    
    
        9
        BackBeat



In [44]:

    
word = '%rock%'

%sql select * from Album WHERE Title LIKE :word;









    



Done.






    Out[44]:





    
        AlbumId
        Title
        ArtistId
    
    
        1
        For Those About To Rock We Salute You
        1
    
    
        4
        Let There Be Rock
        1
    
    
        59
        Deep Purple In Rock
        58
    
    
        108
        Rock In Rio [CD1]
        90
    
    
        109
        Rock In Rio [CD2]
        90
    
    
        213
        Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK]
        139
    
    
        216
        Hot Rocks, 1964-1971 (Disc 1)
        142

Convert to `pandas` dataframe



In [45]:

    
df = result.DataFrame()
df.head(5)









    Out[45]:






  
    
      
      AlbumId
      Title
      ArtistId
    
  
  
    
      0
      1
      For Those About To Rock We Salute You
      1
    
    
      1
      2
      Balls to the Wall
      2
    
    
      2
      3
      Restless and Wild
      2
    
    
      3
      4
      Let There Be Rock
      1
    
    
      4
      5
      Big Ones
      3

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

	total_bill	tip	sex	smoker	day	time	size
0	3.07	1.00	Female	Yes	Sat	Dinner	1
1	5.75	1.00	Female	Yes	Fri	Dinner	2
2	9.60	4.00	Female	Yes	Sun	Dinner	2
3	10.09	2.00	Female	Yes	Fri	Lunch	2
4	10.59	1.61	Female	Yes	Sat	Dinner	2

	student_id	first	last	email	major_id
0	1	frodo	baggins	frodo.baggins@duke.edu	1
1	2	bilbo	baggins	b_baggins@duke.edu	3
2	3	golum	golum	golum.golum@duke.edu	2
3	4	gandalf	white	g.white@duke.edu	5
4	5	gandalf	grey	g.grey@duke.edu	6
5	6	saruman	wise	s.wise@duke.edu	2

	class_id	code	name	credits
0	1	ANT01	Introduction to Hobbits	4
1	2	MAT802	Abstrct Nonsense	8
2	3	ENG234	Jabberwocky	2
3	4	STA007	Statistics for Secret Agens	4
4	5	PHY211	Physics of Star Wars	4

	first	last	name
0	frodo	baggins	Computer Science
1	bilbo	baggins	Statisitcs
2	golum	golum	Physics
3	gandalf	white	History
4	saruman	wise	Physics

AlbumId	Title	ArtistId
1	For Those About To Rock We Salute You	1
2	Balls to the Wall	2
3	Restless and Wild	2
4	Let There Be Rock	1
5	Big Ones	3

Name	Title
AC/DC	For Those About To Rock We Salute You
AC/DC	Let There Be Rock
Aaron Copland & London Symphony Orchestra	A Copland Celebration, Vol. I
Aaron Goldberg	Worlds
Academy of St. Martin in the Fields & Sir Neville Marriner	The World of Classical Favourites

ArtistId	Name
1	AC/DC
2	Accept
3	Aerosmith
4	Alanis Morissette
5	Alice In Chains
6	Antônio Carlos Jobim
7	Apocalyptica
8	Audioslave
9	BackBeat

AlbumId	Title	ArtistId
1	For Those About To Rock We Salute You	1
4	Let There Be Rock	1
59	Deep Purple In Rock	58
108	Rock In Rio [CD1]	90
109	Rock In Rio [CD2]	90
213	Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK]	139
216	Hot Rocks, 1964-1971 (Disc 1)	142

SQL

SQL via pandas DataFrames