SQL is a language for getting data from databases. It is also becoming essential to have some basic familiarity with SQL because it is universally used to slice and dice data across many different data storage formats and technologies.
We will only show the use of SQL as a query language here, since that is probably all that a statistician will use SQL for. However, SQL can also be used to create and modify tables, as well as manage database permissions.
pandas
DataFramesWe will use the pandas-sql package to practice SQL syntax for querying a pandas
DataFrame. Later, we will see how to use an actual relational database.
In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
In [2]:
from pandasql import PandaSQL
pdsql = PandaSQL()
In [3]:
tips = sns.load_dataset('tips')
tips.head()
Out[3]:
In [4]:
pdsql("SELECT * FROM tips LIMIT 5;")
Out[4]:
In [5]:
pdsql("SELECT * FROM tips WHERE sex='Female' LIMIT 5;")
Out[5]:
In [6]:
pdsql("SELECT tip, sex, size FROM tips WHERE total_bill< 10 LIMIT 5;")
Out[6]:
In [7]:
query = """
SELECT * FROM tips
WHERE sex='Female' and smoker='Yes'
ORDER BY total_bill ASC
LIMIT 5;
"""
pdsql(query)
Out[7]:
In [8]:
query = """
SELECT count(*) AS count, max(tip) AS max, min(tip) AS min FROM tips
WHERE size > 1
GROUP BY sex, day
HAVING max < 6
ORDER BY count DESC
LIMIT 5;
"""
pdsql(query)
Out[8]:
In [9]:
student = pd.read_csv('data/student.txt')
student
Out[9]:
In [10]:
cls = pd.read_csv('data/class.txt')
cls
Out[10]:
In [11]:
major = pd.read_csv('data/major.txt')
major
Out[11]:
In [12]:
student_cls = pd.read_csv('data/student_class.txt')
student_cls
Out[12]:
In [13]:
query = """
SELECT s.first, s.last, m.name
FROM student s
INNER JOIN major m
ON s.major_id = m.major_id;
"""
pdsql(query)
Out[13]:
In [14]:
query = """
SELECT s.first, s.last, m.name
FROM student s
LEFT OUTER JOIN major m
ON s.major_id = m.major_id;
"""
pdsql(query)
Out[14]:
In [15]:
query = """
SELECT s.first, s.last, m.name
FROM student s
LEFT JOIN major m
ON s.major_id = m.major_id
UNION All
SELECT s.first, s.last, m.name
FROM major m
LEFT JOIN student s
ON s.major_id = m.major_id
WHERE s.major_id IS NULL;
"""
pdsql(query)
Out[15]:
In [16]:
query = """
SELECT s.first, s.last, c.code, c.name, c.credits
FROM student s
INNER JOIN student_cls sc ON s.student_id = sc.student_id
INNER JOIN cls c ON c.class_id = sc.class_id;
"""
pdsql(query)
Out[16]:
In [17]:
query = """
SELECT s.first, s.last, c.code, c.name, c.credits
FROM student s
LEFT OUTER JOIN student_cls sc ON s.student_id = sc.student_id
LEFT OUTER JOIN cls c ON c.class_id = sc.class_id;
"""
pdsql(query)
Out[17]:
In [18]:
import sqlite3
In [19]:
c = sqlite3.connect('data/Chinook_Sqlite.sqlite')
In [20]:
list(c.execute("SELECT name FROM sqlite_master WHERE type='table';"))
Out[20]:
In [21]:
list(c.execute("PRAGMA table_info(Album);"))
Out[21]:
Note: Using Python string substitution for Python defined parameters is dangerous because of the risk of SQL injection attacks. Use parameter substitution with ?
instead.
In [22]:
t = ['%rock%']
list(c.execute("SELECT * FROM Album WHERE Title like ? LIMIT 5;", t))
Out[22]:
In [23]:
t = ["'%rock%'"]
list(c.execute("SELECT * FROM Album WHERE Title like %s LIMIT 5;" % t[0]))
Out[23]:
In [24]:
def encode(text, offset):
"""Caesar cipher of text with given offset."""
from string import ascii_lowercase, ascii_uppercase
tbl = dict(zip(map(ord, ascii_lowercase + ascii_uppercase),
ascii_lowercase[offset:] + ascii_lowercase[:offset] +
ascii_uppercase[offset:] + ascii_uppercase[:offset]))
return text.translate(tbl)
In [25]:
c.create_function("encode", 2, encode)
In [26]:
list(c.execute("SELECT Title, encode(Title, 3) FROM Album limit 5;"))
Out[26]:
In [27]:
class CV:
"""Aggregate UDF for coefficient of varation in %."""
def __init__(self):
self.s = []
def step(self, value):
self.s.append(value)
def finalize(self):
if len(self.s) < 2:
return 0
else:
return 100.0*np.std(self.s)/np.mean(self.s)
In [28]:
c.create_aggregate("cv", 1, CV)
In [29]:
list(c.execute("PRAGMA table_info(Invoice);"))
Out[29]:
In [30]:
list(c.execute("SELECT cv(Total) from Invoice limit 10;"))
Out[30]:
In [31]:
c.close()
We will use the ipython-sql notebook extension for convenience. This will only work in notebooks and IPython scripts with the .ipy extension.
In [33]:
import warnings
with warnings.catch_warnings():
warnings.simplefilter('ignore')
%load_ext sql
In [34]:
%config SqlMagic
In [35]:
%config SqlMagic.displaylimit=10
In [36]:
%sql sqlite:///data/Chinook_Sqlite.sqlite
Out[36]:
See SQLAlchemy connection strings for how to connect to other databases such as Oracle, MySQL or PostgreSQL.
In [37]:
%sql SELECT * from Album LIMIT 5;
Out[37]:
In [38]:
%sql SELECT * from Artist LIMIT 5;
Out[38]:
In [39]:
%%sql
SELECT Artist.Name, Album.Title
FROM Album
INNER JOIN Artist on Album.ArtistId = Artist.ArtistId
ORDER BY Artist.Name ASC
LIMIT 5;
Out[39]:
In [40]:
result = %sql SELECT * from Album;
In [41]:
type(result)
Out[41]:
In [42]:
result[2:4]
Out[42]:
In [43]:
artist_id = 10
%sql select * from Artist where ArtistId < :artist_id;
Out[43]:
In [44]:
word = '%rock%'
%sql select * from Album WHERE Title LIKE :word;
Out[44]:
In [45]:
df = result.DataFrame()
df.head(5)
Out[45]: