SQL is a language for getting data from databases. It is also becoming essential to have some basic familiarity with SQL because it is universally used to slice and dice data across many different data storage formats and technologies.
We will only show the use of SQL as a query language here, since that is probably all that a statistician will use SQL for. However, SQL can also be used to create and modify tables, as well as manage database permissions.
Three ways of using SQL in Python will be shown:
pandas
data framesmagic
functionspandas
DataFramesWe will use the pandas-sql package to practice SQL syntax for querying a pandas
DataFrame. Later, we will see how to use an actual relational database.
In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
In [5]:
! pip install pandasql
In [6]:
from pandasql import PandaSQL
pdsql = PandaSQL()
In [7]:
tips = sns.load_dataset('tips')
tips.head()
Out[7]:
In [10]:
pdsql('SELECT * FROM tips LIMIT 5;')
Out[10]:
In [18]:
q = """
SELECT sex as gender, smoker, day, size
FROM tips
WHERE size <= 3 AND smoker = 'Yes'
LIMIT 5"""
In [19]:
pdsql(q)
Out[19]:
In [24]:
q = """
SELECT sex as gender, smoker, day, size , time
FROM tips
WHERE time LIKE '%u_c%'
LIMIT 5"""
In [25]:
pdsql(q)
Out[25]:
In [27]:
q = """
SELECT sex as gender, smoker, day, size , time
FROM tips
WHERE time LIKE '%u_c%'
ORDER BY size DESC
LIMIT 5"""
pdsql(q)
Out[27]:
In [28]:
pdsql('select * from tips limit 5;')
Out[28]:
In [33]:
q = """
SELECT sex, smoker, sum(total_bill) as total, max(tip) as max_tip
FROM tips
WHERE time = 'Lunch'
GROUP BY sex, smoker
ORDER BY max_tip DESC
"""
pdsql(q)
Out[33]:
In [34]:
student = pd.read_csv('data/student.txt')
student
Out[34]:
In [35]:
cls = pd.read_csv('data/class.txt')
cls
Out[35]:
In [36]:
major = pd.read_csv('data/major.txt')
major
Out[36]:
In [37]:
student_cls = pd.read_csv('data/student_class.txt')
student_cls
Out[37]:
In [40]:
pdsql("""
SELECT * from student s
INNER JOIN major m
ON s.major_id = m.major_id;
""")
Out[40]:
In [43]:
pdsql("""
SELECT s.*, m.name from student s
LEFT JOIN major m
ON s.major_id = m.major_id;
""")
Out[43]:
In [ ]:
In [ ]:
In [44]:
import sqlite3
In [45]:
c = sqlite3.connect('data/Chinook_Sqlite.sqlite')
In [46]:
list(c.execute("SELECT name FROM sqlite_master WHERE type='table';"))
Out[46]:
In [47]:
list(c.execute("PRAGMA table_info(Album);"))
Out[47]:
Note: Using Python string substitution for Python defined parameters is dangerous because of the risk of SQL injection attacks. Use parameter substitution with ?
instead.
In [48]:
t = ['%rock%', 10]
list(c.execute("SELECT * FROM Album WHERE Title like ? AND ArtistID > ? LIMIT 5;", t))
Out[48]:
In [49]:
t = ("'%rock%'", 10)
list(c.execute("SELECT * FROM Album WHERE Title like %s AND ArtistID > %d LIMIT 5;" % t))
Out[49]:
In [50]:
def encode(text, offset):
"""Caesar cipher of text with given offset."""
from string import ascii_lowercase, ascii_uppercase
tbl = dict(zip(map(ord, ascii_lowercase + ascii_uppercase),
ascii_lowercase[offset:] + ascii_lowercase[:offset] +
ascii_uppercase[offset:] + ascii_uppercase[:offset]))
return text.translate(tbl)
In [51]:
c.create_function("encode", 2, encode)
In [52]:
list(c.execute("SELECT Title, encode(Title, 2) FROM Album limit 5;"))
Out[52]:
In [53]:
class CV:
"""Aggregate UDF for coefficient of varation in %."""
def __init__(self):
self.s = []
def step(self, value):
self.s.append(value)
def finalize(self):
if len(self.s) < 2:
return 0
else:
return 100.0*np.std(self.s)/np.mean(self.s)
In [54]:
c.create_aggregate("cv", 1, CV)
In [55]:
list(c.execute("PRAGMA table_info(Invoice);"))
Out[55]:
In [56]:
list(c.execute("SELECT cv(Total) from Invoice limit 10;"))
Out[56]:
In [57]:
c.close()
We will use the ipython-sql notebook extension for convenience. This will only work in notebooks and IPython scripts with the .ipy extension.
In [58]:
%load_ext sql
In [59]:
%config SqlMagic
In [60]:
%config SqlMagic.displaylimit=10
In [61]:
%sql sqlite:///data/Chinook_Sqlite.sqlite
Out[61]:
See SQLAlchemy connection strings for how to connect to other databases such as Oracle, MySQL or PostgreSQL.
In [62]:
%sql SELECT * from Album LIMIT 5;
Out[62]:
In [63]:
%sql SELECT * from Artist LIMIT 5;
Out[63]:
In [64]:
%%sql
SELECT Artist.Name, Album.Title
FROM Album
INNER JOIN Artist on Album.ArtistId = Artist.ArtistId
ORDER BY Artist.Name ASC
LIMIT 5;
Out[64]:
In [65]:
result = %sql SELECT * from Album;
In [66]:
type(result)
Out[66]:
In [67]:
result[2:4]
Out[67]:
In [68]:
artist_id = 10
%sql select * from Artist where ArtistId < :artist_id;
Out[68]:
In [69]:
word = '%rock%'
%sql select * from Album WHERE Title LIKE :word;
Out[69]:
In [70]:
df = result.DataFrame()
df.head(5)
Out[70]:
In [ ]: