Created: 2014-08-02 / Last Updated: 2014-10-14
Contents:
Package Dependencies:
numpy
: Numerical computing and array processing capabilities for Python. more infopandas
: Data structures and data manipulation tools for analysis and statistics. more infopandasql
: Perform SQL SELECT
statements on Python Pandas DataFrame objects. more info & sourceReferences:
The pandasql
package is a Python package that creates tables in a local SQLite database, runs the requested SQL query on those tables, and then drops the tables from the SQLite database, modeled after the sqldf
library in R. Although this can be slower than subsetting a data frame directly, in cases where the data frame is very complex and conditions get difficult to parse, SQL can provide a straightforward, declarative way of requesting a particular data subset.
The pandasql
package depends on the pandas
package, which provides data manipulation and DataFrame support, which in turn depends on the numpy
package. If you are starting with Python, I recommend the Anaconda distribution from Continuum Analytics, which has both numpy
and pandas
.
If the pandasql
package doesn't exist in your environment, install the package. Once you have it, load it.
$ pip install -U pandasql
In [1]:
import pandas
from pandasql import sqldf
pandasql
package: SQL on Pandas DataFrames in PythonTo use the sqldf
function (the main function in this package), we must provide two parameters:
locals()
or globals()
)To make this look more like R's sqldf
operates, and to remove the annoyance of having to specify the session/environment variables we can write a lambda proxy function.
In [2]:
mysqldf = lambda q: sqldf(q, globals())
Now that we have the function and know how to use it, we'll load some test data.
For this example, we'll use the Basic Business Licenses data from the DC government open data site.
The data is available as a zip file, so we'll download it if it's not already available, and then unzip it if it hasn't been already.
In [5]:
import os
import urllib2
import zipfile
url = "http://data.octo.dc.gov/feeds/bbl/bbl_current_csv.zip"
data_dir = "../data/"
current_dir = os.getcwd()
os.chdir(data_dir)
local_zipped = "bbl_current_csv.zip"
local_data = "bbl_current_csv.csv"
if not (os.path.isfile(local_data) and os.path.isfile(local_zipped)):
filename = os.path.join(os.getcwd(), local_zipped)
urllib2.urlretrieve(url, local_zipped)
if not os.path.isfile(local_data):
with ZipFile(local_zipped, 'r') as zipdata:
zipdata.extractall()
bbl = pandas.DataFrame.from_csv(local_data)
os.chdir(data_dir)
Now that we have the data frame bbl
, let's take a look at what's in it.
In [6]:
bbl.head()
Out[6]:
In [8]:
mysqldf("select * from bbl where BUSINESSIMPROVEMENTDISTRICT = 'DOWNTOWN BID'")
Out[8]:
In [9]:
mysqldf("select * from bbl where WARD = 6")
Out[9]:
The sqldf()
function returns a data frame, so we can store it in another variable for use in a plot, for example.
In [10]:
ward6 = mysqldf("select LICENSECATEGORY, count(*) as LICENSECOUNT from bbl where WARD = 6 group by LICENSECATEGORY")
ward6
Out[10]:
In [24]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
ward6.plot(kind='barh')
Out[24]:
In [26]:
query = """
select BBL_LICENSE_FACT_ID as bbl_id,
LATITUDE as lat, LONGITUDE as lon,
WARD as ward, LICENSESTATUS as status,
BUSINESSIMPROVEMENTDISTRICT as bid
from bbl
where LICENSESTATUS != 'CANCELLED'
and WARD != 'NA'
and LATITUDE <> 0
AND LONGITUDE <> 0"""
results = mysqldf(query)
In [27]:
results.head()
Out[27]:
In [39]:
from ggplot import *
ggplot(aes(x='lat', y='lon', color='bid'), data=results) + \
geom_point() + ggtitle("Business Improvement District Locations") + \
xlab("Latitude") + ylab("Longitude") + facet_grid("ward", "status", scales="free_xy")
Out[39]: