Manipulating Data with Style in SQL

Polyglot Programming DC : October 2014 Meetup

Created: 2014-08-02 / Last Updated: 2014-10-14

Contents:

  • The pandasql package: SQL on Pandas DataFrames in Python
  • Loading some test data
  • Querying and subsetting that test data
  • Example of a complex query

Package Dependencies:

References:

The pandasql package is a Python package that creates tables in a local SQLite database, runs the requested SQL query on those tables, and then drops the tables from the SQLite database, modeled after the sqldf library in R. Although this can be slower than subsetting a data frame directly, in cases where the data frame is very complex and conditions get difficult to parse, SQL can provide a straightforward, declarative way of requesting a particular data subset.

The pandasql package depends on the pandas package, which provides data manipulation and DataFrame support, which in turn depends on the numpy package. If you are starting with Python, I recommend the Anaconda distribution from Continuum Analytics, which has both numpy and pandas.

If the pandasql package doesn't exist in your environment, install the package. Once you have it, load it.

$ pip install -U pandasql


In [1]:
import pandas
from pandasql import sqldf

The pandasql package: SQL on Pandas DataFrames in Python

To use the sqldf function (the main function in this package), we must provide two parameters:

  • The SQL statement, with Pandas DataFrame object references instead of tables
  • A set of session/environment variables (locals() or globals())

To make this look more like R's sqldf operates, and to remove the annoyance of having to specify the session/environment variables we can write a lambda proxy function.


In [2]:
mysqldf = lambda q: sqldf(q, globals())

Now that we have the function and know how to use it, we'll load some test data.

Loading some test data

For this example, we'll use the Basic Business Licenses data from the DC government open data site.

The data is available as a zip file, so we'll download it if it's not already available, and then unzip it if it hasn't been already.


In [5]:
import os
import urllib2
import zipfile

url = "http://data.octo.dc.gov/feeds/bbl/bbl_current_csv.zip"
data_dir = "../data/"

current_dir = os.getcwd()
os.chdir(data_dir)

local_zipped = "bbl_current_csv.zip"
local_data = "bbl_current_csv.csv"

if not (os.path.isfile(local_data) and os.path.isfile(local_zipped)):
  filename = os.path.join(os.getcwd(), local_zipped)
  urllib2.urlretrieve(url, local_zipped)

if not os.path.isfile(local_data): 
  with ZipFile(local_zipped, 'r') as zipdata:
    zipdata.extractall()

bbl = pandas.DataFrame.from_csv(local_data)

os.chdir(data_dir)

Now that we have the data frame bbl, let's take a look at what's in it.


In [6]:
bbl.head()


Out[6]:
LICENSESTATUS LICENSECATEGORY CUST_NUM TRADE_NAME LICENSE_START_DATE LICENSE_EXPIRATION_DATE LICENSE_ISSUE_DATE AGENT_PHONE LASTMODIFIEDDATE CITY ... WARD ANC SMD DISTRICT PSA NEIGHBORHOODCLUSTER HOTSPOT2006NAME HOTSPOT2005NAME HOTSPOT2004NAME BUSINESSIMPROVEMENTDISTRICT
BBL_LICENSE_FACT_ID
144376 CANCELLED Charitable Solicitation 65990715 WASHINGTON ARCHITECTURAL FOUNDATION 9/1/2011 12:00:00 AM 8/31/2013 12:00:00 AM 7/15/2014 12:00:00 AM 9999999999 7/17/2014 6:45:37 AM WASHINGTON ... 2 2B 2B07 SECOND 208 6 NONE NONE NONE NONE
146806 CANCELLED General Business Licenses 400312000769 NaN 3/1/2012 12:00:00 AM 2/28/2014 12:00:00 AM 7/15/2014 12:00:00 AM 2028350680 7/17/2014 6:45:37 AM WASHINGTON ... 2 2B 2B05 SECOND 207 6 NONE NONE NONE GOLDEN TRIANGLE BID
151842 CANCELLED Cigarette Wholesale 410411000056 WAREHOUSE ON THE WHEEL 7/1/2011 12:00:00 AM 6/30/2013 12:00:00 AM 7/15/2014 12:00:00 AM 2025488000 7/17/2014 6:45:37 AM WASHINGTON ... 5 5B 5B09 FIFTH 505 23 NONE NONE NONE NONE
152391 ACTIVE Gen Contr-Construction Mngr 410512000305 REDLINE PROJECTS LLC 4/1/2014 12:00:00 AM 3/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2024210949 7/17/2014 6:45:37 AM WASHINGTON ... 5 5C 5C11 FOURTH 405 21 NONE NONE NONE NONE
162782 READY TO BATCH PRINT One Family Rental 500512001166 THE DUMAN KIM 2008 IRREVOCABLE TRUST 3/1/2014 12:00:00 AM 2/28/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2024664747 7/17/2014 6:45:37 AM WASHINGTON ... 2 2A 2A02 SECOND 207 5 NONE NONE NONE NONE

5 rows × 28 columns

SQL Queries on Data Frames

Now, let's do a few queries on it just to get the hang of using sqldf (and our version mysqldf).


In [8]:
mysqldf("select * from bbl where BUSINESSIMPROVEMENTDISTRICT = 'DOWNTOWN BID'")


Out[8]:
BBL_LICENSE_FACT_ID LICENSESTATUS LICENSECATEGORY CUST_NUM TRADE_NAME LICENSE_START_DATE LICENSE_EXPIRATION_DATE LICENSE_ISSUE_DATE AGENT_PHONE LASTMODIFIEDDATE ... WARD ANC SMD DISTRICT PSA NEIGHBORHOODCLUSTER HOTSPOT2006NAME HOTSPOT2005NAME HOTSPOT2004NAME BUSINESSIMPROVEMENTDISTRICT
0 168986 ACTIVE Hotel 5.000594e+07 CCMH METRO CENTER LLC 5/1/2014 12:00:00 AM 4/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2027372200 7/17/2014 6:45:37 AM ... 2 2C 2C03 FIRST 101 8 NONE NONE NONE DOWNTOWN BID
1 240380 READY TO BATCH PRINT General Business Licenses 7.010787e+07 KIPLINGER WASHINGTON EDITORS, INC. (THE) 9/1/2014 12:00:00 AM 8/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2025723100 7/17/2014 6:45:37 AM ... 2 2F 2F03 THIRD 307 8 NONE NONE NONE DOWNTOWN BID
2 240701 ACTIVE Cigarette Retail 5.300160e+07 MAHIL CORPORATION 8/1/2014 12:00:00 AM 7/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 9999999999 7/17/2014 6:45:37 AM ... 2 2F 2F03 SECOND 207 8 NONE NONE NONE DOWNTOWN BID
3 249449 ACTIVE Charitable Solicitation 4.002120e+11 WASHINGTON AREA WOMEN'S FOUNDATION 8/1/2014 12:00:00 AM 7/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2023477737 7/17/2014 6:45:37 AM ... 2 2F 2F03 FIRST 101 8 NONE NONE NONE DOWNTOWN BID
4 249571 READY TO BATCH PRINT One Family Rental 5.005120e+11 None 8/1/2014 12:00:00 AM 7/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2022979315 7/17/2014 6:45:37 AM ... 2 2F 2F06 FIRST 101 8 NONE NONE NONE DOWNTOWN BID
5 272574 ACTIVE Charitable Solicitation 4.002140e+11 CITYDANCE ENSEMBLE INC 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2025722100 7/17/2014 6:45:38 AM ... 2 2B 2B05 SECOND 207 6 NONE NONE NONE DOWNTOWN BID
6 272576 ACTIVE Charitable Solicitation 4.002140e+11 WASHINGTON ARCHITECHTURAL FOUNDATION 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2026831566 7/17/2014 6:45:38 AM ... 2 6C 6C09 FIRST 102 8 NONE NONE NONE DOWNTOWN BID
7 272589 ACTIVE Parking Facility Attendant 4.110140e+11 COLONIAL PARKING, INC. 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2023931747 7/17/2014 6:45:38 AM ... 2 2F 2F03 SECOND 207 8 NONE NONE NONE DOWNTOWN BID

8 rows × 29 columns


In [9]:
mysqldf("select * from bbl where WARD = 6")


Out[9]:
BBL_LICENSE_FACT_ID LICENSESTATUS LICENSECATEGORY CUST_NUM TRADE_NAME LICENSE_START_DATE LICENSE_EXPIRATION_DATE LICENSE_ISSUE_DATE AGENT_PHONE LASTMODIFIEDDATE ... WARD ANC SMD DISTRICT PSA NEIGHBORHOODCLUSTER HOTSPOT2006NAME HOTSPOT2005NAME HOTSPOT2004NAME BUSINESSIMPROVEMENTDISTRICT
0 164337 READY TO BATCH PRINT One Family Rental 6.600302e+07 LAWRENCE PLUTKIN & RUTH HANSEN 5/1/2014 12:00:00 AM 4/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2025443900 7/17/2014 6:45:37 AM ... 6 6C 6C06 FIRST 104 25 NONE NONE NONE NONE
1 239799 READY TO BATCH PRINT Charitable Solicitation 7.010666e+07 CENTER FOR STATE & LOCAL GOVERNMENT EXCELLENCE... 8/1/2014 12:00:00 AM 7/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2029999999 7/17/2014 6:45:37 AM ... 6 6C 6C05 FIRST 103 25 NONE NONE NONE NOMA BID
2 242055 READY TO BATCH PRINT One Family Rental 6.600593e+07 CHRISTOPHER SIMPSON 9/1/2014 12:00:00 AM 8/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2025467000 7/17/2014 6:45:37 AM ... 6 6B 6B08 FIRST 108 26 NONE NONE NONE NONE
3 242787 READY TO BATCH PRINT One Family Rental 6.800585e+07 UNITED HOUSE OF PRAYER FOR ALL PEOPLE OF THE C... 9/1/2014 12:00:00 AM 8/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2028823956 7/17/2014 6:45:37 AM ... 6 2C 2C03 THIRD 308 8 NONE NONE NONE NONE
4 245630 CANCELLED Cigarette Retail 4.103120e+11 RASO CORPORATION 5/1/2014 12:00:00 AM 4/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2025752450 7/17/2014 6:45:37 AM ... 6 6A 6A01 FIRST 104 25 NONE NONE NONE NONE
5 245768 ACTIVE Restaurant 9.313120e+11 PACIFICO ON EIGHTH, LLC 6/1/2014 12:00:00 AM 5/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2025365650 7/17/2014 6:45:37 AM ... 6 6B 6B04 FIRST 107 26 NONE NONE NONE CAPITOL HILL BID
6 248720 READY TO BATCH PRINT One Family Rental 5.005120e+11 YARMOUTH MANAGEMENT 8/1/2014 12:00:00 AM 7/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2025473511 7/17/2014 6:45:37 AM ... 6 6A 6A06 FIRST 108 25 NONE NONE NONE NONE
7 249244 READY TO BATCH PRINT One Family Rental 5.005120e+11 YARMOUTH MANAGEMENT 8/1/2014 12:00:00 AM 7/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2026473511 7/17/2014 6:45:37 AM ... 6 6A 6A05 FIRST 107 25 NONE NONE NONE NONE
8 249666 READY TO BATCH PRINT One Family Rental 5.005120e+11 YARMOUTH MANAGEMENT 8/1/2014 12:00:00 AM 7/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2025473511 7/17/2014 6:45:37 AM ... 6 6A 6A03 FIRST 107 25 NONE NONE NONE NONE
9 249875 READY TO BATCH PRINT One Family Rental 5.005120e+11 None 8/1/2014 12:00:00 AM 7/31/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2025473511 7/17/2014 6:45:37 AM ... 6 6B 6B10 FIRST 108 26 NONE NONE NONE NONE
10 272577 ACTIVE Charitable Solicitation 4.002140e+11 SHAW COMMUNITY MINISTRY 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM None 7/17/2014 6:45:38 AM ... 6 2C 2C01 THIRD 307 7 NONE NONE NONE NONE
11 272580 ACTIVE General Business Licenses 4.003149e+11 DC Tree LLC; Matthew T. Jones 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 202-294-1570 7/17/2014 6:45:38 AM ... 6 2C 2C02 THIRD 308 7 NONE NONE NONE NONE
12 272584 ACTIVE General Business Licenses 4.003149e+11 Health Strategies and Associates; Dianne Harris 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 202-578-4031 7/17/2014 6:45:38 AM ... 6 6D 6D07 FIRST 106 27 NONE NONE NONE CAPITOL RIVERFRONT BID
13 272593 ACTIVE One Family Rental 5.005140e+11 None 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 2022907971 7/17/2014 6:45:38 AM ... 6 6A 6A04 FIRST 108 26 NONE NONE NONE NONE
14 272600 ACTIVE One Family Rental 5.005149e+11 MEI MEI PENG 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 202-546-1771 7/17/2014 6:45:38 AM ... 6 6B 6B03 FIRST 107 26 NONE NONE NONE NONE
15 272601 ACTIVE One Family Rental 5.005149e+11 Mark Muenchrath; Mark Muenchrath 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 202-997-3575 7/17/2014 6:45:38 AM ... 6 6B 6B01 FIRST 106 26 NONE NONE NONE NONE
16 272602 ACTIVE One Family Rental 5.005149e+11 Holmes Enterprises; Paloma Holmes 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 202-462-7200 7/17/2014 6:45:38 AM ... 6 6B 6B04 FIRST 107 26 NONE NONE NONE NONE
17 272614 ACTIVE Hotel 5.107140e+11 MHF NOMA OPERATING IV LLC 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 8004831140 7/17/2014 6:45:38 AM ... 6 6C 6C04 FIFTH 506 25 NONE NONE NONE NOMA BID
18 272616 ACTIVE Swimming Pool 9.211140e+11 MHF NOMA OPERATING IV LLC 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 8004831140 7/17/2014 6:45:38 AM ... 6 6C 6C04 FIFTH 506 25 NONE NONE NONE NOMA BID
19 272618 ACTIVE Food Products 9.306140e+11 MHF NOMA OPERATING IV LLC 7/1/2014 12:00:00 AM 6/30/2016 12:00:00 AM 7/15/2014 12:00:00 AM 8004831140 7/17/2014 6:45:38 AM ... 6 6C 6C04 FIFTH 506 25 NONE NONE NONE NOMA BID

20 rows × 29 columns

The sqldf() function returns a data frame, so we can store it in another variable for use in a plot, for example.


In [10]:
ward6 = mysqldf("select LICENSECATEGORY, count(*) as LICENSECOUNT from bbl where WARD = 6 group by LICENSECATEGORY")
ward6


Out[10]:
LICENSECATEGORY LICENSECOUNT
0 Charitable Solicitation 2
1 Cigarette Retail 1
2 Food Products 1
3 General Business Licenses 2
4 Hotel 1
5 One Family Rental 11
6 Restaurant 1
7 Swimming Pool 1

In [24]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure()

ward6.plot(kind='barh')


Out[24]:
<matplotlib.axes.AxesSubplot at 0x10adc6710>
<matplotlib.figure.Figure at 0x10a82ab90>

Example of a Complex Query


In [26]:
query = """
select BBL_LICENSE_FACT_ID as bbl_id, 
LATITUDE as lat, LONGITUDE as lon, 
WARD as ward, LICENSESTATUS as status, 
BUSINESSIMPROVEMENTDISTRICT as bid
from bbl
where LICENSESTATUS != 'CANCELLED'
and WARD != 'NA' 
and LATITUDE <> 0 
AND LONGITUDE <> 0"""

results = mysqldf(query)

In [27]:
results.head()


Out[27]:
bbl_id lat lon ward status bid
0 152391 38.928336 -77.005566 5 ACTIVE NONE
1 162782 38.904394 -77.049809 2 READY TO BATCH PRINT NONE
2 164337 38.895944 -76.999045 6 READY TO BATCH PRINT NONE
3 168986 38.899361 -77.027786 2 ACTIVE DOWNTOWN BID
4 174936 38.923781 -77.051009 3 ACTIVE NONE

In [39]:
from ggplot import *

ggplot(aes(x='lat', y='lon', color='bid'), data=results) + \
        geom_point() + ggtitle("Business Improvement District Locations") + \
        xlab("Latitude") + ylab("Longitude") + facet_grid("ward", "status", scales="free_xy")


Out[39]:
<ggplot: (282735529)>