SQL, the Sequel : Data Wranglers DC : August 2014 Meetup

Author: Ryan B. Harvey

Created: 2014-07-18 / Last Updated: 2014-07-18

Contents:

  • The sqldf function in Julia (using the SQLite.jl package)
  • Loading some test data
  • Querying and subsetting that test data
  • Example of a complex query

Package Dependencies:

  • SQLite.jl: Interface to SQLite databases from Julia; also allows SQL operations on data frames via the sqldf function, based on the package of the same name in R

References:

The sqldf function in Julia (using the SQLite.jl package)

The SQLite.jl package is a Julia package that provides an interface to the SQLite library and support for operations on DataFrames through the sqldf function. The sqldf function creates tables in a local SQLite database, runs the requested SQL query on those tables, and then drops the tables from the SQLite database. Although this can be slower than subsetting a data frame directly, in cases where the data frame is very complex and conditions get difficult to parse, SQL can provide a straightforward, declarative way of requesting a particular data subset.

If the SQLite.jl package doesn't exist in your environment, add the package. Once you have it, use it.


In [ ]:
Pkg.update()

In [ ]:
Pkg.add("SQLite")

In [ ]:
Pkg.add("DataFrames")

In [ ]:
using SQLite
using DataFrames

Load some data to work with

For this example, we'll use the Basic Business Licenses data from the DC government open data site.

The data is available as a zip file, so we'll download it if it's not already available, and then unzip it if it hasn't been already.


In [ ]:
url = "http://data.octo.dc.gov/feeds/bbl/bbl_current_csv.zip"

data_dir = "../data/"            
current_dir = pwd()
cd(data_dir)

local_zipped = "bbl_current_csv.zip"
local_data = "bbl_current_csv.csv"

if !(isfile(local_zipped) && isfile(local_data))
  download(url, local_zipped)
  fileDownloadedDate = strftime("%F", time())
  fileDownloadedDate
end

In [ ]:
if !isfile(local_data)
  run(`unzip $local_zipped`)
end

In [ ]:
bbl = readtable(local_data)

In [ ]:
summary(bbl)

In [ ]:
names(bbl)

Querying and subsetting that test data

Now, let's do a few queries on it just to get the hang of using sqldf.


In [ ]:
sqldf("select * from bbl where BUSINESSIMPROVEMENTDISTRICT = 'DOWNTOWN BID'")

In [ ]:
sqldf("select * from bbl where WARD = 6")

The sqldf() function returns a data frame, so we can store it in another variable for use in a plot, for example.


In [ ]:
ward6 = sqldf("select LICENSECATEGORY, count(*) as LICENSECOUNT from bbl where WARD = 6 group by LICENSECATEGORY")
ward6

In [ ]:
Pkg.add("PyPlot")
using PyPlot

In [ ]:
barh(ward6["LICENSECOUNT"], ward6["LICENSECATEGORY"])

In [ ]: