Assignment 4 - Due Tuesday, October 13 at 10pm

Objectives: Gain experience loading a CSV dataset into a database and using SQL to explore its contents. Write and execute a number of SQL queries using common syntax and functions.

Grading criteria: The tasks should all be completed, and questions should all be answered. Results should be correct, according to a specific answer key. The notebook itself should be completely reproducible; from start to finish, another person should be able to use the same code to obtain the same results as yours.

Deadline: Tuesday, October 13, 2015, 10pm

Reminder: We will not meet on Tuesday, October 13.

In this notebook we'll download a clean CSV dataset from data.gov, load it into a SQLite database, and perform a series of queries to answer several questions. For each problem, write and execute queries that provides the answer. Turn in your notebook with your output and a PDF version with your output together in a single zip file.

For this assignment, you need not add narrative description to most of your queries, although you may do so if something you see in the data prompts you.

Suggestion: if you have worked through the Software Carpentry SQL lessons and have run through the Exercise notebook from class October 6, this should all be fairly easy. If you have done neither, do them now, before you begin.

Setup - obtain data and create database

The Connecticut DMV Boating Registrations dataset comprises several years of summary records. It is available from data.gov.

First we download the dataset (note: it might take several seconds):


In [ ]:
!wget --quiet -O boating.csv "https://data.ct.gov/api/views/mrb6-7ee5/rows.csv?accessType=DOWNLOAD"

Verify that it's what we think it is on the commandline:


In [ ]:
!head boating.csv | csvlook

Looks right. How many records are there?


In [ ]:
!wc boating.csv

So that should be 144, counting the header. And the basic stats:


In [ ]:
!csvstat boating.csv

Looks about right!

Note, though, that the column names have spaces, punctuation, and CamelCase. That's annoying! First let's rename the header line.


In [ ]:
!mv boating.csv boating-orig.csv

In [ ]:
!cat boating-orig.csv | header -r "year,tx_type,num" > boating.csv

Now we convert the updated csv file into a SQLite database using CSVkit.


In [ ]:
!csvsql --db sqlite:///boating.db --insert boating.csv

To work with it, we'll need the ipython-sql extension loaded, and then we'll need to connect to the db.


In [ ]:
%load_ext sql

In [ ]:
%sql sqlite:///boating.db

In [ ]:
%%sql
SELECT COUNT(*)
FROM boating

Looks like the same number of rows! We're good to go.

Basic queries

In the following queries, we'll do some basic exploration of the data. Let's first see what a few records look like.


In [ ]:
%%sql
SELECT *
FROM boating
LIMIT 10

This should look familiar!

Let's look at just the "change" types.


In [ ]:
%%sql
SELECT *
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE"

How many is that, and which year had the most?


In [ ]:
%%sql
SELECT COUNT(*)
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE"

In [ ]:
%%sql
SELECT *
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE"
ORDER BY num DESC

...alright, your turn.

Question 1

Use "DISTINCT" to determine the unique set of years in this dataset.


In [ ]:
%%sql
-- write your query here (replace this comment line!)

Question 2

Use "DISTINCT" to determine the unique set of transaction types in this dataset.


In [ ]:
%%sql
-- write your query here

Question 3

Use "GROUP BY" to determine the overall number of transactions (across all types) per year.


In [ ]:
%%sql
-- write your query here

Question 4

Use "GROUP BY" to determine the overall number of transactions (across all years) per type.


In [ ]:
%%sql
-- write your query here

Question 5

Use "ORDER BY" to determine the top five types of transactions overall.


In [ ]:
%%sql
-- write your query here

Question 6

What were the top five types of transactions in 2010?


In [ ]:
%%sql
-- write your query here

Question 7

Use a wildcard search to determine how many transactions in 2012 were renewals.


In [ ]:
%%sql
-- write your query here

Question 8

How many transactions overall involve canoes?


In [ ]:
%%sql
-- write your query here

Question 9

How many transactions in 2011 involved a new registration (as opposed to a renewal or change of type)?


In [ ]:
%%sql
-- write your query here

Question 10

How do the transaction trends over time involving pontoons compare to overall boating transaction activity? Discuss as appropriate.


In [ ]:
%%sql
-- write your query or queries here

Bonus (10 points)

Make a plot (inline, here, using python) that demonstrates one or more overall trends in boat registrations in Connecticut, drawing data directly from the database.