Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your name and collaborators below:


In [ ]:
NAME = "dchud"
COLLABORATORS = ""

Exercise 03 - Due Friday, September 30 at 12pm

Objectives: Gain experience loading a CSV dataset into a database and using SQL to explore its contents. Write and execute a number of SQL queries using common syntax and functions.

Grading criteria: The tasks should all be completed, and questions should all be answered with SQL queries in the space provided, unless a text answer is requested. Results should be correct, according to the embedded tests. The notebook itself should be completely reproducible; from start to finish, another person should be able to use the same code to obtain the same results as yours.

Deadline: Friday, September 30, 12pm. Submit your notebook to Blackboard and push it to your GitHub repository.

In this notebook we'll download a clean CSV dataset from data.gov, load it into a SQLite database, and perform a series of queries to answer several questions. For each problem, write and execute queries that provides the answer in the cells provided, with your SQL queries in the places marked. For each problem after executing your query, immediately execute the following test cell. If the tests complete without error, you have completed that question successfully. If errors arise, you might be missing something. Do not change the tests, just execute them as they are, and update or refine your query until the tests pass.

For this assignment, you need not add narrative description to most of your queries (except where explicitly noted), although you may do so if something you see in the data prompts you. If you do, add new text cells and use Markdown formatting.

Suggestion: if you have worked through the Software Carpentry SQL lessons and have run through the last two lecture notes notebooks, this should all be fairly easy. If you have done neither, do them now, before you begin.

Setup - obtain data and create database

The Connecticut DMV Boating Registrations dataset comprises several years of summary records. It is available from data.gov.

First we download the dataset (note: it might take several seconds):


In [1]:
!wget --quiet -O boating.csv "https://data.ct.gov/api/views/mrb6-7ee5/rows.csv?accessType=DOWNLOAD"

Verify that it's what we think it is on the commandline:


In [2]:
!head boating.csv | csvlook


|--------------+-----------------------------+---------------|
|  Fiscal Year | Type of Transactions        | No .of Boats  |
|--------------+-----------------------------+---------------|
|  2008        | BOAT CHANGE OF TYPE         | 59            |
|  2009        | BOAT CHANGE OF TYPE         | 50            |
|  2010        | BOAT CHANGE OF TYPE         | 21            |
|  2011        | BOAT CHANGE OF TYPE         | 41            |
|  2012        | BOAT CHANGE OF TYPE         | 16            |
|  2013        | BOAT CHANGE OF TYPE         | 27            |
|  2014        | BOAT CHANGE OF TYPE         | 54            |
|  2008        | BOAT RECORD XFER NO PROBATE | 6             |
|  2009        | BOAT RECORD XFER NO PROBATE | 12            |
|--------------+-----------------------------+---------------|

Looks right. How many records are there?


In [3]:
!wc -l boating.csv


     145 boating.csv

So that should be 145, counting the header. And the basic stats:


In [4]:
!csvstat boating.csv


  1. Fiscal Year
	<class 'int'>
	Nulls: False
	Min: 2008
	Max: 2014
	Sum: 289578
	Mean: 2010.9583333333333
	Median: 2011.0
	Standard Deviation: 1.9960899278339177
	Unique values: 7
	5 most frequent values:
		2008:	21
		2010:	21
		2011:	21
		2009:	21
		2014:	20
  2. Type of Transactions
	<class 'str'>
	Nulls: False
	Unique values: 21
	5 most frequent values:
		RENEW BOAT EDUCATIONAL:	7
		BOAT REG, CANOE:	7
		BOAT REG, PONTOON:	7
		RENEW BOAT MILTRY PERSONNEL EXMP:	7
		BOAT, REMAINDER FEE:	7
	Max length: 32
  3. No .of Boats
	<class 'int'>
	Nulls: False
	Min: 0
	Max: 95270
	Sum: 767584
	Mean: 5330.444444444444
	Median: 43.5
	Standard Deviation: 19683.78413284855
	Unique values: 96
	5 most frequent values:
		2:	9
		1:	4
		16:	4
		19:	4
		21:	4

Row count: 144

Looks about right!

Note, though, that the column names have spaces, punctuation, and Upper Cased Names. That's annoying! First let's rename the header line.


In [5]:
!mv boating.csv boating-orig.csv

Okay, using output redirection and tail we can write a new header line.


In [6]:
!echo "year,tx_type,num" > boating.csv

In [7]:
!tail -n +2 boating-orig.csv >> boating.csv

In [8]:
!head boating.csv | csvlook


|-------+-----------------------------+------|
|  year | tx_type                     | num  |
|-------+-----------------------------+------|
|  2008 | BOAT CHANGE OF TYPE         | 59   |
|  2009 | BOAT CHANGE OF TYPE         | 50   |
|  2010 | BOAT CHANGE OF TYPE         | 21   |
|  2011 | BOAT CHANGE OF TYPE         | 41   |
|  2012 | BOAT CHANGE OF TYPE         | 16   |
|  2013 | BOAT CHANGE OF TYPE         | 27   |
|  2014 | BOAT CHANGE OF TYPE         | 54   |
|  2008 | BOAT RECORD XFER NO PROBATE | 6    |
|  2009 | BOAT RECORD XFER NO PROBATE | 12   |
|-------+-----------------------------+------|

Much easier to work with now.

Next we convert the updated csv file into a SQLite database using CSVkit.


In [9]:
!csvsql --db sqlite:///boating.db --insert boating.csv


(sqlite3.OperationalError) table boating already exists [SQL: '\nCREATE TABLE boating (\n\tyear INTEGER NOT NULL, \n\ttx_type VARCHAR(32) NOT NULL, \n\tnum INTEGER NOT NULL\n)\n\n']

To work with it, we'll need the ipython-sql extension loaded, and then we'll need to connect to the db.


In [10]:
%load_ext sql


/usr/local/var/pyenv/versions/miniconda3-latest/envs/mc3-2016-class/lib/python3.5/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.
  "You should import from traitlets.config instead.", ShimWarning)
/usr/local/var/pyenv/versions/miniconda3-latest/envs/mc3-2016-class/lib/python3.5/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.
  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")

Note if you see a pink box above with six lines of "ShimWarning" and "UserWarning", don't panic. This is just a warning message from the ipython-sql package. You are good to go.


In [11]:
%sql sqlite:///boating.db


Out[11]:
'Connected: None@boating.db'

In [15]:
%%sql
SELECT COUNT(*)
FROM boating;


Done.
Out[15]:
COUNT(*)
144

Looks like the same number of rows! We're good to go.

Basic queries

In the following queries, we'll do some basic exploration of the data. Let's first see what a few records look like.


In [16]:
%%sql
SELECT *
FROM boating
LIMIT 10;


Done.
Out[16]:
year tx_type num
2008 BOAT CHANGE OF TYPE 59
2009 BOAT CHANGE OF TYPE 50
2010 BOAT CHANGE OF TYPE 21
2011 BOAT CHANGE OF TYPE 41
2012 BOAT CHANGE OF TYPE 16
2013 BOAT CHANGE OF TYPE 27
2014 BOAT CHANGE OF TYPE 54
2008 BOAT RECORD XFER NO PROBATE 6
2009 BOAT RECORD XFER NO PROBATE 12
2010 BOAT RECORD XFER NO PROBATE 15

This should look familiar!

Let's look at just the "change" types.


In [17]:
%%sql
SELECT *
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE";


Done.
Out[17]:
year tx_type num
2008 BOAT CHANGE OF TYPE 59
2009 BOAT CHANGE OF TYPE 50
2010 BOAT CHANGE OF TYPE 21
2011 BOAT CHANGE OF TYPE 41
2012 BOAT CHANGE OF TYPE 16
2013 BOAT CHANGE OF TYPE 27
2014 BOAT CHANGE OF TYPE 54

How many is that, and which year had the most?


In [18]:
%%sql
SELECT COUNT(*)
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE";


Done.
Out[18]:
COUNT(*)
7

Which year had the most of these transactions?


In [ ]:
%%sql
SELECT *
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE"
ORDER BY num DESC;

...alright, your turn.

Question 1

Use DISTINCT to determine the unique set of years in this dataset.


In [23]:
%%sql
SELECT year
FROM boating;


Done.
Out[23]:
year
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014

In [24]:
df = _.DataFrame()
for y in [2008, 2009, 2010, 2011, 2012, 2013, 2014]:
    assert y in df.year.values
assert len(df) == 7


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-24-0ef4bde685c0> in <module>()
      2 for y in [2008, 2009, 2010, 2011, 2012, 2013, 2014]:
      3     assert y in df.year.values
----> 4 assert len(df) == 7

AssertionError: 

Question 2

Use DISTINCT to determine the unique set of transaction types in this dataset.


In [ ]:
-- YOUR CODE HERE

In [ ]:
df = _.DataFrame()
assert len(df) == 21
for tx_type in ["BOAT, DUPLICATE REGISTRATION", "REN BOAT PONTOON", "BOAT REG, CANOE"]:
    assert tx_type in df.tx_type.values

Question 3

Use GROUP BY to determine the overall number of transactions (across all types) per year.


In [ ]:
-- YOUR CODE HERE

In [ ]:
df = _.DataFrame()
df.rename(columns=lambda x: "year" if x == "year" else "count", inplace=True)
assert (df.loc[df['year'] == 2009]['count'] == 21).all()
assert (df.loc[df['year'] == 2013]['count'] == 20).all()

Question 4

Use SUM and GROUP BY to determine the overall number of transactions (across all years) per type.


In [ ]:
-- YOUR CODE HERE

In [ ]:
df = _.DataFrame()
df.rename(columns=lambda x: "tx_type" if x == "tx_type" else "sum", inplace=True)
assert (df.loc[df['tx_type'] == "BOAT REG, CANOE"]['sum'] == 1129).all()
assert (df.loc[df['tx_type'] == "REN BOAT REGISTRATION"]['sum'] == 640790).all()

Question 5

Use ORDER BY and LIMIT to determine the top five types of transactions overall.


In [ ]:
-- YOUR CODE HERE

In [ ]:
df = _.DataFrame()
df.rename(columns=lambda x: "tx_type" if x == "tx_type" else "sum_num", inplace=True)
assert len(df) == 5
assert (df.loc[df['tx_type'] == "REN BOAT PONTOON"]['sum_num'] == 15556).all()
assert (df.loc[df['tx_type'] == "REN BOAT REGISTRATION"]['sum_num'] == 640790).all()
assert df['sum_num'].max() == 640790
assert df['sum_num'].min() == 2390

Question 6

Using ORDER BY and LIMIT again, what were the top five types of transactions in 2010?


In [ ]:
-- YOUR CODE HERE

In [ ]:
df = _.DataFrame()
assert len(df) == 5
assert (df.loc[df['tx_type'] == "REN BOAT PONTOON"]['num'] == 2118).all()
assert (df.loc[df['tx_type'] == "BOAT REG, NEW"]['num'] == 12569).all()
assert df['num'].max() == 94005
assert df['num'].min() == 555

Question 7

Use a wildcard search to determine how many transactions in 2012 were renewals.


In [ ]:
-- YOUR CODE HERE

In [ ]:
df = _.DataFrame()
assert df.values[0][0] in [7, 9]

Question 8

How many transactions overall involve canoes?


In [ ]:
-- YOUR CODE HERE

In [ ]:
df = _.DataFrame()
assert df.values[0][0] == 14

Question 9

How many transactions in 2011 involved a new registration (as opposed to a renewal or change of type)?


In [ ]:
-- YOUR CODE HERE

In [ ]:
df = _.DataFrame()
assert df.values[0][0] in [12071, 14466]

Question 10

How do the transaction trends over time involving pontoons compare to overall boating transaction activity? Discuss as appropriate, adding Markdown cells for your discussion after your exploratory queries.


In [ ]:
-- YOUR CODE HERE

Bonus (10 points)

Make a plot (inline, here, using python) that demonstrates one or more overall trends in boat registrations in Connecticut, drawing data directly from the database.