Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE", as well as your name and collaborators below:
In [ ]:
NAME = "dchud"
COLLABORATORS = ""
Objectives: Gain experience loading a CSV dataset into a database and using SQL to explore its contents. Write and execute a number of SQL queries using common syntax and functions.
Grading criteria: The tasks should all be completed, and questions should all be answered with SQL queries in the space provided, unless a text answer is requested. Results should be correct, according to the embedded tests. The notebook itself should be completely reproducible; from start to finish, another person should be able to use the same code to obtain the same results as yours.
Deadline: Friday, September 30, 12pm. Submit your notebook to Blackboard and push it to your GitHub repository.
In this notebook we'll download a clean CSV dataset from data.gov, load it into a SQLite database, and perform a series of queries to answer several questions. For each problem, write and execute queries that provides the answer in the cells provided, with your SQL queries in the places marked. For each problem after executing your query, immediately execute the following test cell. If the tests complete without error, you have completed that question successfully. If errors arise, you might be missing something. Do not change the tests, just execute them as they are, and update or refine your query until the tests pass.
For this assignment, you need not add narrative description to most of your queries (except where explicitly noted), although you may do so if something you see in the data prompts you. If you do, add new text cells and use Markdown formatting.
Suggestion: if you have worked through the Software Carpentry SQL lessons and have run through the last two lecture notes notebooks, this should all be fairly easy. If you have done neither, do them now, before you begin.
The Connecticut DMV Boating Registrations dataset comprises several years of summary records. It is available from data.gov.
First we download the dataset (note: it might take several seconds):
In [1]:
!wget --quiet -O boating.csv "https://data.ct.gov/api/views/mrb6-7ee5/rows.csv?accessType=DOWNLOAD"
Verify that it's what we think it is on the commandline:
In [2]:
!head boating.csv | csvlook
Looks right. How many records are there?
In [3]:
!wc -l boating.csv
So that should be 145, counting the header. And the basic stats:
In [4]:
!csvstat boating.csv
Looks about right!
Note, though, that the column names have spaces, punctuation, and Upper Cased Names. That's annoying! First let's rename the header line.
In [5]:
!mv boating.csv boating-orig.csv
Okay, using output redirection and tail
we can write a new header line.
In [6]:
!echo "year,tx_type,num" > boating.csv
In [7]:
!tail -n +2 boating-orig.csv >> boating.csv
In [8]:
!head boating.csv | csvlook
Much easier to work with now.
Next we convert the updated csv file into a SQLite database using CSVkit.
In [9]:
!csvsql --db sqlite:///boating.db --insert boating.csv
To work with it, we'll need the ipython-sql extension loaded, and then we'll need to connect to the db.
In [10]:
%load_ext sql
Note if you see a pink box above with six lines of "ShimWarning" and "UserWarning", don't panic. This is just a warning message from the ipython-sql
package. You are good to go.
In [11]:
%sql sqlite:///boating.db
Out[11]:
In [15]:
%%sql
SELECT COUNT(*)
FROM boating;
Out[15]:
Looks like the same number of rows! We're good to go.
In [16]:
%%sql
SELECT *
FROM boating
LIMIT 10;
Out[16]:
This should look familiar!
Let's look at just the "change" types.
In [17]:
%%sql
SELECT *
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE";
Out[17]:
How many is that, and which year had the most?
In [18]:
%%sql
SELECT COUNT(*)
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE";
Out[18]:
Which year had the most of these transactions?
In [ ]:
%%sql
SELECT *
FROM boating
WHERE tx_type = "BOAT CHANGE OF TYPE"
ORDER BY num DESC;
In [23]:
%%sql
SELECT year
FROM boating;
Out[23]:
In [24]:
df = _.DataFrame()
for y in [2008, 2009, 2010, 2011, 2012, 2013, 2014]:
assert y in df.year.values
assert len(df) == 7
In [ ]:
-- YOUR CODE HERE
In [ ]:
df = _.DataFrame()
assert len(df) == 21
for tx_type in ["BOAT, DUPLICATE REGISTRATION", "REN BOAT PONTOON", "BOAT REG, CANOE"]:
assert tx_type in df.tx_type.values
In [ ]:
-- YOUR CODE HERE
In [ ]:
df = _.DataFrame()
df.rename(columns=lambda x: "year" if x == "year" else "count", inplace=True)
assert (df.loc[df['year'] == 2009]['count'] == 21).all()
assert (df.loc[df['year'] == 2013]['count'] == 20).all()
In [ ]:
-- YOUR CODE HERE
In [ ]:
df = _.DataFrame()
df.rename(columns=lambda x: "tx_type" if x == "tx_type" else "sum", inplace=True)
assert (df.loc[df['tx_type'] == "BOAT REG, CANOE"]['sum'] == 1129).all()
assert (df.loc[df['tx_type'] == "REN BOAT REGISTRATION"]['sum'] == 640790).all()
In [ ]:
-- YOUR CODE HERE
In [ ]:
df = _.DataFrame()
df.rename(columns=lambda x: "tx_type" if x == "tx_type" else "sum_num", inplace=True)
assert len(df) == 5
assert (df.loc[df['tx_type'] == "REN BOAT PONTOON"]['sum_num'] == 15556).all()
assert (df.loc[df['tx_type'] == "REN BOAT REGISTRATION"]['sum_num'] == 640790).all()
assert df['sum_num'].max() == 640790
assert df['sum_num'].min() == 2390
In [ ]:
-- YOUR CODE HERE
In [ ]:
df = _.DataFrame()
assert len(df) == 5
assert (df.loc[df['tx_type'] == "REN BOAT PONTOON"]['num'] == 2118).all()
assert (df.loc[df['tx_type'] == "BOAT REG, NEW"]['num'] == 12569).all()
assert df['num'].max() == 94005
assert df['num'].min() == 555
In [ ]:
-- YOUR CODE HERE
In [ ]:
df = _.DataFrame()
assert df.values[0][0] in [7, 9]
In [ ]:
-- YOUR CODE HERE
In [ ]:
df = _.DataFrame()
assert df.values[0][0] == 14
How many transactions in 2011 involved a new registration (as opposed to a renewal or change of type)?
In [ ]:
-- YOUR CODE HERE
In [ ]:
df = _.DataFrame()
assert df.values[0][0] in [12071, 14466]
In [ ]:
-- YOUR CODE HERE