More SQL

Let's grab a fairly large dataset, load it into a database, and work with it.

Getting your data

Capital Bikeshare trip data is a fun source of transactional data. We can work with one quarter's data to show a few key concepts.

The following few cells should be feeling like old hat to you by now.


In [ ]:
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-Q1-Trips-History-Data.zip

It's in a zip format, so unzip it:


In [ ]:
!unzip 2013-Q1-Trips-History-Data.zip

How big is it?


In [ ]:
!wc 2013-Q1-Trips-History-Data.csv

What are its columns?


In [ ]:
!csvcut -n 2013-Q1-Trips-History-Data.csv

Okay, let's have a look.


In [ ]:
!head -5 2013-Q1-Trips-History-Data.csv | csvlook

Ah, that's kinda wordy. Let's cut out that first column, which we can compute for ourselves later.


In [ ]:
!head 2013-Q1-Trips-History-Data.csv | csvcut -C1 | csvlook

That's a little bit cleaner, and the rest of the data should be useful. Let's clean up the data by removing that column and renaming the headers so they're a little easier to query.


In [ ]:
!csvcut -C1 2013-Q1-Trips-History-Data.csv | \
  header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
  > bikeshare.csv

Make sure you haven't lost anything!


In [ ]:
!wc bikeshare.csv

Prepping and loading data into the database

Alright, then, let's get loading.


In [ ]:
%load_ext sql

NOTE: See a bunch of ShimWarnings with a pink background? That's normal. It's just a heads-up about ongoing changes to IPython/Jupyter code. You can keep going.

First, we create a database in mysql. Note: you can do the same thing on the command line by issuing the CREATE DATABASE command part before the pipe within the mysql shell, which you get to with the second part after the pipe. Here we'll pipe the one into the other so it reads well in the notebook.


In [ ]:
!echo "CREATE DATABASE bikedb" | mysql --user=mysqluser --password=mysqlpass

Here's how we connect the notebook up to the mysql database using a username and password. Remember that this shorthand version is possible thanks to the excellent ipython-sql Jupyter extension that we're using, otherwise you'd have to establish the connection, get a cursor, etc., like you've done explicitly in python in your other class.

Not that there's anything wrong with that.


In [ ]:
%sql mysql://mysqluser:mysqlpass@localhost/bikedb

Very easy, no?

First, clean up if we're not running this for the first time.


In [ ]:
%%sql
DROP TABLE IF EXISTS bikeshare;

Next, create a table schema using DDL.


In [ ]:
%%sql
CREATE TABLE bikeshare (
  start_date    DATETIME,
  end_date      DATETIME,
  start_station VARCHAR(100),
  end_station   VARCHAR(100),
  bike_id       CHAR(7),
  sub_type      CHAR(10)
  )

Just to verify it worked:


In [ ]:
%%sql
SELECT COUNT(*)
FROM bikeshare

It worked! We just don't have any data in there yet.

Now we load the data using LOAD DATA INFILE. You can do pretty much the same thing from the bash shell using mysqlimport and a bunch of options. It'll read better here in the notebook with the options spelled out.

Docs for LOAD DATA INFILE are available at https://dev.mysql.com/doc/refman/5.1/en/load-data.html.

Note: this assumes you've placed your bikeshare file in the directory /vagrant.

Note also: I had to look up the mysql date formatting docs to get this date format conversion correct. It took me a few trials and errors before I got it right. This is an extremely common thing to have to do if you ever spend time wrangling data - every system handles dates in its own way.


In [ ]:
%%sql
LOAD DATA INFILE '/vagrant/bikeshare.csv'
REPLACE
INTO TABLE bikeshare
FIELDS TERMINATED BY ','
       OPTIONALLY ENCLOSED BY '"'
IGNORE 1 LINES
(@start_date, @end_date, start_station, end_station, bike_id, sub_type) 
SET start_date = STR_TO_DATE(@start_date, '%c/%e/%Y %k:%i'),
    end_date = STR_TO_DATE(@end_date, '%c/%e/%Y %k:%i')

Note: if the above command fails for you with a "file not found" error, please read these notes about apparmor. Follow that advice, and add a line like it shows, e.g.:

/vagrant/* r

...to the file, or whatever path you have your data on, reload apparmor, and try again. I had to do this, and it worked perfectly after I made that change.

Exploring your data

Now that we've loaded our data, or we think we have, let's just verify it. Should be the same row count as what csvkit and wc gave us.


In [ ]:
%%sql
SELECT COUNT(*) 
FROM bikeshare

Looks good! Let's look at the data a little.


In [ ]:
%%sql
SELECT *
FROM bikeshare
LIMIT 5

How does MySQL construct this query, or more specifically, what's its execution plan? We can find out with EXPLAIN.

For more about how to read MySQL 5.5's query plan, see https://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html.


In [ ]:
%%sql
EXPLAIN SELECT COUNT(*)
FROM bikeshare
LIMIT 5

This says "using no keys, we're going to just scan roughly 395,390 rows, sans indexes, to answer this query."


In [ ]:
%%sql
SELECT MAX(start_date)
FROM bikeshare

In [ ]:
%%sql
EXPLAIN SELECT MAX(start_date)
FROM bikeshare

Pretty much the same thing. You can't get the max without looking at all of the values if there is no index.


In [ ]:
%%sql
SELECT COUNT(*)
FROM bikeshare
WHERE start_station LIKE "%dupont%"

In [ ]:
%%sql
EXPLAIN SELECT COUNT(*)
FROM bikeshare
WHERE start_station LIKE "%dupont%"

Now we see "using where" under "extra", so we know there's a filter operation, but that's about the only change. What if we add more things to filter on?


In [ ]:
%%sql
EXPLAIN SELECT start_station, end_station, COUNT(*)
FROM bikeshare
WHERE start_station LIKE "%dupont%"
AND end_station LIKE "%21st%"
AND start_date LIKE "2013-02-14%"
GROUP BY start_station, end_station
ORDER BY start_station, end_station

Ah, some more info - it looks like it's using a temporary relation to store intermediate results, perhaps for the GROUP BY, then a sort to handle ORDER BY.

Still no indexes, though. Let's change that.


In [ ]:
%%sql
CREATE INDEX idx_start_station ON bikeshare (start_station)

In [ ]:
%%sql
EXPLAIN SELECT start_station, end_station, COUNT(*)
FROM bikeshare
WHERE start_station LIKE "21st%"
AND start_date LIKE "2013-02-14%"
GROUP BY start_station, end_station
ORDER BY start_station, end_station

I changed the query a little bit to use the index, do you see the difference? It found search keys in the index, and the row count went down by an order of magnitude. That's the power of indexes.

It helps even on simple queries like this.


In [ ]:
%%sql
EXPLAIN SELECT DISTINCT start_station
FROM bikeshare
ORDER BY start_station

What's that 201 value for rows? Maybe the actual count of distinct values. We can test that:


In [ ]:
%%sql
SELECT COUNT(*) 
FROM (
    SELECT DISTINCT start_station 
    FROM bikeshare
    ) made_up_subquery_alias_name

There you go, that's exactly the answer.

How about that MAX() query we tried a little while back?


In [ ]:
%%sql
SELECT MAX(start_date)
FROM bikeshare

In [ ]:
%%sql
EXPLAIN SELECT MAX(start_date)
FROM bikeshare

Let's create another index on start_date to see what the effect on the query plan will be.


In [ ]:
%%sql
CREATE INDEX idx_start_date ON bikeshare (start_date)

In [ ]:
%%sql
SELECT MAX(start_date)
FROM bikeshare

Same result, but...


In [ ]:
%%sql
EXPLAIN SELECT MAX(start_date)
FROM bikeshare

That's new! In this case it doesn't have to look at any rows, it can just look at one end of the index. We've optimized away the need to even look at the table.

Let's go back to COUNT() and try a few more things before we move on.


In [ ]:
%%sql
EXPLAIN SELECT COUNT(*)
FROM bikeshare

In [ ]:
%%sql
EXPLAIN SELECT COUNT(start_date)
FROM bikeshare

In [ ]:
%%sql
EXPLAIN SELECT COUNT(end_date)
FROM bikeshare

Do you see what happened there?

Normalizing attributes

Let's look at a few tasks you might need to perform if you were normalizing this dataset. Remember that in normalization, we reduce redundancy with the goal of consistency.

What's redundant? Well, the station names for one.


In [ ]:
%%sql
SELECT COUNT(DISTINCT start_station)
FROM bikeshare

In [ ]:
%%sql
SELECT COUNT(DISTINCT end_station)
FROM bikeshare

Hmm, they're different. Let's put them together.


In [ ]:
%%sql
SELECT COUNT(DISTINCT station) FROM
(
  SELECT start_station AS station FROM bikeshare
    UNION
  SELECT end_station AS station FROM bikeshare
) a

We'll create a table to hold the names of stations. Each station name should be represented once, and we'll assign a primary key to each in the form of a unique integer.


In [ ]:
%%sql
CREATE TABLE station (
  id SMALLINT NOT NULL AUTO_INCREMENT,
  name VARCHAR(100),
  PRIMARY KEY (id)
)

In [ ]:
%%sql
SELECT COUNT(*) 
FROM station

Looks good. Now we can load the data with an INSERT that draws from our previous query. We can skip specifying the id because MySQL will do that for us.

Note: every database handles this issue its own way. This is a nice convenience in MySQL; other database backends require more work.


In [ ]:
%%sql
INSERT INTO station (name) 
SELECT DISTINCT station AS name
FROM
(
  SELECT start_station AS station FROM bikeshare
    UNION
  SELECT end_station AS station FROM bikeshare
) a

In [ ]:
%%sql
SELECT * 
FROM station
LIMIT 10

It worked. Now we can update the bikeshare table to add columns for station identifiers.


In [ ]:
%%sql
ALTER TABLE bikeshare
ADD COLUMN start_station_id SMALLINT
AFTER start_station

Looks good. But what exactly just happened?


In [ ]:
%%sql
DESCRIBE bikeshare

In [ ]:
%%sql
SELECT * 
FROM bikeshare
LIMIT 5

What just happened? Why are all the start_station_id values None?

Let's fill in those values with our new identifiers from the station table.


In [ ]:
%%sql
UPDATE bikeshare
INNER JOIN station
  ON bikeshare.start_station = station.name
SET bikeshare.start_station_id = station.id

In [ ]:
%%sql
SELECT * FROM bikeshare LIMIT 5

In [ ]:
%%sql
SELECT * FROM station WHERE id = 161

Great, now we can drop start_station from bikeshare and save a lot of space.


In [ ]:
%%sql
ALTER TABLE bikeshare
DROP COLUMN start_station

In [ ]:
%%sql
DESCRIBE bikeshare

In [ ]:
%%sql
SELECT * FROM bikeshare LIMIT 5

Worked!

And we can repeat the process for end_station.


In [ ]:
%%sql
ALTER TABLE bikeshare
ADD COLUMN end_station_id SMALLINT
AFTER end_station

In [ ]:
%%sql
UPDATE bikeshare
INNER JOIN station
  ON bikeshare.end_station = station.name
SET bikeshare.end_station_id = station.id

In [ ]:
%%sql
ALTER TABLE bikeshare
DROP COLUMN end_station

In [ ]:
%%sql
SELECT * FROM bikeshare LIMIT 5

A lot leaner, right?

JOINs and indexes

Now let's look at queries that return station names, thus requiring a JOIN across the two tables. Keep in mind our two table schema.


In [ ]:
%%sql
DESCRIBE station

In [ ]:
%%sql
DESCRIBE bikeshare

Let's try a basic query that looks for the most busy station pairs.


In [ ]:
%%sql
SELECT COUNT(*) AS c, start_station_id, end_station_id
FROM bikeshare
GROUP BY start_station_id, end_station_id
ORDER BY c DESC
LIMIT 5

Now let's liven it up by joining to station and including station names. We'll need to join twice, using two aliases.

Worked just fine. Let's look under the hood, though.


In [ ]:
%%sql
SELECT COUNT(*) AS c, station_1.name AS start_station, station_2.name AS end_station
FROM bikeshare, station AS station_1, station AS station_2
WHERE station_1.id = bikeshare.start_station_id
  AND station_2.id = bikeshare.end_station_id
GROUP BY bikeshare.start_station_id, bikeshare.end_station_id
ORDER BY c DESC
LIMIT 5

Looks good, and it's in my neighborhood. :)

Let's look at the query plan for all this:


In [ ]:
%%sql
EXPLAIN SELECT COUNT(*) AS c, station_1.name AS start_station, station_2.name AS end_station
FROM station AS station_1, station AS station_2, bikeshare
WHERE bikeshare.start_station_id = station_1.id
  AND bikeshare.end_station_id = station_2.id
GROUP BY bikeshare.start_station_id, bikeshare.end_station_id
ORDER BY c DESC
LIMIT 5

Not bad, but it's doing a full table scan on bikeshare. Let's see if some indexes would help with the two joins.


In [ ]:
%%sql
CREATE INDEX idx_start_station_id ON bikeshare (start_station_id)

In [ ]:
%%sql
CREATE INDEX idx_end_station_id ON bikeshare (end_station_id)

In [ ]:
%%sql
EXPLAIN SELECT COUNT(*) AS c, station_1.name AS s1_name, station_2.name AS s2_name
FROM bikeshare, station AS station_1, station AS station_2
WHERE station_1.id = bikeshare.start_station_id
  AND station_2.id = bikeshare.end_station_id
GROUP BY bikeshare.start_station_id, bikeshare.end_station_id
ORDER BY c DESC
LIMIT 5

Well, it's hard to say how much better this will perform without a lot more data. A COUNT operation simply needs to be able to count everything, if the level of granularity it's counting doesn't already have an easy lookup like we saw before. Sometimes you just don't feel the pain of scale until you hit a scaling threshold that varies with the shape of your data.

But - see the possible_keys in the first row? That means the optimizer sees the indexes present and will attempt to use those to at least organize the query a little better than it would be able to do without them.

Let's try one more thing - we can create an index on multiple columns that matches our query more precisely. It's inefficient tot look up one column, then another, after all, we're looking for combinations of both. A multiple column index can precompute that.


In [ ]:
%%sql
CREATE INDEX idx_stations ON bikeshare (start_station_id, end_station_id)

In [ ]:
%%sql
EXPLAIN SELECT COUNT(*) AS c, station_1.name AS s1_name, station_2.name AS s2_name
FROM bikeshare, station AS station_1, station AS station_2
WHERE station_1.id = bikeshare.start_station_id
  AND station_2.id = bikeshare.end_station_id
GROUP BY bikeshare.start_station_id, bikeshare.end_station_id
ORDER BY c DESC
LIMIT 5

Finally, looks like a big difference!