Let's grab a fairly large dataset, load it into a database, and work with it.
Capital Bikeshare trip data is a fun source of transactional data. We can work with one quarter's data to show a few key concepts.
The following few cells should be feeling like old hat to you by now.
In [ ]:
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-Q1-Trips-History-Data.zip
It's in a zip format, so unzip it:
In [ ]:
!unzip 2013-Q1-Trips-History-Data.zip
How big is it?
In [ ]:
!wc 2013-Q1-Trips-History-Data.csv
What are its columns?
In [ ]:
!csvcut -n 2013-Q1-Trips-History-Data.csv
Okay, let's have a look.
In [ ]:
!head -5 2013-Q1-Trips-History-Data.csv | csvlook
Ah, that's kinda wordy. Let's cut out that first column, which we can compute for ourselves later.
In [ ]:
!head 2013-Q1-Trips-History-Data.csv | csvcut -C1 | csvlook
That's a little bit cleaner, and the rest of the data should be useful. Let's clean up the data by removing that column and renaming the headers so they're a little easier to query.
In [ ]:
!csvcut -C1 2013-Q1-Trips-History-Data.csv | \
header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
> bikeshare.csv
Make sure you haven't lost anything!
In [ ]:
!wc bikeshare.csv
In [ ]:
%load_ext sql
NOTE: See a bunch of ShimWarnings with a pink background? That's normal. It's just a heads-up about ongoing changes to IPython/Jupyter code. You can keep going.
First, we create a database in mysql. Note: you can do the same thing on the command line by issuing the CREATE DATABASE
command part before the pipe within the mysql shell, which you get to with the second part after the pipe. Here we'll pipe the one into the other so it reads well in the notebook.
In [ ]:
!echo "CREATE DATABASE bikedb" | mysql --user=mysqluser --password=mysqlpass
Here's how we connect the notebook up to the mysql database using a username and password. Remember that this shorthand version is possible thanks to the excellent ipython-sql Jupyter extension that we're using, otherwise you'd have to establish the connection, get a cursor, etc., like you've done explicitly in python in your other class.
Not that there's anything wrong with that.
In [ ]:
%sql mysql://mysqluser:mysqlpass@localhost/bikedb
Very easy, no?
First, clean up if we're not running this for the first time.
In [ ]:
%%sql
DROP TABLE IF EXISTS bikeshare;
Next, create a table schema using DDL.
In [ ]:
%%sql
CREATE TABLE bikeshare (
start_date DATETIME,
end_date DATETIME,
start_station VARCHAR(100),
end_station VARCHAR(100),
bike_id CHAR(7),
sub_type CHAR(10)
)
Just to verify it worked:
In [ ]:
%%sql
SELECT COUNT(*)
FROM bikeshare
It worked! We just don't have any data in there yet.
Now we load the data using LOAD DATA INFILE
. You can do pretty much the same thing from the bash shell using mysqlimport
and a bunch of options. It'll read better here in the notebook with the options spelled out.
Docs for LOAD DATA INFILE
are available at https://dev.mysql.com/doc/refman/5.1/en/load-data.html.
Note: this assumes you've placed your bikeshare file in the directory /vagrant
.
Note also: I had to look up the mysql date formatting docs to get this date format conversion correct. It took me a few trials and errors before I got it right. This is an extremely common thing to have to do if you ever spend time wrangling data - every system handles dates in its own way.
In [ ]:
%%sql
LOAD DATA INFILE '/vagrant/bikeshare.csv'
REPLACE
INTO TABLE bikeshare
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
IGNORE 1 LINES
(@start_date, @end_date, start_station, end_station, bike_id, sub_type)
SET start_date = STR_TO_DATE(@start_date, '%c/%e/%Y %k:%i'),
end_date = STR_TO_DATE(@end_date, '%c/%e/%Y %k:%i')
Note: if the above command fails for you with a "file not found" error, please read these notes about apparmor. Follow that advice, and add a line like it shows, e.g.:
/vagrant/* r
...to the file, or whatever path you have your data on, reload apparmor, and try again. I had to do this, and it worked perfectly after I made that change.
In [ ]:
%%sql
SELECT COUNT(*)
FROM bikeshare
Looks good! Let's look at the data a little.
In [ ]:
%%sql
SELECT *
FROM bikeshare
LIMIT 5
How does MySQL construct this query, or more specifically, what's its execution plan? We can find out with EXPLAIN
.
For more about how to read MySQL 5.5's query plan, see https://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html.
In [ ]:
%%sql
EXPLAIN SELECT COUNT(*)
FROM bikeshare
LIMIT 5
This says "using no keys, we're going to just scan roughly 395,390 rows, sans indexes, to answer this query."
In [ ]:
%%sql
SELECT MAX(start_date)
FROM bikeshare
In [ ]:
%%sql
EXPLAIN SELECT MAX(start_date)
FROM bikeshare
Pretty much the same thing. You can't get the max without looking at all of the values if there is no index.
In [ ]:
%%sql
SELECT COUNT(*)
FROM bikeshare
WHERE start_station LIKE "%dupont%"
In [ ]:
%%sql
EXPLAIN SELECT COUNT(*)
FROM bikeshare
WHERE start_station LIKE "%dupont%"
Now we see "using where" under "extra", so we know there's a filter operation, but that's about the only change. What if we add more things to filter on?
In [ ]:
%%sql
EXPLAIN SELECT start_station, end_station, COUNT(*)
FROM bikeshare
WHERE start_station LIKE "%dupont%"
AND end_station LIKE "%21st%"
AND start_date LIKE "2013-02-14%"
GROUP BY start_station, end_station
ORDER BY start_station, end_station
Ah, some more info - it looks like it's using a temporary relation to store intermediate results, perhaps for the GROUP BY
, then a sort to handle ORDER BY
.
Still no indexes, though. Let's change that.
In [ ]:
%%sql
CREATE INDEX idx_start_station ON bikeshare (start_station)
In [ ]:
%%sql
EXPLAIN SELECT start_station, end_station, COUNT(*)
FROM bikeshare
WHERE start_station LIKE "21st%"
AND start_date LIKE "2013-02-14%"
GROUP BY start_station, end_station
ORDER BY start_station, end_station
I changed the query a little bit to use the index, do you see the difference? It found search keys in the index, and the row count went down by an order of magnitude. That's the power of indexes.
It helps even on simple queries like this.
In [ ]:
%%sql
EXPLAIN SELECT DISTINCT start_station
FROM bikeshare
ORDER BY start_station
What's that 201 value for rows? Maybe the actual count of distinct values. We can test that:
In [ ]:
%%sql
SELECT COUNT(*)
FROM (
SELECT DISTINCT start_station
FROM bikeshare
) made_up_subquery_alias_name
There you go, that's exactly the answer.
How about that MAX()
query we tried a little while back?
In [ ]:
%%sql
SELECT MAX(start_date)
FROM bikeshare
In [ ]:
%%sql
EXPLAIN SELECT MAX(start_date)
FROM bikeshare
Let's create another index on start_date
to see what the effect on the query plan will be.
In [ ]:
%%sql
CREATE INDEX idx_start_date ON bikeshare (start_date)
In [ ]:
%%sql
SELECT MAX(start_date)
FROM bikeshare
Same result, but...
In [ ]:
%%sql
EXPLAIN SELECT MAX(start_date)
FROM bikeshare
That's new! In this case it doesn't have to look at any rows, it can just look at one end of the index. We've optimized away the need to even look at the table.
Let's go back to COUNT()
and try a few more things before we move on.
In [ ]:
%%sql
EXPLAIN SELECT COUNT(*)
FROM bikeshare
In [ ]:
%%sql
EXPLAIN SELECT COUNT(start_date)
FROM bikeshare
In [ ]:
%%sql
EXPLAIN SELECT COUNT(end_date)
FROM bikeshare
Do you see what happened there?
In [ ]:
%%sql
SELECT COUNT(DISTINCT start_station)
FROM bikeshare
In [ ]:
%%sql
SELECT COUNT(DISTINCT end_station)
FROM bikeshare
Hmm, they're different. Let's put them together.
In [ ]:
%%sql
SELECT COUNT(DISTINCT station) FROM
(
SELECT start_station AS station FROM bikeshare
UNION
SELECT end_station AS station FROM bikeshare
) a
We'll create a table to hold the names of stations. Each station name should be represented once, and we'll assign a primary key to each in the form of a unique integer.
In [ ]:
%%sql
CREATE TABLE station (
id SMALLINT NOT NULL AUTO_INCREMENT,
name VARCHAR(100),
PRIMARY KEY (id)
)
In [ ]:
%%sql
SELECT COUNT(*)
FROM station
Looks good. Now we can load the data with an INSERT that draws from our previous query. We can skip specifying the id
because MySQL will do that for us.
Note: every database handles this issue its own way. This is a nice convenience in MySQL; other database backends require more work.
In [ ]:
%%sql
INSERT INTO station (name)
SELECT DISTINCT station AS name
FROM
(
SELECT start_station AS station FROM bikeshare
UNION
SELECT end_station AS station FROM bikeshare
) a
In [ ]:
%%sql
SELECT *
FROM station
LIMIT 10
It worked. Now we can update the bikeshare table to add columns for station identifiers.
In [ ]:
%%sql
ALTER TABLE bikeshare
ADD COLUMN start_station_id SMALLINT
AFTER start_station
Looks good. But what exactly just happened?
In [ ]:
%%sql
DESCRIBE bikeshare
In [ ]:
%%sql
SELECT *
FROM bikeshare
LIMIT 5
What just happened? Why are all the start_station_id
values None
?
Let's fill in those values with our new identifiers from the station
table.
In [ ]:
%%sql
UPDATE bikeshare
INNER JOIN station
ON bikeshare.start_station = station.name
SET bikeshare.start_station_id = station.id
In [ ]:
%%sql
SELECT * FROM bikeshare LIMIT 5
In [ ]:
%%sql
SELECT * FROM station WHERE id = 161
Great, now we can drop start_station
from bikeshare
and save a lot of space.
In [ ]:
%%sql
ALTER TABLE bikeshare
DROP COLUMN start_station
In [ ]:
%%sql
DESCRIBE bikeshare
In [ ]:
%%sql
SELECT * FROM bikeshare LIMIT 5
Worked!
And we can repeat the process for end_station
.
In [ ]:
%%sql
ALTER TABLE bikeshare
ADD COLUMN end_station_id SMALLINT
AFTER end_station
In [ ]:
%%sql
UPDATE bikeshare
INNER JOIN station
ON bikeshare.end_station = station.name
SET bikeshare.end_station_id = station.id
In [ ]:
%%sql
ALTER TABLE bikeshare
DROP COLUMN end_station
In [ ]:
%%sql
SELECT * FROM bikeshare LIMIT 5
In [ ]:
%%sql
DESCRIBE station
In [ ]:
%%sql
DESCRIBE bikeshare
Let's try a basic query that looks for the most busy station pairs.
In [ ]:
%%sql
SELECT COUNT(*) AS c, start_station_id, end_station_id
FROM bikeshare
GROUP BY start_station_id, end_station_id
ORDER BY c DESC
LIMIT 5
Now let's liven it up by joining to station
and including station names. We'll need to join twice, using two aliases.
Worked just fine. Let's look under the hood, though.
In [ ]:
%%sql
SELECT COUNT(*) AS c, station_1.name AS start_station, station_2.name AS end_station
FROM bikeshare, station AS station_1, station AS station_2
WHERE station_1.id = bikeshare.start_station_id
AND station_2.id = bikeshare.end_station_id
GROUP BY bikeshare.start_station_id, bikeshare.end_station_id
ORDER BY c DESC
LIMIT 5
Looks good, and it's in my neighborhood. :)
Let's look at the query plan for all this:
In [ ]:
%%sql
EXPLAIN SELECT COUNT(*) AS c, station_1.name AS start_station, station_2.name AS end_station
FROM station AS station_1, station AS station_2, bikeshare
WHERE bikeshare.start_station_id = station_1.id
AND bikeshare.end_station_id = station_2.id
GROUP BY bikeshare.start_station_id, bikeshare.end_station_id
ORDER BY c DESC
LIMIT 5
Not bad, but it's doing a full table scan on bikeshare
. Let's see if some indexes would help with the two joins.
In [ ]:
%%sql
CREATE INDEX idx_start_station_id ON bikeshare (start_station_id)
In [ ]:
%%sql
CREATE INDEX idx_end_station_id ON bikeshare (end_station_id)
In [ ]:
%%sql
EXPLAIN SELECT COUNT(*) AS c, station_1.name AS s1_name, station_2.name AS s2_name
FROM bikeshare, station AS station_1, station AS station_2
WHERE station_1.id = bikeshare.start_station_id
AND station_2.id = bikeshare.end_station_id
GROUP BY bikeshare.start_station_id, bikeshare.end_station_id
ORDER BY c DESC
LIMIT 5
Well, it's hard to say how much better this will perform without a lot more data. A COUNT
operation simply needs to be able to count everything, if the level of granularity it's counting doesn't already have an easy lookup like we saw before. Sometimes you just don't feel the pain of scale until you hit a scaling threshold that varies with the shape of your data.
But - see the possible_keys
in the first row? That means the optimizer sees the indexes present and will attempt to use those to at least organize the query a little better than it would be able to do without them.
Let's try one more thing - we can create an index on multiple columns that matches our query more precisely. It's inefficient tot look up one column, then another, after all, we're looking for combinations of both. A multiple column index can precompute that.
In [ ]:
%%sql
CREATE INDEX idx_stations ON bikeshare (start_station_id, end_station_id)
In [ ]:
%%sql
EXPLAIN SELECT COUNT(*) AS c, station_1.name AS s1_name, station_2.name AS s2_name
FROM bikeshare, station AS station_1, station AS station_2
WHERE station_1.id = bikeshare.start_station_id
AND station_2.id = bikeshare.end_station_id
GROUP BY bikeshare.start_station_id, bikeshare.end_station_id
ORDER BY c DESC
LIMIT 5
Finally, looks like a big difference!