Let's grab a fairly large dataset, load it into a database, and create a simple dimensional model with one fact table and one dimension from it.
Let's return to the Capital Bikeshare trip data. This time, though, we mean it - we'll get a whole year's worth of data.
In [ ]:
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-Q1-Trips-History-Data.zip
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-Q2-Trips-History-Data.zip
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-Q3-Trips-History-Data.zip
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-4th-quarter.zip
These are in the zip format, so unzip:
In [ ]:
!for f in 2013-*.zip; do unzip $f; done
How big are they?
In [ ]:
!wc 2013-Q*.csv
Yum, 2.5M records, this should be fun.
Do they all have the same columns?
In [ ]:
!for f in 2013-Q*.csv; do echo $f; csvcut -n $f; done
Nope! Let's cut out the durations and rearrange each (especially Q4) to match columns.
In [ ]:
!csvcut -C1 2013-Q1-Trips-History-Data.csv | \
header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
> bikeshare-q1.csv
In [ ]:
!csvcut -C1 2013-Q2-Trips-History-Data.csv | \
header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
> bikeshare-q2.csv
In [ ]:
!csvcut -C1 2013-Q3-Trips-History-Data.csv | \
header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
> bikeshare-q3.csv
In [ ]:
!csvcut -c2,4,3,5,6,7 2013-Q4-Trips-History-Data2.csv | \
header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
> bikeshare-q4.csv
And verify those counts:
In [ ]:
!wc bikeshare-q*.csv
Looks right, so now let's stack them up into one file.
In [ ]:
!csvstack bikeshare-q*.csv > bikeshare-2013.csv
If that worked correctly, our count should go down by three header lines.
In [ ]:
!wc bikeshare-2013.csv
That also looks right! Now let's just check to make sure things like station names and subscriber status look consistent.
In [ ]:
!shuf -n 1000 bikeshare-2013.csv | csvcut -c3 | sort | uniq | head -25
In [ ]:
!shuf -n 1000 bikeshare-2013.csv | csvcut -c6 | sort | uniq -c
Not bad... though there's at least one issue in the station names that indicates further cleanup is needed. Do you see it?
Let's ignore it for now, but with real projects, you can't just skip details like this if you want reliable results. This is why "80% of the work is in data wrangling."
In [ ]:
%load_ext sql
NOTE: See a bunch of ShimWarnings with a pink background? That's normal. It's just a heads-up about ongoing changes to IPython/Jupyter code. You can keep going.
First, we create a database in mysql. Note: you can do the same thing on the command line by issuing the CREATE DATABASE
command part before the pipe within the mysql shell, which you get to with the second part after the pipe. Here we'll pipe the one into the other so it reads well in the notebook.
In [ ]:
!echo "DROP DATABASE bikedb; CREATE DATABASE bikedb" | mysql --user=mysqluser --password=mysqlpass
And connect to it:
In [ ]:
%sql mysql://mysqluser:mysqlpass@localhost/bikedb
First, clean up if we're not running this for the first time.
In [ ]:
%%sql
DROP TABLE IF EXISTS bikeshare;
Next, create a table schema using DDL. You can do a little sampling on your data to get domain and range information for use here, like here, where we take roughly a 10% sample to get a max length for the bike_id
column.
We take a sample because it's faster. The risk of sampling is that if there's one bad record, there's only a 1 in 10 chance that a 10% sample will spot it.
If you need to know for sure, don't sample. But it'll cost you time.
In [ ]:
!shuf -n 250000 bikeshare-2013.csv | csvcut -c5 | csvstat
In [ ]:
%%sql
CREATE TABLE bikeshare (
start_date DATETIME,
end_date DATETIME,
start_station VARCHAR(100),
end_station VARCHAR(100),
bike_id CHAR(7),
sub_type CHAR(10)
)
Just to verify it worked:
In [ ]:
%%sql
SELECT COUNT(*)
FROM bikeshare
It worked! We just don't have any data in there yet.
Now we load the data using LOAD DATA INFILE
. You can do pretty much the same thing from the bash shell using mysqlimport
and a bunch of options. It'll read better here in the notebook with the options spelled out.
Docs for LOAD DATA INFILE
are available at https://dev.mysql.com/doc/refman/5.1/en/load-data.html.
Note: this assumes you've placed your bikeshare file in the directory /vagrant
.
Note also: I had to look up the mysql date formatting docs to get this date format conversion correct. It took me a few trials and errors before I got it right. This is an extremely common thing to have to do if you ever spend time wrangling data - every system handles dates in its own way.
In [ ]:
!cp bikeshare-2013.csv /vagrant/bikeshare.csv
In [ ]:
%%sql
LOAD DATA INFILE '/vagrant/bikeshare.csv'
REPLACE
INTO TABLE bikeshare
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
IGNORE 1 LINES
(@start_date, @end_date, start_station, end_station, bike_id, sub_type)
SET start_date = STR_TO_DATE(@start_date, '%c/%e/%Y %k:%i'),
end_date = STR_TO_DATE(@end_date, '%c/%e/%Y %k:%i')
Note: if the above command fails for you with a "file not found" error, please read these notes about apparmor. Follow that advice, and add a line like it shows, e.g.:
/vagrant/* r
...to the file, or whatever path you have your data on, reload apparmor, and try again. I had to do this, and it worked perfectly after I made that change.
Remember what we said before about sampling value ranges? It looks like there are a few bad values in the bike_id
column, but again, we'll ignore them for now, but for a real project, you'd want to be sure the issue was either just a few bad records or something you could correct for in your table design.
Think through what we might want to measure here, and what context we might want to measure it within. Design two new tables: a fact table and a dimension, and migrate the data from this base table into them both.
There is a good example we can follow on page 44 of Star Schema, with an Orders fact table and a Day dimension. Let's do something similar, using a Rides fact table and a Day dimension. We could normalize out a Station dimension as we did before, and add details like lat/lon, zip code, census block/group, and neighborhood, but we'll hold off on that. If we had more customer information, we'd probably have a Customer dimension as well, and we can assume the bike_id
might be a key to a Bike dimension. (Wouldn't it be fun if the bikes all had names, and colors, or some such, and the bikes themselves might be interesting to study? Oh well, we only have the ids, so let's do no more for that for now.)
Looking at the Rides fact table, we'll want to capture several surrogate keys and additional values, such as:
And this implies a Day dimension that should include attributes like these (taken largely from p. 44):
Note that the Rides fact table will need to reference the Day dimension by foreign (surrogate) key, so we'll create Day first.
Let's start by building up a full_date
column, then define and add the key values query bits for the full table.
(Pop quiz: why take a union of the two columns?)
In [ ]:
%%sql
SELECT DISTINCT date
FROM
(SELECT DATE(start_date) AS date FROM bikeshare
UNION
SELECT DATE(end_date) AS date FROM bikeshare) b
ORDER BY date
LIMIT 5
Okay, a good start. Let's define the dimension table.
In [ ]:
%%sql
DROP TABLE IF EXISTS day_dim;
CREATE TABLE day_dim (
day_key INT NOT NULL AUTO_INCREMENT,
full_date DATE,
day_of_week_number SMALLINT(1),
day_of_week_name CHAR(9),
day_of_week_abbr CHAR(3),
day_of_month SMALLINT(1),
holiday_flag BOOLEAN,
weekday_flag BOOLEAN,
weekend_flag BOOLEAN,
month_number SMALLINT(2),
month_name CHAR(9),
month_abbr CHAR(3),
quarter SMALLINT(1),
year YEAR,
PRIMARY KEY (day_key)
)
Note: for some reason, year_month CHAR(6)
caused an error I can't figure out.
Okay, let's start loading that up with a query on our source table. We'll have to reach into the MySQL manual for some details here.
In [ ]:
%%sql
DELETE FROM day_dim;
INSERT INTO day_dim (full_date,
day_of_week_number, day_of_week_name, day_of_week_abbr,
day_of_month, holiday_flag, weekday_flag, weekend_flag,
month_number, month_name, month_abbr,
quarter, year)
SELECT DISTINCT date,
DAYOFWEEK(date), DAYNAME(date), DATE_FORMAT(date, "%a"),
DAYOFMONTH(date), 0, WEEKDAY(date) <= 4, WEEKDAY(date) > 4,
MONTH(date), MONTHNAME(date), DATE_FORMAT(date, "%b"),
QUARTER(date), YEAR(DATE)
FROM
(SELECT DATE(start_date) AS date FROM bikeshare
UNION
SELECT DATE(end_date) AS date FROM bikeshare) b
ORDER BY date
Wait, 367 days, how's that?
In [ ]:
%%sql
SELECT COUNT(*) FROM day_dim
In [ ]:
%%sql
SELECT MIN(full_date), MAX(full_date) FROM day_dim
Oh! So there must be rides that ended one and two days after the new year.
Looks like our dimension is all together, let's have a look at a sampling of values.
In [ ]:
%%sql
SELECT *
FROM day_dim
ORDER BY RAND()
LIMIT 20
In [ ]:
%%sql
DROP TABLE IF EXISTS ride_fact;
CREATE TABLE ride_fact (
id INT NOT NULL AUTO_INCREMENT,
full_date_start DATE,
day_key_start INT,
time_start TIME,
hour_start SMALLINT(2),
full_date_stop DATE,
day_key_stop INT,
time_stop TIME,
hour_stop SMALLINT(2),
duration_minutes INT,
station_start VARCHAR(100),
station_stop VARCHAR(100),
sub_type CHAR(10),
bike_id CHAR(7),
PRIMARY KEY (id)
)
Let's start by punting on the date lookups (though you could fit it in if you wanted!), as there's a lot of other stuff to pull out at first. We'll just set the date_key
values to 1
and 2
for now and go back to update them later.
In [ ]:
%%sql
DELETE FROM ride_fact;
INSERT INTO ride_fact (full_date_start, day_key_start, time_start, hour_start,
full_date_stop, day_key_stop, time_stop, hour_stop,
duration_minutes,
station_start, station_stop,
sub_type, bike_id)
SELECT DATE(start_date), 1, TIME(start_date), HOUR(start_date),
DATE(end_date), 2, TIME(end_date), HOUR(end_date),
TIMESTAMPDIFF(MINUTE, start_date, end_date),
start_station, end_station,
sub_type, bike_id
FROM bikeshare
Okay then, let's go back and fix those dates.
In [ ]:
%%sql
UPDATE ride_fact
INNER JOIN day_dim
ON ride_fact.full_date_start = day_dim.full_date
SET ride_fact.day_key_start = day_dim.day_key
Oof, that was sloooow. Maybe an index would help before the next one.
In [ ]:
%%sql
CREATE INDEX idx_full_date_stop
ON ride_fact (full_date_stop)
In [ ]:
%%sql
UPDATE ride_fact
INNER JOIN day_dim
ON ride_fact.full_date_stop = day_dim.full_date
SET ride_fact.day_key_stop = day_dim.day_key
So far so good. Did the updated day_key references come out right?
In [ ]:
%%sql
SELECT day_key_start, day_key_stop, COUNT(*)
FROM ride_fact
GROUP BY day_key_start, day_key_stop
ORDER BY day_key_start, day_key_stop
LIMIT 20
Looks right! Weird, but right. If you think about it, it makes sense that a handful of rides start on one day and finish on the next, so most rides that start on day 1 should end on day 2 (presuming the days are in key order). But look at day_key
6: one ride was returned two days later, and another was returned 23 days later! Maybe it was a zombie, they probably ride slowly.
Let's go back and create that other index before we forget.
In [ ]:
%%sql
CREATE INDEX idx_full_date_start
ON ride_fact (full_date_start)
Explore the new tables, taking advantage of the structure you've set up. What queries are now very easy that were more complicated before?
Here's an example of looking at average ride length by day. You could do this query on our original table, but it would require a bunch of date functions mixed in with the query logic. Now we've already done all that, so the query is cleaner and simpler to write.
In [ ]:
%%sql
SELECT AVG(duration_minutes), day_of_week_name
FROM ride_fact, day_dim
WHERE day_dim.day_key = ride_fact.day_key_start
GROUP BY day_of_week_name
ORDER BY day_of_week_number
In [ ]:
%matplotlib inline
In [ ]:
result = _
result.bar()
How do rides vary by length on weekends?
In [ ]:
%%sql
SELECT hour_start, AVG(duration_minutes)
FROM ride_fact, day_dim
WHERE weekend_flag = 1
AND day_dim.day_key = ride_fact.day_key_start
GROUP BY hour_start
ORDER BY hour_start
In [ ]:
_.bar()
And how does that compare to weekdays?
In [ ]:
%%sql
SELECT hour_start, AVG(duration_minutes)
FROM ride_fact, day_dim
WHERE weekday_flag = 1
AND day_dim.day_key = ride_fact.day_key_start
GROUP BY hour_start
ORDER BY hour_start
In [ ]:
_.bar()