Turning transaction data into dimensional data

Let's grab a fairly large dataset, load it into a database, and create a simple dimensional model with one fact table and one dimension from it.

Getting your data

Let's return to the Capital Bikeshare trip data. This time, though, we mean it - we'll get a whole year's worth of data.


In [ ]:
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-Q1-Trips-History-Data.zip
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-Q2-Trips-History-Data.zip
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-Q3-Trips-History-Data.zip
!wget https://www.capitalbikeshare.com/assets/files/trip-history-data/2013-4th-quarter.zip

These are in the zip format, so unzip:


In [ ]:
!for f in 2013-*.zip; do unzip $f; done

How big are they?


In [ ]:
!wc 2013-Q*.csv

Yum, 2.5M records, this should be fun.

Do they all have the same columns?


In [ ]:
!for f in 2013-Q*.csv; do echo $f; csvcut -n $f; done

Nope! Let's cut out the durations and rearrange each (especially Q4) to match columns.


In [ ]:
!csvcut -C1 2013-Q1-Trips-History-Data.csv | \
  header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
  > bikeshare-q1.csv

In [ ]:
!csvcut -C1 2013-Q2-Trips-History-Data.csv | \
  header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
  > bikeshare-q2.csv

In [ ]:
!csvcut -C1 2013-Q3-Trips-History-Data.csv | \
  header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
  > bikeshare-q3.csv

In [ ]:
!csvcut -c2,4,3,5,6,7 2013-Q4-Trips-History-Data2.csv | \
  header -r "start_date,end_date,start_station,end_station,bike_id,sub_type" \
  > bikeshare-q4.csv

And verify those counts:


In [ ]:
!wc bikeshare-q*.csv

Looks right, so now let's stack them up into one file.


In [ ]:
!csvstack bikeshare-q*.csv > bikeshare-2013.csv

If that worked correctly, our count should go down by three header lines.


In [ ]:
!wc bikeshare-2013.csv

That also looks right! Now let's just check to make sure things like station names and subscriber status look consistent.


In [ ]:
!shuf -n 1000 bikeshare-2013.csv | csvcut -c3 | sort | uniq | head -25

In [ ]:
!shuf -n 1000 bikeshare-2013.csv | csvcut -c6 | sort | uniq -c

Not bad... though there's at least one issue in the station names that indicates further cleanup is needed. Do you see it?

Let's ignore it for now, but with real projects, you can't just skip details like this if you want reliable results. This is why "80% of the work is in data wrangling."

Prepping and loading data into the database

Alright, then, let's get loading.


In [ ]:
%load_ext sql

NOTE: See a bunch of ShimWarnings with a pink background? That's normal. It's just a heads-up about ongoing changes to IPython/Jupyter code. You can keep going.

First, we create a database in mysql. Note: you can do the same thing on the command line by issuing the CREATE DATABASE command part before the pipe within the mysql shell, which you get to with the second part after the pipe. Here we'll pipe the one into the other so it reads well in the notebook.


In [ ]:
!echo "DROP DATABASE bikedb; CREATE DATABASE bikedb" | mysql --user=mysqluser --password=mysqlpass

And connect to it:


In [ ]:
%sql mysql://mysqluser:mysqlpass@localhost/bikedb

First, clean up if we're not running this for the first time.


In [ ]:
%%sql
DROP TABLE IF EXISTS bikeshare;

Next, create a table schema using DDL. You can do a little sampling on your data to get domain and range information for use here, like here, where we take roughly a 10% sample to get a max length for the bike_id column.

We take a sample because it's faster. The risk of sampling is that if there's one bad record, there's only a 1 in 10 chance that a 10% sample will spot it.

If you need to know for sure, don't sample. But it'll cost you time.


In [ ]:
!shuf -n 250000 bikeshare-2013.csv | csvcut -c5 | csvstat

In [ ]:
%%sql
CREATE TABLE bikeshare (
  start_date    DATETIME,
  end_date      DATETIME,
  start_station VARCHAR(100),
  end_station   VARCHAR(100),
  bike_id       CHAR(7),
  sub_type      CHAR(10)
  )

Just to verify it worked:


In [ ]:
%%sql
SELECT COUNT(*)
FROM bikeshare

It worked! We just don't have any data in there yet.

Now we load the data using LOAD DATA INFILE. You can do pretty much the same thing from the bash shell using mysqlimport and a bunch of options. It'll read better here in the notebook with the options spelled out.

Docs for LOAD DATA INFILE are available at https://dev.mysql.com/doc/refman/5.1/en/load-data.html.

Note: this assumes you've placed your bikeshare file in the directory /vagrant.

Note also: I had to look up the mysql date formatting docs to get this date format conversion correct. It took me a few trials and errors before I got it right. This is an extremely common thing to have to do if you ever spend time wrangling data - every system handles dates in its own way.


In [ ]:
!cp bikeshare-2013.csv /vagrant/bikeshare.csv

In [ ]:
%%sql
LOAD DATA INFILE '/vagrant/bikeshare.csv'
REPLACE
INTO TABLE bikeshare
FIELDS TERMINATED BY ','
       OPTIONALLY ENCLOSED BY '"'
IGNORE 1 LINES
(@start_date, @end_date, start_station, end_station, bike_id, sub_type) 
SET start_date = STR_TO_DATE(@start_date, '%c/%e/%Y %k:%i'),
    end_date = STR_TO_DATE(@end_date, '%c/%e/%Y %k:%i')

Note: if the above command fails for you with a "file not found" error, please read these notes about apparmor. Follow that advice, and add a line like it shows, e.g.:

/vagrant/* r

...to the file, or whatever path you have your data on, reload apparmor, and try again. I had to do this, and it worked perfectly after I made that change.

Remember what we said before about sampling value ranges? It looks like there are a few bad values in the bike_id column, but again, we'll ignore them for now, but for a real project, you'd want to be sure the issue was either just a few bad records or something you could correct for in your table design.

Facts and dimensions

Think through what we might want to measure here, and what context we might want to measure it within. Design two new tables: a fact table and a dimension, and migrate the data from this base table into them both.

There is a good example we can follow on page 44 of Star Schema, with an Orders fact table and a Day dimension. Let's do something similar, using a Rides fact table and a Day dimension. We could normalize out a Station dimension as we did before, and add details like lat/lon, zip code, census block/group, and neighborhood, but we'll hold off on that. If we had more customer information, we'd probably have a Customer dimension as well, and we can assume the bike_id might be a key to a Bike dimension. (Wouldn't it be fun if the bikes all had names, and colors, or some such, and the bikes themselves might be interesting to study? Oh well, we only have the ids, so let's do no more for that for now.)

Looking at the Rides fact table, we'll want to capture several surrogate keys and additional values, such as:

  • day_key_start
  • time_start
  • hour_start
  • day_key_stop
  • time_stop
  • hour_stop
  • duration_minutes
  • station_start (we'd use a key if we made Station a dimension)
  • station_stop (same here)
  • sub_type
  • bike_id (a degenerate dimension here, but could also be a key to a Bike dimension)

And this implies a Day dimension that should include attributes like these (taken largely from p. 44):

  • day_key
  • full_date
  • day_of_week_number
  • day_of_week_name
  • day_of_week_abbr
  • day_of_month
  • holiday_flag
  • weekday_flag
  • weekend_flag
  • month_number
  • month_name
  • month_abbr
  • quarter
  • quarter_month
  • year
  • year_month
  • year_quarter

Note that the Rides fact table will need to reference the Day dimension by foreign (surrogate) key, so we'll create Day first.

Building out the Day dimension

Let's start by building up a full_date column, then define and add the key values query bits for the full table.

(Pop quiz: why take a union of the two columns?)


In [ ]:
%%sql
SELECT DISTINCT date
FROM 
  (SELECT DATE(start_date) AS date FROM bikeshare
     UNION
   SELECT DATE(end_date) AS date FROM bikeshare) b
ORDER BY date
LIMIT 5

Okay, a good start. Let's define the dimension table.


In [ ]:
%%sql
DROP TABLE IF EXISTS day_dim;
CREATE TABLE day_dim (
  day_key INT NOT NULL AUTO_INCREMENT,
  full_date DATE,
  day_of_week_number SMALLINT(1),
  day_of_week_name CHAR(9),
  day_of_week_abbr CHAR(3),
  day_of_month SMALLINT(1),
  holiday_flag BOOLEAN,
  weekday_flag BOOLEAN,
  weekend_flag BOOLEAN,
  month_number SMALLINT(2),
  month_name CHAR(9),
  month_abbr CHAR(3),
  quarter SMALLINT(1),
  year YEAR,
  PRIMARY KEY (day_key)
)

Note: for some reason, year_month CHAR(6) caused an error I can't figure out.

Okay, let's start loading that up with a query on our source table. We'll have to reach into the MySQL manual for some details here.


In [ ]:
%%sql
DELETE FROM day_dim;
INSERT INTO day_dim (full_date,
                    day_of_week_number, day_of_week_name, day_of_week_abbr,
                    day_of_month, holiday_flag, weekday_flag, weekend_flag,
                    month_number, month_name, month_abbr,
                    quarter, year)
SELECT DISTINCT date,
  DAYOFWEEK(date), DAYNAME(date), DATE_FORMAT(date, "%a"),
  DAYOFMONTH(date), 0, WEEKDAY(date) <= 4, WEEKDAY(date) > 4,
  MONTH(date), MONTHNAME(date), DATE_FORMAT(date, "%b"),
  QUARTER(date), YEAR(DATE)
FROM 
  (SELECT DATE(start_date) AS date FROM bikeshare
     UNION
   SELECT DATE(end_date) AS date FROM bikeshare) b
ORDER BY date

Wait, 367 days, how's that?


In [ ]:
%%sql
SELECT COUNT(*) FROM day_dim

In [ ]:
%%sql
SELECT MIN(full_date), MAX(full_date) FROM day_dim

Oh! So there must be rides that ended one and two days after the new year.

Looks like our dimension is all together, let's have a look at a sampling of values.


In [ ]:
%%sql
SELECT * 
FROM day_dim
ORDER BY RAND()
LIMIT 20

Building out the Ride fact table

Now we follow a similar process to generate the fact table. In this case, though, we'll be careful to use the newly generated day_key values.


In [ ]:
%%sql
DROP TABLE IF EXISTS ride_fact;
CREATE TABLE ride_fact (
  id INT NOT NULL AUTO_INCREMENT,
  full_date_start DATE,
  day_key_start INT,
  time_start TIME,
  hour_start SMALLINT(2),
  full_date_stop DATE,
  day_key_stop INT,
  time_stop TIME,
  hour_stop SMALLINT(2),
  duration_minutes INT,
  station_start VARCHAR(100),
  station_stop VARCHAR(100),
  sub_type CHAR(10),
  bike_id CHAR(7),
  PRIMARY KEY (id)
)

Let's start by punting on the date lookups (though you could fit it in if you wanted!), as there's a lot of other stuff to pull out at first. We'll just set the date_key values to 1 and 2 for now and go back to update them later.


In [ ]:
%%sql
DELETE FROM ride_fact;
INSERT INTO ride_fact (full_date_start, day_key_start, time_start, hour_start,
                      full_date_stop, day_key_stop, time_stop, hour_stop,
                      duration_minutes,
                      station_start, station_stop,
                      sub_type, bike_id)
SELECT DATE(start_date), 1, TIME(start_date), HOUR(start_date),
       DATE(end_date), 2, TIME(end_date), HOUR(end_date),
       TIMESTAMPDIFF(MINUTE, start_date, end_date),
       start_station, end_station,
       sub_type, bike_id
FROM bikeshare

Okay then, let's go back and fix those dates.


In [ ]:
%%sql
UPDATE ride_fact
INNER JOIN day_dim
  ON ride_fact.full_date_start = day_dim.full_date
SET ride_fact.day_key_start = day_dim.day_key

Oof, that was sloooow. Maybe an index would help before the next one.


In [ ]:
%%sql
CREATE INDEX idx_full_date_stop 
ON ride_fact (full_date_stop)

In [ ]:
%%sql
UPDATE ride_fact
INNER JOIN day_dim
  ON ride_fact.full_date_stop = day_dim.full_date
SET ride_fact.day_key_stop = day_dim.day_key

So far so good. Did the updated day_key references come out right?


In [ ]:
%%sql
SELECT day_key_start, day_key_stop, COUNT(*) 
FROM ride_fact
GROUP BY day_key_start, day_key_stop
ORDER BY day_key_start, day_key_stop
LIMIT 20

Looks right! Weird, but right. If you think about it, it makes sense that a handful of rides start on one day and finish on the next, so most rides that start on day 1 should end on day 2 (presuming the days are in key order). But look at day_key 6: one ride was returned two days later, and another was returned 23 days later! Maybe it was a zombie, they probably ride slowly.

Let's go back and create that other index before we forget.


In [ ]:
%%sql
CREATE INDEX idx_full_date_start 
ON ride_fact (full_date_start)

Exploring the data dimensionally

Explore the new tables, taking advantage of the structure you've set up. What queries are now very easy that were more complicated before?

Here's an example of looking at average ride length by day. You could do this query on our original table, but it would require a bunch of date functions mixed in with the query logic. Now we've already done all that, so the query is cleaner and simpler to write.


In [ ]:
%%sql
SELECT AVG(duration_minutes), day_of_week_name
FROM ride_fact, day_dim
WHERE day_dim.day_key = ride_fact.day_key_start
GROUP BY day_of_week_name
ORDER BY day_of_week_number

In [ ]:
%matplotlib inline

In [ ]:
result = _
result.bar()

How do rides vary by length on weekends?


In [ ]:
%%sql
SELECT hour_start, AVG(duration_minutes)
FROM ride_fact, day_dim
WHERE weekend_flag = 1
  AND day_dim.day_key = ride_fact.day_key_start
GROUP BY hour_start
ORDER BY hour_start

In [ ]:
_.bar()

And how does that compare to weekdays?


In [ ]:
%%sql
SELECT hour_start, AVG(duration_minutes)
FROM ride_fact, day_dim
WHERE weekday_flag = 1
  AND day_dim.day_key = ride_fact.day_key_start
GROUP BY hour_start
ORDER BY hour_start

In [ ]:
_.bar()