SQL Indexes and basic ETL

This week we'll take a look at working with indexes and how examining query plans can shed light on query performance and database design. Then we'll switch gears and look at extracting values from transactional data in a variety of ways.

Setup - bikeshare data, again

We'll download the same Bikeshare data you've worked with before, and we'll create some database tables and indexes more deliberately before using PostgreSQL.


In [ ]:
%load_ext sql

In [ ]:
!createdb week6

In [ ]:
%sql postgresql://dchud@localhost:5432/week6

In [ ]:
!wget https://raw.githubusercontent.com/gwsb-istm-6212-fall-2016/syllabus-and-schedule/master/projects/project-01/2016q1.csv.zip

In [ ]:
!unzip 2016q1.csv.zip

In [ ]:
!wc -l 2016q1.csv

In [ ]:
!csvcut -n 2016q1.csv

Two ways to create this table

First, what we did before, renaming the header line by hand to make it easier to read and type queries.


In [ ]:
!echo "duration_ms,start_date,end_date,start_station_id,start_station,end_station_id,end_station,bike_id,member_type" > rides.csv

In [ ]:
!tail -n +2 2016q1.csv >> rides.csv

Let's make sure we did that correctly.


In [ ]:
!head -3 rides.csv | csvlook

In [ ]:
!csvsql --db postgresql://dchud@localhost:5432/week6 --insert rides.csv

In [ ]:
%%sql
SELECT COUNT(*) FROM rides;

That was a little slow. Let's look at something more direct, using PostgreSQL's support for CSV import.

First, we take a look at a sample of the data to determine its attributes' domains and ranges.


In [ ]:
!gshuf -n 10000 rides.csvcsv.csv | csvstat

Based on these values, I expect we can work with the following:


In [ ]:
%%sql
DROP TABLE IF EXISTS rides;
CREATE TABLE rides (
    duration_ms INTEGER,
    start_date TIMESTAMP,
    end_date TIMESTAMP,
    start_station_id INTEGER,
    start_station VARCHAR(64),
    end_station_id INTEGER,
    end_station VARCHAR(64),
    bike_number CHAR(6),
    member_type CHAR(10)
)

Now we'll load the data in more simply. Note that this requires the use of an absolute path, so adjust it to your location:


In [ ]:
!pwd

In [ ]:
%%sql
COPY rides FROM '/Users/dchud/Documents/Dropbox/teach/2016-fall-data-management/syllabus-and-schedule/lectures/week-06/2016q1.csv'
CSV
HEADER
QUOTE '"'
DELIMITER ',';

In [ ]:
%%sql
SELECT COUNT(*) FROM rides;

By the way, you can extract a schema from a pgsql instance with the following query, which uses the INFORMATION_SCHEMA metadata database.


In [ ]:
%%sql
SELECT column_name, data_type, character_maximum_length, is_nullable
FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'rides';

Working with indexes

Let's find a query that will go a little slow, and see how pgsql plans to implement it.

For example, what are the popular station pairs that result in the slowest average rides?


In [ ]:
%%sql
SELECT start_station, end_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
GROUP BY start_station, end_station, member_type
ORDER BY minutes DESC
LIMIT 10;

That was pretty slow... so we can definitely use indexes to speed things up. First, though, let's look at the counts and member_types. We want more popular station pairs, first of all, so let's set a minimum count level.


In [ ]:
%%sql
SELECT start_station, end_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

That's better. But it's slow. Let's see how pgsql goes about it using EXPLAIN:


In [ ]:
%%sql
EXPLAIN
SELECT start_station, end_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

There's a lot to unpack in there. Read it from the inside out to figure out what it's doing.

  • Seq Scan on rides - this is a table scan, which will be slow
  • Sort Key: start_station, end_station, member_type - we're performing a sort across all three attributes
  • Filter: (count(*) > 90) - there's our selection constraint
  • Sort Key: (round(avg((duration_ms / 60000)), 1)) DESC - look, another sort! what's the difference between the two?
  • Limit (cost=101201.41..101201.44 rows=10 width=63) - can you guess what each element here means?

To speed things up, we need an index. Let's start with one on start_station.


In [ ]:
%%sql
DROP INDEX IF EXISTS idx_start_station;
CREATE INDEX idx_start_station ON rides (start_station);

Note that it takes a few seconds - it's building that indexing structure, then storing it on disk. Remember the metrics we discussed for different index types? This step invokes both the insert time and storage overhead metrics.

The key thing is whether the query will go faster, so let's check


In [ ]:
%%sql
EXPLAIN
SELECT start_station, end_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

It doesn't look like it will be any faster? Why not?

We need to create a different kind of index. That most-nested sort is being done on a combination of three attributes at once. So let's create an index on all three.


In [ ]:
%%sql
DROP INDEX IF EXISTS idx_stations_member_type;
CREATE INDEX idx_stations_member_type ON rides (start_station, end_station, member_type);

In [ ]:
%%sql
EXPLAIN
SELECT start_station, end_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

Now we're getting somewhere! Look at that last line:

  • Index Scan using idx_stations_member_type - this means the optimizer found our index and will use it

And the result is speed:


In [ ]:
%%sql
SELECT start_station, end_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

Much better, right?

Let's refine our query a bit more. First, something weird is going on with those rides that start and end at the same place, so let's dump 'em.


In [ ]:
%%sql
SELECT start_station, end_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
WHERE start_station != end_station
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

Next, because there are clearly a lot of tourists circling the National Mall, let's look at regular riders.


In [ ]:
%%sql
SELECT start_station, end_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
WHERE start_station != end_station
  AND member_type = 'Registered'
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 90
ORDER BY minutes DESC
LIMIT 10;

And the same, but let's recreate our station counts from the last project this time.


In [ ]:
%%sql
SELECT start_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
WHERE start_station != end_station
GROUP BY start_station, member_type
HAVING COUNT(*) > 90
ORDER BY COUNT(*) DESC
LIMIT 10;

In [ ]:
%%sql
SELECT end_station, ROUND(AVG(duration_ms / (1000 * 60)), 1) AS minutes, COUNT(*) AS c, member_type
FROM rides
WHERE start_station != end_station
GROUP BY end_station, member_type
HAVING COUNT(*) > 90
ORDER BY COUNT(*) DESC
LIMIT 10;

Do these look familiar? Let's do the same, but for bikes.


In [ ]:
%%sql
SELECT bike_number, COUNT(*) AS c
FROM rides
WHERE start_station = 'Columbus Circle / Union Station'
GROUP BY bike_number
ORDER BY COUNT(*) DESC
LIMIT 10;

Let's try this again all in one with a subquery. First we make sure we get the nested subquery part right.


In [ ]:
%%sql
SELECT start_station, COUNT(*)
FROM rides
GROUP BY start_station
ORDER BY COUNT(*) DESC
LIMIT 10;

Looks good. Now let's insert the one into the other, removing the COUNT() from the subquery.


In [ ]:
%%sql
SELECT bike_number, COUNT(*) AS c
FROM rides
WHERE start_station IN
    (SELECT start_station
     FROM rides
     GROUP BY start_station
     ORDER BY COUNT(*) DESC
     LIMIT 1)
GROUP BY bike_number
ORDER BY COUNT(*) DESC
LIMIT 10;

Review the query plan from EXPLAIN on that one. See anything familiar in there?


In [ ]:
%%sql
EXPLAIN
SELECT bike_number, COUNT(*) AS c
FROM rides
WHERE start_station IN
    (SELECT start_station
     FROM rides
     GROUP BY start_station
     ORDER BY COUNT(*) DESC
     LIMIT 1)
GROUP BY bike_number
ORDER BY COUNT(*) DESC
LIMIT 10;

Basic ETL with SQL

Today we'll look at examples of how to extract consistent sets of values out of your database. ETL as a whole consists of a lot more than just this, but because every environment has their own tools and approach, we'll just be getting a taste of it here.

First let's look at extracting simple details like station names.


In [ ]:
%%sql
SELECT DISTINCT start_station
FROM rides
ORDER BY start_station;

In [ ]:
%%sql
SELECT DISTINCT end_station
FROM rides
ORDER BY end_station;

To be sure we get them all, we need to combine them into a union set.


In [ ]:
%%sql
SELECT station
FROM (
    SELECT DISTINCT start_station AS station FROM rides
    UNION
    SELECT DISTINCT end_station AS station FROM rides
    ) AS d
;

Now we can create a new table to house the unique station names.


In [ ]:
%%sql
DROP TABLE IF EXISTS stations;
CREATE TABLE stations (
    id SERIAL,
    name VARCHAR(64)
);

In [ ]:
%%sql
INSERT INTO stations (name)
SELECT station FROM (
    SELECT DISTINCT start_station AS station FROM rides
    UNION
    SELECT DISTINCT end_station AS station FROM rides
) AS d;

In [ ]:
%%sql
SELECT * FROM stations LIMIT 10;

We can even add these new identifiers back to the original table now.


In [ ]:
%%sql
ALTER TABLE rides 
ADD COLUMN start_station_nid INTEGER;

In [ ]:
%%sql
UPDATE rides AS r
SET start_station_nid = s.id
FROM stations AS s
WHERE r.start_station = s.name;

In [ ]:
%%sql
SELECT * FROM rides
LIMIT 5;

This isn't the most useful approach, as we already have IDs for the stations, but you can see how we might create them if we didn't yet.

More useful might be recording the minutes as a new column so we don't have to calculate from milliseconds every time.


In [ ]:
%%sql
ALTER TABLE rides
ADD COLUMN duration_min NUMERIC;

In [ ]:
%%sql
UPDATE rides
SET duration_min = ROUND(CAST(duration_ms AS NUMERIC) / (1000 * 60), 1);

In [ ]:
%%sql
SELECT duration_ms, duration_min FROM rides
LIMIT 5;

Another valuable pattern is to use date functions to extract particular time intervals, such as months or days. Every RDBMS has its own set of date functions, unfortunately you will likely just have to learn the ones used by the system in your environment.

Read more in the documentation for PostgreSQL date formatting.


In [ ]:
%%sql
SELECT EXTRACT(DAY FROM start_date), EXTRACT(MONTH FROM start_date), EXTRACT(YEAR FROM start_date)
FROM rides
LIMIT 10;

In data warehouse models and in statistical model feature engineering, it can be particularly useful to extract all kinds of parts of dates out into variables. You never know where you'll find significance.

This kind of extraction is quite common.


In [ ]:
%%sql
SELECT TO_CHAR(start_date, 'YYYY-MM-DD') AS day_of_year, 
    TO_CHAR(start_date, 'YYYY') AS year,
    TO_CHAR(start_date, 'MM') AS month,
    TO_CHAR(start_date, 'DD') AS day,
    TO_CHAR(start_date, 'Day') AS day_of_week_str,
    TO_CHAR(start_date, 'D') AS day_of_week,
    CASE WHEN CAST(TO_CHAR(start_date, 'D') AS INTEGER) >= 6 
        THEN 1 
        ELSE 0
    END AS is_weekend,
    CASE WHEN CAST(TO_CHAR(start_date, 'D') AS INTEGER) < 6 
        THEN 1 
        ELSE 0
    END AS is_weekday,
    TO_CHAR(start_date, 'HH24') AS hour_24,
    TO_CHAR(start_date, 'Q') AS quarter
FROM rides
LIMIT 10;

GROUPING SETS, ROLLUP, CUBE


In [ ]:
%%sql
SELECT start_station, end_station, member_type, COUNT(*) AS c
FROM rides
GROUP BY start_station, end_station, member_type
HAVING COUNT(*) > 300
ORDER BY start_station, end_station, member_type;

In [ ]:
%%sql
SELECT start_station, end_station, member_type, COUNT(*) AS c
FROM rides
GROUP BY GROUPING SETS ((start_station, end_station, member_type), (start_station), (member_type), ())
HAVING COUNT(*) > 300;

This generates a list with both the top station pairs and the total counts for each start station.

ROLLUP generates a similar set of increasingly aggregated summaries, lopping off one column from the right at a time.


In [ ]:
%%sql
SELECT start_station, end_station, member_type, COUNT(*) AS c
FROM rides
GROUP BY ROLLUP (start_station, end_station, member_type)
HAVING COUNT(*) > 300;

CUBE generates summaries for the entire set of attributes and its possible subsets.


In [ ]:
%%sql
SELECT start_station, end_station, member_type, COUNT(*) AS c
FROM rides
GROUP BY CUBE (start_station, end_station, member_type)
HAVING COUNT(*) > 300;