Filtering Observed Arrivals

As the API Exploration Notebook shows, each poll of the scraper produces 3 predicted arrival times for each line direction at a station. We want to transform and reduce these data to only feature observed train arrivals at stations (per this issue).

This notebook explores how to do this. It was initially developed using a day of data 2017-06-14, but now uses a more recent day of data from the serverless data pipeline in 2019.


In [2]:
import datetime
from psycopg2 import connect
import configparser
import pandas as pd
import pandas.io.sql as pandasql
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import matplotlib.ticker as ticker

In [3]:
con = connect(user='rad', database='ttc')

In [ ]:
CONFIG = configparser.ConfigParser(interpolation=None)
CONFIG.read('../db.cfg')
dbset = CONFIG['DBSETTINGS']
con = connect(**dbset)

Goal

We want to generate observed arrival times, in a format similar to GTFS. The GTFS schedule will be useful in this process, data was downloaded from Transit Feeds, the schema of the data is in ttc_gtfs_create.sql and it is processed to a more useful format in PostgreSQL with ttc_gtfs_process.sql.

From gtfs, we can get a sense of the target data count, how many stops are scheduled on the three subway lines for which we have data?


In [8]:
sql = '''SELECT COUNT(1)

FROM gtfs.stop_times 
INNER JOIN gtfs.trips USING (trip_id)
INNER JOIN gtfs.routes USING (route_id)
INNER JOIN gtfs.calendar USING (service_id)
WHERE monday AND route_type = 1 AND route_short_name != '3'
'''

with con.cursor() as cur:
    cur.execute(sql)
    print(cur.fetchone()[0])


46404

This is a ball park figure we are aiming for in our filtering. Creating a materialized view of the raw poll data for a given day Wednesday, June 14th 2017


In [6]:
sql = '''DROP MATERIALIZED VIEW IF EXISTS test_day CASCADE; 
CREATE MATERIALIZED VIEW test_day AS 
SELECT requestid, stationid, lineid, create_date, request_date, station_char, subwayline, system_message_type, 
            timint, traindirection, trainid, train_message
FROM requests_serverless
INNER JOIN ntas_data_serverless USING (requestid)
WHERE request_date >= '2019-07-17'::DATE + interval '5 hours' 
AND request_date <  '2019-07-17'::DATE + interval '29 hours' 
''' 
with con:
    with con.cursor() as cur:
        cur.execute(sql)

In [7]:
with con.cursor() as cur:
    cur.execute('SELECT COUNT(1) FROM test_day')
    print(cur.fetchone()[0])


551408

Cool. Definitely some work to do.

Trying out a very basic filter, which has a Known Issue


In [8]:
sql = '''SELECT COUNT(DISTINCT (requestid, lineid, trainid, traindirection, stationid))
FROM test_day 
WHERE train_message = 'AtStation' OR timint < 1'''

with con.cursor() as cur:
    cur.execute(sql)
    print(cur.fetchone()[0])


91665

It's a start.

If every line takes more than an hour to do a round-trip, we might be able to look for a distinct train-line-direction-station combination for each hour.


In [11]:
sql = '''WITH trips AS (SELECT route_short_name, (SELECT trip_id FROM gtfs.trips WHERE trips.route_id = routes.route_id LIMIT 1) 
FROM gtfs.routes
WHERE route_type = 1 AND route_short_name != '3' )

SELECT route_short_name, MIN(arrival_time) AS "Start Time", MIN(stop_sequence) ||'-'||MAX(stop_sequence) AS "Stops", MAX(arrival_time) - MIN(arrival_time) AS "Half-trip time"
FROM gtfs.stop_times
INNER JOIN trips USING(trip_id)
GROUP BY route_short_name, trip_id
ORDER BY route_short_name, trip_id
'''

trips = pandasql.read_sql(sql, con)

In [12]:
trips


Out[12]:
route_short_name Start Time Stops Half-trip time
0 1 05:37:49 1-32 00:57:45
1 2 05:40:30 1-31 00:51:27
2 4 05:30:15 1-2 00:02:39

So any given train on line 1 or 2 shouldn't pass the same station going the same direction in an hour. So we could add the hour in a DISTINCT query.

What's up with Line 4? It's short, but not two stations short... According to TransitFeeds, a GTFS host and exploration platform, Line 4 trains start the day at non-terminus stations. Line 4 actually makes 5 stops, and it takes 8 minutes to go from one terminus to another, with a two and a half minute layover at each terminus.

Potential issues:

  • headways varying throughout the day
  • delays can push a unique train stop arrival at a station into multiple time boxes.

Better solution:

  • look at each train chronologically through the day and
    • identify when there are multiple observations in a same trip at a same station
    • identify when the train changes direction

In [19]:
sql = ''' WITH unique_trains AS 
    (SELECT lineid::TEXT, COUNT(DISTINCT trainid) AS "Number of trains in a day"
     FROM test_day
     GROUP BY lineid)
     , unique_trips AS(SELECT route_short_name AS lineid, COUNT(DISTINCT trip_id) AS "Number of scheduled trips"
            FROM gtfs.routes -- ON lineid::TEXT = route_short_name
            INNER JOIN gtfs.trips USING (route_id)
            INNER JOIN gtfs.calendar USING (service_id)
            WHERE monday AND route_type = 1 AND route_short_name != '3'
            GROUP BY route_short_name)
            
    SELECT *
        FROM unique_trains
        INNER JOIN unique_trips USING (lineid)
        ORDER BY lineid'''
pandasql.read_sql(sql, con)


Out[19]:
lineid Number of trains in a day Number of scheduled trips
0 1 163 738
1 2 157 700
2 4 72 457

According to wikipedia the number of trains for each line is:

line number of trains
1 76
2 62
4 6

So the


In [1]:
sql = ''' SELECT trainid, lineid, traindirection, stationid, station_char, create_date, request_date, timint, train_message
        FROM test_day
        INNER JOIN (SELECT trainid FROM test_day WHERE lineid = 1 AND create_date::TIME > '07:00'::TIME LIMIT 1) one_train USING (trainid)
        WHERE (timint < 1 OR train_message != 'Arriving') AND lineid = 1
        ORDER BY create_date
'''
one_train = pandasql.read_sql(sql, con)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-db6eee5d7548> in <module>
      5         ORDER BY create_date
      6 '''
----> 7 one_train = pandasql.read_sql(sql, con)

NameError: name 'pandasql' is not defined

In [ ]:


In [ ]:


In [5]:
one_train


Out[5]:
trainid lineid traindirection stationid station_char create_date request_date timint train_message
0 141 1 South 30 SHP2 2017-06-14 05:50:06 2017-06-14 05:50:01.529971 0.000000 AtStation
1 141 1 South 29 YKM2 2017-06-14 05:52:06 2017-06-14 05:52:01.566292 0.896471 Arriving
2 141 1 South 29 YKM2 2017-06-14 05:53:07 2017-06-14 05:53:02.486092 0.000000 AtStation
3 141 1 South 28 LAW2 2017-06-14 05:55:05 2017-06-14 05:55:01.532195 0.812800 Arriving
4 141 1 South 27 EGL2 2017-06-14 05:59:06 2017-06-14 05:59:02.506222 0.274918 Arriving
5 141 1 South 27 EGL2 2017-06-14 06:00:06 2017-06-14 06:00:02.395306 0.000000 AtStation
6 141 1 South 27 EGL2 2017-06-14 06:01:06 2017-06-14 06:01:02.224179 0.000000 AtStation
7 141 1 South 11 MUS2 2017-06-14 06:02:06 2017-06-14 06:02:02.127224 18.099605 Delayed
8 141 1 South 12 QPK2 2017-06-14 06:02:06 2017-06-14 06:02:02.127553 16.777805 Delayed
9 141 1 South 13 STP2 2017-06-14 06:02:06 2017-06-14 06:02:02.127905 15.583304 Delayed
10 141 1 South 15 STA2 2017-06-14 06:02:06 2017-06-14 06:02:02.128542 14.078819 Delayed
11 141 1 South 19 DUN2 2017-06-14 06:02:06 2017-06-14 06:02:02.134013 10.627591 Delayed
12 141 1 South 20 COL2 2017-06-14 06:02:06 2017-06-14 06:02:02.134354 9.724741 Delayed
13 141 1 South 23 ROS2 2017-06-14 06:02:06 2017-06-14 06:02:02.135237 6.368941 Delayed
14 141 1 South 24 SUM2 2017-06-14 06:02:06 2017-06-14 06:02:02.135474 5.305325 Delayed
15 141 1 South 25 STC2 2017-06-14 06:02:06 2017-06-14 06:02:02.135727 3.833515 Delayed
16 141 1 South 27 EGL2 2017-06-14 06:02:06 2017-06-14 06:02:02.136214 0.000000 Delayed
17 141 1 South 7 SCW2 2017-06-14 06:02:06 2017-06-14 06:02:02.125818 24.339155 Delayed
18 141 1 South 4 LWW2 2017-06-14 06:02:06 2017-06-14 06:02:02.119089 32.761508 Delayed
19 141 1 South 2 WIL2 2017-06-14 06:02:06 2017-06-14 06:02:02.115756 37.015169 Delayed
20 141 1 South 3 YKD2 2017-06-14 06:02:06 2017-06-14 06:02:04.741088 34.891925 Delayed
21 141 1 South 5 GCN2 2017-06-14 06:02:06 2017-06-14 06:02:02.124873 31.332132 Delayed
22 141 1 South 8 DUP2 2017-06-14 06:02:06 2017-06-14 06:02:02.126152 21.575233 Delayed
23 141 1 South 10 SGU2 2017-06-14 06:02:06 2017-06-14 06:02:02.126807 19.312334 Delayed
24 141 1 South 17 KNG2 2017-06-14 06:02:06 2017-06-14 06:02:02.133276 12.273120 Delayed
25 141 1 South 16 UNI2 2017-06-14 06:02:06 2017-06-14 06:02:02.132949 13.112619 Delayed
26 141 1 South 14 OSG2 2017-06-14 06:02:06 2017-06-14 06:02:02.128224 14.799984 Delayed
27 141 1 South 9 SPA2 2017-06-14 06:02:06 2017-06-14 06:02:02.126467 20.533729 Delayed
28 141 1 South 6 EGW2 2017-06-14 06:02:06 2017-06-14 06:02:02.125432 29.420454 Delayed
29 141 1 South 26 DAV2 2017-06-14 06:02:06 2017-06-14 06:02:02.135980 1.689353 Delayed
... ... ... ... ... ... ... ... ... ...
1017 141 1 South 18 QUN2 2017-06-15 01:16:12 2017-06-15 01:16:01.899701 0.756791 Arriving
1018 141 1 South 19 DUN2 2017-06-15 01:16:12 2017-06-15 01:16:01.900142 0.000000 AtStation
1019 141 1 South 18 QUN2 2017-06-15 01:17:13 2017-06-15 01:17:01.668437 0.000000 AtStation
1020 141 1 South 17 KNG2 2017-06-15 01:17:13 2017-06-15 01:17:01.668099 0.825653 Arriving
1021 141 1 South 16 UNI2 2017-06-15 01:18:13 2017-06-15 01:18:01.908305 0.807027 Arriving
1022 141 1 South 17 KNG2 2017-06-15 01:18:13 2017-06-15 01:18:01.908617 0.000000 AtStation
1023 141 1 South 15 STA2 2017-06-15 01:19:13 2017-06-15 01:19:01.493000 0.926689 Arriving
1024 141 1 South 16 UNI2 2017-06-15 01:19:13 2017-06-15 01:19:01.493331 0.000000 AtStation
1025 141 1 South 15 STA2 2017-06-15 01:20:13 2017-06-15 01:20:02.061234 0.632178 Arriving
1026 141 1 South 15 STA2 2017-06-15 01:21:13 2017-06-15 01:21:02.070662 0.000000 AtStation
1027 141 1 South 14 OSG2 2017-06-15 01:21:13 2017-06-15 01:21:02.070335 0.695267 Arriving
1028 141 1 South 14 OSG2 2017-06-15 01:22:14 2017-06-15 01:22:02.439572 0.000000 AtStation
1029 141 1 South 13 STP2 2017-06-15 01:22:14 2017-06-15 01:22:02.439261 0.753969 Arriving
1030 141 1 South 13 STP2 2017-06-15 01:23:12 2017-06-15 01:23:01.788040 0.000000 AtStation
1031 141 1 South 12 QPK2 2017-06-15 01:24:13 2017-06-15 01:24:02.265015 0.549204 Arriving
1032 141 1 South 11 MUS2 2017-06-15 01:26:12 2017-06-15 01:26:02.167976 0.000000 AtStation
1033 141 1 South 10 SGU2 2017-06-15 01:27:13 2017-06-15 01:27:01.512989 0.365478 Arriving
1034 141 1 South 9 SPA2 2017-06-15 01:28:14 2017-06-15 01:28:01.936863 0.912707 Arriving
1035 141 1 South 8 DUP2 2017-06-15 01:29:14 2017-06-15 01:29:01.940110 0.997809 Arriving
1036 141 1 South 9 SPA2 2017-06-15 01:29:14 2017-06-15 01:29:01.940508 0.000000 AtStation
1037 141 1 South 8 DUP2 2017-06-15 01:30:13 2017-06-15 01:30:01.827342 0.279964 Arriving
1038 141 1 South 7 SCW2 2017-06-15 01:32:12 2017-06-15 01:32:02.045869 0.960120 Arriving
1039 141 1 South 7 SCW2 2017-06-15 01:33:13 2017-06-15 01:33:02.293891 0.000000 AtStation
1040 141 1 South 6 EGW2 2017-06-15 01:37:14 2017-06-15 01:37:02.000244 0.282222 Arriving
1041 141 1 South 5 GCN2 2017-06-15 01:39:14 2017-06-15 01:39:01.928819 0.000000 AtStation
1042 141 1 South 4 LWW2 2017-06-15 01:40:14 2017-06-15 01:40:02.068115 0.500662 Arriving
1043 141 1 South 3 YKD2 2017-06-15 01:42:14 2017-06-15 01:42:01.954696 0.275449 Arriving
1044 141 1 South 2 WIL2 2017-06-15 01:44:14 2017-06-15 01:44:02.325968 0.291253 Arriving
1045 141 1 North 1 SHW1 2017-06-15 01:48:14 2017-06-15 01:48:02.248363 0.255000 Arriving
1046 141 1 South 1 SHW2 2017-06-15 01:48:14 2017-06-15 01:48:02.248363 0.000000 AtStation

1047 rows × 9 columns

Using the filtered schema instead


In [7]:
sql = '''CREATE MATERIALIZED VIEW filtered.test_day AS 
SELECT requestid, stationid, lineid, create_date, request_date, station_char, subwayline, system_message_type, 
            timint, traindirection, trainid, train_message
FROM filtered.requests
INNER JOIN filtered.ntas_data USING (requestid)
WHERE request_date >= '2017-06-14'::DATE + interval '5 hours' 
AND request_date <  '2017-06-14'::DATE + interval '29 hours' 
''' 
with con:
    with con.cursor() as cur:
        cur.execute(sql)

In [8]:
sql = ''' SELECT trainid, lineid, traindirection, stationid, station_char, create_date, request_date, timint, train_message
        FROM filtered.test_day
        INNER JOIN (SELECT trainid FROM filtered.test_day WHERE lineid = 1 AND create_date::TIME > '07:00'::TIME LIMIT 1) one_train USING (trainid)
        WHERE (timint < 1 OR train_message != 'Arriving') AND lineid = 1
        ORDER BY create_date
'''
one_train = pandasql.read_sql(sql, con)

In [9]:
one_train


Out[9]:
trainid lineid traindirection stationid station_char create_date request_date timint train_message
0 136 1 North 2 WIL1 2017-06-14 06:59:26 2017-06-14 06:59:22.142971 0.000000 AtStation
1 136 1 North 2 WIL1 2017-06-14 06:59:46 2017-06-14 06:59:42.124398 0.000000 AtStation
2 136 1 North 2 WIL1 2017-06-14 07:00:05 2017-06-14 07:00:01.656501 0.000000 AtStation
3 136 1 North 3 YKD1 2017-06-14 07:01:25 2017-06-14 07:01:21.823157 0.000000 AtStation
4 136 1 North 3 YKD1 2017-06-14 07:01:46 2017-06-14 07:01:41.887395 0.000000 AtStation
5 136 1 North 4 LWW1 2017-06-14 07:03:06 2017-06-14 07:03:02.301730 0.000000 AtStation
6 136 1 North 4 LWW1 2017-06-14 07:03:26 2017-06-14 07:03:22.111316 0.000000 AtStation
7 136 1 North 5 GCN1 2017-06-14 07:04:45 2017-06-14 07:04:41.595215 0.000000 AtStation
8 136 1 North 6 EGW1 2017-06-14 07:06:26 2017-06-14 07:06:21.852403 0.000000 AtStation
9 136 1 North 6 EGW1 2017-06-14 07:06:45 2017-06-14 07:06:41.815578 0.000000 AtStation
10 136 1 North 7 SCW1 2017-06-14 07:10:26 2017-06-14 07:10:22.069010 0.000000 AtStation
11 136 1 North 7 SCW1 2017-06-14 07:10:45 2017-06-14 07:10:42.084671 0.000000 AtStation
12 136 1 North 8 DUP1 2017-06-14 07:13:06 2017-06-14 07:13:01.914047 0.000000 AtStation
13 136 1 North 9 SPA1 2017-06-14 07:14:06 2017-06-14 07:14:02.357905 0.000000 AtStation
14 136 1 North 9 SPA1 2017-06-14 07:14:26 2017-06-14 07:14:22.170577 0.000000 AtStation
15 136 1 North 10 SGU1 2017-06-14 07:15:46 2017-06-14 07:15:41.504648 0.000000 AtStation
16 136 1 North 10 SGU1 2017-06-14 07:16:04 2017-06-14 07:16:01.909830 0.000000 AtStation
17 136 1 North 11 MUS1 2017-06-14 07:17:24 2017-06-14 07:17:21.285904 0.000000 AtStation
18 136 1 North 11 MUS1 2017-06-14 07:17:45 2017-06-14 07:17:41.299348 0.000000 AtStation
19 136 1 North 12 QPK1 2017-06-14 07:18:46 2017-06-14 07:18:41.909137 0.000000 AtStation
20 136 1 North 12 QPK1 2017-06-14 07:19:06 2017-06-14 07:19:01.943607 0.000000 AtStation
21 136 1 North 13 STP1 2017-06-14 07:20:06 2017-06-14 07:20:01.635841 0.000000 AtStation
22 136 1 North 13 STP1 2017-06-14 07:20:23 2017-06-14 07:20:21.450990 0.000000 AtStation
23 136 1 North 15 STA1 2017-06-14 07:22:05 2017-06-14 07:22:01.788703 0.000000 AtStation
24 136 1 North 15 STA1 2017-06-14 07:22:26 2017-06-14 07:22:21.566391 0.000000 AtStation
25 136 1 North 16 UNI1 2017-06-14 07:23:45 2017-06-14 07:23:42.172236 0.000000 AtStation
26 136 1 North 16 UNI1 2017-06-14 07:24:07 2017-06-14 07:24:01.977092 0.000000 AtStation
27 136 1 North 16 UNI1 2017-06-14 07:24:24 2017-06-14 07:24:21.772028 0.000000 AtStation
28 136 1 North 17 KNG1 2017-06-14 07:25:06 2017-06-14 07:25:02.305812 0.000000 AtStation
29 136 1 North 17 KNG1 2017-06-14 07:25:26 2017-06-14 07:25:22.105715 0.000000 AtStation
... ... ... ... ... ... ... ... ... ...
1535 136 1 North 11 MUS1 2017-06-15 01:51:14 2017-06-15 01:51:02.420292 20.268782 Delayed
1536 136 1 North 20 COL1 2017-06-15 01:51:14 2017-06-15 01:51:02.423412 28.415942 Delayed
1537 136 1 North 6 EGW1 2017-06-15 01:51:14 2017-06-15 01:51:02.418494 9.612574 Delayed
1538 136 1 North 9 SPA1 2017-06-15 01:51:14 2017-06-15 01:51:02.419538 17.937178 Delayed
1539 136 1 North 16 UNI1 2017-06-15 01:51:14 2017-06-15 01:51:02.422011 25.174006 Delayed
1540 136 1 North 23 ROS1 2017-06-15 01:51:15 2017-06-15 01:51:02.424313 31.616520 Delayed
1541 136 1 North 21 WEL1 2017-06-15 01:51:15 2017-06-15 01:51:02.423765 29.333600 Delayed
1542 136 1 North 24 SUM1 2017-06-15 01:51:15 2017-06-15 01:51:02.424552 32.645937 Delayed
1543 136 1 North 22 BLO1 2017-06-15 01:51:16 2017-06-15 01:51:05.774717 30.287946 Delayed
1544 136 1 North 12 QPK1 2017-06-15 01:51:33 2017-06-15 01:51:22.205588 21.531315 Delayed
1545 136 1 North 14 OSG1 2017-06-15 01:51:33 2017-06-15 01:51:22.206234 23.433235 Delayed
1546 136 1 North 20 COL1 2017-06-15 01:51:33 2017-06-15 01:51:22.208288 28.415942 Delayed
1547 136 1 North 7 SCW1 2017-06-15 01:51:33 2017-06-15 01:51:22.203779 14.312010 Delayed
1548 136 1 North 21 WEL1 2017-06-15 01:51:33 2017-06-15 01:51:22.208632 29.333600 Delayed
1549 136 1 North 11 MUS1 2017-06-15 01:51:33 2017-06-15 01:51:22.205248 20.268782 Delayed
1550 136 1 North 18 QUN1 2017-06-15 01:51:33 2017-06-15 01:51:22.207543 26.794551 Delayed
1551 136 1 North 5 GCN1 2017-06-15 01:51:33 2017-06-15 01:51:22.202781 7.792370 Delayed
1552 136 1 North 6 EGW1 2017-06-15 01:51:33 2017-06-15 01:51:22.203401 9.612574 Delayed
1553 136 1 North 22 BLO1 2017-06-15 01:51:33 2017-06-15 01:51:22.208937 30.287946 Delayed
1554 136 1 North 19 DUN1 2017-06-15 01:51:33 2017-06-15 01:51:22.207922 27.549084 Delayed
1555 136 1 North 2 WIL1 2017-06-15 01:51:33 2017-06-15 01:51:22.200788 2.400054 Delayed
1556 136 1 North 13 STP1 2017-06-15 01:51:33 2017-06-15 01:51:22.205914 22.682088 Delayed
1557 136 1 North 16 UNI1 2017-06-15 01:51:33 2017-06-15 01:51:22.206887 25.174006 Delayed
1558 136 1 North 17 KNG1 2017-06-15 01:51:33 2017-06-15 01:51:22.207207 25.968897 Delayed
1559 136 1 North 15 STA1 2017-06-15 01:51:33 2017-06-15 01:51:22.206556 24.267073 Delayed
1560 136 1 North 4 LWW1 2017-06-15 01:51:33 2017-06-15 01:51:22.202098 6.433881 Delayed
1561 136 1 North 3 YKD1 2017-06-15 01:51:33 2017-06-15 01:51:22.201337 4.415556 Delayed
1562 136 1 North 8 DUP1 2017-06-15 01:51:33 2017-06-15 01:51:22.204115 16.936547 Delayed
1563 136 1 North 9 SPA1 2017-06-15 01:51:33 2017-06-15 01:51:22.204482 17.937178 Delayed
1564 136 1 North 10 SGU1 2017-06-15 01:51:33 2017-06-15 01:51:22.204836 19.066785 Delayed

1565 rows × 9 columns

Ah. We can see train 136 skipped station 14. Fortunately, we have unfiltered data from the same day


In [3]:
sql = ''' SELECT trainid, lineid, traindirection, stationid, station_char, create_date, create_date + timint * interval '1 minute' AS expected_arrival, timint, train_message
        FROM test_day
        WHERE trainid = 136 AND (timint < 1 OR train_message != 'Arriving') AND lineid = 1
        ORDER BY create_date + timint * interval '1 minute'
'''
train_136 = pandasql.read_sql(sql, con)

In [4]:
train_136


Out[4]:
trainid lineid traindirection stationid station_char create_date expected_arrival timint train_message
0 136 1 North 2 WIL1 2017-06-14 06:59:06 2017-06-14 06:59:13.001077 0.116685 Arriving
1 136 1 North 2 WIL1 2017-06-14 07:00:05 2017-06-14 07:00:05.000000 0.000000 AtStation
2 136 1 North 3 YKD1 2017-06-14 07:01:06 2017-06-14 07:01:43.651765 0.627529 Arriving
3 136 1 North 4 LWW1 2017-06-14 07:03:06 2017-06-14 07:03:06.000000 0.000000 AtStation
4 136 1 North 6 EGW1 2017-06-14 07:06:06 2017-06-14 07:06:19.733929 0.228899 Arriving
5 136 1 North 7 SCW1 2017-06-14 07:10:07 2017-06-14 07:10:14.404847 0.123414 Arriving
6 136 1 North 8 DUP1 2017-06-14 07:13:06 2017-06-14 07:13:06.000000 0.000000 AtStation
7 136 1 North 9 SPA1 2017-06-14 07:14:06 2017-06-14 07:14:06.000000 0.000000 AtStation
8 136 1 North 10 SGU1 2017-06-14 07:15:07 2017-06-14 07:15:40.796941 0.563282 Arriving
9 136 1 North 10 SGU1 2017-06-14 07:16:04 2017-06-14 07:16:04.000000 0.000000 AtStation
10 136 1 North 11 MUS1 2017-06-14 07:17:06 2017-06-14 07:17:22.548847 0.275814 Arriving
11 136 1 North 12 QPK1 2017-06-14 07:18:07 2017-06-14 07:18:58.529129 0.858819 Arriving
12 136 1 North 12 QPK1 2017-06-14 07:19:06 2017-06-14 07:19:06.000000 0.000000 AtStation
13 136 1 North 13 STP1 2017-06-14 07:20:06 2017-06-14 07:20:06.000000 0.000000 AtStation
14 136 1 North 14 OSG1 2017-06-14 07:20:06 2017-06-14 07:20:52.819906 0.780332 Arriving
15 136 1 North 15 STA1 2017-06-14 07:21:06 2017-06-14 07:21:42.773224 0.612887 Arriving
16 136 1 North 15 STA1 2017-06-14 07:22:05 2017-06-14 07:22:05.000000 0.000000 AtStation
17 136 1 North 16 UNI1 2017-06-14 07:22:05 2017-06-14 07:23:01.716941 0.945282 Arriving
18 136 1 North 16 UNI1 2017-06-14 07:23:07 2017-06-14 07:23:24.221200 0.287020 Arriving
19 136 1 North 16 UNI1 2017-06-14 07:24:07 2017-06-14 07:24:07.000000 0.000000 AtStation
20 136 1 North 17 KNG1 2017-06-14 07:24:07 2017-06-14 07:24:56.598965 0.826649 Arriving
21 136 1 North 17 KNG1 2017-06-14 07:25:06 2017-06-14 07:25:06.000000 0.000000 AtStation
22 136 1 North 18 QUN1 2017-06-14 07:25:06 2017-06-14 07:25:57.553271 0.859221 Arriving
23 136 1 North 18 QUN1 2017-06-14 07:26:06 2017-06-14 07:26:06.000000 0.000000 AtStation
24 136 1 North 19 DUN1 2017-06-14 07:26:06 2017-06-14 07:26:53.035059 0.783918 Arriving
25 136 1 North 19 DUN1 2017-06-14 07:27:06 2017-06-14 07:27:11.701553 0.095026 Arriving
26 136 1 North 20 COL1 2017-06-14 07:27:06 2017-06-14 07:28:05.872518 0.997875 Arriving
27 136 1 North 20 COL1 2017-06-14 07:28:06 2017-06-14 07:28:44.870965 0.647849 Arriving
28 136 1 North 21 WEL1 2017-06-14 07:29:07 2017-06-14 07:29:49.098259 0.701638 Arriving
29 136 1 North 22 BLO1 2017-06-14 07:30:07 2017-06-14 07:30:51.429082 0.740485 Arriving
... ... ... ... ... ... ... ... ... ...
1006 136 1 North 13 STP1 2017-06-15 01:51:14 2017-06-15 02:13:54.925306 22.682088 Delayed
1007 136 1 North 16 UNI1 2017-06-15 01:49:12 2017-06-15 02:14:22.440372 25.174006 Delayed
1008 136 1 North 15 STA1 2017-06-15 01:50:14 2017-06-15 02:14:30.024372 24.267073 Delayed
1009 136 1 North 14 OSG1 2017-06-15 01:51:14 2017-06-15 02:14:39.994106 23.433235 Delayed
1010 136 1 North 17 KNG1 2017-06-15 01:49:12 2017-06-15 02:15:10.133839 25.968897 Delayed
1011 136 1 North 16 UNI1 2017-06-15 01:50:14 2017-06-15 02:15:24.440372 25.174006 Delayed
1012 136 1 North 15 STA1 2017-06-15 01:51:14 2017-06-15 02:15:30.024372 24.267073 Delayed
1013 136 1 North 18 QUN1 2017-06-15 01:49:12 2017-06-15 02:15:59.673039 26.794551 Delayed
1014 136 1 North 17 KNG1 2017-06-15 01:50:14 2017-06-15 02:16:12.133839 25.968897 Delayed
1015 136 1 North 16 UNI1 2017-06-15 01:51:14 2017-06-15 02:16:24.440372 25.174006 Delayed
1016 136 1 North 19 DUN1 2017-06-15 01:49:12 2017-06-15 02:16:44.945039 27.549084 Delayed
1017 136 1 North 18 QUN1 2017-06-15 01:50:14 2017-06-15 02:17:01.673039 26.794551 Delayed
1018 136 1 North 17 KNG1 2017-06-15 01:51:14 2017-06-15 02:17:12.133839 25.968897 Delayed
1019 136 1 North 20 COL1 2017-06-15 01:49:12 2017-06-15 02:17:36.956506 28.415942 Delayed
1020 136 1 North 19 DUN1 2017-06-15 01:50:14 2017-06-15 02:17:46.945039 27.549084 Delayed
1021 136 1 North 18 QUN1 2017-06-15 01:51:14 2017-06-15 02:18:01.673039 26.794551 Delayed
1022 136 1 North 21 WEL1 2017-06-15 01:49:15 2017-06-15 02:18:35.015972 29.333600 Delayed
1023 136 1 North 20 COL1 2017-06-15 01:50:14 2017-06-15 02:18:38.956506 28.415942 Delayed
1024 136 1 North 19 DUN1 2017-06-15 01:51:14 2017-06-15 02:18:46.945039 27.549084 Delayed
1025 136 1 North 22 BLO1 2017-06-15 01:49:15 2017-06-15 02:19:32.276772 30.287946 Delayed
1026 136 1 North 21 WEL1 2017-06-15 01:50:14 2017-06-15 02:19:34.015972 29.333600 Delayed
1027 136 1 North 20 COL1 2017-06-15 01:51:14 2017-06-15 02:19:38.956506 28.415942 Delayed
1028 136 1 North 22 BLO1 2017-06-15 01:50:14 2017-06-15 02:20:31.276772 30.287946 Delayed
1029 136 1 North 21 WEL1 2017-06-15 01:51:15 2017-06-15 02:20:35.015972 29.333600 Delayed
1030 136 1 North 23 ROS1 2017-06-15 01:49:15 2017-06-15 02:20:51.991172 31.616520 Delayed
1031 136 1 North 22 BLO1 2017-06-15 01:51:16 2017-06-15 02:21:33.276772 30.287946 Delayed
1032 136 1 North 23 ROS1 2017-06-15 01:50:14 2017-06-15 02:21:50.991172 31.616520 Delayed
1033 136 1 North 23 ROS1 2017-06-15 01:51:15 2017-06-15 02:22:51.991172 31.616520 Delayed
1034 136 1 North 24 SUM1 2017-06-15 01:50:14 2017-06-15 02:22:52.756239 32.645937 Delayed
1035 136 1 North 24 SUM1 2017-06-15 01:51:15 2017-06-15 02:23:53.756239 32.645937 Delayed

1036 rows × 9 columns


In [7]:
train_136.iloc[[14]]


Out[7]:
trainid lineid traindirection stationid station_char create_date expected_arrival timint train_message
14 136 1 North 14 OSG1 2017-06-14 07:20:06 2017-06-14 07:20:52.819906 0.780332 Arriving

So we have an expected arrival time at Osgoode station from the unfiltered dataset, meaning that it can have some use after all! However, we can see at the end that the train is super delayed


In [9]:
train_136[train_136['create_date'] > datetime.datetime(2017, 6, 15, 1,30)]


Out[9]:
trainid lineid traindirection stationid station_char create_date expected_arrival timint train_message
964 136 1 South 3 YKD2 2017-06-15 01:31:13 2017-06-15 01:31:29.526933 0.275449 Arriving
965 136 1 South 2 WIL2 2017-06-15 01:33:13 2017-06-15 01:33:13.000000 0.000000 AtStation
966 136 1 South 1 SHW2 2017-06-15 01:36:14 2017-06-15 01:37:04.535556 0.842259 Arriving
967 136 1 North 1 SHW1 2017-06-15 01:37:14 2017-06-15 01:37:32.438889 0.307315 Arriving
968 136 1 North 2 WIL1 2017-06-15 01:49:12 2017-06-15 01:51:36.003239 2.400054 Delayed
969 136 1 North 2 WIL1 2017-06-15 01:50:14 2017-06-15 01:52:38.003239 2.400054 Delayed
970 136 1 North 3 YKD1 2017-06-15 01:49:12 2017-06-15 01:53:36.933372 4.415556 Delayed
971 136 1 North 2 WIL1 2017-06-15 01:51:14 2017-06-15 01:53:38.003239 2.400054 Delayed
972 136 1 North 3 YKD1 2017-06-15 01:50:14 2017-06-15 01:54:38.933372 4.415556 Delayed
973 136 1 North 4 LWW1 2017-06-15 01:49:12 2017-06-15 01:55:38.032839 6.433881 Delayed
974 136 1 North 3 YKD1 2017-06-15 01:51:14 2017-06-15 01:55:38.933372 4.415556 Delayed
975 136 1 North 4 LWW1 2017-06-15 01:50:14 2017-06-15 01:56:40.032839 6.433881 Delayed
976 136 1 North 5 GCN1 2017-06-15 01:49:12 2017-06-15 01:56:59.542172 7.792370 Delayed
977 136 1 North 4 LWW1 2017-06-15 01:51:14 2017-06-15 01:57:40.032839 6.433881 Delayed
978 136 1 North 5 GCN1 2017-06-15 01:50:14 2017-06-15 01:58:01.542172 7.792370 Delayed
979 136 1 North 6 EGW1 2017-06-15 01:49:12 2017-06-15 01:58:48.754439 9.612574 Delayed
980 136 1 North 5 GCN1 2017-06-15 01:51:14 2017-06-15 01:59:01.542172 7.792370 Delayed
981 136 1 North 6 EGW1 2017-06-15 01:50:14 2017-06-15 01:59:50.754439 9.612574 Delayed
982 136 1 North 6 EGW1 2017-06-15 01:51:14 2017-06-15 02:00:50.754439 9.612574 Delayed
983 136 1 North 7 SCW1 2017-06-15 01:49:17 2017-06-15 02:03:35.720572 14.312010 Delayed
984 136 1 North 7 SCW1 2017-06-15 01:50:14 2017-06-15 02:04:32.720572 14.312010 Delayed
985 136 1 North 7 SCW1 2017-06-15 01:51:14 2017-06-15 02:05:32.720572 14.312010 Delayed
986 136 1 North 8 DUP1 2017-06-15 01:49:12 2017-06-15 02:06:08.192839 16.936547 Delayed
987 136 1 North 9 SPA1 2017-06-15 01:49:12 2017-06-15 02:07:08.230706 17.937178 Delayed
988 136 1 North 8 DUP1 2017-06-15 01:50:14 2017-06-15 02:07:10.192839 16.936547 Delayed
989 136 1 North 8 DUP1 2017-06-15 01:51:14 2017-06-15 02:08:10.192839 16.936547 Delayed
990 136 1 North 9 SPA1 2017-06-15 01:50:14 2017-06-15 02:08:10.230706 17.937178 Delayed
991 136 1 North 10 SGU1 2017-06-15 01:49:12 2017-06-15 02:08:16.007106 19.066785 Delayed
992 136 1 North 9 SPA1 2017-06-15 01:51:14 2017-06-15 02:09:10.230706 17.937178 Delayed
993 136 1 North 10 SGU1 2017-06-15 01:50:14 2017-06-15 02:09:18.007106 19.066785 Delayed
... ... ... ... ... ... ... ... ... ...
1006 136 1 North 13 STP1 2017-06-15 01:51:14 2017-06-15 02:13:54.925306 22.682088 Delayed
1007 136 1 North 16 UNI1 2017-06-15 01:49:12 2017-06-15 02:14:22.440372 25.174006 Delayed
1008 136 1 North 15 STA1 2017-06-15 01:50:14 2017-06-15 02:14:30.024372 24.267073 Delayed
1009 136 1 North 14 OSG1 2017-06-15 01:51:14 2017-06-15 02:14:39.994106 23.433235 Delayed
1010 136 1 North 17 KNG1 2017-06-15 01:49:12 2017-06-15 02:15:10.133839 25.968897 Delayed
1011 136 1 North 16 UNI1 2017-06-15 01:50:14 2017-06-15 02:15:24.440372 25.174006 Delayed
1012 136 1 North 15 STA1 2017-06-15 01:51:14 2017-06-15 02:15:30.024372 24.267073 Delayed
1013 136 1 North 18 QUN1 2017-06-15 01:49:12 2017-06-15 02:15:59.673039 26.794551 Delayed
1014 136 1 North 17 KNG1 2017-06-15 01:50:14 2017-06-15 02:16:12.133839 25.968897 Delayed
1015 136 1 North 16 UNI1 2017-06-15 01:51:14 2017-06-15 02:16:24.440372 25.174006 Delayed
1016 136 1 North 19 DUN1 2017-06-15 01:49:12 2017-06-15 02:16:44.945039 27.549084 Delayed
1017 136 1 North 18 QUN1 2017-06-15 01:50:14 2017-06-15 02:17:01.673039 26.794551 Delayed
1018 136 1 North 17 KNG1 2017-06-15 01:51:14 2017-06-15 02:17:12.133839 25.968897 Delayed
1019 136 1 North 20 COL1 2017-06-15 01:49:12 2017-06-15 02:17:36.956506 28.415942 Delayed
1020 136 1 North 19 DUN1 2017-06-15 01:50:14 2017-06-15 02:17:46.945039 27.549084 Delayed
1021 136 1 North 18 QUN1 2017-06-15 01:51:14 2017-06-15 02:18:01.673039 26.794551 Delayed
1022 136 1 North 21 WEL1 2017-06-15 01:49:15 2017-06-15 02:18:35.015972 29.333600 Delayed
1023 136 1 North 20 COL1 2017-06-15 01:50:14 2017-06-15 02:18:38.956506 28.415942 Delayed
1024 136 1 North 19 DUN1 2017-06-15 01:51:14 2017-06-15 02:18:46.945039 27.549084 Delayed
1025 136 1 North 22 BLO1 2017-06-15 01:49:15 2017-06-15 02:19:32.276772 30.287946 Delayed
1026 136 1 North 21 WEL1 2017-06-15 01:50:14 2017-06-15 02:19:34.015972 29.333600 Delayed
1027 136 1 North 20 COL1 2017-06-15 01:51:14 2017-06-15 02:19:38.956506 28.415942 Delayed
1028 136 1 North 22 BLO1 2017-06-15 01:50:14 2017-06-15 02:20:31.276772 30.287946 Delayed
1029 136 1 North 21 WEL1 2017-06-15 01:51:15 2017-06-15 02:20:35.015972 29.333600 Delayed
1030 136 1 North 23 ROS1 2017-06-15 01:49:15 2017-06-15 02:20:51.991172 31.616520 Delayed
1031 136 1 North 22 BLO1 2017-06-15 01:51:16 2017-06-15 02:21:33.276772 30.287946 Delayed
1032 136 1 North 23 ROS1 2017-06-15 01:50:14 2017-06-15 02:21:50.991172 31.616520 Delayed
1033 136 1 North 23 ROS1 2017-06-15 01:51:15 2017-06-15 02:22:51.991172 31.616520 Delayed
1034 136 1 North 24 SUM1 2017-06-15 01:50:14 2017-06-15 02:22:52.756239 32.645937 Delayed
1035 136 1 North 24 SUM1 2017-06-15 01:51:15 2017-06-15 02:23:53.756239 32.645937 Delayed

72 rows × 9 columns

So this doesn't seem like a particularly good example, since the train is just ultimately stuck at Sheppard West station until the end of the (scraping) day. The solution in this case would probably be to just filter out any of these observations where train_message == 'Delayed' and timint > 2. Let's try to see if we can find anything else.


In [10]:
train_136[train_136['train_message'] == 'Delayed']


Out[10]:
trainid lineid traindirection stationid station_char create_date expected_arrival timint train_message
98 136 1 South 7 SCW2 2017-06-14 08:50:07 2017-06-14 08:50:07.000000 0.000000 Delayed
99 136 1 South 7 SCW2 2017-06-14 08:51:07 2017-06-14 08:51:07.000000 0.000000 Delayed
100 136 1 South 7 SCW2 2017-06-14 08:52:05 2017-06-14 08:52:05.000000 0.000000 Delayed
101 136 1 South 7 SCW2 2017-06-14 08:53:06 2017-06-14 08:53:06.000000 0.000000 Delayed
102 136 1 South 6 EGW2 2017-06-14 08:50:07 2017-06-14 08:55:11.877929 5.081299 Delayed
103 136 1 South 6 EGW2 2017-06-14 08:51:07 2017-06-14 08:56:11.877929 5.081299 Delayed
104 136 1 South 6 EGW2 2017-06-14 08:52:05 2017-06-14 08:57:09.877929 5.081299 Delayed
105 136 1 South 6 EGW2 2017-06-14 08:53:06 2017-06-14 08:58:10.877929 5.081299 Delayed
165 136 1 South 29 YKM2 2017-06-14 10:19:00 2017-06-14 10:19:00.000000 0.000000 Delayed
167 136 1 South 28 LAW2 2017-06-14 10:19:00 2017-06-14 10:22:41.602937 3.693382 Delayed
283 136 1 South 27 EGL2 2017-06-14 12:55:07 2017-06-14 12:55:07.000000 0.000000 Delayed
285 136 1 South 26 DAV2 2017-06-14 12:55:07 2017-06-14 12:56:48.080847 1.684681 Delayed
287 136 1 South 25 STC2 2017-06-14 12:55:07 2017-06-14 12:58:56.361371 3.822690 Delayed
289 136 1 South 24 SUM2 2017-06-14 12:55:07 2017-06-14 13:00:24.432122 5.290535 Delayed
292 136 1 South 23 ROS2 2017-06-14 12:55:07 2017-06-14 13:01:28.091075 6.351518 Delayed
294 136 1 South 22 BLO2 2017-06-14 12:55:07 2017-06-14 13:02:51.336654 7.738944 Delayed
297 136 1 South 21 WEL2 2017-06-14 12:55:07 2017-06-14 13:03:50.921016 8.732017 Delayed
299 136 1 South 22 BLO2 2017-06-14 13:04:07 2017-06-14 13:04:07.000000 0.000000 Delayed
300 136 1 South 20 COL2 2017-06-14 12:55:07 2017-06-14 13:04:48.932730 9.698879 Delayed
301 136 1 South 21 WEL2 2017-06-14 13:04:07 2017-06-14 13:05:06.584362 0.993073 Delayed
303 136 1 South 19 DUN2 2017-06-14 12:55:07 2017-06-14 13:05:42.977079 10.599618 Delayed
304 136 1 South 20 COL2 2017-06-14 13:04:07 2017-06-14 13:06:04.596076 1.959935 Delayed
307 136 1 South 18 QUN2 2017-06-14 12:55:07 2017-06-14 13:06:30.051734 11.384196 Delayed
308 136 1 South 19 DUN2 2017-06-14 13:04:07 2017-06-14 13:06:58.640425 2.860674 Delayed
309 136 1 South 17 KNG2 2017-06-14 12:55:07 2017-06-14 13:07:21.486916 12.241449 Delayed
311 136 1 South 18 QUN2 2017-06-14 13:04:07 2017-06-14 13:07:45.715080 3.645251 Delayed
313 136 1 South 17 KNG2 2017-06-14 13:04:07 2017-06-14 13:08:37.150262 4.502504 Delayed
316 136 1 South 16 UNI2 2017-06-14 13:04:07 2017-06-14 13:09:27.405957 5.340099 Delayed
319 136 1 South 15 STA2 2017-06-14 13:04:07 2017-06-14 13:10:25.238960 6.303983 Delayed
322 136 1 South 14 OSG2 2017-06-14 13:04:07 2017-06-14 13:11:08.417736 7.023629 Delayed
... ... ... ... ... ... ... ... ... ...
1006 136 1 North 13 STP1 2017-06-15 01:51:14 2017-06-15 02:13:54.925306 22.682088 Delayed
1007 136 1 North 16 UNI1 2017-06-15 01:49:12 2017-06-15 02:14:22.440372 25.174006 Delayed
1008 136 1 North 15 STA1 2017-06-15 01:50:14 2017-06-15 02:14:30.024372 24.267073 Delayed
1009 136 1 North 14 OSG1 2017-06-15 01:51:14 2017-06-15 02:14:39.994106 23.433235 Delayed
1010 136 1 North 17 KNG1 2017-06-15 01:49:12 2017-06-15 02:15:10.133839 25.968897 Delayed
1011 136 1 North 16 UNI1 2017-06-15 01:50:14 2017-06-15 02:15:24.440372 25.174006 Delayed
1012 136 1 North 15 STA1 2017-06-15 01:51:14 2017-06-15 02:15:30.024372 24.267073 Delayed
1013 136 1 North 18 QUN1 2017-06-15 01:49:12 2017-06-15 02:15:59.673039 26.794551 Delayed
1014 136 1 North 17 KNG1 2017-06-15 01:50:14 2017-06-15 02:16:12.133839 25.968897 Delayed
1015 136 1 North 16 UNI1 2017-06-15 01:51:14 2017-06-15 02:16:24.440372 25.174006 Delayed
1016 136 1 North 19 DUN1 2017-06-15 01:49:12 2017-06-15 02:16:44.945039 27.549084 Delayed
1017 136 1 North 18 QUN1 2017-06-15 01:50:14 2017-06-15 02:17:01.673039 26.794551 Delayed
1018 136 1 North 17 KNG1 2017-06-15 01:51:14 2017-06-15 02:17:12.133839 25.968897 Delayed
1019 136 1 North 20 COL1 2017-06-15 01:49:12 2017-06-15 02:17:36.956506 28.415942 Delayed
1020 136 1 North 19 DUN1 2017-06-15 01:50:14 2017-06-15 02:17:46.945039 27.549084 Delayed
1021 136 1 North 18 QUN1 2017-06-15 01:51:14 2017-06-15 02:18:01.673039 26.794551 Delayed
1022 136 1 North 21 WEL1 2017-06-15 01:49:15 2017-06-15 02:18:35.015972 29.333600 Delayed
1023 136 1 North 20 COL1 2017-06-15 01:50:14 2017-06-15 02:18:38.956506 28.415942 Delayed
1024 136 1 North 19 DUN1 2017-06-15 01:51:14 2017-06-15 02:18:46.945039 27.549084 Delayed
1025 136 1 North 22 BLO1 2017-06-15 01:49:15 2017-06-15 02:19:32.276772 30.287946 Delayed
1026 136 1 North 21 WEL1 2017-06-15 01:50:14 2017-06-15 02:19:34.015972 29.333600 Delayed
1027 136 1 North 20 COL1 2017-06-15 01:51:14 2017-06-15 02:19:38.956506 28.415942 Delayed
1028 136 1 North 22 BLO1 2017-06-15 01:50:14 2017-06-15 02:20:31.276772 30.287946 Delayed
1029 136 1 North 21 WEL1 2017-06-15 01:51:15 2017-06-15 02:20:35.015972 29.333600 Delayed
1030 136 1 North 23 ROS1 2017-06-15 01:49:15 2017-06-15 02:20:51.991172 31.616520 Delayed
1031 136 1 North 22 BLO1 2017-06-15 01:51:16 2017-06-15 02:21:33.276772 30.287946 Delayed
1032 136 1 North 23 ROS1 2017-06-15 01:50:14 2017-06-15 02:21:50.991172 31.616520 Delayed
1033 136 1 North 23 ROS1 2017-06-15 01:51:15 2017-06-15 02:22:51.991172 31.616520 Delayed
1034 136 1 North 24 SUM1 2017-06-15 01:50:14 2017-06-15 02:22:52.756239 32.645937 Delayed
1035 136 1 North 24 SUM1 2017-06-15 01:51:15 2017-06-15 02:23:53.756239 32.645937 Delayed

135 rows × 9 columns

Lucky for us, train 136 is delayed a second time in our day, around 12:55.


In [12]:
train_136[(train_136['create_date'] > datetime.datetime(2017, 6, 14, 12, 50)) & (train_136['create_date'] < datetime.datetime(2017, 6, 14, 13, 30))]


Out[12]:
trainid lineid traindirection stationid station_char create_date expected_arrival timint train_message
280 136 1 South 28 LAW2 2017-06-14 12:50:07 2017-06-14 12:50:07.000000 0.000000 AtStation
281 136 1 South 27 EGL2 2017-06-14 12:53:08 2017-06-14 12:53:08.000000 0.000000 AtStation
282 136 1 South 27 EGL2 2017-06-14 12:54:08 2017-06-14 12:54:08.000000 0.000000 AtStation
283 136 1 South 27 EGL2 2017-06-14 12:55:07 2017-06-14 12:55:07.000000 0.000000 Delayed
284 136 1 South 26 DAV2 2017-06-14 12:56:06 2017-06-14 12:56:21.333326 0.255555 Arriving
285 136 1 South 26 DAV2 2017-06-14 12:55:07 2017-06-14 12:56:48.080847 1.684681 Delayed
286 136 1 South 25 STC2 2017-06-14 12:58:07 2017-06-14 12:58:07.000000 0.000000 AtStation
287 136 1 South 25 STC2 2017-06-14 12:55:07 2017-06-14 12:58:56.361371 3.822690 Delayed
288 136 1 South 24 SUM2 2017-06-14 12:59:06 2017-06-14 12:59:49.855458 0.730924 Arriving
289 136 1 South 24 SUM2 2017-06-14 12:55:07 2017-06-14 13:00:24.432122 5.290535 Delayed
290 136 1 South 23 ROS2 2017-06-14 13:00:06 2017-06-14 13:00:54.358952 0.805983 Arriving
291 136 1 South 23 ROS2 2017-06-14 13:01:07 2017-06-14 13:01:07.000000 0.000000 AtStation
292 136 1 South 23 ROS2 2017-06-14 12:55:07 2017-06-14 13:01:28.091075 6.351518 Delayed
293 136 1 South 22 BLO2 2017-06-14 13:02:10 2017-06-14 13:02:10.000000 0.000000 AtStation
294 136 1 South 22 BLO2 2017-06-14 12:55:07 2017-06-14 13:02:51.336654 7.738944 Delayed
295 136 1 South 22 BLO2 2017-06-14 13:03:04 2017-06-14 13:03:04.000000 0.000000 AtStation
296 136 1 South 21 WEL2 2017-06-14 13:02:09 2017-06-14 13:03:08.584362 0.993073 Arriving
297 136 1 South 21 WEL2 2017-06-14 12:55:07 2017-06-14 13:03:50.921016 8.732017 Delayed
298 136 1 South 21 WEL2 2017-06-14 13:03:04 2017-06-14 13:04:03.584362 0.993073 Arriving
299 136 1 South 22 BLO2 2017-06-14 13:04:07 2017-06-14 13:04:07.000000 0.000000 Delayed
300 136 1 South 20 COL2 2017-06-14 12:55:07 2017-06-14 13:04:48.932730 9.698879 Delayed
301 136 1 South 21 WEL2 2017-06-14 13:04:07 2017-06-14 13:05:06.584362 0.993073 Delayed
302 136 1 South 21 WEL2 2017-06-14 13:05:13 2017-06-14 13:05:13.000000 0.000000 AtStation
303 136 1 South 19 DUN2 2017-06-14 12:55:07 2017-06-14 13:05:42.977079 10.599618 Delayed
304 136 1 South 20 COL2 2017-06-14 13:04:07 2017-06-14 13:06:04.596076 1.959935 Delayed
305 136 1 South 20 COL2 2017-06-14 13:05:13 2017-06-14 13:06:11.011713 0.966862 Arriving
306 136 1 South 20 COL2 2017-06-14 13:06:05 2017-06-14 13:06:14.042731 0.150712 Arriving
307 136 1 South 18 QUN2 2017-06-14 12:55:07 2017-06-14 13:06:30.051734 11.384196 Delayed
308 136 1 South 19 DUN2 2017-06-14 13:04:07 2017-06-14 13:06:58.640425 2.860674 Delayed
309 136 1 South 17 KNG2 2017-06-14 12:55:07 2017-06-14 13:07:21.486916 12.241449 Delayed
... ... ... ... ... ... ... ... ... ...
322 136 1 South 14 OSG2 2017-06-14 13:04:07 2017-06-14 13:11:08.417736 7.023629 Delayed
323 136 1 South 13 STP2 2017-06-14 13:04:07 2017-06-14 13:11:55.313681 7.805228 Delayed
324 136 1 South 15 STA2 2017-06-14 13:11:08 2017-06-14 13:12:05.833003 0.963883 Arriving
325 136 1 South 16 UNI2 2017-06-14 13:12:07 2017-06-14 13:12:07.000000 0.000000 AtStation
326 136 1 South 15 STA2 2017-06-14 13:12:07 2017-06-14 13:13:04.833003 0.963883 Arriving
327 136 1 South 15 STA2 2017-06-14 13:13:08 2017-06-14 13:13:48.031062 0.667184 Arriving
328 136 1 South 15 STA2 2017-06-14 13:14:06 2017-06-14 13:14:06.000000 0.000000 AtStation
329 136 1 South 14 OSG2 2017-06-14 13:14:06 2017-06-14 13:14:49.178775 0.719646 Arriving
330 136 1 North 9 SPA1 2017-06-14 13:17:09 2017-06-14 13:17:56.215208 0.786920 Arriving
331 136 1 North 9 SPA1 2017-06-14 13:18:09 2017-06-14 13:18:09.000000 0.000000 AtStation
332 136 1 North 10 SGU1 2017-06-14 13:19:07 2017-06-14 13:19:40.686853 0.561448 Arriving
333 136 1 North 10 SGU1 2017-06-14 13:20:06 2017-06-14 13:20:06.000000 0.000000 AtStation
334 136 1 North 10 SGU1 2017-06-14 13:21:06 2017-06-14 13:21:06.000000 0.000000 AtStation
335 136 1 North 10 SGU1 2017-06-14 13:22:05 2017-06-14 13:22:05.000000 0.000000 Delayed
336 136 1 North 11 MUS1 2017-06-14 13:23:07 2017-06-14 13:23:14.094791 0.118247 Arriving
337 136 1 North 11 MUS1 2017-06-14 13:22:05 2017-06-14 13:23:20.266173 1.254436 Delayed
338 136 1 North 12 QPK1 2017-06-14 13:22:05 2017-06-14 13:24:39.365678 2.572761 Delayed
339 136 1 North 12 QPK1 2017-06-14 13:24:07 2017-06-14 13:24:49.425777 0.707096 Arriving
340 136 1 North 12 QPK1 2017-06-14 13:25:08 2017-06-14 13:25:08.000000 0.000000 AtStation
341 136 1 North 13 STP1 2017-06-14 13:22:05 2017-06-14 13:25:51.388263 3.773138 Delayed
342 136 1 North 13 STP1 2017-06-14 13:26:04 2017-06-14 13:26:04.000000 0.000000 AtStation
343 136 1 North 14 OSG1 2017-06-14 13:22:05 2017-06-14 13:26:38.105498 4.551758 Delayed
344 136 1 North 14 OSG1 2017-06-14 13:26:04 2017-06-14 13:26:50.717235 0.778621 Arriving
345 136 1 North 15 STA1 2017-06-14 13:22:05 2017-06-14 13:27:30.058939 5.417649 Delayed
346 136 1 North 15 STA1 2017-06-14 13:27:08 2017-06-14 13:27:44.653441 0.610891 Arriving
347 136 1 North 15 STA1 2017-06-14 13:28:08 2017-06-14 13:28:08.000000 0.000000 AtStation
348 136 1 North 16 UNI1 2017-06-14 13:22:05 2017-06-14 13:28:26.640972 6.360683 Delayed
349 136 1 North 16 UNI1 2017-06-14 13:28:08 2017-06-14 13:29:04.582033 0.943034 Arriving
350 136 1 North 17 KNG1 2017-06-14 13:22:05 2017-06-14 13:29:16.128214 7.185470 Delayed
351 136 1 North 16 UNI1 2017-06-14 13:29:06 2017-06-14 13:29:23.165105 0.286085 Arriving

72 rows × 9 columns

It seems like we could actually be fine if we just filtered out observations with Delayed and timint <1. The delayed records could be useful to store in a separate table for their own analysis, but they don't appear to really fill in the gaps here


In [13]:
train_136[(train_136['create_date'] > datetime.datetime(2017, 6, 14, 12, 50)) 
          & (train_136['create_date'] < datetime.datetime(2017, 6, 14, 13, 30))
          & ((train_136['train_message'] != 'Delayed') | (train_136['timint'] < 1.0 ))]


Out[13]:
trainid lineid traindirection stationid station_char create_date expected_arrival timint train_message
280 136 1 South 28 LAW2 2017-06-14 12:50:07 2017-06-14 12:50:07.000000 0.000000 AtStation
281 136 1 South 27 EGL2 2017-06-14 12:53:08 2017-06-14 12:53:08.000000 0.000000 AtStation
282 136 1 South 27 EGL2 2017-06-14 12:54:08 2017-06-14 12:54:08.000000 0.000000 AtStation
283 136 1 South 27 EGL2 2017-06-14 12:55:07 2017-06-14 12:55:07.000000 0.000000 Delayed
284 136 1 South 26 DAV2 2017-06-14 12:56:06 2017-06-14 12:56:21.333326 0.255555 Arriving
286 136 1 South 25 STC2 2017-06-14 12:58:07 2017-06-14 12:58:07.000000 0.000000 AtStation
288 136 1 South 24 SUM2 2017-06-14 12:59:06 2017-06-14 12:59:49.855458 0.730924 Arriving
290 136 1 South 23 ROS2 2017-06-14 13:00:06 2017-06-14 13:00:54.358952 0.805983 Arriving
291 136 1 South 23 ROS2 2017-06-14 13:01:07 2017-06-14 13:01:07.000000 0.000000 AtStation
293 136 1 South 22 BLO2 2017-06-14 13:02:10 2017-06-14 13:02:10.000000 0.000000 AtStation
295 136 1 South 22 BLO2 2017-06-14 13:03:04 2017-06-14 13:03:04.000000 0.000000 AtStation
296 136 1 South 21 WEL2 2017-06-14 13:02:09 2017-06-14 13:03:08.584362 0.993073 Arriving
298 136 1 South 21 WEL2 2017-06-14 13:03:04 2017-06-14 13:04:03.584362 0.993073 Arriving
299 136 1 South 22 BLO2 2017-06-14 13:04:07 2017-06-14 13:04:07.000000 0.000000 Delayed
301 136 1 South 21 WEL2 2017-06-14 13:04:07 2017-06-14 13:05:06.584362 0.993073 Delayed
302 136 1 South 21 WEL2 2017-06-14 13:05:13 2017-06-14 13:05:13.000000 0.000000 AtStation
305 136 1 South 20 COL2 2017-06-14 13:05:13 2017-06-14 13:06:11.011713 0.966862 Arriving
306 136 1 South 20 COL2 2017-06-14 13:06:05 2017-06-14 13:06:14.042731 0.150712 Arriving
310 136 1 South 19 DUN2 2017-06-14 13:07:07 2017-06-14 13:07:31.947930 0.415799 Arriving
312 136 1 South 19 DUN2 2017-06-14 13:08:06 2017-06-14 13:08:06.000000 0.000000 AtStation
314 136 1 South 18 QUN2 2017-06-14 13:08:06 2017-06-14 13:08:53.074655 0.784578 Arriving
315 136 1 South 18 QUN2 2017-06-14 13:09:07 2017-06-14 13:09:07.000000 0.000000 AtStation
317 136 1 South 17 KNG2 2017-06-14 13:09:07 2017-06-14 13:09:58.435182 0.857253 Arriving
318 136 1 South 17 KNG2 2017-06-14 13:10:07 2017-06-14 13:10:07.000000 0.000000 AtStation
320 136 1 South 16 UNI2 2017-06-14 13:10:07 2017-06-14 13:10:57.255695 0.837595 Arriving
321 136 1 South 16 UNI2 2017-06-14 13:11:08 2017-06-14 13:11:08.000000 0.000000 AtStation
324 136 1 South 15 STA2 2017-06-14 13:11:08 2017-06-14 13:12:05.833003 0.963883 Arriving
325 136 1 South 16 UNI2 2017-06-14 13:12:07 2017-06-14 13:12:07.000000 0.000000 AtStation
326 136 1 South 15 STA2 2017-06-14 13:12:07 2017-06-14 13:13:04.833003 0.963883 Arriving
327 136 1 South 15 STA2 2017-06-14 13:13:08 2017-06-14 13:13:48.031062 0.667184 Arriving
328 136 1 South 15 STA2 2017-06-14 13:14:06 2017-06-14 13:14:06.000000 0.000000 AtStation
329 136 1 South 14 OSG2 2017-06-14 13:14:06 2017-06-14 13:14:49.178775 0.719646 Arriving
330 136 1 North 9 SPA1 2017-06-14 13:17:09 2017-06-14 13:17:56.215208 0.786920 Arriving
331 136 1 North 9 SPA1 2017-06-14 13:18:09 2017-06-14 13:18:09.000000 0.000000 AtStation
332 136 1 North 10 SGU1 2017-06-14 13:19:07 2017-06-14 13:19:40.686853 0.561448 Arriving
333 136 1 North 10 SGU1 2017-06-14 13:20:06 2017-06-14 13:20:06.000000 0.000000 AtStation
334 136 1 North 10 SGU1 2017-06-14 13:21:06 2017-06-14 13:21:06.000000 0.000000 AtStation
335 136 1 North 10 SGU1 2017-06-14 13:22:05 2017-06-14 13:22:05.000000 0.000000 Delayed
336 136 1 North 11 MUS1 2017-06-14 13:23:07 2017-06-14 13:23:14.094791 0.118247 Arriving
339 136 1 North 12 QPK1 2017-06-14 13:24:07 2017-06-14 13:24:49.425777 0.707096 Arriving
340 136 1 North 12 QPK1 2017-06-14 13:25:08 2017-06-14 13:25:08.000000 0.000000 AtStation
342 136 1 North 13 STP1 2017-06-14 13:26:04 2017-06-14 13:26:04.000000 0.000000 AtStation
344 136 1 North 14 OSG1 2017-06-14 13:26:04 2017-06-14 13:26:50.717235 0.778621 Arriving
346 136 1 North 15 STA1 2017-06-14 13:27:08 2017-06-14 13:27:44.653441 0.610891 Arriving
347 136 1 North 15 STA1 2017-06-14 13:28:08 2017-06-14 13:28:08.000000 0.000000 AtStation
349 136 1 North 16 UNI1 2017-06-14 13:28:08 2017-06-14 13:29:04.582033 0.943034 Arriving
351 136 1 North 16 UNI1 2017-06-14 13:29:06 2017-06-14 13:29:23.165105 0.286085 Arriving

Coincidentally, this period of time also features a short-turn, at 13:15, and we want to identify distinct trips (where trains turn around, either at the end of the usual run, or early). This should be relatively easy to implement with the traindirection column


In [9]:
split_trips = '''CREATE SEQUENCE IF NOT EXISTS trip_ids;
                CREATE MATERIALIZED VIEW test_day_w_trips AS
                SELECT trainid, lineid, traindirection, stationid, station_char, create_date, create_date + timint * interval '1 minute' AS expected_arrival, timint, train_message,
                CASE traindirection WHEN lag(traindirection) OVER w THEN currval('trip_ids') ELSE nextval('trip_ids') END AS trip_id
        FROM test_day
        WHERE (timint < 1 OR train_message = 'AtStation') 
        WINDOW w AS (PARTITION BY lineid, trainid ORDER BY create_date + timint * interval '1 minute') 
        '''
with con:
    with con.cursor() as cur:
        cur.execute(split_trips)

A final step is to group together multiple observations at a same station, during a same trip, to get an approximation of arrival and "departure" time.


In [10]:
final_step = ''' DROP MATERIALIZED VIEW IF EXISTS test_day_final;
CREATE MATERIALIZED VIEW test_day_final AS 
SELECT trainid, lineid, traindirection, stationid, station_char, trip_id,
 MIN(expected_arrival) AS estimated_arrival, MAX(expected_arrival) AS estimated_departure,
 CASE (ARRAY_AGG(train_message ORDER BY expected_arrival))[1] WHEN 'AtStation' THEN 1 ELSE 0 END AS exact_arr, 
CASE (ARRAY_AGG(train_message ORDER BY expected_arrival DESC))[1] WHEN 'AtStation' THEN 1 ELSE 0 END AS exact_dep
 FROM test_day_w_trips 
 GROUP BY trainid, lineid, traindirection, stationid, station_char, trip_id 
'''
with con:
    with con.cursor() as cur:
        cur.execute(final_step)

Woo! Now to test how well this process did


In [11]:
cnt = '''SELECT COUNT(*) FROM test_day_final'''
with con.cursor() as cur:
    cur.execute(cnt)
    print('The number of station stops made is', cur.fetchone()[0])


The number of station stops made is 53958

Huh. 5k higher than the scheduled number of station stops


In [12]:
sql = ''' WITH observed_trips AS 
    (SELECT lineid::TEXT, COUNT(DISTINCT trip_id) AS "Number of observed trips"
     FROM test_day_final
     GROUP BY lineid)
     , unique_trips AS(SELECT route_short_name AS lineid, COUNT(DISTINCT trip_id) AS "Number of scheduled trips"
            FROM gtfs.routes -- ON lineid::TEXT = route_short_name
            INNER JOIN gtfs.trips USING (route_id)
            INNER JOIN gtfs.calendar USING (service_id)
            WHERE monday AND route_type = 1 AND route_short_name != '3'
            GROUP BY route_short_name)
            
    SELECT *
        FROM observed_trips
        INNER JOIN unique_trips USING (lineid)
        ORDER BY lineid'''
pandasql.read_sql(sql, con)


Out[12]:
lineid Number of observed trips Number of scheduled trips
0 1 1910 747
1 2 1410 704
2 4 1103 457

Welp. Seems like that trip identification code is a little too basic. Let's try to investigate what might be the problem:


In [13]:
sql = '''WITH trips AS(SELECT lineid, trip_id, EXTRACT('minutes' FROM MAX(estimated_arrival) - MIN(estimated_arrival)) as trip_duration
FROM test_day_final
GROUP BY lineid, trip_id)
SELECT *
FROM trips
ORDER BY trip_duration
LIMIT 10'''
pandasql.read_sql(sql, con)


Out[13]:
lineid trip_id trip_duration
0 4 15100 0.0
1 4 14892 0.0
2 4 14912 0.0
3 4 15188 0.0
4 2 13313 0.0
5 4 15017 0.0
6 4 15121 0.0
7 4 15219 0.0
8 2 13827 0.0
9 4 14714 0.0

In [18]:
sql = '''WITH one_stop_trips AS(SELECT lineid, trip_id, ARRAY_AGG(station_char) AS stations
FROM test_day_final
GROUP BY lineid, trip_id
HAVING COUNT(1) = 1)
SELECT lineid, unnest(stations) as station_char, COUNT(1) "Number of Trips"
FROM one_stop_trips
GROUP BY lineid, station_char
ORDER BY lineid, "Number of Trips" DESC'''

pandasql.read_sql(sql, con)


Out[18]:
lineid station_char Number of Trips
0 1 FIN2 139
1 1 FIN1 139
2 1 YIE1 127
3 1 SHW1 126
4 1 YIE2 124
5 1 SHW2 111
6 1 BSP2 5
7 1 YNG2 4
8 1 YNG1 3
9 1 SPA1 2
10 1 SPA2 1
11 1 SGL2 1
12 1 STC2 1
13 1 QUN2 1
14 1 WIL1 1
15 2 KEN2 122
16 2 KIP1 121
17 2 KEN1 120
18 2 KIP2 118
19 2 BLO2 40
20 2 BLO1 18
21 2 SPA1 4
22 2 SPA2 4
23 2 SGU1 3
24 2 SGU2 3
25 2 KEL2 1
26 2 OSS1 1
27 2 CHE1 1
28 4 SHP2 335
29 4 SHP1 332
30 4 YIE1 63
31 4 YIE2 61
32 4 DML2 25
33 4 DML1 24

So for the most part we seem to have issues with identifying trip start/end at the termini.

Line One stop Trips at Termini
1 766
2 481
4 791

So approximately half of "extra trips" are one stop trips at termini. Let's see the overall distribution of number of stops for each trip we've inferred and compare that to the ideal.


In [14]:
sql = '''WITH inferred_trips AS(SELECT lineid, trip_id, COUNT(1) as stops
FROM test_day_final
GROUP BY lineid, trip_id
),
inferred_trip_length AS( SELECT lineid, stops, COUNT(trip_id) as obs_trips
FROM inferred_trips
GROUP BY lineid, stops)
,
gtfs_trip_lengths AS(SELECT route_short_name::INT AS lineid, trip_id, COUNT(1) as stops
            FROM gtfs.stop_times 
            INNER JOIN gtfs.trips USING (trip_id)
            INNER JOIN gtfs.routes USING (route_id)
            INNER JOIN gtfs.calendar USING (service_id)
            WHERE monday AND route_type = 1 AND route_short_name != '3'
            GROUP BY route_short_name, trip_id
)
,gtfs_trip_length_distro AS (SELECT lineid, stops, COUNT(trip_id) as num_trips
FROM gtfs_trip_lengths
GROUP BY lineid, stops)

SELECT lineid, stops, COALESCE(num_trips,0) as scheduled, COUNT(inferred_trips.trip_id) as observed 
FROM inferred_trips
FULL OUTER JOIN gtfs_trip_length_distro USING (lineid, stops)
GROUP BY lineid, stops, num_trips
ORDER BY lineid, stops
'''
trip_lengths = pandasql.read_sql(sql, con)

In [15]:
line_one = trip_lengths[trip_lengths['lineid'] == 1]
fig, ax = plt.subplots(figsize=(16,9))
line_one.plot(x='stops', y='scheduled', kind='bar', ax=ax,position=0, color='red')
line_one.plot(x='stops', y='observed', sharey=True, sharex=True, kind='bar', ax=ax, position=1, color='blue')
ax.set_title('Line 1 Distribution of Trip Lengths')
ax.yaxis.set_label('Number of trips')


So we are certainly getting 1, 2, and 3 stop trips that shouldn't exist and undercounting the more appropriate number of trips. The one-stop trips are primarily at termini. What is happening with the 2, 3 stop trips...?


In [16]:
sql_2_3 = '''SELECT stops, COUNT(1)
             FROM (
                 SELECT array_agg(station_char ORDER BY estimated_arrival) AS stops
                 FROM test_day_final
                 WHERE lineid = 1
                 GROUP BY trip_id 
                 HAVING COUNT(1) =2 OR COUNT(1) = 3
                 )grouped_trips
             GROUP BY stops
             ORDER BY COUNT(1) DESC
             LIMIT 10
             '''

pandasql.read_sql(sql_2_3, con)


Out[16]:
stops count
0 [YNG2, SGL2, BSP2] 343
1 [BSP1, SGL1, YNG1] 334
2 [YIE1, YIE2] 65
3 [YIE2, YIE1] 22
4 [VMC1, VMC2] 16
5 [VMC2, VMC1] 13
6 [STA1, UNI1] 7
7 [FIN2, FIN1] 7
8 [FIN1, FIN2] 7
9 [EGL2, DAV2] 5

The top "trips" are from Bloor-Spadina to Yonge via St. George and vice-versa, but these are the stations on line 2...

Trying to group arrival-departure times first

Assuming a maximum time for any given train to dwell at a station, we will test this procedure on lines 1-2 first.


In [ ]:
sql = ''' CREATE MATERIALIZED VIEW test_day_stop_arrival AS
                SELECT trainid, lineid, traindirection, stationid, station_char, 
                MIN(create_date + timint * interval '1 minute') AS expected_arrival, timint, train_message,
                
        FROM test_day
        WHERE (timint < 1 OR train_message = 'AtStation')
        GROUP BY trainid, lineid, traindirection, stationid, station_char,  '''