Exploratory Data Analysis with Python

We will explore the NYC MTA turnstile data set. These data files are from the New York Subway. It tracks the hourly entries and exits to turnstiles (UNIT) by day in the subway system.

Here is an example of what you could do with the data. James Kao investigates how subway ridership is affected by incidence of rain.

Exercise 1

  • Download at least 2 weeks worth of MTA turnstile data (You can do this manually or via Python)
  • Open up a file, use csv reader to read it, make a python dict where there is a key for each (C/A, UNIT, SCP, STATION). These are the first four columns. The value for this key should be a list of lists. Each list in the list is the rest of the columns in a row. For example, one key-value pair should look like
    {    ('A002','R051','02-00-00','LEXINGTON AVE'):    
         [
           ['NQR456', 'BMT', '01/03/2015', '03:00:00', 'REGULAR', '0004945474', '0001675324'],          
             ['NQR456', 'BMT', '01/03/2015', '07:00:00', 'REGULAR', '0004945478', '0001675333'],  
            ['NQR456', 'BMT', '01/03/2015', '11:00:00', 'REGULAR', '0004945515', '0001675364'],
          ...   
     ] 
    }

Store all the weeks in a data structure of your choosing


In [2]:
import csv
import os

Field Description

  • C/A = Control Area (A002)
  • UNIT = Remote Unit for a station (R051)
  • SCP = Subunit Channel Position represents an specific address for a device (02-00-00)
  • STATION = Represents the station name the device is located at
  • LINENAME = Represents all train lines that can be boarded at this station. Normally lines are represented by one character. LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.
  • DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND
  • DATE = Represents the date (MM-DD-YY)
  • TIME = Represents the time (hh:mm:ss) for a scheduled audit event
  • DESC = Represent the "REGULAR" scheduled audit event (Normally occurs every 4 hours)
    1. Audits may occur more that 4 hours due to planning, or troubleshooting activities.
    2. Additionally, there may be a "RECOVR AUD" entry: This refers to a missed audit that was recovered.
  • ENTRIES = The comulative entry register value for a device
  • EXIST = The cumulative exit register value for a device

In [3]:
turnstile = {}

# looping through all files in data dir starting with MTA_Turnstile
for filename in os.listdir('data'):
    if filename.startswith('MTA_Turnstile'): 
        # reading file and writing each row in a dict
        with open(os.path.join('data', filename), newline='') as csvfile:
            mtareader = csv.reader(csvfile, delimiter=',')
            next(mtareader)
            for row in mtareader:
                key = (row[0], row[1], row[2], row[3])
                value = [row[4], row[5], row[6], row[7], row[8], row[9], row[10].rstrip()]
                if key in turnstile:
                    turnstile[key].append(value)
                else:
                    turnstile[key] = [value]

In [4]:
# test value for dict
test = ('A002','R051','02-00-00','59 ST')

In [5]:
turnstile[test]#[:2]


Out[5]:
[['NQR456W',
  'BMT',
  '08/12/2017',
  '00:00:00',
  'REGULAR',
  '0006287419',
  '0002130235'],
 ['NQR456W',
  'BMT',
  '08/12/2017',
  '04:00:00',
  'REGULAR',
  '0006287451',
  '0002130238'],
 ['NQR456W',
  'BMT',
  '08/12/2017',
  '08:00:00',
  'REGULAR',
  '0006287469',
  '0002130269'],
 ['NQR456W',
  'BMT',
  '08/12/2017',
  '12:00:00',
  'REGULAR',
  '0006287551',
  '0002130364'],
 ['NQR456W',
  'BMT',
  '08/12/2017',
  '16:00:00',
  'REGULAR',
  '0006287762',
  '0002130410'],
 ['NQR456W',
  'BMT',
  '08/12/2017',
  '20:00:00',
  'REGULAR',
  '0006287984',
  '0002130445'],
 ['NQR456W',
  'BMT',
  '08/13/2017',
  '00:00:00',
  'REGULAR',
  '0006288108',
  '0002130468'],
 ['NQR456W',
  'BMT',
  '08/13/2017',
  '04:00:00',
  'REGULAR',
  '0006288126',
  '0002130468'],
 ['NQR456W',
  'BMT',
  '08/13/2017',
  '08:00:00',
  'REGULAR',
  '0006288143',
  '0002130485'],
 ['NQR456W',
  'BMT',
  '08/13/2017',
  '12:00:00',
  'REGULAR',
  '0006288201',
  '0002130551'],
 ['NQR456W',
  'BMT',
  '08/13/2017',
  '16:00:00',
  'REGULAR',
  '0006288330',
  '0002130599'],
 ['NQR456W',
  'BMT',
  '08/13/2017',
  '20:00:00',
  'REGULAR',
  '0006288513',
  '0002130630'],
 ['NQR456W',
  'BMT',
  '08/14/2017',
  '00:00:00',
  'REGULAR',
  '0006288610',
  '0002130640'],
 ['NQR456W',
  'BMT',
  '08/14/2017',
  '04:00:00',
  'REGULAR',
  '0006288624',
  '0002130643'],
 ['NQR456W',
  'BMT',
  '08/14/2017',
  '08:00:00',
  'REGULAR',
  '0006288662',
  '0002130722'],
 ['NQR456W',
  'BMT',
  '08/14/2017',
  '12:00:00',
  'REGULAR',
  '0006288818',
  '0002130967'],
 ['NQR456W',
  'BMT',
  '08/14/2017',
  '16:00:00',
  'REGULAR',
  '0006289079',
  '0002131031'],
 ['NQR456W',
  'BMT',
  '08/14/2017',
  '20:00:00',
  'REGULAR',
  '0006289737',
  '0002131080'],
 ['NQR456W',
  'BMT',
  '08/15/2017',
  '00:00:00',
  'REGULAR',
  '0006289919',
  '0002131093'],
 ['NQR456W',
  'BMT',
  '08/15/2017',
  '04:00:00',
  'REGULAR',
  '0006289941',
  '0002131093'],
 ['NQR456W',
  'BMT',
  '08/15/2017',
  '08:00:00',
  'REGULAR',
  '0006289964',
  '0002131200'],
 ['NQR456W',
  'BMT',
  '08/15/2017',
  '12:00:00',
  'REGULAR',
  '0006290105',
  '0002131433'],
 ['NQR456W',
  'BMT',
  '08/15/2017',
  '16:00:00',
  'REGULAR',
  '0006290395',
  '0002131518'],
 ['NQR456W',
  'BMT',
  '08/15/2017',
  '20:00:00',
  'REGULAR',
  '0006291088',
  '0002131599'],
 ['NQR456W',
  'BMT',
  '08/16/2017',
  '00:00:00',
  'REGULAR',
  '0006291302',
  '0002131624'],
 ['NQR456W',
  'BMT',
  '08/16/2017',
  '04:00:00',
  'REGULAR',
  '0006291317',
  '0002131624'],
 ['NQR456W',
  'BMT',
  '08/16/2017',
  '08:00:00',
  'REGULAR',
  '0006291353',
  '0002131714'],
 ['NQR456W',
  'BMT',
  '08/16/2017',
  '12:00:00',
  'REGULAR',
  '0006291506',
  '0002131952'],
 ['NQR456W',
  'BMT',
  '08/16/2017',
  '16:00:00',
  'REGULAR',
  '0006291775',
  '0002132014'],
 ['NQR456W',
  'BMT',
  '08/16/2017',
  '20:00:00',
  'REGULAR',
  '0006292487',
  '0002132075'],
 ['NQR456W',
  'BMT',
  '08/17/2017',
  '00:00:00',
  'REGULAR',
  '0006292715',
  '0002132097'],
 ['NQR456W',
  'BMT',
  '08/17/2017',
  '04:00:00',
  'REGULAR',
  '0006292734',
  '0002132098'],
 ['NQR456W',
  'BMT',
  '08/17/2017',
  '08:00:00',
  'REGULAR',
  '0006292772',
  '0002132171'],
 ['NQR456W',
  'BMT',
  '08/17/2017',
  '12:00:00',
  'REGULAR',
  '0006292908',
  '0002132279'],
 ['NQR456W',
  'BMT',
  '08/17/2017',
  '16:00:00',
  'REGULAR',
  '0006293190',
  '0002132294'],
 ['NQR456W',
  'BMT',
  '08/17/2017',
  '20:00:00',
  'REGULAR',
  '0006293931',
  '0002132313'],
 ['NQR456W',
  'BMT',
  '08/18/2017',
  '00:00:00',
  'REGULAR',
  '0006294144',
  '0002132318'],
 ['NQR456W',
  'BMT',
  '08/18/2017',
  '04:00:00',
  'REGULAR',
  '0006294165',
  '0002132318'],
 ['NQR456W',
  'BMT',
  '08/18/2017',
  '08:00:00',
  'REGULAR',
  '0006294191',
  '0002132349'],
 ['NQR456W',
  'BMT',
  '08/18/2017',
  '12:00:00',
  'REGULAR',
  '0006294343',
  '0002132397'],
 ['NQR456W',
  'BMT',
  '08/18/2017',
  '16:00:00',
  'REGULAR',
  '0006294622',
  '0002132424'],
 ['NQR456W',
  'BMT',
  '08/18/2017',
  '20:00:00',
  'REGULAR',
  '0006295244',
  '0002132442'],
 ['NQR456W',
  'BMT',
  '08/19/2017',
  '00:00:00',
  'REGULAR',
  '0006295409',
  '0002132450'],
 ['NQR456W',
  'BMT',
  '08/19/2017',
  '04:00:00',
  'REGULAR',
  '0006295425',
  '0002132451'],
 ['NQR456W',
  'BMT',
  '08/19/2017',
  '08:00:00',
  'REGULAR',
  '0006295438',
  '0002132457'],
 ['NQR456W',
  'BMT',
  '08/19/2017',
  '12:00:00',
  'REGULAR',
  '0006295513',
  '0002132471'],
 ['NQR456W',
  'BMT',
  '08/19/2017',
  '16:00:00',
  'REGULAR',
  '0006295684',
  '0002132482'],
 ['NQR456W',
  'BMT',
  '08/19/2017',
  '20:00:00',
  'REGULAR',
  '0006295904',
  '0002132493'],
 ['NQR456W',
  'BMT',
  '08/20/2017',
  '00:00:00',
  'REGULAR',
  '0006296039',
  '0002132503'],
 ['NQR456W',
  'BMT',
  '08/20/2017',
  '04:00:00',
  'REGULAR',
  '0006296053',
  '0002132503'],
 ['NQR456W',
  'BMT',
  '08/20/2017',
  '08:00:00',
  'REGULAR',
  '0006296064',
  '0002132509'],
 ['NQR456W',
  'BMT',
  '08/20/2017',
  '12:00:00',
  'REGULAR',
  '0006296143',
  '0002132521'],
 ['NQR456W',
  'BMT',
  '08/20/2017',
  '16:00:00',
  'REGULAR',
  '0006296272',
  '0002132529'],
 ['NQR456W',
  'BMT',
  '08/20/2017',
  '20:00:00',
  'REGULAR',
  '0006296418',
  '0002132541'],
 ['NQR456W',
  'BMT',
  '08/21/2017',
  '00:00:00',
  'REGULAR',
  '0006296534',
  '0002132542'],
 ['NQR456W',
  'BMT',
  '08/21/2017',
  '04:00:00',
  'REGULAR',
  '0006296543',
  '0002132542'],
 ['NQR456W',
  'BMT',
  '08/21/2017',
  '08:00:00',
  'REGULAR',
  '0006296575',
  '0002132564'],
 ['NQR456W',
  'BMT',
  '08/21/2017',
  '12:00:00',
  'REGULAR',
  '0006296729',
  '0002132613'],
 ['NQR456W',
  'BMT',
  '08/21/2017',
  '16:00:00',
  'REGULAR',
  '0006296965',
  '0002132625'],
 ['NQR456W',
  'BMT',
  '08/21/2017',
  '20:00:00',
  'REGULAR',
  '0006297665',
  '0002132638'],
 ['NQR456W',
  'BMT',
  '08/22/2017',
  '00:00:00',
  'REGULAR',
  '0006297828',
  '0002132643'],
 ['NQR456W',
  'BMT',
  '08/22/2017',
  '04:00:00',
  'REGULAR',
  '0006297837',
  '0002132643'],
 ['NQR456W',
  'BMT',
  '08/22/2017',
  '08:00:00',
  'REGULAR',
  '0006297863',
  '0002132661'],
 ['NQR456W',
  'BMT',
  '08/22/2017',
  '12:00:00',
  'REGULAR',
  '0006297978',
  '0002132719'],
 ['NQR456W',
  'BMT',
  '08/22/2017',
  '16:00:00',
  'REGULAR',
  '0006298217',
  '0002132730'],
 ['NQR456W',
  'BMT',
  '08/22/2017',
  '20:00:00',
  'REGULAR',
  '0006298845',
  '0002132741'],
 ['NQR456W',
  'BMT',
  '08/23/2017',
  '00:00:00',
  'REGULAR',
  '0006299015',
  '0002132756'],
 ['NQR456W',
  'BMT',
  '08/23/2017',
  '04:00:00',
  'REGULAR',
  '0006299021',
  '0002132757'],
 ['NQR456W',
  'BMT',
  '08/23/2017',
  '08:00:00',
  'REGULAR',
  '0006299064',
  '0002132776'],
 ['NQR456W',
  'BMT',
  '08/23/2017',
  '12:00:00',
  'REGULAR',
  '0006299225',
  '0002132815'],
 ['NQR456W',
  'BMT',
  '08/23/2017',
  '16:00:00',
  'REGULAR',
  '0006299485',
  '0002132834'],
 ['NQR456W',
  'BMT',
  '08/23/2017',
  '20:00:00',
  'REGULAR',
  '0006300157',
  '0002132848'],
 ['NQR456W',
  'BMT',
  '08/24/2017',
  '00:00:00',
  'REGULAR',
  '0006300372',
  '0002132854'],
 ['NQR456W',
  'BMT',
  '08/24/2017',
  '04:00:00',
  'REGULAR',
  '0006300395',
  '0002132854'],
 ['NQR456W',
  'BMT',
  '08/24/2017',
  '08:00:00',
  'REGULAR',
  '0006300438',
  '0002132941'],
 ['NQR456W',
  'BMT',
  '08/24/2017',
  '12:00:00',
  'REGULAR',
  '0006300594',
  '0002133154'],
 ['NQR456W',
  'BMT',
  '08/24/2017',
  '16:00:00',
  'REGULAR',
  '0006300885',
  '0002133214'],
 ['NQR456W',
  'BMT',
  '08/24/2017',
  '20:00:00',
  'REGULAR',
  '0006301655',
  '0002133274'],
 ['NQR456W',
  'BMT',
  '08/25/2017',
  '00:00:00',
  'REGULAR',
  '0006301850',
  '0002133292'],
 ['NQR456W',
  'BMT',
  '08/25/2017',
  '04:00:00',
  'REGULAR',
  '0006301868',
  '0002133293'],
 ['NQR456W',
  'BMT',
  '08/25/2017',
  '08:00:00',
  'REGULAR',
  '0006301903',
  '0002133378'],
 ['NQR456W',
  'BMT',
  '08/25/2017',
  '12:00:00',
  'REGULAR',
  '0006302032',
  '0002133601'],
 ['NQR456W',
  'BMT',
  '08/25/2017',
  '16:00:00',
  'REGULAR',
  '0006302352',
  '0002133657'],
 ['NQR456W',
  'BMT',
  '08/25/2017',
  '20:00:00',
  'REGULAR',
  '0006303007',
  '0002133711']]

Exercise 2

  • Let's turn this into a time series.

    For each key (basically the control area, unit, device address and station of a specific turnstile), have a list again, but let the list be comprised of just the point in time and the cumulative count of entries.

This basically means keeping only the date, time, and entries fields in each list. You can convert the date and time into datetime objects -- That is a python class that represents a point in time. You can combine the date and time fields into a string and use the dateutil module to convert it into a datetime object.

Your new dict should look something like

{    ('A002','R051','02-00-00','LEXINGTON AVE'):    
         [
            [datetime.datetime(2013, 3, 2, 3, 0), 3788],
            [datetime.datetime(2013, 3, 2, 7, 0), 2585],
            [datetime.datetime(2013, 3, 2, 12, 0), 10653],
            [datetime.datetime(2013, 3, 2, 17, 0), 11016],
            [datetime.datetime(2013, 3, 2, 23, 0), 10666],
            [datetime.datetime(2013, 3, 3, 3, 0), 10814],
            [datetime.datetime(2013, 3, 3, 7, 0), 10229],
            ...
          ],
 ....
 }

In [6]:
import numpy as np
import datetime
from dateutil.parser import parse

In [7]:
# With respect to the solutions I converted the cumulative entries in the number of entries in the period
# That's ok I think since it is required below to do so...

turnstile_timeseries = {}

# looping through each key in dict, parsing the date and calculating the difference between previous and current count
for key in turnstile:
    prev = np.nan
    value = []
    for el in turnstile[key]:
        value.append([parse(el[2] + ' ' + el[3]), int(el[5]) - prev])
        prev = int(el[5])
    if key in turnstile_timeseries:
        turnstile_timeseries[key].append(value)
    else:
        turnstile_timeseries[key] = value

In [8]:
turnstile_timeseries[test]#[:5]
# ('R305', 'R206', '01-00-00','125 ST')


Out[8]:
[[datetime.datetime(2017, 8, 12, 0, 0), nan],
 [datetime.datetime(2017, 8, 12, 4, 0), 32],
 [datetime.datetime(2017, 8, 12, 8, 0), 18],
 [datetime.datetime(2017, 8, 12, 12, 0), 82],
 [datetime.datetime(2017, 8, 12, 16, 0), 211],
 [datetime.datetime(2017, 8, 12, 20, 0), 222],
 [datetime.datetime(2017, 8, 13, 0, 0), 124],
 [datetime.datetime(2017, 8, 13, 4, 0), 18],
 [datetime.datetime(2017, 8, 13, 8, 0), 17],
 [datetime.datetime(2017, 8, 13, 12, 0), 58],
 [datetime.datetime(2017, 8, 13, 16, 0), 129],
 [datetime.datetime(2017, 8, 13, 20, 0), 183],
 [datetime.datetime(2017, 8, 14, 0, 0), 97],
 [datetime.datetime(2017, 8, 14, 4, 0), 14],
 [datetime.datetime(2017, 8, 14, 8, 0), 38],
 [datetime.datetime(2017, 8, 14, 12, 0), 156],
 [datetime.datetime(2017, 8, 14, 16, 0), 261],
 [datetime.datetime(2017, 8, 14, 20, 0), 658],
 [datetime.datetime(2017, 8, 15, 0, 0), 182],
 [datetime.datetime(2017, 8, 15, 4, 0), 22],
 [datetime.datetime(2017, 8, 15, 8, 0), 23],
 [datetime.datetime(2017, 8, 15, 12, 0), 141],
 [datetime.datetime(2017, 8, 15, 16, 0), 290],
 [datetime.datetime(2017, 8, 15, 20, 0), 693],
 [datetime.datetime(2017, 8, 16, 0, 0), 214],
 [datetime.datetime(2017, 8, 16, 4, 0), 15],
 [datetime.datetime(2017, 8, 16, 8, 0), 36],
 [datetime.datetime(2017, 8, 16, 12, 0), 153],
 [datetime.datetime(2017, 8, 16, 16, 0), 269],
 [datetime.datetime(2017, 8, 16, 20, 0), 712],
 [datetime.datetime(2017, 8, 17, 0, 0), 228],
 [datetime.datetime(2017, 8, 17, 4, 0), 19],
 [datetime.datetime(2017, 8, 17, 8, 0), 38],
 [datetime.datetime(2017, 8, 17, 12, 0), 136],
 [datetime.datetime(2017, 8, 17, 16, 0), 282],
 [datetime.datetime(2017, 8, 17, 20, 0), 741],
 [datetime.datetime(2017, 8, 18, 0, 0), 213],
 [datetime.datetime(2017, 8, 18, 4, 0), 21],
 [datetime.datetime(2017, 8, 18, 8, 0), 26],
 [datetime.datetime(2017, 8, 18, 12, 0), 152],
 [datetime.datetime(2017, 8, 18, 16, 0), 279],
 [datetime.datetime(2017, 8, 18, 20, 0), 622],
 [datetime.datetime(2017, 8, 19, 0, 0), 165],
 [datetime.datetime(2017, 8, 19, 4, 0), 16],
 [datetime.datetime(2017, 8, 19, 8, 0), 13],
 [datetime.datetime(2017, 8, 19, 12, 0), 75],
 [datetime.datetime(2017, 8, 19, 16, 0), 171],
 [datetime.datetime(2017, 8, 19, 20, 0), 220],
 [datetime.datetime(2017, 8, 20, 0, 0), 135],
 [datetime.datetime(2017, 8, 20, 4, 0), 14],
 [datetime.datetime(2017, 8, 20, 8, 0), 11],
 [datetime.datetime(2017, 8, 20, 12, 0), 79],
 [datetime.datetime(2017, 8, 20, 16, 0), 129],
 [datetime.datetime(2017, 8, 20, 20, 0), 146],
 [datetime.datetime(2017, 8, 21, 0, 0), 116],
 [datetime.datetime(2017, 8, 21, 4, 0), 9],
 [datetime.datetime(2017, 8, 21, 8, 0), 32],
 [datetime.datetime(2017, 8, 21, 12, 0), 154],
 [datetime.datetime(2017, 8, 21, 16, 0), 236],
 [datetime.datetime(2017, 8, 21, 20, 0), 700],
 [datetime.datetime(2017, 8, 22, 0, 0), 163],
 [datetime.datetime(2017, 8, 22, 4, 0), 9],
 [datetime.datetime(2017, 8, 22, 8, 0), 26],
 [datetime.datetime(2017, 8, 22, 12, 0), 115],
 [datetime.datetime(2017, 8, 22, 16, 0), 239],
 [datetime.datetime(2017, 8, 22, 20, 0), 628],
 [datetime.datetime(2017, 8, 23, 0, 0), 170],
 [datetime.datetime(2017, 8, 23, 4, 0), 6],
 [datetime.datetime(2017, 8, 23, 8, 0), 43],
 [datetime.datetime(2017, 8, 23, 12, 0), 161],
 [datetime.datetime(2017, 8, 23, 16, 0), 260],
 [datetime.datetime(2017, 8, 23, 20, 0), 672],
 [datetime.datetime(2017, 8, 24, 0, 0), 215],
 [datetime.datetime(2017, 8, 24, 4, 0), 23],
 [datetime.datetime(2017, 8, 24, 8, 0), 43],
 [datetime.datetime(2017, 8, 24, 12, 0), 156],
 [datetime.datetime(2017, 8, 24, 16, 0), 291],
 [datetime.datetime(2017, 8, 24, 20, 0), 770],
 [datetime.datetime(2017, 8, 25, 0, 0), 195],
 [datetime.datetime(2017, 8, 25, 4, 0), 18],
 [datetime.datetime(2017, 8, 25, 8, 0), 35],
 [datetime.datetime(2017, 8, 25, 12, 0), 129],
 [datetime.datetime(2017, 8, 25, 16, 0), 320],
 [datetime.datetime(2017, 8, 25, 20, 0), 655]]

Exercise 3

  • These counts are cumulative every n hours. We want total daily entries.

Now make it that we again have the same keys, but now we have a single value for a single day, which is not cumulative counts but the total number of passengers that entered through this turnstile on this day.


In [9]:
# In the solutions there's a check for abnormal values, I added it in the exercises below
# because I found out about the problem later in the analysis

turnstile_daily = {}

# looping through each key in the timeseries, tracking if the date change while cumulating partial counts
for key in turnstile_timeseries:
    value = []
    prev_date = ''
    daily_entries = 0
    for el in turnstile_timeseries[key]:
        curr_date = el[0].date()
        daily_entries += el[1]
        # if the current date differs from the previous I write the value in the dict and reset the other data
        # I check that the date isn't empty to avoid writing the initial values for each key
        if prev_date != curr_date:
            if prev_date != '':
                value.append([prev_date, daily_entries])
            daily_entries = 0
            prev_date = curr_date
    # I write the last value of the loop in each case, this is the closing value of the period
    value.append([prev_date, daily_entries])
    if key in turnstile_daily:
        turnstile_daily[key].append(value)
    else:
        turnstile_daily[key] = value

In [10]:
turnstile_daily[test]


Out[10]:
[[datetime.date(2017, 8, 12), 689],
 [datetime.date(2017, 8, 13), 502],
 [datetime.date(2017, 8, 14), 1309],
 [datetime.date(2017, 8, 15), 1383],
 [datetime.date(2017, 8, 16), 1413],
 [datetime.date(2017, 8, 17), 1429],
 [datetime.date(2017, 8, 18), 1265],
 [datetime.date(2017, 8, 19), 630],
 [datetime.date(2017, 8, 20), 495],
 [datetime.date(2017, 8, 21), 1294],
 [datetime.date(2017, 8, 22), 1187],
 [datetime.date(2017, 8, 23), 1357],
 [datetime.date(2017, 8, 24), 1478],
 [datetime.date(2017, 8, 25), 1157]]

Exercise 4

  • We will plot the daily time series for a turnstile.

In ipython notebook, add this to the beginning of your next cell:

%matplotlib inline

This will make your matplotlib graphs integrate nicely with the notebook. To plot the time series, import matplotlib with

import matplotlib.pyplot as plt

Take the list of [(date1, count1), (date2, count2), ...], for the turnstile and turn it into two lists: dates and counts. This should plot it:

plt.figure(figsize=(10,3))
plt.plot(dates,counts)

In [11]:
import matplotlib.pyplot as plt
%matplotlib inline

In [12]:
# using list comprehension, there are other ways such as dict.keys() and dict.items()
dates = [el[0] for el in turnstile_daily[test]]
counts = [el[1] for el in turnstile_daily[test]]

fig = plt.figure(figsize=(14, 5))
ax = plt.axes()
ax.plot(dates, counts)
plt.grid('on');


Exercise 5

  • So far we've been operating on a single turnstile level, let's combine turnstiles in the same ControlArea/Unit/Station combo. There are some ControlArea/Unit/Station groups that have a single turnstile, but most have multiple turnstilea-- same value for the C/A, UNIT and STATION columns, different values for the SCP column.

We want to combine the numbers together -- for each ControlArea/UNIT/STATION combo, for each day, add the counts from each turnstile belonging to that combo.


In [68]:
temp = {}

# for each key I form the new key and check if it's already in the new dict
# I append the date in this temp dict to make it easier to sum the values
# then I create a new dict with the required keys
for key in turnstile_daily:
    new_key = list(key[0:2]) + list(key[-1:])
    for el in turnstile_daily[key]:
        # setting single negative values to 0:
        # possible causes:
            # strange things in data such as totals that lessen each hour going forward
        # also setting single values over 10.000.000 to 0 to avoid integer overflow:
        # possible causes:
            # data recovery
        value = np.int64(el[1])
        if value < 0 or value > 10000000:
            value = 0 # Maybe nan is a better choice...
        if tuple(new_key + [el[0]]) in temp:
            temp[tuple(new_key + [el[0]])] += value
        else:
            temp[tuple(new_key + [el[0]])] = value

In [69]:
ca_unit_station = {}

for key in temp:
    new_key = key[0:3]
    date = key[-1]
    if new_key in ca_unit_station:
        ca_unit_station[new_key].append([date, temp[key]])
    else:
        ca_unit_station[new_key] = [[date, temp[key]]]

In [70]:
ca_unit_station[('R305', 'R206', '125 ST')]


Out[70]:
[[datetime.date(2017, 8, 12), 2439],
 [datetime.date(2017, 8, 13), 2010],
 [datetime.date(2017, 8, 14), 3105],
 [datetime.date(2017, 8, 15), 0],
 [datetime.date(2017, 8, 16), 5344],
 [datetime.date(2017, 8, 17), 5422],
 [datetime.date(2017, 8, 18), 4987],
 [datetime.date(2017, 8, 19), 3873],
 [datetime.date(2017, 8, 20), 3148],
 [datetime.date(2017, 8, 21), 4758],
 [datetime.date(2017, 8, 22), 4933],
 [datetime.date(2017, 8, 23), 5013],
 [datetime.date(2017, 8, 24), 5254],
 [datetime.date(2017, 8, 25), 4678]]

Exercise 6

  • Similarly, combine everything in each station, and come up with a time series of [(date1, count1),(date2,count2),...] type of time series for each STATION, by adding up all the turnstiles in a station.

In [71]:
temp = {}

# for each key I form the new key and check if it's already in the new dict
# I append the date in this temp dict to make it easier to sum the values
# then I create a new dict with the required keys
for key in turnstile_daily:
    new_key = key[-1]
    for el in turnstile_daily[key]:
        # setting single negative values to 0:
        # possible causes:
            # strange things in data such as totals that lessen each hour going forward
        # also setting single values over 10.000.000 to 0 to avoid integer overflow:
        # possible causes:
            # data recovery
        value = np.int64(el[1])
        if value < 0 or value > 10000000:
            value = 0
        if (new_key, el[0]) in temp:
            temp[(new_key, el[0])] += value
        else:
            temp[(new_key, el[0])] = value

In [72]:
station = {}

for key in temp:
    new_key = key[0]
    date = key[-1]
    if new_key in station:
        station[new_key].append([date, temp[key]])
    else:
        station[new_key] = [[date, temp[key]]]

In [84]:
station['59 ST']


Out[84]:
[[datetime.date(2017, 8, 12), 37121],
 [datetime.date(2017, 8, 13), 29765],
 [datetime.date(2017, 8, 14), 67938],
 [datetime.date(2017, 8, 15), 71742],
 [datetime.date(2017, 8, 16), 72568],
 [datetime.date(2017, 8, 17), 71766],
 [datetime.date(2017, 8, 18), 63868],
 [datetime.date(2017, 8, 19), 41776],
 [datetime.date(2017, 8, 20), 32361],
 [datetime.date(2017, 8, 21), 63113],
 [datetime.date(2017, 8, 22), 67567],
 [datetime.date(2017, 8, 23), 69994],
 [datetime.date(2017, 8, 24), 69327],
 [datetime.date(2017, 8, 25), 59005]]

Exercise 7

  • Plot the time series for a station

In [76]:
test_station = '59 ST'

dates = [el[0] for el in station[test_station]]
counts = [el[1] for el in station[test_station]]

fig = plt.figure(figsize=(14, 5))
ax = plt.axes()
ax.plot(dates, counts)
plt.grid('on');


Exercise 8

  • Make one list of counts for one week for one station. Monday's count, Tuesday's count, etc. so it's a list of 7 counts. Make the same list for another week, and another week, and another week. plt.plot(week_count_list) for every week_count_list you created this way. You should get a rainbow plot of weekly commute numbers on top of each other.

In [77]:
fig = plt.figure(figsize=(16, 6))
ax = plt.axes()
n = len(station[test_station])
# creating a list with all the counts for the station
all_counts = [el[1] for el in station[test_station]]
# splitting counts every 7 values to get weekly data
for i in range(int(np.floor(n/7))):
    ax.plot(all_counts[i*7: 7 + i*7])
ax.set_xticklabels(['', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])
plt.grid('on');


Exercise 9

  • Over multiple weeks, sum total ridership for each station and sort them, so you can find out the stations with the highest traffic during the time you investigate

In [78]:
total_ridership = {}

# just looping through keys and summing all elements inside the dict
for key in station:
    for el in station[key]:
        if key in total_ridership:
            total_ridership[key] += el[1]
        else:
            total_ridership[key] = el[1]

In [99]:
import operator
sorted(total_ridership.items(), key=operator.itemgetter(1), reverse=True)


Out[99]:
[('34 ST-PENN STA', 1767161),
 ('GRD CNTRL-42 ST', 1604500),
 ('34 ST-HERALD SQ', 1320762),
 ('TIMES SQ-42 ST', 1207986),
 ('14 ST-UNION SQ', 1205867),
 ('23 ST', 1192051),
 ('42 ST-PORT AUTH', 1077606),
 ('FULTON ST', 1065607),
 ('86 ST', 938611),
 ('CANAL ST', 848954),
 ('125 ST', 840526),
 ('59 ST', 817911),
 ('59 ST COLUMBUS', 775840),
 ('FLUSHING-MAIN', 710137),
 ('96 ST', 679140),
 ('47-50 STS ROCK', 652054),
 ('CHAMBERS ST', 629776),
 ('ATL AV-BARCLAY', 626845),
 ('14 ST', 613190),
 ('PATH NEW WTC', 601406),
 ('JKSN HT-ROOSVLT', 600527),
 ('50 ST', 549029),
 ('28 ST', 539319),
 ('72 ST', 513946),
 ('42 ST-BRYANT PK', 487990),
 ('WALL ST', 466430),
 ('W 4 ST-WASH SQ', 465351),
 ('JAY ST-METROTEC', 448077),
 ('8 AV', 441008),
 ('7 AV', 423605),
 ('145 ST', 417455),
 ('CHURCH AV', 413452),
 ('JAMAICA CENTER', 405501),
 ('LEXINGTON AV/53', 401562),
 ('GRAND ST', 385133),
 ('116 ST', 384039),
 ('BOROUGH HALL', 379706),
 ('57 ST-7 AV', 360354),
 ('BEDFORD AV', 358225),
 ('BOWLING GREEN', 356110),
 ('DEKALB AV', 354855),
 ('77 ST', 354208),
 ('103 ST', 343680),
 ('KINGS HWY', 321850),
 ('33 ST', 321168),
 ('49 ST', 319614),
 ('JOURNAL SQUARE', 318448),
 ('161/YANKEE STAD', 309016),
 ('FOREST HILLS 71', 308408),
 ('MYRTLE-WYCKOFF', 306254),
 ('72 ST-2 AVE', 305036),
 ('THIRTY THIRD ST', 301367),
 ('CROWN HTS-UTICA', 299400),
 ('WOODHAVEN BLVD', 297684),
 ('KEW GARDENS', 285312),
 ('DELANCEY/ESSEX', 279095),
 ('168 ST', 276156),
 ('SUTPHIN-ARCHER', 275942),
 ('86 ST-2 AVE', 273252),
 ('FORDHAM RD', 272331),
 ('5 AV/53 ST', 266750),
 ('SPRING ST', 264114),
 ('JUNCTION BLVD', 263416),
 ('CONEY IS-STILLW', 258556),
 ('NOSTRAND AV', 258412),
 ('181 ST', 250222),
 ("B'WAY-LAFAYETTE", 249345),
 ('1 AV', 242736),
 ('79 ST', 242139),
 ('103 ST-CORONA', 240481),
 ('GROVE STREET', 239978),
 ('CORTLANDT ST', 236391),
 ('135 ST', 236143),
 ('JAMAICA 179 ST', 235830),
 ('3 AV-149 ST', 232079),
 ('WHITEHALL S-FRY', 227900),
 ('167 ST', 227203),
 ('QUEENSBORO PLZ', 226757),
 ('51 ST', 222935),
 ('PAVONIA/NEWPORT', 220772),
 ('CATHEDRAL PKWY', 215041),
 ('36 ST', 207992),
 ('68ST-HUNTER CO', 206105),
 ('BROOKLYN BRIDGE', 204815),
 ('96 ST-2 AVE', 203440),
 ('61 ST WOODSIDE', 201991),
 ('LEXINGTON AV/63', 201760),
 ('66 ST-LINCOLN', 201584),
 ('NEWARK BM BW', 201283),
 ('FLATBUSH AV-B.C', 200936),
 ('90 ST-ELMHURST', 200424),
 ('111 ST', 200268),
 ('CITY / BUS', 199919),
 ('5 AV/59 ST', 197827),
 ('SOUTH FERRY', 196337),
 ('KINGSBRIDGE RD', 195189),
 ('UTICA AV', 192000),
 ('BAY PKWY', 191186),
 ('2 AV', 188975),
 ('EXCHANGE PLACE', 188411),
 ('RECTOR ST', 188095),
 ('82 ST-JACKSON H', 187155),
 ('HALSEY ST', 186871),
 ('PARKCHESTER', 182841),
 ('VERNON-JACKSON', 182217),
 ('63 DR-REGO PARK', 181942),
 ('BLEECKER ST', 181705),
 ('FRANKLIN AV', 181560),
 ('170 ST', 180874),
 ('GRAND-NEWTOWN', 180727),
 ('WORLD TRADE CTR', 178290),
 ('MYRTLE AV', 177278),
 ('ASTOR PL', 175620),
 ('CHRISTOPHER ST', 175337),
 ('STEINWAY ST', 173057),
 ('81 ST-MUSEUM', 171084),
 ('8 ST-NYU', 166461),
 ('EAST BROADWAY', 163212),
 ('PRINCE ST', 162904),
 ('DYCKMAN ST', 161409),
 ('BROADWAY', 160431),
 ('18 AV', 159838),
 ('46 ST BLISS ST', 159726),
 ('BERGEN ST', 158496),
 ('SHEEPSHEAD BAY', 158171),
 ('HOUSTON ST', 158039),
 ('MARCY AV', 155840),
 ('175 ST', 155113),
 ('COURT SQ', 151546),
 ('ST. GEORGE', 150243),
 ('ASTORIA DITMARS', 150027),
 ('57 ST', 148689),
 ('5 AVE', 148195),
 ('BRIGHTON BEACH', 147979),
 ('137 ST CITY COL', 147632),
 ('FT HAMILTON PKY', 147388),
 ('ELMHURST AV', 141492),
 ('CANARSIE-ROCKAW', 140664),
 ('149/GRAND CONC', 134972),
 ('FLUSHING AV', 133795),
 ('METS-WILLETS PT', 132738),
 ('GUN HILL RD', 131673),
 ('AVENUE U', 130118),
 ('BEDFORD PK BLVD', 129752),
 ('PROSPECT PARK', 128738),
 ('34 ST-HUDSON YD', 127949),
 ('30 AV', 127755),
 ('YORK ST', 126320),
 ('PELHAM PKWY', 125370),
 ('LORIMER ST', 124831),
 ('116 ST-COLUMBIA', 124598),
 ('110 ST', 122947),
 ('157 ST', 122787),
 ('NEVINS ST', 121998),
 ('QUEENS PLAZA', 121949),
 ('HOYT-SCHER', 121754),
 ('ROCKAWAY AV', 121715),
 ('EUCLID AV', 121714),
 ('BURNSIDE AV', 119947),
 ('NEWKIRK PLAZA', 119659),
 ('JFK JAMAICA CT1', 119342),
 ('HIGH ST', 118706),
 ('THIRTY ST', 117749),
 ('GRAHAM AV', 117256),
 ('21 ST-QNSBRIDGE', 116807),
 ('4AV-9 ST', 116555),
 ('40 ST LOWERY ST', 116329),
 ('HUNTS POINT AV', 115726),
 ('E 180 ST', 115565),
 ('CARROLL ST', 115329),
 ('CLINTON-WASH AV', 114158),
 ('TWENTY THIRD ST', 113748),
 ('TREMONT AV', 113136),
 ('ASTORIA BLVD', 111650),
 ('METROPOLITAN AV', 110102),
 ('14TH STREET', 108915),
 ('INWOOD-207 ST', 108695),
 ('BROADWAY JCT', 107483),
 ('GREENPOINT AV', 107363),
 ('SIMPSON ST', 106492),
 ('231 ST', 104930),
 ('CENTRAL PK N110', 104863),
 ('NASSAU AV', 101890),
 ('HARRISON', 101096),
 ('JEFFERSON ST', 99839),
 ('NORWOOD 205 ST', 99087),
 ('MOSHOLU PKWY', 98330),
 ('BEDFORD-NOSTRAN', 97890),
 ('46 ST', 97173),
 ('MORGAN AV', 96365),
 ('ROOSEVELT ISLND', 96111),
 ('COURT SQ-23 ST', 95811),
 ('45 ST', 93293),
 ('33 ST-RAWSON ST', 92913),
 ('67 AV', 91908),
 ('169 ST', 91568),
 ('191 ST', 91466),
 ('20 AV', 91191),
 ('3 AV 138 ST', 89605),
 ('ROCKAWAY BLVD', 89599),
 ('OZONE PK LEFFRT', 88845),
 ('18 ST', 88127),
 ('WOODLAWN', 87498),
 ('WEST FARMS SQ', 87350),
 ('GATES AV', 86913),
 ('V.CORTLANDT PK', 86204),
 ('PELHAM BAY PARK', 85814),
 ('PROSPECT AV', 85021),
 ('NEWKIRK AV', 84446),
 ('MONTROSE AV', 83451),
 ('KOSCIUSZKO ST', 81843),
 ('CITY HALL', 81418),
 ('3 AV', 79766),
 ('NORTHERN BLVD', 78501),
 ('MORISN AV/SNDVW', 77508),
 ('BROAD ST', 77201),
 ('KINGSTON-THROOP', 76957),
 ('PARKSIDE AV', 76589),
 ('GRAND ARMY PLAZ', 76577),
 ('52 ST', 76126),
 ('ELDER AV', 75420),
 ('207 ST', 75380),
 ('183 ST', 74488),
 ('155 ST', 74358),
 ('NEWARK C', 73806),
 ('CASTLE HILL AV', 73599),
 ('CORTELYOU RD', 72779),
 ('SUTTER AV-RUTLD', 72423),
 ('GRANT AV', 72316),
 ('176 ST', 71099),
 ('NEW LOTS AV', 71059),
 ('HOYT ST', 70982),
 ('PARSONS BLVD', 70342),
 ('LACKAWANNA', 70101),
 ('VAN SICLEN AV', 69399),
 ('RALPH AV', 69175),
 ('WINTHROP ST', 68836),
 ('71 ST', 68650),
 ('WESTCHESTER SQ', 68420),
 ('NEWARK HW BMEBE', 68260),
 ('BROOK AV', 67359),
 ('AVENUE J', 67357),
 ('174 ST', 66545),
 ('UNION ST', 65002),
 ('9 AV', 64796),
 ('MYRTLE-WILLOUGH', 64753),
 ('BAY RIDGE-95 ST', 64520),
 ('STERLING ST', 64422),
 ('MT EDEN AV', 63703),
 ('174-175 STS', 63161),
 ('SARATOGA AV', 62301),
 ('JACKSON AV', 61262),
 ('25 AV', 61070),
 ('FRANKLIN ST', 60963),
 ('AVENUE M', 60911),
 ('HUNTERS PT AV', 60282),
 ('69 ST', 59705),
 ('CRESCENT ST', 59373),
 ('MARBLE HILL-225', 59246),
 ('NEW LOTS', 59049),
 ('15 ST-PROSPECT', 58678),
 ('36 AV', 58420),
 ('JAMAICA VAN WK', 58121),
 ('25 ST', 58017),
 ('FAR ROCKAWAY', 57974),
 ('9TH STREET', 57337),
 ('LAFAYETTE AV', 57103),
 ('HOWARD BCH JFK', 56819),
 ('WAKEFIELD/241', 55549),
 ('ALLERTON AV', 54952),
 ('FREEMAN ST', 54924),
 ('190 ST', 54776),
 ('CLASSON AV', 54373),
 ('E 149 ST', 54348),
 ('SMITH-9 ST', 54113),
 ('WILSON AV', 54017),
 ('PARK PLACE', 53068),
 ('233 ST', 52791),
 ('163 ST-AMSTERDM', 52296),
 ('KINGSTON AV', 52267),
 ('EASTN PKWY-MUSM', 52000),
 ('CLARK ST', 51581),
 ('BOWERY', 51394),
 ('NECK RD', 51173),
 ('BRIARWOOD', 50748),
 ('80 ST', 50285),
 ('NEW UTRECHT AV', 50056),
 ('BEVERLY RD', 49604),
 ('182-183 STS', 49066),
 ('ST LAWRENCE AV', 48314),
 ('SUTPHIN BLVD', 47833),
 ('SUTTER AV', 46342),
 ('CYPRESS AV', 44954),
 ('238 ST', 44937),
 ('4 AV-9 ST', 44257),
 ('EAST 105 ST', 44220),
 ('HARLEM 148 ST', 43905),
 ('PENNSYLVANIA AV', 42979),
 ('225 ST', 42920),
 ('EASTCHSTER/DYRE', 41921),
 ('RIT-MANHATTAN', 41902),
 ('W 8 ST-AQUARIUM', 41648),
 ('CHAUNCEY ST', 41260),
 ('RIT-ROOSEVELT', 41001),
 ('PRESIDENT ST', 40876),
 ('74 ST-BROADWAY', 40693),
 ('65 ST', 40511),
 ('BUHRE AV', 40215),
 ('85 ST-FOREST PK', 40052),
 ('LONGWOOD AV', 39916),
 ('DITMAS AV', 39808),
 ('OCEAN PKWY', 39654),
 ('SHEPHERD AV', 38985),
 ('6 AV', 38820),
 ('138/GRAND CONC', 38704),
 ('NORWOOD AV', 38688),
 ('BEVERLEY ROAD', 37825),
 ('VAN SICLEN AVE', 37433),
 ('BEACH 67 ST', 37317),
 ('LIVONIA AV', 37232),
 ('INTERVALE AV', 36667),
 ('NEREID AV', 36299),
 ('PATH WTC 2', 36297),
 ('CLEVELAND ST', 35531),
 ('75 ST-ELDERTS', 35277),
 ('219 ST', 35227),
 ('BURKE AV', 35053),
 ('BAYCHESTER AV', 34144),
 ('75 AV', 34071),
 ('88 ST', 32940),
 ('AVENUE H', 32829),
 ('104 ST', 31955),
 ('LIBERTY AV', 31407),
 ('HEWES ST', 31291),
 ('AVENUE X', 31034),
 ('BAY 50 ST', 30537),
 ('ALABAMA AV', 28760),
 ('ZEREGA AV', 28618),
 ('BEACH 60 ST', 28573),
 ('39 AV', 27729),
 ('AQUEDUCT RACETR', 26488),
 ('BRONX PARK EAST', 26099),
 ('BUSHWICK AV', 24907),
 ('MORRIS PARK', 24801),
 ('21 ST', 24180),
 ('215 ST', 22588),
 ('55 ST', 22402),
 ('MIDDLETOWN RD', 22063),
 ('BEACH 25 ST', 21631),
 ('ATLANTIC AV', 21180),
 ('WHITLOCK AV', 20814),
 ('JUNIUS ST', 20705),
 ('AVENUE N', 19666),
 ('BEACH 90 ST', 19479),
 ('NEPTUNE AV', 16143),
 ('AVENUE P', 15702),
 ('CYPRESS HILLS', 14635),
 ('ROCKAWAY PARK B', 13309),
 ('BOTANIC GARDEN', 13295),
 ('AQUEDUCT N.COND', 11888),
 ('BEACH 36 ST', 11562),
 ('BEACH 98 ST', 10482),
 ('TOMPKINSVILLE', 9441),
 ("E 143/ST MARY'S", 9371),
 ('121 ST', 9032),
 ('BEACH 44 ST', 8480),
 ('AVENUE I', 8036),
 ('BEACH 105 ST', 6450),
 ('ORCHARD BEACH', 3827),
 ('BROAD CHANNEL', 3536),
 ('NEWARK HM HE', 321),
 ('KNICKERBOCKER', 2),
 ('CENTRAL AV', 1),
 ('SENECA AVE', 1),
 ('FOREST AVE', 0),
 ('FRESH POND RD', 0)]

Exercise 10

  • Make a single list of these total ridership values and plot it with plt.hist(total_ridership_counts) to get an idea about the distribution of total ridership among different stations.
    This should show you that most stations have a small traffic, and the histogram bins for large traffic volumes have small bars.

Additional Hint:
If you want to see which stations take the meat of the traffic, you can sort the total ridership counts and make a plt.bar graph. For this, you want to have two lists: the indices of each bar, and the values. The indices can just be 0,1,2,3,..., so you can do

indices = range(len(total_ridership_values))
plt.bar(indices, total_ridership_values)

In [105]:
fig = plt.figure(figsize=(16, 10))
ax = plt.axes()
ax.hist(list(total_ridership.values()));



In [108]:
fig = plt.figure(figsize=(16, 10))
ax = plt.axes()
ax.bar(range(len(total_ridership)), sorted(list(total_ridership.values())));