Exploratory Data Analysis with Python

We will explore the NYC MTA turnstile data set. These data files are from the New York Subway. It tracks the hourly entries and exits to turnstiles (UNIT) by day in the subway system.

Here is an example of what you could do with the data. James Kao investigates how subway ridership is affected by incidence of rain.


NOTE:
This notebook uses code found in the k2datascience.nyc_mta package. To execute all the cells do one of the following items:

  • Install the k2datascience package to the active Python interpreter.
  • Add k2datascience/k2datascience to the PYTHON_PATH system variable.
  • Create a link to the nyc_mta.py file in the same directory as this notebook.


Imports


In [1]:
from collections import defaultdict
import csv
import os
import os.path as osp

from dateutil.parser import parse
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from k2datascience import nyc_mta

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

Download Data

Would you like to download New York City MTA Turnstile data?

  • Each file is for a week of data and is approximately 24 Megabytes in size.

In [2]:
download = False
file_quantity = 2

Scrape MTA Turnstile Web Page to extract all available data files.


In [3]:
d = nyc_mta.TurnstileData()
if download:
    d.write_data_files(qty=file_quantity)
    print(f'\n\nThe raw data files were written out to:\n\n{d.data_dir}')

Exercise 1

  • Download at least 2 weeks worth of MTA turnstile data (You can do this manually or via Python)
  • Open up a file, use csv reader to read it, make a python dict where there is a key for each (C/A, UNIT, SCP, STATION). These are the first four columns. The value for this key should be a list of lists. Each list in the list is the rest of the columns in a row. For example, one key-value pair should look like
    {    ('A002','R051','02-00-00','LEXINGTON AVE'):    
         [
           ['NQR456', 'BMT', '01/03/2015', '03:00:00', 'REGULAR', '0004945474', '0001675324'],          
             ['NQR456', 'BMT', '01/03/2015', '07:00:00', 'REGULAR', '0004945478', '0001675333'],  
            ['NQR456', 'BMT', '01/03/2015', '11:00:00', 'REGULAR', '0004945515', '0001675364'],
          ...   
     ] 
    }

Store all the weeks in a data structure of your choosing

Data File Path


In [4]:
data_file = '170401.txt'
data_dir = osp.join('..', 'data', 'nyc_mta_turnstile')
data_file_path = osp.join(data_dir, data_file)

Create Excersize 1 Dictionary


In [5]:
turnstile = defaultdict(list)
with open(data_file_path, 'r') as f:
    reader = csv.reader(f)
    initial_row = True
    for row in reader:
        if not initial_row:
            turnstile[tuple(row[:4])].append([x.strip() for x in row[4:]])
        else:
            header = [x.strip() for x in row]
            initial_row = False
  • C/A: Control Area (A002)
  • UNIT: Remote Unit for a station (R051)
  • SCP: Subunit Channel Position represents an specific address for a device (02-00-00)
  • STATION: Represents the station name the device is located at
  • LINENAME: Represents all train lines that can be boarded at this station
    • Normally lines are represented by one character.
    • LINENAME 456NQR represents train server for 4, 5, 6, N, Q, and R trains.
  • DIVISION: Represents the Line originally the station belonged to BMT, IRT, or IND
  • DATE: Represents the date (MM-DD-YY)
  • TIME: Represents the time (hh:mm:ss) for a scheduled audit event
  • DESc: Represent the "REGULAR" scheduled audit event (Normally occurs every 4 hours)
    • Audits may occur more that 4 hours due to planning, or troubleshooting activities.
    • Additionally, there may be a "RECOVR AUD" entry: This refers to a missed audit that was recovered.
  • ENTRIES: The comulative entry register value for a device
  • EXIST: The cumulative exit register value for a device

In [6]:
header


Out[6]:
['C/A',
 'UNIT',
 'SCP',
 'STATION',
 'LINENAME',
 'DIVISION',
 'DATE',
 'TIME',
 'DESC',
 'ENTRIES',
 'EXITS']

Example Entry in Turnstile Dictionary


In [7]:
turnstile[('A002', 'R051', '02-00-00', '59 ST')][:3]


Out[7]:
[['NQR456W',
  'BMT',
  '03/25/2017',
  '00:00:00',
  'REGULAR',
  '0006108321',
  '0002069313'],
 ['NQR456W',
  'BMT',
  '03/25/2017',
  '04:00:00',
  'REGULAR',
  '0006108343',
  '0002069319'],
 ['NQR456W',
  'BMT',
  '03/25/2017',
  '08:00:00',
  'REGULAR',
  '0006108360',
  '0002069347']]

Create Pandas DataFrame


In [8]:
d.get_data()
d.data.shape
d.data.head()


Out[8]:
(396013, 11)
Out[8]:
c_a unit scp station linename division date time desc entries exits
0 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 00:00:00 REGULAR 6108321 2069313
1 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 04:00:00 REGULAR 6108343 2069319
2 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 08:00:00 REGULAR 6108360 2069347
3 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 12:00:00 REGULAR 6108462 2069456
4 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 16:00:00 REGULAR 6108689 2069525

Exercise 2

  • Let's turn this into a time series.

    For each key (basically the control area, unit, device address and station of a specific turnstile), have a list again, but let the list be comprised of just the point in time and the cumulative count of entries.

This basically means keeping only the date, time, and entries fields in each list. You can convert the date and time into datetime objects -- That is a python class that represents a point in time. You can combine the date and time fields into a string and use the dateutil package to convert it into a datetime object.

Your new dict should look something like

{    ('A002','R051','02-00-00','LEXINGTON AVE'):    
         [
            [datetime.datetime(2013, 3, 2, 3, 0), 3788],
            [datetime.datetime(2013, 3, 2, 7, 0), 2585],
            [datetime.datetime(2013, 3, 2, 12, 0), 10653],
            [datetime.datetime(2013, 3, 2, 17, 0), 11016],
            [datetime.datetime(2013, 3, 2, 23, 0), 10666],
            [datetime.datetime(2013, 3, 3, 3, 0), 10814],
            [datetime.datetime(2013, 3, 3, 7, 0), 10229],
            ...
          ],
 ....
 }

Create Exersize 2 Time Series Dictionary

Note: The extended computational time is due to the dateutil operation.


In [9]:
turnstile_ts = {}
for k, v in turnstile.items():
    turnstile_ts[k] = [[parse(f'{x[2]} {x[3]}'), int(x[-2])] for x in v]

Example Entry in Turnstile Time Series Dictionary


In [10]:
turnstile_ts[('A002', 'R051', '02-00-00', '59 ST')][:10]


Out[10]:
[[datetime.datetime(2017, 3, 25, 0, 0), 6108321],
 [datetime.datetime(2017, 3, 25, 4, 0), 6108343],
 [datetime.datetime(2017, 3, 25, 8, 0), 6108360],
 [datetime.datetime(2017, 3, 25, 12, 0), 6108462],
 [datetime.datetime(2017, 3, 25, 16, 0), 6108689],
 [datetime.datetime(2017, 3, 25, 20, 0), 6108983],
 [datetime.datetime(2017, 3, 26, 0, 0), 6109170],
 [datetime.datetime(2017, 3, 26, 4, 0), 6109213],
 [datetime.datetime(2017, 3, 26, 8, 0), 6109224],
 [datetime.datetime(2017, 3, 26, 12, 0), 6109284]]

Add Time Stamp Series to Pandas DataFrame


In [11]:
d.get_time_stamp()
d.data.shape
d.data.head()


Out[11]:
(396013, 12)
Out[11]:
c_a unit scp station linename division date time desc entries exits time_stamp
0 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 00:00:00 REGULAR 6108321 2069313 2017-03-25 00:00:00
1 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 04:00:00 REGULAR 6108343 2069319 2017-03-25 04:00:00
2 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 08:00:00 REGULAR 6108360 2069347 2017-03-25 08:00:00
3 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 12:00:00 REGULAR 6108462 2069456 2017-03-25 12:00:00
4 A002 R051 02-00-00 59 ST NQR456W BMT 03/25/2017 16:00:00 REGULAR 6108689 2069525 2017-03-25 16:00:00

Exercise 3

  • These counts are cumulative every n hours. We want total daily entries.

Now make it that we again have the same keys, but now we have a single value for a single day, which is not cumulative counts but the total number of passengers that entered through this turnstile on this day.


In [12]:
daily_total = defaultdict(list)
for k, v in turnstile_ts.items():
    days = set([x[0].date() for x in v])
    for day in sorted(days):
        daily_total[k].append([day, sum([x[1] for x in v if x[0].date() == day])])

Example Entry in Turnstile Time Series Dictionary


In [13]:
daily_total[('A002', 'R051', '02-00-00', '59 ST')]


Out[13]:
[[datetime.date(2017, 3, 25), 36651158],
 [datetime.date(2017, 3, 26), 36656063],
 [datetime.date(2017, 3, 27), 36660942],
 [datetime.date(2017, 3, 28), 36670134],
 [datetime.date(2017, 3, 29), 36679266],
 [datetime.date(2017, 3, 30), 36689181],
 [datetime.date(2017, 3, 31), 36698829]]

Return Daily Entry Totals Using Pandas


In [14]:
d.turnstile_daily.head(10)
d.turnstile_daily.tail(10)


Out[14]:
entries exits
c_a unit station scp date
A002 R051 59 ST 02-00-00 2017-03-25 36651158 12416530
2017-03-26 36656063 12418103
2017-03-27 36660942 12420525
2017-03-28 36670134 12424098
2017-03-29 36679266 12427620
2017-03-30 36689181 12431192
2017-03-31 36698829 12434815
2017-04-01 36706941 12437604
2017-04-02 36712675 12439341
2017-04-03 36718244 12441609
Out[14]:
entries exits
c_a unit station scp date
TRAM2 R469 RIT-ROOSEVELT 00-05-01 2017-03-29 33324 1764
2017-03-30 33324 1764
2017-03-31 33324 1764
2017-04-01 33324 1764
2017-04-02 33324 1764
2017-04-03 16662 882
2017-04-04 33324 1764
2017-04-05 33324 1767
2017-04-06 33324 1770
2017-04-07 38878 2065

Exercise 4

  • We will plot the daily time series for a turnstile.

In ipython notebook, add this to the beginning of your next cell:

%matplotlib inline

This will make your matplotlib graphs integrate nicely with the notebook. To plot the time series, import matplotlib with

import matplotlib.pyplot as plt

Take the list of [(date1, count1), (date2, count2), ...], for the turnstile and turn it into two lists: dates and counts. This should plot it:

plt.figure(figsize=(10,3))
plt.plot(dates,counts)

In [15]:
label_size = 14

fig = plt.figure('Station 59 ST: Daily Turnstile Entries', figsize=(10, 3),
                 facecolor='white', edgecolor='black')
ax1 = plt.subplot2grid((1, 1), (0, 0))

dt = daily_total[('A002', 'R051', '02-00-00', '59 ST')]
dates = [x[0] for x in dt]
entries = [x[1] for x in dt]

ax1.plot_date(dates, entries, '^k-')

plt.suptitle('Station: 59 ST', fontsize=24, y=1.16);
plt.title('Control Area: A002 | Unit: R051 | Subunit Channel Position: 02-00-00',
          fontsize=18, y=1.10);
ax1.set_xlabel('Date', fontsize=label_size)
ax1.set_ylabel('Turnstile Entries', fontsize=label_size)
fig.autofmt_xdate();


Pandas Plot


In [16]:
label_size = 14
marker_size = 5

fig = plt.figure('Station 59 ST: Daily Turnstile Entries', figsize=(10, 7),
                 facecolor='white', edgecolor='black')
rows, cols = (2, 1)
ax1 = plt.subplot2grid((rows, cols), (0, 0))
ax2 = plt.subplot2grid((rows, cols), (1, 0), sharex=ax1)

dt = d.turnstile_daily.query(('c_a == "A002"'
                              '& unit == "R051"'
                              '& scp == "02-00-00"'
                              '& station == "59 ST"'))
dt.plot(x=dt.index.levels[4], y='entries', color='IndianRed', legend=None,
        markersize=marker_size, marker='o', ax=ax1)

ax1.set_title('Control Area: A002 | Unit: R051 | Subunit Channel Position: 02-00-00',
              fontsize=18, y=1.10)
ax1.set_ylabel('Turnstile Entries', fontsize=label_size)

dt.plot(x=dt.index.levels[4], y='exits', color='black', legend=None,
        markersize=marker_size, marker='d', ax=ax2)

ax2.set_xlabel('Date', fontsize=label_size)
ax2.set_ylabel('Turnstile Exits', fontsize=label_size)

plt.suptitle('Station: 59 ST', fontsize=24, y=1.04);
plt.tight_layout()
fig.autofmt_xdate();



Exercise 5

  • So far we've been operating on a single turnstile level, let's combine turnstiles in the same ControlArea/Unit/Station combo. There are some ControlArea/Unit/Station groups that have a single turnstile, but most have multiple turnstilea-- same value for the C/A, UNIT and STATION columns, different values for the SCP column.

We want to combine the numbers together -- for each ControlArea/UNIT/STATION combo, for each day, add the counts from each turnstile belonging to that combo.

Pandas Return Total Passengers Filtered By Control Area, Unit, Station and Date


In [17]:
d.get_station_daily(control_area=True, unit=True)
station_daily_all = d._station_daily
station_daily_all.head(10)
station_daily_all.tail(10)


Out[17]:
entries exits
c_a unit station date week weekday
A002 R051 59 ST 2017-03-25 12 5 325026630 152589940
2017-03-26 12 6 325074736 152617301
2017-03-27 13 0 325123649 152653387
2017-03-28 13 1 325208230 152706088
2017-03-29 13 2 325293469 152760234
2017-03-30 13 3 325384470 152815615
2017-03-31 13 4 325473896 152869893
2017-04-01 13 5 325550538 152913131
2017-04-02 13 6 325603548 152943561
2017-04-03 14 0 325658809 152982749
Out[17]:
entries exits
c_a unit station date week weekday
TRAM2 R469 RIT-ROOSEVELT 2017-03-29 13 2 53124295 2824056
2017-03-30 13 3 53144162 2824410
2017-03-31 13 4 53161031 2824641
2017-04-01 13 5 53173541 2825021
2017-04-02 13 6 53192221 2825501
2017-04-03 14 0 26604761 1412902
2017-04-04 14 1 53234340 2826194
2017-04-05 14 2 53253030 2826522
2017-04-06 14 3 53273010 2826839
2017-04-07 14 4 62171498 3298232

Exercise 6

  • Similarly, combine everything in each station, and come up with a time series of [(date1, count1),(date2,count2),...] type of time series for each STATION, by adding up all the turnstiles in a station.

Pandas Return Total Passengers Filtered By Station and Date


In [18]:
station_daily = d.station_daily
station_daily.query('station == "59 ST"')


Out[18]:
entries exits
station date week weekday
59 ST 2017-03-25 12 5 12079560867 10083394396
2017-03-26 12 6 12079768829 10083556718
2017-03-27 13 0 15672795310 13104307168
2017-03-28 13 1 13876841143 11594370646
2017-03-29 13 2 13877291288 11594713332
2017-03-30 13 3 12079842584 10080971796
2017-03-31 13 4 12081807216 10085128441
2017-04-01 13 5 12082179709 10085383640
2017-04-02 13 6 10285999779 8575253070
2017-04-03 14 0 12082692498 10085775094
2017-04-04 14 1 12083137197 10086117226
2017-04-05 14 2 12083592333 10086463996
2017-04-06 14 3 12131879938 10125401894
2017-04-07 14 4 12084508296 10087163888

Exercise 7

  • Plot the time series for a station

In [19]:
label_size = 14

fig = plt.figure('Station 59 ST: Total Passengers', figsize=(12, 4),
                 facecolor='white', edgecolor='black')
ax1 = plt.subplot2grid((1, 1), (0, 0))

dt = station_daily.query('station == "59 ST"')
dt.plot(kind='bar', x=dt.index.levels[1], alpha=0.5, ax=ax1)

ax1.set_xlabel('Date', fontsize=label_size)
ax1.set_ylabel('Passengers', fontsize=label_size)

plt.suptitle('Station: 59 ST', fontsize=24, y=1.16);
plt.title('Total Passengers', fontsize=18, y=1.10);
fig.autofmt_xdate();



Exercise 8

  • Make one list of counts for one week for one station. Monday's count, Tuesday's count, etc. so it's a list of 7 counts. Make the same list for another week, and another week, and another week. plt.plot(week_count_list) for every week_count_list you created this way. You should get a rainbow plot of weekly commute numbers on top of each other.

In [20]:
week_59st = station_daily.query('station == "59 ST"').reset_index()
week_59st


Out[20]:
station date week weekday entries exits
0 59 ST 2017-03-25 12 5 12079560867 10083394396
1 59 ST 2017-03-26 12 6 12079768829 10083556718
2 59 ST 2017-03-27 13 0 15672795310 13104307168
3 59 ST 2017-03-28 13 1 13876841143 11594370646
4 59 ST 2017-03-29 13 2 13877291288 11594713332
5 59 ST 2017-03-30 13 3 12079842584 10080971796
6 59 ST 2017-03-31 13 4 12081807216 10085128441
7 59 ST 2017-04-01 13 5 12082179709 10085383640
8 59 ST 2017-04-02 13 6 10285999779 8575253070
9 59 ST 2017-04-03 14 0 12082692498 10085775094
10 59 ST 2017-04-04 14 1 12083137197 10086117226
11 59 ST 2017-04-05 14 2 12083592333 10086463996
12 59 ST 2017-04-06 14 3 12131879938 10125401894
13 59 ST 2017-04-07 14 4 12084508296 10087163888

In [21]:
label_size = 14

fig = plt.figure('Station 59 ST: Weekly Passengers', figsize=(12, 4),
                 facecolor='white', edgecolor='black')
ax1 = plt.subplot2grid((1, 1), (0, 0))

for w in week_59st.week.unique():
    mask = f'station == "59 ST" & week == {w}' 
    dt = station_daily.query(mask).reset_index()
    dt.plot(kind='area', x=dt.weekday, y='entries', alpha=0.5, label=f'Week: {w}', ax=ax1)

ax1.set_xlabel('Weekday', fontsize=label_size)
ax1.set_ylabel('Passengers', fontsize=label_size)
ax1.set_xticklabels(['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                     'Friday', 'Saturday', 'Sunday'])
x_min, x_max, y_min, y_max = ax1.axis()
ax1.axis((x_min, x_max, 0, 2e10))

plt.suptitle('Station: 59 ST', fontsize=24, y=1.16);
plt.title('Weekly Passengers', fontsize=18, y=1.10);
fig.autofmt_xdate();



Exercise 9

  • Over multiple weeks, sum total ridership for each station and sort them, so you can find out the stations with the highest traffic during the time you investigate

In [22]:
mask = ['station',
        pd.Series([x.week for x in d.data.time_stamp], name='week')]
station_weekly = d.data.groupby(mask)['entries', 'exits'].sum()

In [23]:
station_weekly.sort_values('entries', ascending=False)


Out[23]:
entries exits
station week
42 ST-PORT AUTH 13 319124309256 262435830256
125 ST 13 287895323372 164547715629
23 ST 13 277882381357 256229309479
CANAL ST 13 253324718914 255730969063
TIMES SQ-42 ST 13 243078609523 190566915810
42 ST-PORT AUTH 14 233193143429 191833295202
CHAMBERS ST 13 213061631997 188965251698
125 ST 14 209843368775 119566654740
23 ST 14 208665500718 191553604149
104 ST 13 198074369856 145353196485
34 ST-HERALD SQ 13 187164081772 218402299840
CANAL ST 14 184769238450 181888225014
57 ST-7 AV 13 176508770956 129526831847
FULTON ST 13 172839788886 178272857072
TIMES SQ-42 ST 14 172668839539 135188839034
14 ST 13 165818476717 120019044754
CHAMBERS ST 14 149998118277 135474224395
104 ST 14 145335605636 105669325115
59 ST COLUMBUS 13 145185563642 128107345560
34 ST-HERALD SQ 14 133651390572 155871859993
57 ST-7 AV 14 130209347655 95581664156
FLATBUSH AV-B.C 13 125164305357 91614009199
FULTON ST 14 123359594230 127200499013
72 ST 13 122209093719 124382012931
BROOKLYN BRIDGE 13 122106255392 73244065364
14 ST 14 121493871959 88910945188
34 ST-PENN STA 13 118770113598 67734867359
GRD CNTRL-42 ST 13 117237715788 31764219601
CORTLANDT ST 13 116472060669 39660290670
HIGH ST 13 115708371207 155777422744
... ... ... ...
RIT-MANHATTAN 12 48902917 600258
96 ST-2 AVE 13 48171817 39909132
14TH STREET 13 47230356 90512719
PATH NEW WTC 12 47215950 42962936
53 ST 13 47007529 25545470
PENNSYLVANIA AV 12 45849347 11983690
HARRISON 12 42524424 58839398
NEWARK C 12 39556171 5117601
TWENTY THIRD ST 12 37553291 22133458
96 ST-2 AVE 14 37313463 30873031
9TH STREET 12 36797038 36351025
BROAD CHANNEL 14 36293548 4709165
21 ST 12 35065139 34232732
JFK JAMAICA CT1 13 34952031 32331724
BEACH 44 ST 12 32479720 41452948
LACKAWANNA 12 29667137 38852195
NEWARK HM HE 13 27980895 286038147
14TH STREET 14 27241594 65042867
ATLANTIC AV 12 27215765 11242191
NEWARK HW BMEBE 12 23024534 20328977
JFK JAMAICA CT1 14 21902206 19611669
NEWARK HM HE 14 20084451 207116098
BROAD CHANNEL 12 14482752 1875154
96 ST-2 AVE 12 13336275 11077748
14TH STREET 12 12730721 26104012
JFK JAMAICA CT1 12 12475271 11929603
NEWARK HM HE 12 8056877 81902722
PATH WTC 2 13 3272204 9062403
14 1416386 5490050
12 1309408 3107090

1124 rows × 2 columns


Exercise 10

  • Make a single list of these total ridership values and plot it with plt.hist(total_ridership_counts) to get an idea about the distribution of total ridership among different stations.
    This should show you that most stations have a small traffic, and the histogram bins for large traffic volumes have small bars.

Additional Hint:
If you want to see which stations take the meat of the traffic, you can sort the total ridership counts and make a plt.bar graph. For this, you want to have two lists: the indices of each bar, and the values. The indices can just be 0,1,2,3,..., so you can do

indices = range(len(total_ridership_values))
plt.bar(indices, total_ridership_values)

In [24]:
station_group = d.data.groupby('station')
station_entries = station_group['entries'].sum()
station_entries.tail()


Out[24]:
station
WOODHAVEN BLVD     14359016536
WOODLAWN            2510281023
WORLD TRADE CTR     3213463710
YORK ST             2925643268
ZEREGA AV           4464846535
Name: entries, dtype: int64

In [25]:
label_size = 14
suptitle_size = 24
title_size = 18

bins = 50

fig = plt.figure('', figsize=(10, 8),
                 facecolor='white', edgecolor='black')
rows, cols = (2, 1)
ax1 = plt.subplot2grid((rows, cols), (0, 0))
ax2 = plt.subplot2grid((rows, cols), (1, 0))

station_entries.sort_values().plot(kind='bar', ax=ax1)

ax1.set_title('Total Passengers', fontsize=title_size);
ax1.set_xlabel('Stations', fontsize=label_size)
ax1.set_ylabel('Passengers', fontsize=label_size)
ax1.set_xticklabels('')

station_entries.plot(kind='hist', alpha=0.5, bins=bins,
                     edgecolor='black', label='_nolegend_', ax=ax2)
ax2.axvline(station_entries.mean(), color='crimson',
            label='Mean', linestyle='--')
ax2.axvline(station_entries.median(), color='black',
            label='Median', linestyle='-.')

ax2.legend()
ax2.set_xlabel('Total Passengers', fontsize=label_size)
ax2.set_ylabel('Count', fontsize=label_size)

plt.suptitle('All NYC MTA Stations', fontsize=suptitle_size, y=1.03);
plt.tight_layout();