We will explore the NYC MTA turnstile data set. These data files are from the New York Subway. It tracks the hourly entries and exits to turnstiles (UNIT) by day in the subway system.
Here is an example of what you could do with the data. James Kao investigates how subway ridership is affected by incidence of rain.
NOTE:
This notebook uses code found in the
k2datascience.nyc_mta package.
To execute all the cells do one of the following items:
In [1]:
from collections import defaultdict
import csv
import os
import os.path as osp
from dateutil.parser import parse
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from k2datascience import nyc_mta
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
In [2]:
download = False
file_quantity = 2
Scrape MTA Turnstile Web Page to extract all available data files.
In [3]:
d = nyc_mta.TurnstileData()
if download:
d.write_data_files(qty=file_quantity)
print(f'\n\nThe raw data files were written out to:\n\n{d.data_dir}')
{ ('A002','R051','02-00-00','LEXINGTON AVE'):
[
['NQR456', 'BMT', '01/03/2015', '03:00:00', 'REGULAR', '0004945474', '0001675324'],
['NQR456', 'BMT', '01/03/2015', '07:00:00', 'REGULAR', '0004945478', '0001675333'],
['NQR456', 'BMT', '01/03/2015', '11:00:00', 'REGULAR', '0004945515', '0001675364'],
...
]
}
Store all the weeks in a data structure of your choosing
In [4]:
data_file = '170401.txt'
data_dir = osp.join('..', 'data', 'nyc_mta_turnstile')
data_file_path = osp.join(data_dir, data_file)
In [5]:
turnstile = defaultdict(list)
with open(data_file_path, 'r') as f:
reader = csv.reader(f)
initial_row = True
for row in reader:
if not initial_row:
turnstile[tuple(row[:4])].append([x.strip() for x in row[4:]])
else:
header = [x.strip() for x in row]
initial_row = False
In [6]:
header
Out[6]:
In [7]:
turnstile[('A002', 'R051', '02-00-00', '59 ST')][:3]
Out[7]:
In [8]:
d.get_data()
d.data.shape
d.data.head()
Out[8]:
Out[8]:
Let's turn this into a time series.
For each key (basically the control area, unit, device address and station of a specific turnstile), have a list again, but let the list be comprised of just the point in time and the cumulative count of entries.
This basically means keeping only the date, time, and entries fields in each list. You can convert the date and time into datetime objects -- That is a python class that represents a point in time. You can combine the date and time fields into a string and use the dateutil package to convert it into a datetime object.
Your new dict should look something like
{ ('A002','R051','02-00-00','LEXINGTON AVE'):
[
[datetime.datetime(2013, 3, 2, 3, 0), 3788],
[datetime.datetime(2013, 3, 2, 7, 0), 2585],
[datetime.datetime(2013, 3, 2, 12, 0), 10653],
[datetime.datetime(2013, 3, 2, 17, 0), 11016],
[datetime.datetime(2013, 3, 2, 23, 0), 10666],
[datetime.datetime(2013, 3, 3, 3, 0), 10814],
[datetime.datetime(2013, 3, 3, 7, 0), 10229],
...
],
....
}
In [9]:
turnstile_ts = {}
for k, v in turnstile.items():
turnstile_ts[k] = [[parse(f'{x[2]} {x[3]}'), int(x[-2])] for x in v]
In [10]:
turnstile_ts[('A002', 'R051', '02-00-00', '59 ST')][:10]
Out[10]:
In [11]:
d.get_time_stamp()
d.data.shape
d.data.head()
Out[11]:
Out[11]:
In [12]:
daily_total = defaultdict(list)
for k, v in turnstile_ts.items():
days = set([x[0].date() for x in v])
for day in sorted(days):
daily_total[k].append([day, sum([x[1] for x in v if x[0].date() == day])])
In [13]:
daily_total[('A002', 'R051', '02-00-00', '59 ST')]
Out[13]:
In [14]:
d.turnstile_daily.head(10)
d.turnstile_daily.tail(10)
Out[14]:
Out[14]:
In ipython notebook, add this to the beginning of your next cell:
%matplotlib inline
This will make your matplotlib graphs integrate nicely with the notebook. To plot the time series, import matplotlib with
import matplotlib.pyplot as plt
Take the list of [(date1, count1), (date2, count2), ...], for the turnstile and turn it into two lists: dates and counts. This should plot it:
plt.figure(figsize=(10,3))
plt.plot(dates,counts)
In [15]:
label_size = 14
fig = plt.figure('Station 59 ST: Daily Turnstile Entries', figsize=(10, 3),
facecolor='white', edgecolor='black')
ax1 = plt.subplot2grid((1, 1), (0, 0))
dt = daily_total[('A002', 'R051', '02-00-00', '59 ST')]
dates = [x[0] for x in dt]
entries = [x[1] for x in dt]
ax1.plot_date(dates, entries, '^k-')
plt.suptitle('Station: 59 ST', fontsize=24, y=1.16);
plt.title('Control Area: A002 | Unit: R051 | Subunit Channel Position: 02-00-00',
fontsize=18, y=1.10);
ax1.set_xlabel('Date', fontsize=label_size)
ax1.set_ylabel('Turnstile Entries', fontsize=label_size)
fig.autofmt_xdate();
In [16]:
label_size = 14
marker_size = 5
fig = plt.figure('Station 59 ST: Daily Turnstile Entries', figsize=(10, 7),
facecolor='white', edgecolor='black')
rows, cols = (2, 1)
ax1 = plt.subplot2grid((rows, cols), (0, 0))
ax2 = plt.subplot2grid((rows, cols), (1, 0), sharex=ax1)
dt = d.turnstile_daily.query(('c_a == "A002"'
'& unit == "R051"'
'& scp == "02-00-00"'
'& station == "59 ST"'))
dt.plot(x=dt.index.levels[4], y='entries', color='IndianRed', legend=None,
markersize=marker_size, marker='o', ax=ax1)
ax1.set_title('Control Area: A002 | Unit: R051 | Subunit Channel Position: 02-00-00',
fontsize=18, y=1.10)
ax1.set_ylabel('Turnstile Entries', fontsize=label_size)
dt.plot(x=dt.index.levels[4], y='exits', color='black', legend=None,
markersize=marker_size, marker='d', ax=ax2)
ax2.set_xlabel('Date', fontsize=label_size)
ax2.set_ylabel('Turnstile Exits', fontsize=label_size)
plt.suptitle('Station: 59 ST', fontsize=24, y=1.04);
plt.tight_layout()
fig.autofmt_xdate();
We want to combine the numbers together -- for each ControlArea/UNIT/STATION combo, for each day, add the counts from each turnstile belonging to that combo.
In [17]:
d.get_station_daily(control_area=True, unit=True)
station_daily_all = d._station_daily
station_daily_all.head(10)
station_daily_all.tail(10)
Out[17]:
Out[17]:
In [18]:
station_daily = d.station_daily
station_daily.query('station == "59 ST"')
Out[18]:
In [19]:
label_size = 14
fig = plt.figure('Station 59 ST: Total Passengers', figsize=(12, 4),
facecolor='white', edgecolor='black')
ax1 = plt.subplot2grid((1, 1), (0, 0))
dt = station_daily.query('station == "59 ST"')
dt.plot(kind='bar', x=dt.index.levels[1], alpha=0.5, ax=ax1)
ax1.set_xlabel('Date', fontsize=label_size)
ax1.set_ylabel('Passengers', fontsize=label_size)
plt.suptitle('Station: 59 ST', fontsize=24, y=1.16);
plt.title('Total Passengers', fontsize=18, y=1.10);
fig.autofmt_xdate();
plt.plot(week_count_list)
for every week_count_list
you created this way. You should get a rainbow plot of weekly commute numbers on top of each other.
In [20]:
week_59st = station_daily.query('station == "59 ST"').reset_index()
week_59st
Out[20]:
In [21]:
label_size = 14
fig = plt.figure('Station 59 ST: Weekly Passengers', figsize=(12, 4),
facecolor='white', edgecolor='black')
ax1 = plt.subplot2grid((1, 1), (0, 0))
for w in week_59st.week.unique():
mask = f'station == "59 ST" & week == {w}'
dt = station_daily.query(mask).reset_index()
dt.plot(kind='area', x=dt.weekday, y='entries', alpha=0.5, label=f'Week: {w}', ax=ax1)
ax1.set_xlabel('Weekday', fontsize=label_size)
ax1.set_ylabel('Passengers', fontsize=label_size)
ax1.set_xticklabels(['Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'])
x_min, x_max, y_min, y_max = ax1.axis()
ax1.axis((x_min, x_max, 0, 2e10))
plt.suptitle('Station: 59 ST', fontsize=24, y=1.16);
plt.title('Weekly Passengers', fontsize=18, y=1.10);
fig.autofmt_xdate();
In [22]:
mask = ['station',
pd.Series([x.week for x in d.data.time_stamp], name='week')]
station_weekly = d.data.groupby(mask)['entries', 'exits'].sum()
In [23]:
station_weekly.sort_values('entries', ascending=False)
Out[23]:
plt.hist(total_ridership_counts)
to get an idea about the distribution of total ridership among different stations.Additional Hint:
If you want to see which stations take the meat of the traffic, you can sort the total ridership counts and make a plt.bar
graph. For this, you want to have two lists: the indices of each bar, and the values. The indices can just be 0,1,2,3,...
, so you can do
indices = range(len(total_ridership_values))
plt.bar(indices, total_ridership_values)
In [24]:
station_group = d.data.groupby('station')
station_entries = station_group['entries'].sum()
station_entries.tail()
Out[24]:
In [25]:
label_size = 14
suptitle_size = 24
title_size = 18
bins = 50
fig = plt.figure('', figsize=(10, 8),
facecolor='white', edgecolor='black')
rows, cols = (2, 1)
ax1 = plt.subplot2grid((rows, cols), (0, 0))
ax2 = plt.subplot2grid((rows, cols), (1, 0))
station_entries.sort_values().plot(kind='bar', ax=ax1)
ax1.set_title('Total Passengers', fontsize=title_size);
ax1.set_xlabel('Stations', fontsize=label_size)
ax1.set_ylabel('Passengers', fontsize=label_size)
ax1.set_xticklabels('')
station_entries.plot(kind='hist', alpha=0.5, bins=bins,
edgecolor='black', label='_nolegend_', ax=ax2)
ax2.axvline(station_entries.mean(), color='crimson',
label='Mean', linestyle='--')
ax2.axvline(station_entries.median(), color='black',
label='Median', linestyle='-.')
ax2.legend()
ax2.set_xlabel('Total Passengers', fontsize=label_size)
ax2.set_ylabel('Count', fontsize=label_size)
plt.suptitle('All NYC MTA Stations', fontsize=suptitle_size, y=1.03);
plt.tight_layout();