We will explore the NYC MTA turnstile data set. These data files are from the New York Subway. It tracks the hourly entries and exits to turnstiles (UNIT) by day in the subway system.
Here is an example of what you could do with the data. James Kao investigates how subway ridership is affected by incidence of rain.
{ ('A002','R051','02-00-00','LEXINGTON AVE'):
[
['NQR456', 'BMT', '01/03/2015', '03:00:00', 'REGULAR', '0004945474', '0001675324'],
['NQR456', 'BMT', '01/03/2015', '07:00:00', 'REGULAR', '0004945478', '0001675333'],
['NQR456', 'BMT', '01/03/2015', '11:00:00', 'REGULAR', '0004945515', '0001675364'],
...
]
}
Store all the weeks in a data structure of your choosing
In [2]:
import csv
import os
Field Description
In [3]:
turnstile = {}
# looping through all files in data dir starting with MTA_Turnstile
for filename in os.listdir('data'):
if filename.startswith('MTA_Turnstile'):
# reading file and writing each row in a dict
with open(os.path.join('data', filename), newline='') as csvfile:
mtareader = csv.reader(csvfile, delimiter=',')
next(mtareader)
for row in mtareader:
key = (row[0], row[1], row[2], row[3])
value = [row[4], row[5], row[6], row[7], row[8], row[9], row[10].rstrip()]
if key in turnstile:
turnstile[key].append(value)
else:
turnstile[key] = [value]
In [4]:
# test value for dict
test = ('A002','R051','02-00-00','59 ST')
In [5]:
turnstile[test]#[:2]
Out[5]:
Let's turn this into a time series.
For each key (basically the control area, unit, device address and station of a specific turnstile), have a list again, but let the list be comprised of just the point in time and the cumulative count of entries.
This basically means keeping only the date, time, and entries fields in each list. You can convert the date and time into datetime objects -- That is a python class that represents a point in time. You can combine the date and time fields into a string and use the dateutil module to convert it into a datetime object.
Your new dict should look something like
{ ('A002','R051','02-00-00','LEXINGTON AVE'):
[
[datetime.datetime(2013, 3, 2, 3, 0), 3788],
[datetime.datetime(2013, 3, 2, 7, 0), 2585],
[datetime.datetime(2013, 3, 2, 12, 0), 10653],
[datetime.datetime(2013, 3, 2, 17, 0), 11016],
[datetime.datetime(2013, 3, 2, 23, 0), 10666],
[datetime.datetime(2013, 3, 3, 3, 0), 10814],
[datetime.datetime(2013, 3, 3, 7, 0), 10229],
...
],
....
}
In [6]:
import numpy as np
import datetime
from dateutil.parser import parse
In [7]:
# With respect to the solutions I converted the cumulative entries in the number of entries in the period
# That's ok I think since it is required below to do so...
turnstile_timeseries = {}
# looping through each key in dict, parsing the date and calculating the difference between previous and current count
for key in turnstile:
prev = np.nan
value = []
for el in turnstile[key]:
value.append([parse(el[2] + ' ' + el[3]), int(el[5]) - prev])
prev = int(el[5])
if key in turnstile_timeseries:
turnstile_timeseries[key].append(value)
else:
turnstile_timeseries[key] = value
In [8]:
turnstile_timeseries[test]#[:5]
# ('R305', 'R206', '01-00-00','125 ST')
Out[8]:
In [9]:
# In the solutions there's a check for abnormal values, I added it in the exercises below
# because I found out about the problem later in the analysis
turnstile_daily = {}
# looping through each key in the timeseries, tracking if the date change while cumulating partial counts
for key in turnstile_timeseries:
value = []
prev_date = ''
daily_entries = 0
for el in turnstile_timeseries[key]:
curr_date = el[0].date()
daily_entries += el[1]
# if the current date differs from the previous I write the value in the dict and reset the other data
# I check that the date isn't empty to avoid writing the initial values for each key
if prev_date != curr_date:
if prev_date != '':
value.append([prev_date, daily_entries])
daily_entries = 0
prev_date = curr_date
# I write the last value of the loop in each case, this is the closing value of the period
value.append([prev_date, daily_entries])
if key in turnstile_daily:
turnstile_daily[key].append(value)
else:
turnstile_daily[key] = value
In [10]:
turnstile_daily[test]
Out[10]:
In ipython notebook, add this to the beginning of your next cell:
%matplotlib inline
This will make your matplotlib graphs integrate nicely with the notebook. To plot the time series, import matplotlib with
import matplotlib.pyplot as plt
Take the list of [(date1, count1), (date2, count2), ...], for the turnstile and turn it into two lists: dates and counts. This should plot it:
plt.figure(figsize=(10,3))
plt.plot(dates,counts)
In [11]:
import matplotlib.pyplot as plt
%matplotlib inline
In [12]:
# using list comprehension, there are other ways such as dict.keys() and dict.items()
dates = [el[0] for el in turnstile_daily[test]]
counts = [el[1] for el in turnstile_daily[test]]
fig = plt.figure(figsize=(14, 5))
ax = plt.axes()
ax.plot(dates, counts)
plt.grid('on');
We want to combine the numbers together -- for each ControlArea/UNIT/STATION combo, for each day, add the counts from each turnstile belonging to that combo.
In [68]:
temp = {}
# for each key I form the new key and check if it's already in the new dict
# I append the date in this temp dict to make it easier to sum the values
# then I create a new dict with the required keys
for key in turnstile_daily:
new_key = list(key[0:2]) + list(key[-1:])
for el in turnstile_daily[key]:
# setting single negative values to 0:
# possible causes:
# strange things in data such as totals that lessen each hour going forward
# also setting single values over 10.000.000 to 0 to avoid integer overflow:
# possible causes:
# data recovery
value = np.int64(el[1])
if value < 0 or value > 10000000:
value = 0 # Maybe nan is a better choice...
if tuple(new_key + [el[0]]) in temp:
temp[tuple(new_key + [el[0]])] += value
else:
temp[tuple(new_key + [el[0]])] = value
In [69]:
ca_unit_station = {}
for key in temp:
new_key = key[0:3]
date = key[-1]
if new_key in ca_unit_station:
ca_unit_station[new_key].append([date, temp[key]])
else:
ca_unit_station[new_key] = [[date, temp[key]]]
In [70]:
ca_unit_station[('R305', 'R206', '125 ST')]
Out[70]:
In [71]:
temp = {}
# for each key I form the new key and check if it's already in the new dict
# I append the date in this temp dict to make it easier to sum the values
# then I create a new dict with the required keys
for key in turnstile_daily:
new_key = key[-1]
for el in turnstile_daily[key]:
# setting single negative values to 0:
# possible causes:
# strange things in data such as totals that lessen each hour going forward
# also setting single values over 10.000.000 to 0 to avoid integer overflow:
# possible causes:
# data recovery
value = np.int64(el[1])
if value < 0 or value > 10000000:
value = 0
if (new_key, el[0]) in temp:
temp[(new_key, el[0])] += value
else:
temp[(new_key, el[0])] = value
In [72]:
station = {}
for key in temp:
new_key = key[0]
date = key[-1]
if new_key in station:
station[new_key].append([date, temp[key]])
else:
station[new_key] = [[date, temp[key]]]
In [84]:
station['59 ST']
Out[84]:
In [76]:
test_station = '59 ST'
dates = [el[0] for el in station[test_station]]
counts = [el[1] for el in station[test_station]]
fig = plt.figure(figsize=(14, 5))
ax = plt.axes()
ax.plot(dates, counts)
plt.grid('on');
plt.plot(week_count_list)
for every week_count_list
you created this way. You should get a rainbow plot of weekly commute numbers on top of each other.
In [77]:
fig = plt.figure(figsize=(16, 6))
ax = plt.axes()
n = len(station[test_station])
# creating a list with all the counts for the station
all_counts = [el[1] for el in station[test_station]]
# splitting counts every 7 values to get weekly data
for i in range(int(np.floor(n/7))):
ax.plot(all_counts[i*7: 7 + i*7])
ax.set_xticklabels(['', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])
plt.grid('on');
In [78]:
total_ridership = {}
# just looping through keys and summing all elements inside the dict
for key in station:
for el in station[key]:
if key in total_ridership:
total_ridership[key] += el[1]
else:
total_ridership[key] = el[1]
In [99]:
import operator
sorted(total_ridership.items(), key=operator.itemgetter(1), reverse=True)
Out[99]:
plt.hist(total_ridership_counts)
to get an idea about the distribution of total ridership among different stations.Additional Hint:
If you want to see which stations take the meat of the traffic, you can sort the total ridership counts and make a plt.bar
graph. For this, you want to have two lists: the indices of each bar, and the values. The indices can just be 0,1,2,3,...
, so you can do
indices = range(len(total_ridership_values))
plt.bar(indices, total_ridership_values)
In [105]:
fig = plt.figure(figsize=(16, 10))
ax = plt.axes()
ax.hist(list(total_ridership.values()));
In [108]:
fig = plt.figure(figsize=(16, 10))
ax = plt.axes()
ax.bar(range(len(total_ridership)), sorted(list(total_ridership.values())));