Humberto Ortiz-Zuazaga humberto.ortiz@upr.edu
We got some data from "Autoridad Metropolitana de Autobuses", and I wanted to see what bus service to my neighborhood looks like. I picked a spot near my home, and grepped the data files for partial coordinates:
$ cat AMA20140[12345].txt | grep ",18370" | grep ",-66066" > parana-big.txt
The data is a csv file with vehicle number, date and time, latitude, longitude, speed (km/h), and a code (1 for in motion, 2 for stopped, 4 for off):
$ head parana.txt
823,20140107205422,18370866,-66066824,25,1
823,20140107205427,18370545,-66066850,24,1
823,20140107205432,18370246,-66066868,20,1
823,20140116154618,18370701,-66066830,18,1
823,20140116154623,18370535,-66066846,7,1
823,20140116154703,18370331,-66066865,19,1
823,20140116154708,18370140,-66066874,18,1
823,20140116161308,18370128,-66066767,5,1
823,20140116161318,18370310,-66066757,7,1
823,20140116161356,18370485,-66066739,12,1
This particular file has 16773 lines.
In [1]:
%matplotlib inline
In [2]:
import csv
from datetime import datetime, timedelta
def timeparse(ds):
timestamp = datetime.strptime(ds, "%Y%m%d%H%M%S")
#return "{0}-{1}-{2}T{3}:{4}:{5}Z".format(year, month, day, hh, mm, ss)
return timestamp
counts = [0] * 24
trips = []
times = []
last = datetime.fromordinal(1)
epsilon = timedelta(minutes=15) # it takes around 20 minutes for the bus to come back
lastbus = 0
with open('parana-big.txt', 'rb') as csvfile:
amareader = csv.reader(csvfile)
for row in amareader:
timestamp = timeparse(row[1])
bus = row[0]
tdiff = abs(timestamp - last)
# make sure it's a different bus, not the same stuck in traffic
# the data is sorted by vehicle number, then date
if tdiff > epsilon or bus != lastbus:
# print timestamp, last, row
lastbus = bus
last = timestamp
hour = timestamp.hour
counts[hour] += 1
trips.append(timestamp)
midnight = timestamp.replace(hour=0, minute=0, second=0)
diff = timestamp - midnight
times.append(diff.total_seconds())
In [3]:
counts
Out[3]:
My niece is always complaining there was no service between 11:00 AM and 2:00 PM. Let's just have a quick look to see if there really is no service:
In [4]:
from pylab import *
bar(range(24), counts)
xlim(0,24)
Out[4]:
That's actually not so bad. There is a lull in service at 1:00 PM, but busses seem to still be running. Let's dig a little deeper. We have the timestamp of each trip recorded, but they're not in order.
In [5]:
trips[:10]
Out[5]:
In [6]:
trips.sort() # sort by datetime
trips[:10]
Out[6]:
Now that the timestamps are sorted, we can check how long a person that just missed a bus will wait for the next one.
In [7]:
intervals = [0] * 24
counts = [0] * 24
for i in range(1, len(trips)):
wait = trips[i] - trips[i-1] # how long since the last bus?
hour = trips[i].hour
# count how many hours you waited if you missed the last bus
intervals[hour] += wait.seconds / 60.0**2
counts[hour] += 1
Now we can compute the average wait time per hour blocks.
In [8]:
for i in range(24):
if counts[i]: # don't divide by zero
intervals[i] = intervals[i] / counts[i]
intervals
Out[8]:
In [9]:
bar(range(24), intervals)
xlim(0,24)
Out[9]:
In [10]:
len(trips)
Out[10]:
In [11]:
import numpy as np
day = 24 * 60 * 60
bins = np.arange(day / (6 * 60) ,dtype = np.float_) * 6 * 60
plt = hist(times, bins)
xlim(0, 240 * 6* 60)
Out[11]:
In [12]:
trips[:20]
Out[12]:
In [13]:
trips[15:15+30]
Out[13]:
In [ ]: