In my last post I created probability distributions for all Divvy bike stations in Chicago minute by minute to predict if a Divvy bike would be available at the time I'd need it. In this post I dive into the data a bit deeper and begin to visualize changes over time.

You can read the previous article here:

http://www.databozo.com/2014/01/25/Predicting_Divvy_bike_availability_with_MRJob_and_Python.html

Getting my data into the right form

Let's start by finding the closest station to me.


In [2]:
import json
results_filepath = '/Users/justin/Documents/Code/divvy_probability/results.txt'
my_longitude = -87.664362
my_latitude = 42.007758
closest_station = None
closest_station_distance = float('inf')

with open(results_filepath, 'r') as results_file:
    for line in results_file:
        station_data = json.loads(line)
        station_latitude = station_data['latitude']
        station_logitude = station_data['longitude']
        distance = math.sqrt(math.pow(station_latitude-my_latitude,2)+math.pow(station_logitude-my_longitude,2))
        if distance < closest_station_distance:
            closest_station = {'id': station_data['station_id'], 'name': station_data['station_name']}
            closest_station_distance = distance
 
print closest_station


{'id': 294, 'name': u'Broadway & Berwyn Ave'}

Now, I bet we don't run out of bikes very often on weekends... but let's start with visualizing every minute of the weekend and see if my hypothesis holds any water.

First let me make sure I have a sample for every minute of every hour in a day.


In [3]:
station_minutes = []
with open(results_filepath, 'r') as results_file:
    for line in results_file:
        station_data = json.loads(line)
        if station_data['station_id'] == closest_station['id'] and not station_data['weekday']:
            station_minutes.append(station_data)
len(station_minutes)


Out[3]:
1440

In [3]:
1440/60.0


Out[3]:
24.0

Looks like I have exactly 1440 minutes of bins which is a full 24 hours. I wonder how I might visualize this. Maybe animating every minute of the day as a histogram? To do this I can leverage the animation features of matplotlib.


In [282]:
import matplotlib as plt
import numpy
from matplotlib import animation

sorted_minutes = sorted(station_minutes, key=lambda x: x['hour']*60+x['minute'])
minute_frequencies = map(lambda x: numpy.array(x['available_bikes'])/(1.0*sum(numpy.array(x['available_bikes']))), sorted_minutes)
columns = range(0,100)
fig = figure = plt.pyplot.figure(figsize=(4, 4)) 

def animate(frame_index):
    plt.pyplot.cla()
    plt.pyplot.xlim(0,15)
    plt.pyplot.ylim(0,1)
    plt.pyplot.title('Time of day- %02d:%02d'%(frame_index/60, frame_index%60))
    plt.pyplot.bar(range(0,100), minute_frequencies[frame_index])

anim = animation.FuncAnimation(fig, animate, frames=1440)
anim.save('./demoanimation.gif', fps=4, codec='gif');


<matplotlib.figure.Figure at 0x10c9844d0>

Ok it errored out. I'm not really sure why BUT! the process still left me with all of the 1,440 images I need. It just didn't weave them together into an animation for me. I fought with matplotlib for a while on this and I don't get it. Instead, I'm going to move forward by using another tool to create an animation out of the images matplotlib created.

ImageMagick to the rescue! This converts the 1,440 separate images that I have into an animated gif:

convert -delay 0 _tmp*.png -loop 0 ../divvy_weekend_near_me_anim.gif

The problem now is that the resulting gif is huge and runs very slow. That also means that my hosting costs would probably go up if too many people read this. My next thought was that it'd be nice to convert it to a compressed video and just host it on YouTube. Here's how to do that.

I found a tool called "ffmpeg" and used the following command at Terminal:

ffmpeg -r 60 -pattern_type glob -i '_tmp*.png' -c:v libx264 -pix_fmt yuv420p divvy_weekend_video.mov

Let quickly describe what's going on here.

-r 60

means that it should show 60 frames per second

glob -i '_tmp*.png'

is how I point the program at my individual picture files

-c:v libx264 -pix_fmt yuv420p

Is all just to specify the encoding to something that my computer and YouTube can play.

The final argument is just the name I want for my final movie file.

That gives me this as a final result:


In [42]:
from IPython.display import YouTubeVideo
YouTubeVideo('z6v6CVP0bwc')


Out[42]:

So that's the weekend data, now let's create a video of the weekday data and see how it differs.


In [5]:
station_minutes = []
with open(results_filepath, 'r') as results_file:
    for line in results_file:
        station_data = json.loads(line)
        if station_data['station_id'] == closest_station['id'] and station_data['weekday']:
            station_minutes.append(station_data)

import matplotlib as plt
import numpy
from matplotlib import animation

sorted_minutes = sorted(station_minutes, key=lambda x: x['hour']*60+x['minute'])
minute_frequencies = map(lambda x: numpy.array(x['available_bikes'])/(1.0*sum(numpy.array(x['available_bikes']))), sorted_minutes)
columns = range(0,100)
fig = figure = plt.pyplot.figure(figsize=(4, 4)) 

def animate(frame_index):
    plt.pyplot.cla()
    plt.pyplot.xlim(0,15)
    plt.pyplot.ylim(0,1)
    plt.pyplot.title('Time of day- %02d:%02d'%(frame_index/60, frame_index%60))
    plt.pyplot.bar(range(0,100), minute_frequencies[frame_index])

#anim = animation.FuncAnimation(fig, animate, frames=1440)
#anim.save('./demoanimation.gif', fps=4, codec='gif');


<matplotlib.figure.Figure at 0x10cf14310>

In [44]:
from IPython.display import YouTubeVideo
YouTubeVideo('sXoL9XRUSZo')


Out[44]:

Visualizing animated data statically

While the above animation might be interesting and it definitely got my Fiancée engaged it's not as convenient to analyze as a static chart. One reason is because I have to watch for several seconds looking for shifts to happen. I even tried speeding up the video to 240 frames per second but it still wasn't ideal for me.

But now I'm thinking... A 2-D image that changes over time is really just a static 3-D image. I think it could very well turn out to be too abstract for a general audience but it's worth trying. So how can we get to a 3D image?

Again matplotlib has support for this feature. Here's my setup:


In [34]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(111,projection='3d')

for z in np.arange(1440):
    xs = np.arange(14)
    ys = minute_frequencies[z][:14]
    ax.bar(xs, ys, zs=z, zdir='y', alpha=1)

ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

plt.show()


OOOF that's terrible. It turns out that in order to do much better I'd have to create a mesh and I have no clue what that's about. In my frustrated fervor to find how to finally plot the for visualization in 3D, I found another fantastic idea: Heatmap. In the below heatmap I can visualize the histogram of every minute of bike availability for weekdays and weekends.


In [256]:
def load_data_for_dock_near_me():
    revised_results_filepath = '/Users/justin/Documents/Code/divvy_probability/results.txt'
    weekday_daley_station_minutes = []
    weekend_daley_station_minutes = []
    with open(revised_results_filepath, 'r') as revised_results_file:
        for line in revised_results_file:
            station_data = json.loads(line)
            if station_data['station_id'] == closest_station['id']:
                station_data['available_bikes'] = np.array(station_data['available_bikes'])
                station_data['available_docks'] = np.array(station_data['available_docks'])
                if station_data['weekday']:
                    weekday_daley_station_minutes.append(station_data)
                else:
                    weekend_daley_station_minutes.append(station_data)
    weekday_daley_station = sorted(weekday_daley_station_minutes, key=lambda x: x['hour']*60+x['minute'])
    weekend_daley_station = sorted(weekend_daley_station_minutes, key=lambda x: x['hour']*60+x['minute'])
    return weekday_daley_station, weekend_daley_station

weekday_station_minutes, weekend_station_minutes = load_data_for_dock_near_me()

In [257]:
cropped_weekday_minute_frequencies = np.array(map(lambda x: x['available_bikes'][:14]/(1.0*np.sum(x['available_bikes'])), weekday_station_minutes))
cropped_weekend_minute_frequencies = np.array(map(lambda x: x['available_bikes'][:14]/(1.0*np.sum(x['available_bikes'])), weekend_station_minutes))

fig = plt.figure(figsize=(4,20))
fig.suptitle('Bike availability likelihoods', fontsize=18)
ax = fig.add_subplot(121)
ax.set_ylim([0,1440])
ax.set_title('Weekday')
plt.xlabel('Bikes available')
plt.ylabel('Time of weekday')
weekday_heatmap = plt.pcolor(cropped_weekday_minute_frequencies, vmin=0, vmax=.4)
weekday_cbar = plt.colorbar(weekday_heatmap)
plt.yticks([0*180, 180, 2*180,3*180,4*180,5*180, 6*180, 7*180, 8*180], ['12:00a', '3:00a', '6:00a', '9:00a', '12:00p', '3:00p', '6:00p', '9:00p', '11:59p'])

bx = fig.add_subplot(122)
bx.set_ylim([0,1440])
bx.set_title('Weekend')
plt.xlabel('Bikes available')
plt.ylabel('')
weekend_heatmap = plt.pcolor(cropped_weekend_minute_frequencies, vmin=0, vmax=.4)
weekend_cbar = plt.colorbar(weekend_heatmap)
weekend_cbar.set_label('Likelihood of N bikes available', rotation=90)
plt.yticks([])#[0*180, 180, 2*180,3*180,4*180,5*180, 6*180, 7*180, 8*180], ['12:00a', '3:00a', '6:00a', '9:00a', '12:00p', '3:00p', '6:00p', '9:00p', '11:59p'])

fig.subplots_adjust(wspace=-0.15)
plt.subplots_adjust(top=.95)
plt.show()


This speaks to me pretty well! It's not perfect but I can definitely get an idea of what's happening throughout the course of your average weekday and weekend (pretty much nothing). Now the other side of this Divvy coin is dock availability. Let's chart that too:


In [258]:
cropped_weekday_dock_frequencies = np.array(map(lambda x: x['available_docks'][:14]/(1.0*np.sum(x['available_docks'])), weekday_station_minutes))
cropped_weekend_dock_frequencies = np.array(map(lambda x: x['available_docks'][:14]/(1.0*np.sum(x['available_docks'])), weekend_station_minutes))


fig = plt.figure(figsize=(4,20))
fig.suptitle('Dock availability likelihoods', fontsize=18)
ax = fig.add_subplot(121)
ax.set_ylim([0,1440])
ax.set_title('Weekday')
plt.xlabel('Docks available')
plt.ylabel('Time of weekday')
weekday_dock_heatmap = plt.pcolor(cropped_weekday_dock_frequencies, vmin=0, vmax=.4)
weekday_dock_cbar = plt.colorbar(weekday_heatmap)
plt.yticks([0*180, 180, 2*180,3*180,4*180,5*180, 6*180, 7*180, 8*180], ['12:00a', '3:00a', '6:00a', '9:00a', '12:00p', '3:00p', '6:00p', '9:00p', '11:59p'])

bx = fig.add_subplot(122)
bx.set_ylim([0,1440])
bx.set_title('Weekend')
plt.xlabel('Docks available')
plt.ylabel('')
weekend_dock_heatmap = plt.pcolor(cropped_weekend_dock_frequencies, vmin=0, vmax=.4)
weekend_dock_cbar = plt.colorbar(weekend_heatmap)
weekend_dock_cbar.set_label('Likelihood of N bikes available', rotation=90)
plt.yticks([])#[0*180, 180, 2*180,3*180,4*180,5*180, 6*180, 7*180, 8*180], ['12:00a', '3:00a', '6:00a', '9:00a', '12:00p', '3:00p', '6:00p', '9:00p', '11:59p'])

fig.subplots_adjust(wspace=-0.15)
plt.subplots_adjust(top=.95)
plt.show()


Vetting data via exploration

Now let me graph bike availability at the Divvy station by city hall and the Daley Center in downtown Chicago:


In [192]:
weekday_daley_station_minutes = []
weekend_daley_station_minutes = []
with open(results_filepath, 'r') as results_file:
    for line in results_file:
        station_data = json.loads(line)
        if station_data['station_id'] == 81:
            if station_data['weekday']:
                weekday_daley_station_minutes.append(np.array(station_data['available_bikes']))
            else:
                weekend_daley_station_minutes.append(np.array(station_data['available_bikes']))

Some quick sanity checking by looking for exactly 1440 minutes of data in both the weekday and weekend frequencies...


In [196]:
print "Weekday minutes: " + str(len(weekday_daley_station_minutes))
print "Weekend minutes: " + str(len(weekend_daley_station_minutes))


Weekday minutes: 4307
Weekend minutes: 3437

Whoa. I bet something changed and caused my aggregation job to split up this station. Let's explore it a bit more. Below there should be only two keys and each key should have 1440 counts.


In [207]:
import pprint
pprint.pprint("{'dsfsdfsdf':0, 'rte':42}")


"{'dsfsdfsdf':0, 'rte':42}"

In [208]:
station_info = {}
with open(results_filepath, 'r') as results_file:
    for line in results_file:
        station_data = json.loads(line)
        if station_data['station_id'] == 81:
            key = (station_data['station_id'], station_data['weekday'], station_data['longitude'], station_data['latitude'])
            if not key in station_info:
                station_info[key] = 0
            station_info[key] += 1
            
pprint.pprint(station_info)


{(81, False, -87.630183, 41.884337): 1440,
 (81, False, -87.62971344, 41.8833135): 1997,
 (81, True, -87.630183, 41.884337): 1440,
 (81, True, -87.62971344, 41.8833135): 2867}

Ah. The GPS coordinates have shifted slightly. For the purposes of this article I'm going to leave this discovery in here and plow ahead to just visualize some different data. I will be correcting this reporting bug though for future analysis purposes.

This is a key reason to always perform sanity checks on your data.

So since there is a matching set of data that has exactly 1440 minutes in it, I'm going to plot that.


In [209]:
weekday_daley_station_minutes = []
weekend_daley_station_minutes = []
with open(results_filepath, 'r') as results_file:
    for line in results_file:
        station_data = json.loads(line)
        if station_data['station_id'] == 81 and station_data['longitude'] == -87.630183:
            if station_data['weekday']:
                weekday_daley_station_minutes.append(np.array(station_data['available_bikes']))
            else:
                weekend_daley_station_minutes.append(np.array(station_data['available_bikes']))
weekday_daley_station = sorted(weekday_daley_station_minutes, key=lambda x: x['hour']*60+x['minute'])
weekend_daley_station = sorted(weekend_daley_station_minutes, key=lambda x: x['hour']*60+x['minute'])

And sanity check our data one more time...


In [210]:
print "Weekday minutes: " + str(len(weekday_daley_station_minutes))
print "Weekend minutes: " + str(len(weekend_daley_station_minutes))


Weekday minutes: 1440
Weekend minutes: 1440

OK. Good. Moving along...


In [212]:
cropped_weekday_daley_minute_frequencies = np.array(map(lambda x: x[:40]/(1.0*np.sum(x)), weekday_daley_station_minutes))
cropped_weekend_daley_minute_frequencies = np.array(map(lambda x: x[:40]/(1.0*np.sum(x)), weekend_daley_station_minutes))

fig = plt.figure(figsize=(4,20))
fig.suptitle('Daley Center bike availability likelihoods', fontsize=18)
ax = fig.add_subplot(121)
ax.set_ylim([0,1440])
ax.set_title('Weekday')
plt.xlabel('Bikes available')
plt.ylabel('Time of weekday')
weekday_daley_heatmap = plt.pcolor(cropped_weekday_daley_minute_frequencies, vmin=0, vmax=.4)
weekday_daley_cbar = plt.colorbar(weekday_daley_heatmap)
plt.yticks([0*180, 180, 2*180,3*180,4*180,5*180, 6*180, 7*180, 8*180], ['12:00a', '3:00a', '6:00a', '9:00a', '12:00p', '3:00p', '6:00p', '9:00p', '11:59p'])

bx = fig.add_subplot(122)
bx.set_ylim([0,1440])
bx.set_title('Weekend')
plt.xlabel('Bikes available')
plt.ylabel('')
weekend_daley_heatmap = plt.pcolor(cropped_weekend_daley_minute_frequencies, vmin=0, vmax=.4)
weekend_daley_cbar = plt.colorbar(weekend_daley_heatmap)
weekend_daley_cbar.set_label('Likelihood of N bikes available', rotation=90)
plt.yticks([])#[0*180, 180, 2*180,3*180,4*180,5*180, 6*180, 7*180, 8*180], ['12:00a', '3:00a', '6:00a', '9:00a', '12:00p', '3:00p', '6:00p', '9:00p', '11:59p'])

fig.subplots_adjust(wspace=-0.15)
plt.subplots_adjust(top=.95)
plt.show()


Note the abrupt shifts in color. There's further evidence of data corruption here and even for this article I can't ignore it. I'm going to rerun my data report on MRJob and see what this looks like afterwards. (refer back to this article if you're curious what all is included in this: http://www.databozo.com/2014/01/25/Predicting_Divvy_bike_availability_with_MRJob_and_Python.html

Data's done cooking! Let's taste!!

I finished rerunning my report against my data and stripped out location data as well as the total dock count. Let's try the same technique I did above:


In [240]:
def load_data_for_dock_near_daley_center():
    revised_results_filepath = '/Users/justin/Documents/Code/divvy_probability/revised_results.txt'
    weekday_daley_station_minutes = []
    weekend_daley_station_minutes = []
    with open(revised_results_filepath, 'r') as revised_results_file:
        for line in revised_results_file:
            station_data = json.loads(line)
            if station_data['station_id'] == 81:
                if station_data['weekday']:
                    weekday_daley_station_minutes.append(station_data)
                else:
                    weekend_daley_station_minutes.append(station_data)
    weekday_daley_station = sorted(weekday_daley_station_minutes, key=lambda x: x['hour']*60+x['minute'])
    weekend_daley_station = sorted(weekend_daley_station_minutes, key=lambda x: x['hour']*60+x['minute'])
    return {'weekday': weekday_daley_station, 'weekend': weekend_daley_station}

In [241]:
print "Weekday minutes: " + str(len(weekday_daley_station))
print "Weekend minutes: " + str(len(weekend_daley_station))


Weekday minutes: 1440
Weekend minutes: 1440

BOOM! The error is fixed. Now let's plot this good data:


In [259]:
station_data = load_data_for_dock_near_daley_center()

In [249]:
weekday_daley_station_minutes = np.array(map(lambda x: x['available_bikes'], station_data['weekday']))
weekend_daley_station_minutes = np.array(map(lambda x: x['available_bikes'], station_data['weekend']))

cropped_weekday_daley_minute_frequencies = np.array(map(lambda x: x[:50]/(1.0*np.sum(x)), weekday_daley_station_minutes))
cropped_weekend_daley_minute_frequencies = np.array(map(lambda x: x[:50]/(1.0*np.sum(x)), weekend_daley_station_minutes))

fig = plt.figure(figsize=(4,20))
fig.suptitle('Daley Center bike availability likelihoods', fontsize=18)
ax = fig.add_subplot(121)
ax.set_ylim([0,1440])
ax.set_title('Weekday')
plt.xlabel('Bikes available')
plt.ylabel('Time of weekday')
weekday_daley_heatmap = plt.pcolor(cropped_weekday_daley_minute_frequencies, vmin=0, vmax=.15)
weekday_daley_cbar = plt.colorbar(weekday_daley_heatmap)
plt.yticks([0*3*60, 3*60, 2*3*60,3*3*60,4*3*60,5*3*60, 6*3*60, 7*3*60, 8*3*60], ['12:00a', '3:00a', '6:00a', '9:00a', '12:00p', '3:00p', '6:00p', '9:00p', '11:59p'])

bx = fig.add_subplot(122)
bx.set_ylim([0,1440])
bx.set_title('Weekend')
plt.xlabel('Bikes available')
plt.ylabel('')
weekend_daley_heatmap = plt.pcolor(cropped_weekend_daley_minute_frequencies, vmin=0, vmax=.15)
weekend_daley_cbar = plt.colorbar(weekend_daley_heatmap)
weekend_daley_cbar.set_label('Likelihood of N bikes available', rotation=90)
plt.yticks([])#[0*180, 180, 2*180,3*180,4*180,5*180, 6*180, 7*180, 8*180], ['12:00a', '3:00a', '6:00a', '9:00a', '12:00p', '3:00p', '6:00p', '9:00p', '11:59p'])

fig.subplots_adjust(wspace=-0.15)
plt.subplots_adjust(top=.95)
plt.show()


Whooa! Now THAT looks like the kind of plot I'd expect. Keep in mind these plots are essentially probability distributions over a 1,440 minute (24-hour) period and the warmer portions of the plot correspond to a higher likelihood of that many bikes being available at the given time.

On the weekdays, prior to 9am you can see we have a very tight bunching of availability centered somewhere around 5-10 bikes. It stays pretty consistent until around 8-9am where the spread of the colors swells until just after 6pm. This seems to be related to people commuting into downtown and remaining until the end of their work day. Likewise it seems to suggest that the most common work day is 9a-6p.

I want to really quickly vet that I have the plot labelled right. That 12:00a is at the bottom. To do that I'll pull data from the beginning (since that matches the beginning AND end) and then check that 6a looks like it should and that 6p looks quite different. It also seems that during the work day you are most likely to find the most bikes available due to the commuters who've brought bikes in from the surrounding neighborhoods. Not surprising in hindsight, but it's always nice to see your intuition validated with data.


In [270]:
print '12:00a ' + str(station_data['weekday'][0]['available_bikes'])


12:00a [0, 2, 2, 5, 4, 9, 8, 6, 7, 4, 4, 7, 6, 4, 4, 3, 2, 3, 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [271]:
print '6:00a ' + str(station_data['weekday'][6*60]['available_bikes'])


6:00a [1, 2, 1, 3, 3, 7, 10, 7, 6, 4, 5, 4, 8, 3, 7, 4, 2, 1, 2, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [272]:
print '6:00p ' + str(station_data['weekday'][18*60]['available_bikes'])


6:00p [2, 1, 1, 2, 1, 0, 2, 0, 1, 0, 1, 2, 6, 4, 4, 2, 4, 3, 1, 3, 6, 4, 6, 3, 6, 2, 1, 1, 2, 2, 2, 0, 4, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Looks about right let's also plot it to be sure:


In [280]:
plt.plot(station_data['weekday'][0]['available_bikes'], color = (1, 0, 0, 1))
plt.plot(station_data['weekday'][6*60]['available_bikes'], color = (0, 1, 0, 1))
plt.plot(station_data['weekday'][18*60]['available_bikes'], color = (0, 0, 0, 1))


Out[280]:
[<matplotlib.lines.Line2D at 0x11507fb50>]

The green and red lines seem to line up pretty well. The black line looks pretty different from both of them. So I think everything is lined up correctly.

Reviewing what happened

This is another of those exploratory posts by me where I'm walking you through as I'm discovering. It can cause some meandering so I appreciate that you made it this far. If you want the Divvy data my friend Andrew Slotnick and I have been gathering and curating just email me and we'll hook you up. I would just open up public downloads but I REALLY don't feel like one of those bills this month. :)