In this notebook I do some initial data exploration with the metro delay data


In [58]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from geopy.geocoders import Nominatim
from geopy.distance import vincenty
from __future__ import division
import pickle
matplotlib.style.use('ggplot')
%matplotlib inline

Metro delays are from http://www.opendatadc.org/dataset/wmata-disruption-reports, a open file of daily disruption reports from WMATA's api


In [52]:
metro_delays = pd.read_csv("We'll be Moving Momentarily - Incidents.csv")

In [53]:
metro_delays.head()


Out[53]:
Date Time Incident Line Direction Cause Delay
0 12/31/2012 5:00 a.m. A Fort Totten-bound Yellow Line train at Brad... Yellow Fort Totten an operational problem 13.0
1 12/31/2012 10:30 a.m. A Grosvenor-bound Red Line train at Gallery P... Red Grosvenor an equipment problem 6.0
2 12/31/2012 4:11 p.m. A Greenbelt-bound Green Line train at Branch ... Green Greenbelt an operational problem 4.0
3 12/31/2012 4:50 p.m. A Greenbelt-bound Green Line train at Shaw-Ho... Green Greenbelt an equipment problem 3.0
4 12/31/2012 5:55 p.m. A Largo Town Center-bound Blue Line train at ... Blue Largo Town Center a brake problem 9.0

How many delays are there of greater than 30 minutes?


In [54]:
long_delays = metro_delays[metro_delays['Delay'] >= 30]
len(long_delays)


Out[54]:
461

In [55]:
long_delays.iloc[0]["Date"]


Out[55]:
'12/10/2012'

I need to convert the 'Date' and 'Time' columns into a timestamp for each event


In [56]:
metro_delays['Time'] = metro_delays['Time'].str.replace('.','')
metro_delays['time_stamp'] = metro_delays['Date'] + ' ' + metro_delays['Time']

In [57]:
metro_delays['Time'] = pd.to_datetime(metro_delays['time_stamp'], format='%m/%d/%Y %I:%M %p')
metro_delays.head()


Out[57]:
Date Time Incident Line Direction Cause Delay time_stamp
0 12/31/2012 2012-12-31 05:00:00 A Fort Totten-bound Yellow Line train at Brad... Yellow Fort Totten an operational problem 13.0 12/31/2012 5:00 am
1 12/31/2012 2012-12-31 10:30:00 A Grosvenor-bound Red Line train at Gallery P... Red Grosvenor an equipment problem 6.0 12/31/2012 10:30 am
2 12/31/2012 2012-12-31 16:11:00 A Greenbelt-bound Green Line train at Branch ... Green Greenbelt an operational problem 4.0 12/31/2012 4:11 pm
3 12/31/2012 2012-12-31 16:50:00 A Greenbelt-bound Green Line train at Shaw-Ho... Green Greenbelt an equipment problem 3.0 12/31/2012 4:50 pm
4 12/31/2012 2012-12-31 17:55:00 A Largo Town Center-bound Blue Line train at ... Blue Largo Town Center a brake problem 9.0 12/31/2012 5:55 pm

export dataframe with pickle


In [59]:
pickle.dump( metro_delays, open( "metro_delays.p", "wb" ) )

In [ ]: