In [1]:
import pandas as pd
data = pd.read_csv("data.csv", index_col='Date', parse_dates=True)
In [2]:
# Checking the data
data.head()
Out[2]:
In [3]:
# Visualizing our data
%matplotlib inline
data.plot()
Out[3]:
In [4]:
# the above diagram shows that the data is too much to visualize
# therefore, we will try to see the weekly data
data.resample('W').sum().plot()
Out[4]:
We will now go deeper into this visualization for further analysis. First we will change the visualization style.
In [5]:
import matplotlib.pyplot as plt
plt.style.use('seaborn')
In [6]:
data.columns = ['West', 'East']
data.resample('W').sum().plot()
Out[6]:
Now one thing you might want to do is see if there's any sort of annual trend in the number of riders, any sort of annual growth or decline and ridership. We can create a rolling window. So let's resample and do a rolling sum of over days of all the data and what comes out there is kind of the annual trend each of these points is the sum of rides in the previous days.
In [7]:
data.resample('D').sum().rolling(365).sum().plot()
Out[7]:
We see something interesting, on the west side things got increased and then sort of decreased. These axes limits are a little bit suspect because they don't go all the way to 0 so it might be better if we can set the Y axis to use the current maximum.
In [8]:
ax = data.resample('D').sum().rolling(365).sum().plot()
ax.set_ylim(0, None)
Out[8]:
We can see that the change is not as dramatic but there is some change, but there seems to be an offset here between the west side walk and the east side walk. So another thing we can do is that we can say data total equals data West plus data East, so let's just add a new column to the data and then we we can plot.
In [9]:
data['Total'] = data['West'] + data['East']
ax = data.resample('D').sum().rolling(365).sum().plot()
ax.set_ylim(0, None)
Out[9]:
We can see that somehow that the East Side and west side of the bridge have kind of slipped a little bit, i.e., the trends are reversed so that the the total counts of biker bikes across the bridge hover right around one million per year or something like that. That's been pretty consistent for the past few years plus or minus a few percent.
Another thing that we can do is that we can take a look at the trend within individual days. So we will take a look at "group by" here and let's group by the time of day and take the mean and then plot it to see what that looks like.
In [10]:
data.groupby(data.index.time).mean().plot()
Out[10]:
This (the plot above) is the over each time of day throughout the year. We (through all the days) calculate the average of the number of crossings of each time of day and we see some interesting patterns. First, the eastbound sidewalk seems to peak in the afternoon and the westbound sidewalk peeks in the morning and these two peaks here are kind of indicative of a commute pattern, i.e., people going into the city into the city on the west bound on the West Side walk in the morning and out of the city on the east side block in the afternoon generally.
Now this average is nice but it would also be nice to kind of see the whole data set in this way. One way we can do that is with something called a "pivot table". So let's make a pivoted data set and the data pivot table.
In [11]:
pivoted = data.pivot_table('Total', index=data.index.time, columns=data.index.date)
If we look at just the first five by five block of this pivoted data, we can see what we've done we now have a two dimensional data frame where each column is a day in the data set and each row corresponds to an hour during that day.
In [12]:
pivoted.iloc[:5, :5]
Out[12]:
Let's take a look at that that data we plot it we want we don't want a legend so we're going to say legend equals false and take a look at what comes out.
In [13]:
pivoted.plot(legend=False)
Out[13]:
what we see here is that we have a line now for each day of the year or each day in the four years and it's up maybe a little bit hard to see so let's try doing alpha equals 0.01. This is the transparency so we're going to plot a whole bunch of transparent lines on top of each other to get an idea of how the trend in crossings over the day changes throughout this four year period.
In [14]:
pivoted.plot(legend=False, alpha=0.01)
Out[14]:
We can see that there's a bunch of days that have this kind of commute this bimodal commute pattern but there are also a bunch of days that don't have a commute pattern they kind of go and peak somewhere mid day and then go down during the rest of the day. Hence, the the best hypothesis here is that these commute days would be weekdays and these broad usage days would be weekends or holidays.
In [ ]: