This project endeavors to understand usership trends amongst Citi Bike riders in New York City.
CitiBike collects cumulative data about its riders, including the number of rentals each day, the total distance per ride (measured as distance between pick-up station and drop-off station), and the number of long-term rentals on any given day. Our analysis examines trends among the number of rentals by day and by month, as well as average distances. With these trends, we can better understand how and when people use CitiBike. This information has the potential to make a huge impact on CitiBike's advertising and marketing campaigns, as well as its internal operations.
In [79]:
import sys # system module
import pandas as pd # data package
import matplotlib as mpl # graphics package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for pandas
import csv # package for converting csv
from collections import defaultdict # will be used to convert dates
import seaborn as sns # advanced graphics
import urllib.request # package to read url
%matplotlib inline
First, we must import the data from CitiBike's website. The data accessed throught the 'Get the data' link at the bottom left corner of the following page http://datawrapper.dwcdn.net/33zqP/6/. This data is updated in near-real-time. When we run our anlysis, data available was from January 1, 2017, through March 31, 2017. Due to the character length of the link address for the data file, it is not redable directly by a .read_csv() function in Python and so we use the urllib.request functionality as shown below to access the source website directly through python.
In [80]:
url = "data:application/octet-stream;charset=utf-8,Date%2CTrips%20over%20the%20past%2024-hours%20(midnight%20to%2011%3A59pm)%2CMiles%20traveled%20today%20(midnight%20to%2011%3A59%20pm)%2CTotal%20Annual%20Members%20(All%20Time)%2C24-Hour%20Passes%20Purchased%20(midnight%20to%2011%3A59%20pm)%2C3-Day%20Passes%20Purchased%20(midnight%20to%2011%3A59%20pm
data_file = urllib.request.urlopen(url) # this code allows python to access the information directly from the source website
CitiBike = pd.read_csv(data_file)
print ('Variable dtypes:\n', CitiBike.dtypes)
CitiBike.head()
Out[80]:
We see that this data is has much more information than we need. For example, it includes total annual membership, which we do not need for this analysis. Thus, we have removed this column to prioritize the data that will most impact daily and monthly usership.
In [81]:
CitiBike.drop(CitiBike.columns[[3,4,5]], axis = 1, inplace = True)
CitiBike.head()
Out[81]:
In order to manipulate and sort the data based on day of the week and month, we must convert the date information from a .csv format to python datetime format.
In [82]:
CitiBike['Date'] = pd.to_datetime(CitiBike['Date'])
CitiBike.head ()
Out[82]:
In [83]:
CitiBike.dtypes
Out[83]:
Now that Python recognizes the data in the Date column as calendar dates, we can add a column to classify each data point by day of the week and by month. This will ultimately allow us to compare usage on Monday vs. Tuesday, e.g., or January vs. February.
In [84]:
CitiBike['Day of Week'] = CitiBike['Date'].dt.weekday_name
CitiBike.head()
Out[84]:
In [85]:
CitiBike['Month'] = CitiBike['Date'].dt.month
CitiBike.head()
Out[85]:
In order to get a sense for how much data we are working with, we need to pull the size and shape. This is relevant to see how many data points we have.
In [86]:
print ("The number of rows and columns are ", CitiBike.shape, "respectively")
We now have all the useful data columns, but the index column needs to be replaced. We want to analyze this data by date, so we need to make the date column the index.
In [87]:
CitiBike = CitiBike.set_index ('Date')
CitiBike.head()
Out[87]:
Next, we will retitle each column so that it's easier to understand what we're looking at.
In [88]:
titles = ['Total Trips', 'Total Miles', 'Day of Week', 'Month']
CitiBike.columns = titles
CitiBike.head()
Out[88]:
To begin our analysis, we will add a column that shows the average mileage per trip each day. This can be done using a formula that divides the total miles for each day by the number of corresponding trips for each day, to derive an average trip length for each day.
In [89]:
CitiBike['Average Miles per Trip'] = CitiBike['Total Miles'] / CitiBike['Total Trips']
CitiBike.head()
Out[89]:
In [90]:
CitiBike.shape
Out[90]:
To finalize the daily average comparisons, we need to create individual dataframes for each day of the week.
In [91]:
CitiBike [CitiBike['Day of Week'] == 'Sunday']
CitiBike [CitiBike['Day of Week'] == 'Monday']
CitiBike [CitiBike['Day of Week'] == 'Tuesday']
CitiBike [CitiBike['Day of Week'] == 'Wednesday']
CitiBike [CitiBike['Day of Week'] == 'Thursday']
CitiBike [CitiBike['Day of Week'] == 'Friday']
CitiBike [CitiBike['Day of Week'] == 'Saturday']
CitiBike [CitiBike['Month'] == 1]
CitiBike [CitiBike['Month'] == 2].head ()
Out[91]:
Now that we have individual dataframes for each day of the week, we can create larger dataframes for week days and weekends.
In [92]:
Weekend_List = ['Saturday', 'Sunday']
Weekday_List = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
CitiBike [CitiBike ['Day of Week'].isin (Weekend_List)].head ()
Out[92]:
In [93]:
CitiBike [CitiBike ['Day of Week'].isin (Weekday_List)].head ()
Out[93]:
Now that we have these dataframes compiled, we can start to pull some insights. For instance, we can calculate the average number of miles a rider travels on a weekday vs. a weekend.
In [94]:
Weekend_Chart = CitiBike [CitiBike ['Day of Week'].isin (Weekend_List)]
Weekend_Average = Weekend_Chart[['Average Miles per Trip']].mean ()
Weekday_Chart = CitiBike [CitiBike ['Day of Week'].isin (Weekday_List)].head ()
Weekday_Average = Weekday_Chart[['Average Miles per Trip']].mean ()
print ("The average miles riders cover on the weekend are", Weekend_Average)
print ("The average miles riders cover on weekdays are", Weekday_Average)
From this comparison, we can see that riders typically travel 50% farther on weekend trips than weekdays.
We will build on this insight in later graphs.
In [95]:
Average_Mileage = pd.DataFrame ({'Weekdays' : Weekday_Average, 'Weekends' : Weekend_Average})
Average_Mileage = Average_Mileage [['Weekdays', 'Weekends']]
print (Average_Mileage)
Based on the averages calculated above, we can plot how far riders travel on weekend rides vs. weekday rides.
This comparison shows a clear trend that riders travel 50% farther on weekend rides than on weekday rides. This makes sense to us, since the motivation for renting a CitiBiki would be very different on a weekday (likely for a commute) than on a weekend (likely to visit a place of interest).
In [96]:
fig, ax = plt.subplots(1)
Average_Mileage.plot(ax=ax, kind = 'bar', title = 'Average Miles on weekends vs. Weekdays Q1')
ax.legend(['Weekdays', 'Weekends'], loc = 'best')
ax.set_ylabel('Miles')
ax.set_ylim (0,3.5)
Out[96]:
Another interesting comparison is between months. We would like to examine and compare the total number of miles traveled by CitiBike users in January, February, and March. A higher number of miles traveled in a given month would indicate more rentals and/or more miles traveled per use. Either way, there is a trend toward heavier bike usage.
Our hypothesis is that riders will ride their bikes more in the beginning of the year, because New Year’s resolutions will push people to ride a bike to work instead of taking the train or a cab. We also need to factor in the poor weather during this time of year, which may deter bike riders, but we think that there will be a spike in January and then a downward trend month to month.
In [97]:
January_Miles = CitiBike [CitiBike['Month'] == 1]
January_Miles_Total = January_Miles [['Total Miles']].sum ()
February_Miles = CitiBike [CitiBike['Month'] == 2]
February_Miles_Total = February_Miles [['Total Miles']].sum ()
March_Miles = CitiBike [CitiBike['Month'] == 3]
March_Miles_Total = March_Miles [['Total Miles']].sum ()
print (January_Miles_Total)
print (February_Miles_Total)
print (March_Miles_Total)
In [98]:
Total_Mileage = pd.DataFrame ({'January' : January_Miles_Total,
'February' : February_Miles_Total,
'March' : March_Miles_Total})
Total_Mileage = Total_Mileage[['January', 'February', 'March']]
print (Total_Mileage)
In [99]:
fig, ax = plt.subplots(1)
Total_Mileage.plot(ax=ax, kind = 'bar', title = 'Total Miles Covered per Month Q1')
ax.legend(['JAN', 'FEB', 'MAR'], loc='best')
ax.set_xlabel('Month')
ax.set_ylabel('Total Miles')
ax.set_ylim (0,2100000)
Out[99]:
Based on the analysis, total miles traveled was actually highest in February, disproving our original hypothesis. One theory for why this may be the case is that riders are on vacation in the beginning of February and, therefore, are not commuting to work. Alternatively, blizzards and poor weather may have kept them on the train and in cabs, or working from home. Finally, it could be the case that February had more opportunities for bike rides (perhaps this was popular on Valentine’s Day weekend as couples sought out activities to do together), or CitiBike ran a promotion for part of the month to encourage bike rentals.
Though we can’t claim this as a long-term trend to be expected every February, we recommend that CitiBike work to convert this spike in miles traveled to other months of the year. If this spike represents a flurry of one-time users, CitiBike has an opportunity to convert those users to longer-term users. For instance, they could offer 2 weeks of unlimited use free with the first rental, in order to demonstrate the benefit of CitiBike to users who enjoy riding bikes. This may help create a slightly stickier service for consumers, and would translate a spike in interest into longer term business benefits.
We want to get a bit more granular in our analysis and look into which days are most popular for CitiBike in New York City. In order to do this, we will first create individual data frames for each day of the week in order to average usage.
In [100]:
# Monday
Monday_Data = CitiBike [CitiBike['Day of Week'] == 'Monday']
Monday_Miles = Monday_Data[['Average Miles per Trip']].mean ()
# Tuesday
Tuesday_Data = CitiBike [CitiBike['Day of Week'] == 'Tuesday']
Tuesday_Miles = Tuesday_Data[['Average Miles per Trip']].mean ()
# Wednesday
Wednesday_Data = CitiBike [CitiBike['Day of Week'] == 'Wednesday']
Wednesday_Miles = Wednesday_Data[['Average Miles per Trip']].mean ()
# Thursday
Thursday_Data = CitiBike [CitiBike['Day of Week'] == 'Thursday']
Thursday_Miles = Thursday_Data[['Average Miles per Trip']].mean ()
# Friday
Friday_Data = CitiBike [CitiBike['Day of Week'] == 'Friday']
Friday_Miles = Friday_Data[['Average Miles per Trip']].mean ()
# Saturday
Saturday_Data = CitiBike [CitiBike['Day of Week'] == 'Saturday']
Saturday_Miles = Saturday_Data[['Average Miles per Trip']].mean ()
# Sunday
Sunday_Data = CitiBike [CitiBike['Day of Week'] == 'Sunday']
Sunday_Miles = Sunday_Data[['Average Miles per Trip']].mean ()
print (Monday_Miles) # to confirm that code is working as intended and returning desired results
We now use the values above to create a dataframe that we can use to plot daily average miles by the day of the week.
In [101]:
Weekday_Daily_Mileage = pd.DataFrame ({'Monday' : Monday_Miles,
'Tuesday' : Tuesday_Miles,
'Wednesday' : Wednesday_Miles,
'Thursday' : Thursday_Miles,
'Friday' : Friday_Miles})
Weekend_Daily_Mileage = pd.DataFrame ({'Saturday' : Saturday_Miles,
'Sunday' : Sunday_Miles})
Weekday_Daily_Mileage = Weekday_Daily_Mileage[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']]
Weekend_Daily_Mileage = Weekend_Daily_Mileage[['Saturday', 'Sunday']]
print (Weekday_Daily_Mileage)
print (Weekend_Daily_Mileage)
Weekday_Daily_Mileage.head ()
Out[101]:
The analysis shows a steady downward trend throughout the week – riders are much more likely to rent a CitiBike on Monday than on Thursday. This is fairly logical. It’s easy to imagine that riders are energized on Monday after a relaxing weekend, and tired, busy, and distracted later in the week. This means that they have more energy and are more willing to ride a bike to work at the beginning of the week. Only those who are going short distances want to ride a bike.
However, there is a considerable spike on Fridays, up to more than 2.5 miles travelled by the average CitiBike rider. This could represent a number of things, but it is certainly related to the impending weekend. Riders are energized by the end of the week, and may bike more at the end of the day because they have the free time. Alternatively, riders may be working from home and going for longer rides during a break in the middle of the day.
In [102]:
fig, ax = plt.subplots(1)
Weekday_Daily_Mileage.plot(ax=ax, kind = 'bar', title = 'Daily Weekday Average Miles per Rider Q1')
ax.legend(['MON', 'TUE', 'WED', 'THU', 'FRI'], loc='best')
ax.set_xlabel('Days of the week (Weekday)')
ax.set_ylabel('Average number of miles')
ax.set_ylim (0,3.0)
Out[102]:
We already know that weekend riders travel much farther than weekday riders, but we anticipated that there would be some difference in Saturday vs. Sunday usage. Instead, data on average miles per trip was almost identical between the two days. Sunday is marginally higher, but it is fair to conclude that weekend travel distance is split evenly across Saturday and Sunday.
In [103]:
fig, ax = plt.subplots(1)
Weekend_Daily_Mileage.plot(ax=ax,
kind = 'barh',
title = 'Daily Weekend Average Miles per Rider Q1')
ax.legend(['SAT', 'SUN'], loc='best')
ax.set_ylabel('Day')
ax.set_xlabel('Average number of miles')
ax.set_xlim (0,3.5)
Out[103]: