In the frame of this study I will use the dataset provided on Kaggle for the challenge entitled "Cycle Share Dataset". The dataset was released by Pronto which is the Seattle’s new cycle sharing system with 500 bikes and 58 stations located throughout the city. Data covers every bike trips from 10-13-2014 to 8-31-2016 and is composed of 3 tables:
It is worth noticing that Pronto clients are splitted into two kinds of users: eather annual members or short-term pass holders. Short-term pass holders can own two kinds of tickets: eather a 24-hour pass or 3-day pass. One can notice that Pronto is not providing any user ID, hence it will be impossible to drive this analysis in terms of number of users but in terms of number of trips.
Furthermore the final aim of this project is to try to predict accurately the bikes demand at each station of Seattle.
This project will fall mainly into six parts detailed below:
N.B: The IPython modules attached to this project must be run in the same order as the report.
Following a classical Data Science scheme, we must first load the raw data and make a bit of data cleaning. We will use during this project the convenient structure of Pandas Dataframes to represent our datasets. One can oberserve below the shapes and some extracts from the raw Dataframes we will use in the following.
In [1]:
%run acquisition_raw_data.ipynb
This part mainly rely on the data exploration step. The idea is to analyze the trips data in order to identify first time patterns and then users behaviors regarding their age or type (annual member or short-term pass holder). As said in the introduction, the frequency of use will be quantified by the number of bike trips.
In [2]:
% run daily_time_analysis.ipynb
As expected temperature shape is highly linked to the daily count of trips. Moreover one can notice the seasons influence on the number of trips: more during summer, less during winter.
In [3]:
% run hourly_time_analysis.ipynb
Indeed during week days we can clearly observe office hours influence whereas during weekend days the distribution is centered around 14:00 which is the ideal time for a bike ride.
In [4]:
% run age_distribution.ipynb
In [5]:
% run type_user.ipynb
As we could have imagined annual members are more likely to use Pronto bikes to go working whereas short-term pass holders are more likely to go cycling occasionally during weekends as a hobby.
The first data exploitation tool we would like to use now is a Principal Components Analysis (PCA). Indeed this machine learning algorithm will allow us to reduce the dimensions of our data in order to ease struture recognition.
For performing PCA, we decided to represent our trips data as 1-day data points. To do so we computed the total hourly number of trips every day. The aim will be to identify patterns among this time distribution which could be expalined by either the weather or the day of the week. Indeed both possible explanations are motivated by the data exploration we did before.
To this 24 dimensions (coming from 24 hours a day) we will add the mean temperature. This leaves us with 25 dimensions to be reduced.
One can find below an extract from the Dataframe we will use to perform PCA. As said before, we have 689 data points and 25 dimensions which will be reduced to 2 dimensions for further analysis.
The remaining 4 columns (precipitation, day of the week, events, total_trips) will only be used as labels to analyze PCA (supervised learning).
Regarding the column 'Events', numerization is the following:
In [6]:
% run building_PCA_data.ipynb
In [7]:
% run applying_PCA.ipynb
Results are highly interesting since PCA almost perfectly distinguish weekends and week days. In other words the two PCA components allow to explain the variance within our dataset in terms of week days/ week ends. This is in compliance with part 2.
Furthermore dry and hot days are more likely to be with higher PCA1 where as wet and cold days are more likely to be with lower PCA1.
Hence one can advocate that PCA2 is related to the week days/ week ends feature whereas PCA1 is related to the total number of trips as well as the temperature.
The final aim of this project is be to predict the demand at a given station. Therefore such a purpose would require to know the number of available bikes at each station at a given time. Then we would be able to perform a regression algorithm on this number of available bikes.
Unfortunately this bike availability information at each station is not provided in the dataset. I e-mailed Pronto to know if it was possible to get this information, but they did not answer. Consequently we have now to find a way to compute it.
In [8]:
% run station_popularity.ipynb
First of all some stations are definitely more popular than others.
One can analyze that many stations do not reach an equilibrium between the number of starts and the number of stops. This means that Pronto is most certainly re-equilibrating the distribtion of bikes regularly. Obsiously this intervention of Pronto in the bike availability is not quantified in our raw data. Consequently it might be impossible to compute the real number of available bikes at a given station.
However we might try to infer the bike availability at a station which has as many start than stop trips. Indeed with hope, there will be very few interventions of Pronto and the total bike variation will be very close to the bike availability. For this test we chose the station SLU-01 which has both enough data and almost the same number of starts ans stops.
Hence let's compute the total variation at SLU-01 over the entire period of our data to see if we can infer the number of available bikes.
To do so, a new column was added - called 'incrementation' - and valued (+1) for a stop and (-1) for a start. This column allowed us then to compute the total variation by making the sum as you can observe below.
In [9]:
% run instanteanous_variation_total.ipynb
It appears that the total variation of bikes is often greatly superior to the number of docks for SLU-01 (18). Hence we can not directly infer the bike availability. On reflection however the daily bikes variation might be an effective indicator to represent the demand at a station.
In [10]:
% run instanteanous_variation_daily.ipynb
To perform an efficient regession on the daily variation of bikes, we need to resample on a regular time basis. Indeed we must exploit the information that between each data points above the variation is constant! Consequently let's sample our dataset every 15 mins as you can notice below:
In [11]:
% run sampled_variation.ipynb
Now that we know how to compute the sampled daily variation for one station, let's compute a single DataFrame gathering the variations for all stations.
For computational matters, we reduced the number of data by keeping datapoints verifying the following criteria:
Even with these reductions, the script below is extremly time consuming (about 30 mins), which is why we provide you below with the resulting matrix used for regression. This matrix 'Xtot_to_keep_all' can be loaded below so that you do not need to run the 3 next IPython scripts (especially the next one).
In [98]:
#ATTENTION: The script below takes about 30 mins to be run on my computer
#You can directly load the matrix later in the report if wanted (default)
%run create_sampled_data_each_station.ipynb
print('Sampled Variation database shape: {}'.format(sampled_variation.shape))
As we managed to quantify the demand at each station, we can tackle a new data exploitation step. Indeed our goal is to perform a regression on the daily variation of bikes. In other words knowing a dataset X, we would like to predict the daily variation y such that f(X) = y.
This regression problem will cover the following steps:
Regressions will be run using the Scikit Learn package.
As we have already computed the daily variation y, only the choice of features to be added to the DataFrame remains. The features which are most likely to explain the daily variation are listed below:
One can observe below the final structure of the DataFrame we will use for the regression:
In [101]:
% run building_final_dataframe_regression.ipynb
In [100]:
% run numerization_final_dataframe.ipynb
We tried first to perform directly a regression on the entire dataset with the 45 stations coming from the numerization.
Moreover we decided to split our dataset equally between training and testing set in order to observe accurate enough predictions.
For the regression evaluation we will use two metrics:
Many different regressor models were tested - with default hyperparameters - such as [we represent (mse,mae)]:
It appeared that the random forest (RF) regressor outperformed by far the other regressors. Hence we decided to use only the random forest (RF) regressor in the following.
One can wonder now the optimal number of trees to use in the regression.
In [12]:
#Load data (ONLY If you don't want to run the last 3 scripts)
Xtot = np.load('Xtot_to_keep_all.npy')
ytot = np.load('ytot_to_keep_all.npy')
int_to_station = np.load('int_to_station_all.npy')
int_to_station = int_to_station[()]
In [13]:
#Finding the optimal number of trees (!takes a lot of time!)
% run regression_onemodel_all.ipynb
As a result, one can notice that the higher the number of regressors the more precise is the RF regression. Moreover it seems that a vast majority of predictions were at less than one unit from the truth which is extremely encouraging !
We will consider in the following that 20 trees is precise enough to perform RF regression on our dataset.
In [20]:
% run regression_onemodel_each.ipynb
Comparing MSE values on the graph above with the value of 0.51 found earlier in the case of only one model, it appears that one model per station provides for a great majority of stations more precise regressions (since lower MSE).
In [15]:
station_id='SLU-01'
In [16]:
% run numerization_final_dataframe_1station.ipynb
In [17]:
date_start = date(2016,6,6) # number 0 of the week (Monday)
date_end = date(2016,6,12)
week = [date_start + timedelta(days=x) for x in range((date_end-date_start).days + 1)]
In [18]:
% run regression_1station_partially.ipynb
In [19]:
% run regression_1station_not.ipynb
To conclude results look promising when we consider the model whose training set is composed of some datapoints from the week we want predictions. However performances worsen considerably when there is no datapoint of the week under study in the training set.
Nevertheless the predictions for Monday (6th of June) in the second regression seem almost acceptable. In fact one can notice a decline of precision along the week.
This project on the cycle share system provided by Pronto allowed us to experiment the entire data science process.
From the data acquisition of the Kaggle dataset and its cleaning, we moved on to the exploration of the users behaviors in terms of time trends, age and type of users.
Then a PCA analysis allowed us to identify clearly the correlation between the number of trips and weather features. This data exploitation refound also some patterns already identified in the exploration such as the distinction between week day and weekend.
Despite the high number of data, we managed also to compute the daily bikes variation for almost all the stations. once the ground truth of daily variation computed, we were able to perform many different regressions on it to fit best our data. The random forest regressor was by far the most precise found. Moreover we tested different fitting strategies: eather one for all stations or one per station. Finally regression results for the optimal number of trees were highly encouraging.
In [ ]: