A Network Tour of Data Science

EPFL Course EE-558

Vincent Hardy

15/01/2017

In the frame of this study I will use the dataset provided on Kaggle for the challenge entitled "Cycle Share Dataset". The dataset was released by Pronto which is the Seattle’s new cycle sharing system with 500 bikes and 58 stations located throughout the city. Data covers every bike trips from 10-13-2014 to 8-31-2016 and is composed of 3 tables:

station.csv
trip.csv
weather.csv

It is worth noticing that Pronto clients are splitted into two kinds of users: eather annual members or short-term pass holders. Short-term pass holders can own two kinds of tickets: eather a 24-hour pass or 3-day pass. One can notice that Pronto is not providing any user ID, hence it will be impossible to drive this analysis in terms of number of users but in terms of number of trips.

Furthermore the final aim of this project is to try to predict accurately the bikes demand at each station of Seattle.

This project will fall mainly into six parts detailed below:

Acquisition of Row Data
Users Behaviors
1. General Time Trends: Daily and Hourly Analysis
  1. Daily Analysis
  2. Hourly Analysis
2. Age Distribution
3. Type of user: Annual Member or Short-Term Pass Holder (24H or 3D pass)
PCA Analysis on daily trends
1. Building the data to apply PCA
2. Applying PCA & Visualization
Bikes Variation Analysis Per Station
1. Most popular start & stop stations
2. Total Variation at the station SLU-01
3. Daily Variation
  1. Details for one station: SLU-01
  2. Computation of the sampled data for each station
Regression on the Daily Variation of Bikes
1. Building final Dataframe for regression
2. Numerization of the Dataframe
3. Regression with one model for all the stations
4. Regression with one model at each station
5. Pratical application for Pronto
Conclusion

N.B: The IPython modules attached to this project must be run in the same order as the report.

1. Acquisition of Raw Data

Following a classical Data Science scheme, we must first load the raw data and make a bit of data cleaning. We will use during this project the convenient structure of Pandas Dataframes to represent our datasets. One can oberserve below the shapes and some extracts from the raw Dataframes we will use in the following.

See Code



In [1]:

    
%run acquisition_raw_data.ipynb









    



There are 58 stations.
Station database shape: (58, 8) , the 5 first lines are:






    






  
    
      
      name
      lat
      long
      install_date
      install_dockcount
      modification_date
      current_dockcount
      decommission_date
    
    
      station_id
      
      
      
      
      
      
      
      
    
  
  
    
      BT-01
      3rd Ave & Broad St
      47.618418
      -122.350964
      10/13/2014
      18
      NaN
      18
      NaN
    
    
      BT-03
      2nd Ave & Vine St
      47.615829
      -122.348564
      10/13/2014
      16
      NaN
      16
      NaN
    
    
      BT-04
      6th Ave & Blanchard St
      47.616094
      -122.341102
      10/13/2014
      16
      NaN
      16
      NaN
    
    
      BT-05
      2nd Ave & Blanchard St
      47.613110
      -122.344208
      10/13/2014
      14
      NaN
      14
      NaN
    
    
      CBD-03
      7th Ave & Union St
      47.610731
      -122.332447
      10/13/2014
      20
      NaN
      20
      NaN
    
  








    



We have 689 days of weather data.
Weather database shape: (689, 20) , the 5 first lines are:






    






  
    
      
      Max_Temperature_F
      Mean_Temperature_F
      Min_TemperatureF
      Max_Dew_Point_F
      MeanDew_Point_F
      Min_Dewpoint_F
      Max_Humidity
      Mean_Humidity
      Min_Humidity
      Max_Sea_Level_Pressure_In
      Mean_Sea_Level_Pressure_In
      Min_Sea_Level_Pressure_In
      Max_Visibility_Miles
      Mean_Visibility_Miles
      Min_Visibility_Miles
      Max_Wind_Speed_MPH
      Mean_Wind_Speed_MPH
      Max_Gust_Speed_MPH
      Precipitation_In
      Events
    
    
      Date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2014-10-13
      71
      62.0
      54
      55
      51
      46
      87
      68
      46
      30.03
      29.79
      29.65
      10
      10
      4
      13
      4
      21
      0.00
      Rain
    
    
      2014-10-14
      63
      59.0
      55
      52
      51
      50
      88
      78
      63
      29.84
      29.75
      29.54
      10
      9
      3
      10
      5
      17
      0.11
      Rain
    
    
      2014-10-15
      62
      58.0
      54
      53
      50
      46
      87
      77
      67
      29.98
      29.71
      29.51
      10
      9
      3
      18
      7
      25
      0.45
      Rain
    
    
      2014-10-16
      71
      61.0
      52
      49
      46
      42
      83
      61
      36
      30.03
      29.95
      29.81
      10
      10
      10
      9
      4
      -
      0.00
      Rain
    
    
      2014-10-17
      64
      60.0
      57
      55
      51
      41
      87
      72
      46
      29.83
      29.78
      29.73
      10
      10
      6
      8
      3
      -
      0.14
      Rain
    
  








    



There are 236065 trips in total.
Trip database shape: (236065, 14) , the 5 first lines are:






    






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
      date
      hour
    
  
  
    
      0
      431
      2014-10-13 10:31:00
      2014-10-13 10:48:00
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
      2014-10-13
      10
    
    
      1
      432
      2014-10-13 10:32:00
      2014-10-13 10:48:00
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0
      2014-10-13
      10
    
    
      2
      433
      2014-10-13 10:33:00
      2014-10-13 10:48:00
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1988.0
      2014-10-13
      10
    
    
      3
      434
      2014-10-13 10:34:00
      2014-10-13 10:48:00
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1977.0
      2014-10-13
      10
    
    
      4
      435
      2014-10-13 10:34:00
      2014-10-13 10:49:00
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1971.0
      2014-10-13
      10

2. Users Behaviors

This part mainly rely on the data exploration step. The idea is to analyze the trips data in order to identify first time patterns and then users behaviors regarding their age or type (annual member or short-term pass holder). As said in the introduction, the frequency of use will be quantified by the number of bike trips.

A. General Time Trends: Daily and Hourly Analysis

a. Daily Analysis

We start by analyzing the daily trends over the whole period we have data. Furthermore we would like to test whether this global daily trips is correlated with some weather data - for instance the mean temperature.

See Code



In [2]:

    
% run daily_time_analysis.ipynb

As expected temperature shape is highly linked to the daily count of trips. Moreover one can notice the seasons influence on the number of trips: more during summer, less during winter.

b. Hourly Analysis

We might expect for the hourly analysis quite different patterns whether this is a week day of weekend day. Let's verify this hint.

See Code



In [3]:

    
% run hourly_time_analysis.ipynb

Indeed during week days we can clearly observe office hours influence whereas during weekend days the distribution is centered around 14:00 which is the ideal time for a bike ride.

B. Age Distribution

One can also wonder what is the age distribution of users for all these trips.

See Code



In [4]:

    
% run age_distribution.ipynb

C. Type of user: Annual Member or Short-Term Pass Holder (24H or 3D pass)

Finally we would like to know better the difference of behaviors between annual members and short-term pass holders if there is any. Again we should analyze separetely week days and weekend days.

See Code



In [5]:

    
% run type_user.ipynb

As we could have imagined annual members are more likely to use Pronto bikes to go working whereas short-term pass holders are more likely to go cycling occasionally during weekends as a hobby.

3. PCA Analysis on daily trends

The first data exploitation tool we would like to use now is a Principal Components Analysis (PCA). Indeed this machine learning algorithm will allow us to reduce the dimensions of our data in order to ease struture recognition.

For performing PCA, we decided to represent our trips data as 1-day data points. To do so we computed the total hourly number of trips every day. The aim will be to identify patterns among this time distribution which could be expalined by either the weather or the day of the week. Indeed both possible explanations are motivated by the data exploration we did before.

To this 24 dimensions (coming from 24 hours a day) we will add the mean temperature. This leaves us with 25 dimensions to be reduced.

A. Building the data to apply PCA

One can find below an extract from the Dataframe we will use to perform PCA. As said before, we have 689 data points and 25 dimensions which will be reduced to 2 dimensions for further analysis.

The remaining 4 columns (precipitation, day of the week, events, total_trips) will only be used as labels to analyze PCA (supervised learning).

Regarding the column 'Events', numerization is the following:

3: Snow
2: Rain
1: Fog
0: empty

See Code



In [6]:

    
% run building_PCA_data.ipynb









    






  
    
      hour
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      19
      20
      21
      22
      23
      Mean_Temperature_F
      Precipitation_In
      day_of_week
      Events
      total_trips
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2014-10-13
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      12.0
      12.0
      10.0
      4.0
      3.0
      62.0
      0.00
      week
      2
      406.0
    
    
      2014-10-14
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      5.0
      22.0
      35.0
      25.0
      ...
      28.0
      17.0
      13.0
      6.0
      2.0
      59.0
      0.11
      week
      2
      489.0
    
    
      2014-10-15
      3.0
      2.0
      1.0
      0.0
      0.0
      0.0
      7.0
      10.0
      12.0
      21.0
      ...
      21.0
      12.0
      12.0
      8.0
      1.0
      58.0
      0.45
      week
      2
      312.0
    
    
      2014-10-16
      4.0
      1.0
      0.0
      0.0
      0.0
      2.0
      7.0
      14.0
      35.0
      25.0
      ...
      22.0
      21.0
      8.0
      6.0
      6.0
      61.0
      0.00
      week
      2
      389.0
    
    
      2014-10-17
      2.0
      1.0
      1.0
      0.0
      0.0
      1.0
      5.0
      16.0
      28.0
      26.0
      ...
      15.0
      10.0
      20.0
      9.0
      2.0
      60.0
      0.14
      week
      2
      292.0
    
  

5 rows × 29 columns

B. Applying PCA & Visualization

As announced before, we perform PCA only on the first 25 columns, the remaining 4 columns being used to display labels. Each data points represent a day.

See Code



In [7]:

    
% run applying_PCA.ipynb

Results are highly interesting since PCA almost perfectly distinguish weekends and week days. In other words the two PCA components allow to explain the variance within our dataset in terms of week days/ week ends. This is in compliance with part 2.

Furthermore dry and hot days are more likely to be with higher PCA1 where as wet and cold days are more likely to be with lower PCA1.

Hence one can advocate that PCA2 is related to the week days/ week ends feature whereas PCA1 is related to the total number of trips as well as the temperature.

4. Bikes Variation Analysis Per Station

The final aim of this project is be to predict the demand at a given station. Therefore such a purpose would require to know the number of available bikes at each station at a given time. Then we would be able to perform a regression algorithm on this number of available bikes.

Unfortunately this bike availability information at each station is not provided in the dataset. I e-mailed Pronto to know if it was possible to get this information, but they did not answer. Consequently we have now to find a way to compute it.

A. Most popular start & stop stations

Before performing additional data exploitation, we need first to make some further data exploration on the stations we will work on. Especially we would like to know which stations are more used for starts and stops and in which proportion.

See Code



In [8]:

    
% run station_popularity.ipynb









    






  
    
      
      Nber_starts
      Nber_stops
    
    
      station_id
      
      
    
  
  
    
      8D OPS 02
      2
      2
    
    
      BT-01
      9392
      9202
    
    
      BT-03
      6568
      5135
    
    
      BT-04
      3819
      3326
    
    
      BT-05
      4986
      5031

First of all some stations are definitely more popular than others.

One can analyze that many stations do not reach an equilibrium between the number of starts and the number of stops. This means that Pronto is most certainly re-equilibrating the distribtion of bikes regularly. Obsiously this intervention of Pronto in the bike availability is not quantified in our raw data. Consequently it might be impossible to compute the real number of available bikes at a given station.

However we might try to infer the bike availability at a station which has as many start than stop trips. Indeed with hope, there will be very few interventions of Pronto and the total bike variation will be very close to the bike availability. For this test we chose the station SLU-01 which has both enough data and almost the same number of starts ans stops.

B. Total Variation at the station SLU-01

Hence let's compute the total variation at SLU-01 over the entire period of our data to see if we can infer the number of available bikes.

To do so, a new column was added - called 'incrementation' - and valued (+1) for a stop and (-1) for a start. This column allowed us then to compute the total variation by making the sum as you can observe below.

See Code



In [9]:

    
% run instanteanous_variation_total.ipynb









    



The Station SLU-01 has 18 docks in total.






    






  
    
      
      time
      bikeid
      destination_id
      incrementation
      total_variation
    
    
      trip_id
      
      
      
      
      
    
  
  
    
      552
      2014-10-13 13:09:00
      SEA00204
      SLU-01
      -1.0
      -1.0
    
    
      552
      2014-10-13 13:18:00
      SEA00204
      SLU-01
      1.0
      0.0
    
    
      591
      2014-10-13 14:03:00
      SEA00456
      WF-04
      -1.0
      -1.0
    
    
      592
      2014-10-13 14:05:00
      SEA00297
      WF-04
      -1.0
      -2.0
    
    
      596
      2014-10-13 14:06:00
      SEA00204
      WF-04
      -1.0
      -3.0

It appears that the total variation of bikes is often greatly superior to the number of docks for SLU-01 (18). Hence we can not directly infer the bike availability. On reflection however the daily bikes variation might be an effective indicator to represent the demand at a station.

C. Daily Variation

Before computing the daily bikes varition per station at a large scale, let's first detail it on one particular station: SLU-01.

a. Details for one station: SLU-01

To begin with, one should compute the instanteanous daily variation using the same principle as before, instead that this time we reset to 0 each day the column daily_variation as exposed below:

Instanteanous Variation:

See Code



In [10]:

    
% run instanteanous_variation_daily.ipynb









    






  
    
      
      time
      bikeid
      destination_id
      incrementation
      total_variation
      daily_variation
    
    
      trip_id
      
      
      
      
      
      
    
  
  
    
      883
      2014-10-13 20:31:00
      SEA00429
      CH-01
      -1.0
      -5.0
      -5.0
    
    
      898
      2014-10-13 21:43:00
      SEA00046
      SLU-02
      -1.0
      -6.0
      -6.0
    
    
      911
      2014-10-14 06:36:00
      SEA00317
      CH-12
      1.0
      -5.0
      1.0
    
    
      930
      2014-10-14 07:55:00
      SEA00166
      CH-15
      1.0
      -4.0
      2.0
    
    
      934
      2014-10-14 07:56:00
      SEA00317
      ID-04
      -1.0
      -5.0
      1.0
    
    
      951
      2014-10-14 08:19:00
      SEA00422
      SLU-18
      1.0
      -4.0
      2.0

Sampled Variation:

To perform an efficient regession on the daily variation of bikes, we need to resample on a regular time basis. Indeed we must exploit the information that between each data points above the variation is constant! Consequently let's sample our dataset every 15 mins as you can notice below:

See Code



In [11]:

    
% run sampled_variation.ipynb

b. Computation of the sampled data for each station

Now that we know how to compute the sampled daily variation for one station, let's compute a single DataFrame gathering the variations for all stations.

For computational matters, we reduced the number of data by keeping datapoints verifying the following criteria:

only stations which opened 2014-10-13 and had trips until 2016-8-31
trips from 2014-10-14 to 2016-3-17

Even with these reductions, the script below is extremly time consuming (about 30 mins), which is why we provide you below with the resulting matrix used for regression. This matrix 'Xtot_to_keep_all' can be loaded below so that you do not need to run the 3 next IPython scripts (especially the next one).

See Code



In [98]:

    
#ATTENTION: The script below takes about 30 mins to be run on my computer
#You can directly load the matrix later in the report if wanted (default)
%run create_sampled_data_each_station.ipynb
print('Sampled Variation database shape: {}'.format(sampled_variation.shape))









    



Sampled Variation database shape: (2246400, 2)

5. Regression on the Daily Variation of Bikes

As we managed to quantify the demand at each station, we can tackle a new data exploitation step. Indeed our goal is to perform a regression on the daily variation of bikes. In other words knowing a dataset X, we would like to predict the daily variation y such that f(X) = y.

This regression problem will cover the following steps:

choice of features to build our dataset X
how to represent these features (numerization)?
splitting our entire dataset between training and testing set
choice of the regressor model
choice of the regressor hyperparameters
evaluation of the regression, which metric to use?

Regressions will be run using the Scikit Learn package.

A. Building final Dataframe for regression

As we have already computed the daily variation y, only the choice of features to be added to the DataFrame remains. The features which are most likely to explain the daily variation are listed below:

weather information:
- events indicator (nothing , fog, rain or snow)
- mean Temperature
- precipitation
time information:
- month
- day of week
- hour

One can observe below the final structure of the DataFrame we will use for the regression:

See Code



In [101]:

    
% run building_final_dataframe_regression.ipynb









    






  
    
      
      Events
      Mean_Temperature_F
      Precipitation_In
      daily_variation
      station_id
      month
      weekday
      hour
    
  
  
    
      2014-10-14 00:00:00
      Rain
      59.0
      0.11
      0.0
      SLU-07
      10
      1
      0.00
    
    
      2014-10-14 00:15:00
      Rain
      59.0
      0.11
      0.0
      SLU-07
      10
      1
      0.25
    
    
      2014-10-14 00:30:00
      Rain
      59.0
      0.11
      0.0
      SLU-07
      10
      1
      0.50
    
    
      2014-10-14 00:45:00
      Rain
      59.0
      0.11
      0.0
      SLU-07
      10
      1
      0.75
    
    
      2014-10-14 01:00:00
      Rain
      59.0
      0.11
      0.0
      SLU-07
      10
      1
      1.00

B. Numerization of the Dataframe

Now how to represent these features?

The 'station_id' is simply mapped to an int dictionary. Regarding the column 'Events', numerization is the following:

3: Snow
2: Rain
1: Fog
0: nothing

See Code



In [100]:

    
% run numerization_final_dataframe.ipynb









    



Database shape used for regression: (2242080, 7)

C. Regression with one model for all the stations

We tried first to perform directly a regression on the entire dataset with the 45 stations coming from the numerization.

Moreover we decided to split our dataset equally between training and testing set in order to observe accurate enough predictions.

For the regression evaluation we will use two metrics:

Mean Squared Error (mse): classical metric used for regressions
Mean Absolute Error (mae): since we try to predict the daily variation which is an integer, the absolute error represents directly how far is the prediction from the truth (of how many units)

Many different regressor models were tested - with default hyperparameters - such as [we represent (mse,mae)]:

linear regression (10.73 , 1.90)
Ridge regression (10.72 , 1.89)
SGD regression (15.36 , 2.74)
Bayesian Ridge (10.75 , 1.90)
Logistic regression (no results: too much time of computation)
random forest regression (0.51 , 0.30)

It appeared that the random forest (RF) regressor outperformed by far the other regressors. Hence we decided to use only the random forest (RF) regressor in the following.

One can wonder now the optimal number of trees to use in the regression.



In [12]:

    
#Load data (ONLY If you don't want to run the last 3 scripts)
Xtot = np.load('Xtot_to_keep_all.npy')
ytot = np.load('ytot_to_keep_all.npy')
int_to_station = np.load('int_to_station_all.npy')
int_to_station = int_to_station[()]

See Code



In [13]:

    
#Finding the optimal number of trees (!takes a lot of time!)
% run regression_onemodel_all.ipynb









    



Split: 1121040 testing and 1121040 training samples

As a result, one can notice that the higher the number of regressors the more precise is the RF regression. Moreover it seems that a vast majority of predictions were at less than one unit from the truth which is extremely encouraging !

We will consider in the following that 20 trees is precise enough to perform RF regression on our dataset.

D. Regression with one model at each station

As seen before, stations have very different singularities, therefore it might be interesting to assign one regression model (with 20 Trees) at each station.

See Code



In [20]:

    
% run regression_onemodel_each.ipynb

Comparing MSE values on the graph above with the value of 0.51 found earlier in the case of only one model, it appears that one model per station provides for a great majority of stations more precise regressions (since lower MSE).

E. Pratical application for Pronto

Finally what if Pronto wanted to use this prediction tool, how would the prediction look like?

In this section you can vary both station and week to visualize.

Select Station you want to study:



In [15]:

    
station_id='SLU-01'

Recompute the final Dataframe to use for regression (ONLY for station selected & from 10-14-2014 to 30-8-2016):

See Code



In [16]:

    
% run numerization_final_dataframe_1station.ipynb









    



The Station SLU-01 has 18 docks in total.






    






  
    
      
      Events
      Mean_Temperature_F
      Precipitation_In
      daily_variation
      month
      weekday
      hour
    
    
      time
      
      
      
      
      
      
      
    
  
  
    
      2014-10-14 00:00:00
      2
      59.0
      0.11
      0.0
      10
      1
      0.00
    
    
      2014-10-14 00:15:00
      2
      59.0
      0.11
      0.0
      10
      1
      0.25
    
    
      2014-10-14 00:30:00
      2
      59.0
      0.11
      0.0
      10
      1
      0.50
    
    
      2014-10-14 00:45:00
      2
      59.0
      0.11
      0.0
      10
      1
      0.75
    
    
      2014-10-14 01:00:00
      2
      59.0
      0.11
      0.0
      10
      1
      1.00
    
  








    



DataFrame shape used for regression: (65856, 7)

Choose the week for which Pronto would like to have a prediction:



In [17]:

    
date_start = date(2016,6,6) # number 0 of the week (Monday)
date_end = date(2016,6,12)
week = [date_start + timedelta(days=x) for x in range((date_end-date_start).days + 1)]

RF Regression performed with 20 trees to the station selected (partially trained during the week selected):

See Code



In [18]:

    
% run regression_1station_partially.ipynb









    



Split: 32928 testing and 32928 training samples
mse: 0.2346
mae: 0.2422

RF Regression performed with 20 trees to the station selected (not trained during the week selected):

See Code



In [19]:

    
% run regression_1station_not.ipynb









    



Split: 32592 testing and 32592 training samples
mse: 6.1937
mae: 1.5991

To conclude results look promising when we consider the model whose training set is composed of some datapoints from the week we want predictions. However performances worsen considerably when there is no datapoint of the week under study in the training set.

Nevertheless the predictions for Monday (6th of June) in the second regression seem almost acceptable. In fact one can notice a decline of precision along the week.

6. Conclusion

This project on the cycle share system provided by Pronto allowed us to experiment the entire data science process.

From the data acquisition of the Kaggle dataset and its cleaning, we moved on to the exploration of the users behaviors in terms of time trends, age and type of users.

Then a PCA analysis allowed us to identify clearly the correlation between the number of trips and weather features. This data exploitation refound also some patterns already identified in the exploration such as the distinction between week day and weekend.

Despite the high number of data, we managed also to compute the daily bikes variation for almost all the stations. once the ground truth of daily variation computed, we were able to perform many different regressions on it to fit best our data. The random forest regressor was by far the most precise found. Moreover we tested different fitting strategies: eather one for all stations or one per station. Finally regression results for the optimal number of trees were highly encouraging.



In [ ]:

	name	lat	long	install_date	install_dockcount	modification_date	current_dockcount	decommission_date
station_id
BT-01	3rd Ave & Broad St	47.618418	-122.350964	10/13/2014	18	NaN	18	NaN
BT-03	2nd Ave & Vine St	47.615829	-122.348564	10/13/2014	16	NaN	16	NaN
BT-04	6th Ave & Blanchard St	47.616094	-122.341102	10/13/2014	16	NaN	16	NaN
BT-05	2nd Ave & Blanchard St	47.613110	-122.344208	10/13/2014	14	NaN	14	NaN
CBD-03	7th Ave & Union St	47.610731	-122.332447	10/13/2014	20	NaN	20	NaN

	Max_Temperature_F	Mean_Temperature_F	Min_TemperatureF	Max_Dew_Point_F	MeanDew_Point_F	Min_Dewpoint_F	Max_Humidity	Mean_Humidity	Min_Humidity	Max_Sea_Level_Pressure_In	Mean_Sea_Level_Pressure_In	Min_Sea_Level_Pressure_In	Max_Visibility_Miles	Mean_Visibility_Miles	Min_Visibility_Miles	Max_Wind_Speed_MPH	Mean_Wind_Speed_MPH	Max_Gust_Speed_MPH	Precipitation_In	Events
Date
2014-10-13	71	62.0	54	55	51	46	87	68	46	30.03	29.79	29.65	10	10	4	13	4	21	0.00	Rain
2014-10-14	63	59.0	55	52	51	50	88	78	63	29.84	29.75	29.54	10	9	3	10	5	17	0.11	Rain
2014-10-15	62	58.0	54	53	50	46	87	77	67	29.98	29.71	29.51	10	9	3	18	7	25	0.45	Rain
2014-10-16	71	61.0	52	49	46	42	83	61	36	30.03	29.95	29.81	10	10	10	9	4	-	0.00	Rain
2014-10-17	64	60.0	57	55	51	41	87	72	46	29.83	29.78	29.73	10	10	6	8	3	-	0.14	Rain

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear	date	hour
0	431	2014-10-13 10:31:00	2014-10-13 10:48:00	SEA00298	985.935	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Male	1960.0	2014-10-13	10
1	432	2014-10-13 10:32:00	2014-10-13 10:48:00	SEA00195	926.375	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Male	1970.0	2014-10-13	10
2	433	2014-10-13 10:33:00	2014-10-13 10:48:00	SEA00486	883.831	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Female	1988.0	2014-10-13	10
3	434	2014-10-13 10:34:00	2014-10-13 10:48:00	SEA00333	865.937	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Female	1977.0	2014-10-13	10
4	435	2014-10-13 10:34:00	2014-10-13 10:49:00	SEA00202	923.923	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Male	1971.0	2014-10-13	10

hour	0	1	2	3	4	5	6	7	8	9	...	19	20	21	22	23	Mean_Temperature_F	Precipitation_In	day_of_week	Events	total_trips
date
2014-10-13	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	12.0	12.0	10.0	4.0	3.0	62.0	0.00	week	2	406.0
2014-10-14	0.0	0.0	0.0	0.0	0.0	1.0	5.0	22.0	35.0	25.0	...	28.0	17.0	13.0	6.0	2.0	59.0	0.11	week	2	489.0
2014-10-15	3.0	2.0	1.0	0.0	0.0	0.0	7.0	10.0	12.0	21.0	...	21.0	12.0	12.0	8.0	1.0	58.0	0.45	week	2	312.0
2014-10-16	4.0	1.0	0.0	0.0	0.0	2.0	7.0	14.0	35.0	25.0	...	22.0	21.0	8.0	6.0	6.0	61.0	0.00	week	2	389.0
2014-10-17	2.0	1.0	1.0	0.0	0.0	1.0	5.0	16.0	28.0	26.0	...	15.0	10.0	20.0	9.0	2.0	60.0	0.14	week	2	292.0

	Nber_starts	Nber_stops
station_id
8D OPS 02	2	2
BT-01	9392	9202
BT-03	6568	5135
BT-04	3819	3326
BT-05	4986	5031

	time	bikeid	destination_id	incrementation	total_variation
trip_id
552	2014-10-13 13:09:00	SEA00204	SLU-01	-1.0	-1.0
552	2014-10-13 13:18:00	SEA00204	SLU-01	1.0	0.0
591	2014-10-13 14:03:00	SEA00456	WF-04	-1.0	-1.0
592	2014-10-13 14:05:00	SEA00297	WF-04	-1.0	-2.0
596	2014-10-13 14:06:00	SEA00204	WF-04	-1.0	-3.0

	time	bikeid	destination_id	incrementation	total_variation	daily_variation
trip_id
883	2014-10-13 20:31:00	SEA00429	CH-01	-1.0	-5.0	-5.0
898	2014-10-13 21:43:00	SEA00046	SLU-02	-1.0	-6.0	-6.0
911	2014-10-14 06:36:00	SEA00317	CH-12	1.0	-5.0	1.0
930	2014-10-14 07:55:00	SEA00166	CH-15	1.0	-4.0	2.0
934	2014-10-14 07:56:00	SEA00317	ID-04	-1.0	-5.0	1.0
951	2014-10-14 08:19:00	SEA00422	SLU-18	1.0	-4.0	2.0

	Events	Mean_Temperature_F	Precipitation_In	station_id	month	weekday	hour
2014-10-14 00:00:00	Rain	59.0	0.11	SLU-07	10	1	0.00
2014-10-14 00:15:00	Rain	59.0	0.11	SLU-07	10	1	0.25
2014-10-14 00:30:00	Rain	59.0	0.11	SLU-07	10	1	0.50
2014-10-14 00:45:00	Rain	59.0	0.11	SLU-07	10	1	0.75
2014-10-14 01:00:00	Rain	59.0	0.11	SLU-07	10	1	1.00

	Events	Mean_Temperature_F	Precipitation_In	daily_variation	month	weekday	hour
time
2014-10-14 00:00:00	2	59.0	0.11	0.0	10	1	0.00
2014-10-14 00:15:00	2	59.0	0.11	0.0	10	1	0.25
2014-10-14 00:30:00	2	59.0	0.11	0.0	10	1	0.50
2014-10-14 00:45:00	2	59.0	0.11	0.0	10	1	0.75
2014-10-14 01:00:00	2	59.0	0.11	0.0	10	1	1.00

A Network Tour of Data Science

Vincent Hardy

15/01/2017

Project: Bike Sharing Demand

1. Acquisition of Raw Data

2. Users Behaviors

A. General Time Trends: Daily and Hourly Analysis

a. Daily Analysis

b. Hourly Analysis

B. Age Distribution

C. Type of user: Annual Member or Short-Term Pass Holder (24H or 3D pass)

3. PCA Analysis on daily trends

A. Building the data to apply PCA

B. Applying PCA & Visualization

4. Bikes Variation Analysis Per Station

A. Most popular start & stop stations

B. Total Variation at the station SLU-01

C. Daily Variation

a. Details for one station: SLU-01

b. Computation of the sampled data for each station

5. Regression on the Daily Variation of Bikes

A. Building final Dataframe for regression

B. Numerization of the Dataframe

C. Regression with one model for all the stations

D. Regression with one model at each station

E. Pratical application for Pronto

6. Conclusion