Bogazici-SWE546-Spring2016 Final Project - Mustafa Atik

I am just a beginner to kaggle competitions. After reading a few posts dealing with the famous titanic data, I felt confident to try to solve a problem on my own. Bike sharing problem is also famous but this time, I will not read the articles about it. After I am done with this problem, surely I will take a look at them.

Forecast use of a city bikeshare system

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

In this problem, you are asked to predict bike rental demand in Washington D.C. by using the historical data. Since you will try to predict the demand which is a continuous number, this is a regression problem, not a classification. In titanic competition, you are asked to predict whether a given person can survive or not; obviously, this is a classification task.

I assume you are familiar with python, and the popular libraries such as pandas, seaborn, pyplot, numpy or even sklearn.



In [20]:

    
from IPython.core import display
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, cross_validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV

%matplotlib inline



In [2]:

    
def readCsv(name):
    df = pd.read_csv("data/bike-sharing/{}.csv".format(name), parse_dates=["datetime"])
    df["year"] = pd.DatetimeIndex(df['datetime']).year
    df["month"] = pd.DatetimeIndex(df['datetime']).month
    df["hour"] = pd.DatetimeIndex(df['datetime']).hour
    df["dayofweek"] = pd.DatetimeIndex(df['datetime']).dayofweek
    if "count" in df.columns:
        df["count"] = np.log(df["count"] + 1)
    return df

df = readCsv("train")
df.sample(5)









    Out[2]:






  
    
      
      datetime
      season
      holiday
      workingday
      weather
      temp
      atemp
      humidity
      windspeed
      casual
      registered
      count
      year
      month
      hour
      dayofweek
    
  
  
    
      6168
      2012-02-13 06:00:00
      1
      0
      1
      1
      4.10
      5.305
      54
      15.0013
      1
      71
      4.290459
      2012
      2
      6
      0
    
    
      7800
      2012-06-05 09:00:00
      2
      0
      1
      2
      21.32
      25.000
      72
      16.9979
      50
      285
      5.817111
      2012
      6
      9
      1
    
    
      6767
      2012-03-19 06:00:00
      1
      0
      1
      1
      18.86
      22.725
      88
      8.9981
      2
      118
      4.795791
      2012
      3
      6
      0
    
    
      9649
      2012-10-06 10:00:00
      4
      0
      0
      1
      25.42
      30.305
      69
      19.0012
      198
      341
      6.291569
      2012
      10
      10
      5
    
    
      4281
      2011-10-10 10:00:00
      4
      1
      0
      1
      22.96
      26.515
      73
      11.0014
      107
      158
      5.583496
      2011
      10
      10
      0

Field descriptions

datetime - hourly date + timestamp  
season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 
  1: Clear, Few clouds, Partly cloudy, Partly cloudy 
  2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
  3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
  4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals



In [3]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 10886 entries, 0 to 10885
Data columns (total 16 columns):
datetime      10886 non-null datetime64[ns]
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null float64
year          10886 non-null int32
month         10886 non-null int32
hour          10886 non-null int32
dayofweek     10886 non-null int32
dtypes: datetime64[ns](1), float64(4), int32(4), int64(7)
memory usage: 1.2 MB



In [4]:

    
def exploreUnivariate(feature):
    print "\n-----------------------\nFEATURE: {}\n".format(feature.name)
    print feature.describe()
    sns.distplot(feature)
    plt.show()



In [5]:

    
for c in df.columns:
    if c == "datetime":
        continue
    exploreUnivariate(df[c])









    



-----------------------
FEATURE: season

count    10886.000000
mean         2.506614
std          1.116174
min          1.000000
25%          2.000000
50%          3.000000
75%          4.000000
max          4.000000
Name: season, dtype: float64






    












    



-----------------------
FEATURE: holiday

count    10886.000000
mean         0.028569
std          0.166599
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: holiday, dtype: float64






    












    



-----------------------
FEATURE: workingday

count    10886.000000
mean         0.680875
std          0.466159
min          0.000000
25%          0.000000
50%          1.000000
75%          1.000000
max          1.000000
Name: workingday, dtype: float64






    












    



-----------------------
FEATURE: weather

count    10886.000000
mean         1.418427
std          0.633839
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          4.000000
Name: weather, dtype: float64






    












    



-----------------------
FEATURE: temp

count    10886.00000
mean        20.23086
std          7.79159
min          0.82000
25%         13.94000
50%         20.50000
75%         26.24000
max         41.00000
Name: temp, dtype: float64






    












    



-----------------------
FEATURE: atemp

count    10886.000000
mean        23.655084
std          8.474601
min          0.760000
25%         16.665000
50%         24.240000
75%         31.060000
max         45.455000
Name: atemp, dtype: float64






    












    



-----------------------
FEATURE: humidity

count    10886.000000
mean        61.886460
std         19.245033
min          0.000000
25%         47.000000
50%         62.000000
75%         77.000000
max        100.000000
Name: humidity, dtype: float64






    












    



-----------------------
FEATURE: windspeed

count    10886.000000
mean        12.799395
std          8.164537
min          0.000000
25%          7.001500
50%         12.998000
75%         16.997900
max         56.996900
Name: windspeed, dtype: float64






    












    



-----------------------
FEATURE: casual

count    10886.000000
mean        36.021955
std         49.960477
min          0.000000
25%          4.000000
50%         17.000000
75%         49.000000
max        367.000000
Name: casual, dtype: float64






    












    



-----------------------
FEATURE: registered

count    10886.000000
mean       155.552177
std        151.039033
min          0.000000
25%         36.000000
50%        118.000000
75%        222.000000
max        886.000000
Name: registered, dtype: float64






    












    



-----------------------
FEATURE: count

count    10886.000000
mean         4.591364
std          1.419454
min          0.693147
25%          3.761200
50%          4.983607
75%          5.652489
max          6.885510
Name: count, dtype: float64






    












    



-----------------------
FEATURE: year

count    10886.000000
mean      2011.501929
std          0.500019
min       2011.000000
25%       2011.000000
50%       2012.000000
75%       2012.000000
max       2012.000000
Name: year, dtype: float64






    












    



-----------------------
FEATURE: month

count    10886.000000
mean         6.521495
std          3.444373
min          1.000000
25%          4.000000
50%          7.000000
75%         10.000000
max         12.000000
Name: month, dtype: float64






    












    



-----------------------
FEATURE: hour

count    10886.000000
mean        11.541613
std          6.915838
min          0.000000
25%          6.000000
50%         12.000000
75%         18.000000
max         23.000000
Name: hour, dtype: float64






    












    



-----------------------
FEATURE: dayofweek

count    10886.000000
mean         3.013963
std          2.004585
min          0.000000
25%          1.000000
50%          3.000000
75%          5.000000
max          6.000000
Name: dayofweek, dtype: float64



In [6]:

    
#days_in_month", "dayofweek
cols = [
    "hour", "workingday", "holiday", "month", 
    "weather", "season", "temp", "humidity", "windspeed"]
for i in cols:
    sns.boxplot(x=i, y="count", data=df)
    plt.show()

Inferences

count is high between 6am and 8pm, especially at 7-8 and 17-18
count is low on weekends
count is high from March to November
count is high when the season is 2 and 3
count is high when temp is higher than 10 and lower than 40

Constructing an estimator

First, data is split into two sets as training set and test set.



In [7]:

    
X = df[cols]
y = df["count"]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X,y, test_size=0.2, random_state=0)
print("traing set: ", X_train.shape, y_train.shape)
print("test set: ", X_test.shape, y_test.shape)









    



('traing set: ', (8708, 9), (8708,))
('test set: ', (2178, 9), (2178,))

Choosing the right regressor



In [8]:

    
def calcError(clf, x_test):
    y_hat = clf.predict(X_test)
    return np.sum(np.sqrt(np.power(np.mat(y_test) - np.mat(y_hat), 2)))

Evaluation

To evaluate proposed regressors, R² score, the coefficient of determination method is used. http://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination $$ R^2(y, {\hat{y}}) = 1 - \frac{ \sum_{i=1}^{n=samples} (y_i - \hat{y_i})^2 }{ \sum_{i=1}^{n=samples} (y_i - \bar{y})^2 } $$

Support Vector Regressor



In [9]:

    
clf = svm.SVR(kernel="rbf", C=6, gamma=.02)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)









    Out[9]:





0.79381930861656336

Random Forest Regressor

ID3 Algorithm -- Michael Crawford, http://www.saedsayad.com/decision_tree.htm



In [10]:

    
X = df[cols]
y = df["count"]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X,y, test_size=0.2, random_state=0)

clf = RandomForestRegressor()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)









    Out[10]:





0.90984494885776079

We can compare random forest regressor(rfr) and svm regressor by looking at their scores. The random forest regressor looks like far better than the svm regressor. So, I choose rfr.

Finding the optimum parameters for randomforest regressor

To fine tune our randomforest regressor, we find the optimum parameters through grid searching. For each combination of parameters, a new instance of randomforest regressor will be created with that parameter set and its score will be calculated automatically. After grid searching, it is easy to find the best parameter combination to pass to our regressor.



In [11]:

    
param_grid = [
    {"n_estimators": [2, 5, 10, 15, 20, 100], "bootstrap": [False, True]}
]
clf = GridSearchCV(
    RandomForestRegressor(), param_grid=param_grid, cv=5).fit(X_train, y_train)



In [12]:

    
clf.grid_scores_









    Out[12]:





[mean: 0.85461, std: 0.00366, params: {'n_estimators': 2, 'bootstrap': False},
 mean: 0.85688, std: 0.00528, params: {'n_estimators': 5, 'bootstrap': False},
 mean: 0.85758, std: 0.00487, params: {'n_estimators': 10, 'bootstrap': False},
 mean: 0.85732, std: 0.00487, params: {'n_estimators': 15, 'bootstrap': False},
 mean: 0.85818, std: 0.00452, params: {'n_estimators': 20, 'bootstrap': False},
 mean: 0.85827, std: 0.00494, params: {'n_estimators': 100, 'bootstrap': False},
 mean: 0.87826, std: 0.00511, params: {'n_estimators': 2, 'bootstrap': True},
 mean: 0.90554, std: 0.00250, params: {'n_estimators': 5, 'bootstrap': True},
 mean: 0.91342, std: 0.00208, params: {'n_estimators': 10, 'bootstrap': True},
 mean: 0.91523, std: 0.00280, params: {'n_estimators': 15, 'bootstrap': True},
 mean: 0.91605, std: 0.00250, params: {'n_estimators': 20, 'bootstrap': True},
 mean: 0.91991, std: 0.00270, params: {'n_estimators': 100, 'bootstrap': True}]

As shown above, the optimum paramet set is {n_estimators:100, bootstrap:True}

Preparing the final regressor and testing it



In [15]:

    
clf = RandomForestRegressor(n_estimators=100, bootstrap=True, max_depth=20)
clf.fit(X, y)
print "score: ", clf.score(X_test, y_test)

i = 22
y_hat = clf.predict(X_train[i:i+1])
print("actual:{} , predicted:{}".format(int(y_train[i:i+1]), y_hat))









    



score:  0.98712586999
actual:6 , predicted:[ 6.19752345]



In [19]:

    
sns.regplot(x=y_test, y=clf.predict(X_test))









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x114297110>

Making prediction on the test data



In [67]:

    
test_df = readCsv("test")
result = pd.concat([
        test_df["datetime"],
        pd.DataFrame(np.exp(clf.predict(test_df[cols])))
    ], axis=1, join='inner')
result.columns = ["datetime", "count"]
result.head(10)









    Out[67]:






  
    
      
      datetime
      count
    
  
  
    
      0
      2011-01-20 00:00:00
      40.271258
    
    
      1
      2011-01-20 01:00:00
      15.807444
    
    
      2
      2011-01-20 02:00:00
      6.116351
    
    
      3
      2011-01-20 03:00:00
      7.137744
    
    
      4
      2011-01-20 04:00:00
      8.547666
    
    
      5
      2011-01-20 05:00:00
      34.937701
    
    
      6
      2011-01-20 06:00:00
      118.945244
    
    
      7
      2011-01-20 07:00:00
      389.747271
    
    
      8
      2011-01-20 08:00:00
      651.526031
    
    
      9
      2011-01-20 09:00:00
      293.013254

A better approach: predicting month by month (not completed yet)



In [64]:

    
uniqueMonths = test_df["month"].unique()
uniqueYears = test_df["year"].unique()
result = []
for year in uniqueYears:
    for m in uniqueMonths:
        print "month ", year, m
        
        date_condition = (df["month"] == m) & (df["year"] == year )
        X = df[date_condition][cols]
        y = df[date_condition]["count"]
        
        date_condition = (test_df["month"] == m) & (test_df["year"] == year )
        q = test_df[date_condition][cols]

        clf = RandomForestRegressor(n_estimators=200)
        clf.fit(X, y)
        result.append(clf.predict(q))









    



month  2011 1
month  2011 2
month  2011 3
month  2011 4
month  2011 5
month  2011 6
month  2011 7
month  2011 8
month  2011 9
month  2011 10
month  2011 11
month  2011 12
month  2012 1
month  2012 2
month  2012 3
month  2012 4
month  2012 5
month  2012 6
month  2012 7
month  2012 8
month  2012 9
month  2012 10
month  2012 11
month  2012 12



In [619]:

    
result = pd.concat([
        test_df["datetime"],
        pd.DataFrame(result)
    ], axis=1, join='inner')
result.columns = ["datetime", "count"]
result.head(10)









    Out[619]:






  
    
      
      datetime
      count
    
  
  
    
      0
      2011-01-20 00:00:00
      6.991726
    
    
      1
      2011-01-20 01:00:00
      6.257386
    
    
      2
      2011-01-20 02:00:00
      5.872800
    
    
      3
      2011-01-20 03:00:00
      3.270695
    
    
      4
      2011-01-20 04:00:00
      2.840367
    
    
      5
      2011-01-20 05:00:00
      3.559689
    
    
      6
      2011-01-20 06:00:00
      26.462941
    
    
      7
      2011-01-20 07:00:00
      81.220191
    
    
      8
      2011-01-20 08:00:00
      163.667520
    
    
      9
      2011-01-20 09:00:00
      106.090600

	datetime	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	count	year	month	hour	dayofweek
6168	2012-02-13 06:00:00	1	0	1	1	4.10	5.305	54	15.0013	1	71	4.290459	2012	2	6	0
7800	2012-06-05 09:00:00	2	0	1	2	21.32	25.000	72	16.9979	50	285	5.817111	2012	6	9	1
6767	2012-03-19 06:00:00	1	0	1	1	18.86	22.725	88	8.9981	2	118	4.795791	2012	3	6	0
9649	2012-10-06 10:00:00	4	0	0	1	25.42	30.305	69	19.0012	198	341	6.291569	2012	10	10	5
4281	2011-10-10 10:00:00	4	1	0	1	22.96	26.515	73	11.0014	107	158	5.583496	2011	10	10	0

	datetime	count
0	2011-01-20 00:00:00	40.271258
1	2011-01-20 01:00:00	15.807444
2	2011-01-20 02:00:00	6.116351
3	2011-01-20 03:00:00	7.137744
4	2011-01-20 04:00:00	8.547666
5	2011-01-20 05:00:00	34.937701
6	2011-01-20 06:00:00	118.945244
7	2011-01-20 07:00:00	389.747271
8	2011-01-20 08:00:00	651.526031
9	2011-01-20 09:00:00	293.013254

	datetime	count
0	2011-01-20 00:00:00	6.991726
1	2011-01-20 01:00:00	6.257386
2	2011-01-20 02:00:00	5.872800
3	2011-01-20 03:00:00	3.270695
4	2011-01-20 04:00:00	2.840367
5	2011-01-20 05:00:00	3.559689
6	2011-01-20 06:00:00	26.462941
7	2011-01-20 07:00:00	81.220191
8	2011-01-20 08:00:00	163.667520
9	2011-01-20 09:00:00	106.090600

Bike sharing demand competition