Bogazici-SWE546-Spring2016 Final Project - Mustafa Atik
I am just a beginner to kaggle competitions. After reading a few posts dealing with the famous titanic data, I felt confident to try to solve a problem on my own. Bike sharing problem is also famous but this time, I will not read the articles about it. After I am done with this problem, surely I will take a look at them.
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
In this problem, you are asked to predict bike rental demand in Washington D.C. by using the historical data. Since you will try to predict the demand which is a continuous number, this is a regression problem, not a classification. In titanic competition, you are asked to predict whether a given person can survive or not; obviously, this is a classification task.
I assume you are familiar with python, and the popular libraries such as pandas, seaborn, pyplot, numpy or even sklearn.
In [20]:
from IPython.core import display
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, cross_validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV
%matplotlib inline
In [2]:
def readCsv(name):
df = pd.read_csv("data/bike-sharing/{}.csv".format(name), parse_dates=["datetime"])
df["year"] = pd.DatetimeIndex(df['datetime']).year
df["month"] = pd.DatetimeIndex(df['datetime']).month
df["hour"] = pd.DatetimeIndex(df['datetime']).hour
df["dayofweek"] = pd.DatetimeIndex(df['datetime']).dayofweek
if "count" in df.columns:
df["count"] = np.log(df["count"] + 1)
return df
df = readCsv("train")
df.sample(5)
Out[2]:
datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather -
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals
In [3]:
df.info()
In [4]:
def exploreUnivariate(feature):
print "\n-----------------------\nFEATURE: {}\n".format(feature.name)
print feature.describe()
sns.distplot(feature)
plt.show()
In [5]:
for c in df.columns:
if c == "datetime":
continue
exploreUnivariate(df[c])
In [6]:
#days_in_month", "dayofweek
cols = [
"hour", "workingday", "holiday", "month",
"weather", "season", "temp", "humidity", "windspeed"]
for i in cols:
sns.boxplot(x=i, y="count", data=df)
plt.show()
First, data is split into two sets as training set
and test set
.
In [7]:
X = df[cols]
y = df["count"]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X,y, test_size=0.2, random_state=0)
print("traing set: ", X_train.shape, y_train.shape)
print("test set: ", X_test.shape, y_test.shape)
In [8]:
def calcError(clf, x_test):
y_hat = clf.predict(X_test)
return np.sum(np.sqrt(np.power(np.mat(y_test) - np.mat(y_hat), 2)))
To evaluate proposed regressors, R² score, the coefficient of determination method is used. http://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination $$ R^2(y, {\hat{y}}) = 1 - \frac{ \sum_{i=1}^{n=samples} (y_i - \hat{y_i})^2 }{ \sum_{i=1}^{n=samples} (y_i - \bar{y})^2 } $$
In [9]:
clf = svm.SVR(kernel="rbf", C=6, gamma=.02)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
Out[9]:
ID3 Algorithm -- Michael Crawford, http://www.saedsayad.com/decision_tree.htm
In [10]:
X = df[cols]
y = df["count"]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X,y, test_size=0.2, random_state=0)
clf = RandomForestRegressor()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
Out[10]:
We can compare random forest regressor(rfr) and svm regressor by looking at their scores. The random forest regressor looks like far better than the svm regressor. So, I choose rfr.
To fine tune our randomforest regressor, we find the optimum parameters through grid searching. For each combination of parameters, a new instance of randomforest regressor will be created with that parameter set and its score will be calculated automatically. After grid searching, it is easy to find the best parameter combination to pass to our regressor.
In [11]:
param_grid = [
{"n_estimators": [2, 5, 10, 15, 20, 100], "bootstrap": [False, True]}
]
clf = GridSearchCV(
RandomForestRegressor(), param_grid=param_grid, cv=5).fit(X_train, y_train)
In [12]:
clf.grid_scores_
Out[12]:
As shown above, the optimum paramet set is {n_estimators:100, bootstrap:True}
In [15]:
clf = RandomForestRegressor(n_estimators=100, bootstrap=True, max_depth=20)
clf.fit(X, y)
print "score: ", clf.score(X_test, y_test)
i = 22
y_hat = clf.predict(X_train[i:i+1])
print("actual:{} , predicted:{}".format(int(y_train[i:i+1]), y_hat))
In [19]:
sns.regplot(x=y_test, y=clf.predict(X_test))
Out[19]:
In [67]:
test_df = readCsv("test")
result = pd.concat([
test_df["datetime"],
pd.DataFrame(np.exp(clf.predict(test_df[cols])))
], axis=1, join='inner')
result.columns = ["datetime", "count"]
result.head(10)
Out[67]:
In [64]:
uniqueMonths = test_df["month"].unique()
uniqueYears = test_df["year"].unique()
result = []
for year in uniqueYears:
for m in uniqueMonths:
print "month ", year, m
date_condition = (df["month"] == m) & (df["year"] == year )
X = df[date_condition][cols]
y = df[date_condition]["count"]
date_condition = (test_df["month"] == m) & (test_df["year"] == year )
q = test_df[date_condition][cols]
clf = RandomForestRegressor(n_estimators=200)
clf.fit(X, y)
result.append(clf.predict(q))
In [619]:
result = pd.concat([
test_df["datetime"],
pd.DataFrame(result)
], axis=1, join='inner')
result.columns = ["datetime", "count"]
result.head(10)
Out[619]: