Linear methods


In [1]:
# !pip install -U kaggle
# register the token in you kaggle profile & save it to ~/.kaggle/kaggle.json
# !kaggle competitions download -c bike-sharing-demand

In [19]:
import pandas as pd
from  sklearn import linear_model
from scipy import stats
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer
import numpy as np
from dateutil.parser import parse
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import cross_val_score
%matplotlib inline

In [3]:
# to prevent warining inside sklearn code
pd.options.mode.chained_assignment = None

In [4]:
df = pd.read_csv("train.csv")
df.shape


Out[4]:
(10886, 12)

In [5]:
df.head()


Out[5]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1

At first, we need custom score function, described in task. https://www.kaggle.com/c/bike-sharing-demand/overview/evaluation

  • Why do we need +1 in score function?

In [6]:
def rmsle(y_true, y_pred):
    y_pred_clipped = np.clip(y_pred, 0., None)
    return mean_squared_error(np.log1p(y_true), np.log1p(y_pred_clipped)) ** .5
  • What happens without np.clip?

Let's start with the exisiting features and simple linear regression.

All that feature extractors and grid search would be more clear further.


In [7]:
class SimpleFeatureExtractor(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self    
        
    def transform(self, X, y=None):
        return X[["holiday", "workingday", "season", "weather", "temp", "atemp", "humidity", "windspeed"]].values

In [8]:
exctractor = SimpleFeatureExtractor()
clf = Pipeline([
    ("extractor", exctractor),
    ("regression", linear_model.LinearRegression()),
])
param_grid = {}
scorerer = make_scorer(rmsle, greater_is_better=False)
researcher = GridSearchCV(clf, param_grid, scoring=scorerer, cv=5, n_jobs=4, verbose=1, refit=False)
researcher.fit(df, df["count"].values)


Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    2.0s finished
Out[8]:
GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('extractor', SimpleFeatureExtractor()), ('regression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))]),
       fit_params=None, iid='warn', n_jobs=4, param_grid={},
       pre_dispatch='2*n_jobs', refit=False, return_train_score='warn',
       scoring=make_scorer(rmsle, greater_is_better=False), verbose=1)

Hyperparameters Searcher always maximizes the score function, so if we need to decrease it, it just adds the minus.


In [9]:
researcher.best_score_


Out[9]:
-1.4687746950336822

Add regularization and grid search the hyperparameters

Now it's more clear why we have Grid Searcher ;-)


In [10]:
exctractor = SimpleFeatureExtractor()
clf = Pipeline([
    ("extractor", exctractor),
    ("regression", linear_model.ElasticNet()),
])

In [11]:
param_grid = {
    "regression__alpha": np.logspace(-3, 2, 10),
    "regression__l1_ratio": np.linspace(0, 1, 10)
}
scorerer = make_scorer(rmsle, greater_is_better=False)
researcher = GridSearchCV(clf, param_grid, scoring=scorerer, cv=5, n_jobs=4, verbose=1, refit=False)
researcher.fit(df, df["count"].values)


Fitting 5 folds for each of 100 candidates, totalling 500 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  76 tasks      | elapsed:    1.8s
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed:    6.8s finished
Out[11]:
GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('extractor', SimpleFeatureExtractor()), ('regression', ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'regression__alpha': array([1.00000e-03, 3.59381e-03, 1.29155e-02, 4.64159e-02, 1.66810e-01,
       5.99484e-01, 2.15443e+00, 7.74264e+00, 2.78256e+01, 1.00000e+02]), 'regression__l1_ratio': array([0.     , 0.11111, 0.22222, 0.33333, 0.44444, 0.55556, 0.66667,
       0.77778, 0.88889, 1.     ])},
       pre_dispatch='2*n_jobs', refit=False, return_train_score='warn',
       scoring=make_scorer(rmsle, greater_is_better=False), verbose=1)

In [12]:
researcher.best_score_


Out[12]:
-1.4196641068278841

In [13]:
researcher.best_params_


Out[13]:
{'regression__alpha': 27.825594022071257,
 'regression__l1_ratio': 0.3333333333333333}

Try to add some custom features


In [14]:
class FeatureExtractor(BaseEstimator, TransformerMixin):
    
    ohe = OneHotEncoder(categories='auto', sparse=False)
    scaler = StandardScaler()
    
    categorical_columns = ["week_day", "hour", "season", "weather"]
    numerical_columns = ["temp", "atemp", "humidity", "windspeed"]
    
    def _add_features(self, X):
        X["week_day"] = X.datetime.apply(lambda dttm: parse(dttm).weekday())
        X["hour"] = X.datetime.apply(lambda dttm: parse(dttm).hour)
        
    def _combine(self, *feature_groups):
        return np.hstack(feature_groups)
    
    def collect_stats(self, X):
        self._add_features(X)
        self.ohe.fit(X[self.categorical_columns])
        self.scaler.fit(X[self.numerical_columns])
    
    def fit(self, X, y=None):
        return self    
        
    def transform(self, X, y=None):
        self._add_features(X)
        custom_binary_features = self.ohe.transform(X[self.categorical_columns])
        scaled_features = self.scaler.transform(X[self.numerical_columns])
        return self._combine(
            custom_binary_features, 
            scaled_features,
            X[["holiday", "workingday"]].values
        )

In [15]:
exctractor = FeatureExtractor()
exctractor.collect_stats(df)
clf = Pipeline([
    ("extractor", exctractor),
    ("regression", linear_model.ElasticNet()),
])


/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)

In [16]:
param_grid = {
    "regression__alpha": np.logspace(-3, 2, 10),
    "regression__l1_ratio": np.linspace(0, 1, 10)
}
pd.options.mode.chained_assignment = None
scorerer = make_scorer(rmsle, greater_is_better=False)
researcher = GridSearchCV(clf, param_grid, scoring=scorerer, cv=5, n_jobs=4, verbose=1, refit=False)
researcher.fit(df, df["count"].values)


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 5 folds for each of 100 candidates, totalling 500 fits
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  5.2min
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed: 11.6min
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed: 12.9min finished
Out[16]:
GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('extractor', FeatureExtractor()), ('regression', ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'regression__alpha': array([1.00000e-03, 3.59381e-03, 1.29155e-02, 4.64159e-02, 1.66810e-01,
       5.99484e-01, 2.15443e+00, 7.74264e+00, 2.78256e+01, 1.00000e+02]), 'regression__l1_ratio': array([0.     , 0.11111, 0.22222, 0.33333, 0.44444, 0.55556, 0.66667,
       0.77778, 0.88889, 1.     ])},
       pre_dispatch='2*n_jobs', refit=False, return_train_score='warn',
       scoring=make_scorer(rmsle, greater_is_better=False), verbose=1)

In [17]:
researcher.best_score_


Out[17]:
-1.1171565953466736

In [18]:
researcher.best_params_


Out[18]:
{'regression__alpha': 0.046415888336127795,
 'regression__l1_ratio': 0.7777777777777777}

In [31]:
scorerer = make_scorer(mean_squared_error, greater_is_better=False)
scores = cross_val_score(clf, df, df["count"].values, cv=5, n_jobs=4, scoring=scorerer)
np.mean((-np.array(scores)) ** .5)


Out[31]:
156.66559636027856

What we can theoretically get if we optimize RMSE


In [34]:
param_grid = {
    "regression__alpha": np.logspace(-3, 2, 10),
    "regression__l1_ratio": np.linspace(0, 1, 10)
}
pd.options.mode.chained_assignment = None

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred) ** .5

scorerer = make_scorer(rmse, greater_is_better=False)
researcher = GridSearchCV(clf, param_grid, scoring=scorerer, cv=5, n_jobs=4, verbose=1, refit=False)
researcher.fit(df, df["count"].values)


Fitting 5 folds for each of 100 candidates, totalling 500 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  5.5min
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed: 11.8min
[Parallel(n_jobs=4)]: Done 500 out of 500 | elapsed: 13.5min finished
Out[34]:
GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('extractor', FeatureExtractor()), ('regression', ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'regression__alpha': array([1.00000e-03, 3.59381e-03, 1.29155e-02, 4.64159e-02, 1.66810e-01,
       5.99484e-01, 2.15443e+00, 7.74264e+00, 2.78256e+01, 1.00000e+02]), 'regression__l1_ratio': array([0.     , 0.11111, 0.22222, 0.33333, 0.44444, 0.55556, 0.66667,
       0.77778, 0.88889, 1.     ])},
       pre_dispatch='2*n_jobs', refit=False, return_train_score='warn',
       scoring=make_scorer(rmse, greater_is_better=False), verbose=1)

In [35]:
researcher.best_score_


Out[35]:
-129.4953879296039

In [36]:
researcher.best_params_


Out[36]:
{'regression__alpha': 0.01291549665014884,
 'regression__l1_ratio': 0.4444444444444444}
  • 11 min!!! Now we also learn FeaureExtractor every time and the pipeline becomes heavier. Why? Can you speed it up?

What was the point about Maximum Likelihood

The process is described by possion distribution better

https://en.wikipedia.org/wiki/Poisson_distribution

In probability theory and statistics, the Poisson distribution (French pronunciation: ​[pwasɔ̃]; in English often rendered /ˈpwɑːsɒn/), named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event.[1] The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

The other point of view: we have 200 people with 3% probability to pick up the bike.

What about CLT??? It works when $n \rightarrow \inf$. For poisson distribution there is a special case called De Moivre–Laplace theorem.

The list of different kinds of Generalized Linear Regression methods in sklearn: https://scikit-learn.org/stable/modules/linear_model.html

And there is no Poisson regression there.

So, let's write a probabilistic model for poisson distribution and optimize maximum likelihood.

Hausaufgaben: try to do it.

Hint:

start from the assumption $\hat{y} = \exp{\langle x, \theta \rangle}$ and find the derivative of log-likelihood by $\theta$. It's zero + check the sign of the second derivative.

The conclusion: we can simulate poisson regression with simple wrapper.

Poisson hierarchical regression

Check if we have issues with np.log(y == 0)


In [37]:
df[df["count"] == 0]


Out[37]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count week_day hour

In [38]:
np.log(0)


/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log
  """Entry point for launching an IPython kernel.
Out[38]:
-inf

In [39]:
class PoissonRegression(linear_model.ElasticNet):
    
    def __init__(self, alpha=1.0, l1_ratio=0.5, fit_intercept=True,
                 normalize=False, precompute=False, max_iter=1000,
                 copy_X=True, tol=1e-4, warm_start=False, positive=False,
                 random_state=None, selection='cyclic'):
        super().__init__(alpha, l1_ratio, fit_intercept, normalize, precompute, max_iter,
                         copy_X, tol, warm_start, positive, random_state, selection)
        
    def fit(self, X, y, *args):
        return super().fit(X, np.log(y), *args)
        
    def predict(self, X):
        return np.exp(super().predict(X))

In [40]:
exctractor = FeatureExtractor()
exctractor.collect_stats(df)
clf = Pipeline([
    ("extractor", exctractor),
    ("regression", PoissonRegression()),
])
param_grid = {
    "regression__alpha": np.logspace(-5, 1, 20),
    "regression__l1_ratio": np.linspace(0, 1, 10)
}
pd.options.mode.chained_assignment = None
scorerer = make_scorer(rmsle, greater_is_better=False)
researcher = GridSearchCV(clf, param_grid, scoring=scorerer, cv=5, n_jobs=4, verbose=1, refit=False)
researcher.fit(df, df["count"].values)


Fitting 5 folds for each of 200 candidates, totalling 1000 fits
/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  5.7min
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed: 12.8min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 23.4min
[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed: 30.3min finished
Out[40]:
GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('extractor', FeatureExtractor()), ('regression', PoissonRegression(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
         max_iter=1000, normalize=False, positive=False, precompute=False,
         random_state=None, selection='cyclic', tol=0.0001,
         warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'regression__alpha': array([1.00000e-05, 2.06914e-05, 4.28133e-05, 8.85867e-05, 1.83298e-04,
       3.79269e-04, 7.84760e-04, 1.62378e-03, 3.35982e-03, 6.95193e-03,
       1.43845e-02, 2.97635e-02, 6.15848e-02, 1.27427e-01, 2.63665e-01,
       5.45559e-01, 1.12884e+00, 2.33572e+00, 4.83293e+00, 1.00000e+01]), 'regression__l1_ratio': array([0.     , 0.11111, 0.22222, 0.33333, 0.44444, 0.55556, 0.66667,
       0.77778, 0.88889, 1.     ])},
       pre_dispatch='2*n_jobs', refit=False, return_train_score='warn',
       scoring=make_scorer(rmsle, greater_is_better=False), verbose=1)

In [41]:
researcher.best_params_


Out[41]:
{'regression__alpha': 0.001623776739188721,
 'regression__l1_ratio': 0.1111111111111111}

In [42]:
researcher.best_score_


Out[42]:
-0.7326059519929896

In terms of MSE the score is worse. But it doesn't mean MSE is the most relevant metric. At least poisson regression never predicts negative values.

  • When you expect poisson regression to have better MSE score?

In [43]:
scorerer = make_scorer(mean_squared_error, greater_is_better=False)
scores = cross_val_score(clf, df, df["count"].values, cv=5, n_jobs=4, scoring=scorerer)
np.mean((-np.array(scores)) ** .5)


Out[43]:
193.30622303311725

Skill vs Education

When you need to predict counts, try to use Poisson Regression. You can get good enough results with experience, but you can't handle on just your skills when face a new type of tasks. More complicated tasks you have less your previous experience can help you.

The key to success is to have good enough education. With education you can do research.


In [52]:
df_test = pd.read_csv("test.csv")
cols = df_test.columns

all_data = pd.concat([df[cols], df_test[cols]])

In [63]:
exctractor = FeatureExtractor()
exctractor.collect_stats(all_data)
clf = Pipeline([
    ("extractor", exctractor),
    ("regression", PoissonRegression(alpha=0.001623776739188721, l1_ratio=0.1111111111111111)),
])
clf.fit(df, df["count"].values)
df_test["count"] = clf.predict(df_test)


/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:27: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:27: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

In [68]:
df_test[["datetime","count"]].set_index("datetime").to_csv("linear.csv")

In [71]:
# !kaggle competitions submit -f linear.csv -m "linear regression" bike-sharing-demand
# score 0.64265

Further steps: use Random Forest Regressor & Catboost to get into top 10%.