We have in hand a regression problem with 5000 observations and 254 features, whose names are not known. The number of features is fairly large and we are expected to report on the important features. Therefore we it will be sensible to use a model that appropriatly regularize the outcome and automatically perform 'feature-selection'. Lasso is perfect for that as it uses an L1 norm penalty on the coefficient vector and promotes sparsity.
In addition to Lasso, we performed random forest regression, which outperformed Lasso on the training set. To compute the optimal regularization parameter for Lasso, we empleyed a cross validation approach. Similarly, we determined optimal number of trees and number of features for the random forrest regresssor via a small scale grid search powered by cross validation.
Random forrest classifer also gives importance score for each feature in terms of average reduction in the variance for each cut in that feature. We also report on the importantance of features as determined by random forrest regressor, and compare those with the ones selected by Lasso.
Since we have fairly good amount of data we also measured the performance of both models with unseen data. For that we set side 20% of the data as a test set. Perhaps surprisingly Lasso outperformed random forest regressor on the test set. one could try to increase the number of trees in the random forest model to overcome its overfitting issues with this particular problem. Nevertheless we already tried up to 30 trees and maybe Lasso is just a better choice.
Other feature selection methods such as forward selection where features are added one at a time could have been used. We will omit more complex feature engineering methods such as considering higher order terms and correlation analysis to add new features to incorporate interactions.
Some comments on the qualty of the data set and the data cleaning methods that were employed are given next.
In [3]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
A quick visual investigation of the 'codetest.train.text' file reveals that some of the features are categorical. Also there are missing entries that needs to be taken care of.
Let's do the initial analysis using Pandas.
In [4]:
#load the given data set
import os
import pandas as pd
train_data_path = os.path.join('data', 'codetest_train.txt')
test_data_path = os.path.join('data', 'codetest_test.txt')
df_train = pd.read_csv(train_data_path, sep = '\t')
df_test = pd.read_csv(test_data_path, sep = '\t')
Let's list the categorical features and their unique values:
In [5]:
categorical_keys = []
for key,dt in zip(df_train.keys(), df_train.dtypes):
if dt != 'float64':
categorical_keys.append(key)
print('{}: {}'.format(key,dt))
print(df_train[key].unique())
print(df_train[key].describe())
print('')
Let's impute missing categorical features with 'top' values, and regular features with 'mean'.
In [6]:
for key in df_train.keys():
if key in categorical_keys:
df_train[key] = df_train[key].fillna(df_train[key].mode()[0])
else:
df_train[key] = df_train[key].fillna(df_train[key].mean())
for key in df_test.keys():
if key in categorical_keys:
df_test[key] = df_test[key].fillna(df_train[key].mode()[0])
else:
df_test[key] = df_test[key].fillna(df_train[key].mean())
As a sanity check let's make sure we imputed the 'nan' values from categorical features.
In [7]:
for key,dt in zip(df_train.keys(), df_train.dtypes):
if dt != 'float64':
categorical_keys.append(key)
print('{}: {}'.format(key,dt))
print(df_train[key].unique())
print(df_train[key].describe())
print('')
Let's make sure the features have unit standard deviation since we'll be using Lasso.
In [8]:
for key in df_train.keys():
if key not in categorical_keys and key != 'target':
std = df_train[key].std()
df_train[key] = df_train[key] / std
df_test[key] = df_test[key] / std
Convert the Pandas data frame to Numpy arrays.
In [9]:
# I learned this trick of going from a data frame to a python dictionary from the following post:
# http://stackoverflow.com/questions/20024584/vectorizing-a-pandas-dataframe-for-scikit-learn
# note the first index in the df.keys() is the 'target' vector
X_train = df_train[list(df_train.keys())[1:]].T.to_dict().values()
X_test = df_test.T.to_dict().values()
Y_train = df_train['target'].values
Now we'll use scikit-learn's DictVectorizer method to encode the categorical variables using one-hot encoding.
In [10]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit(X_train)
X_train = vec.transform(X_train).toarray()
X_test = vec.transform(X_test).toarray()
Now we have the the data in the format scikit-learn expects.
Compute basic statistics of the target vector and plot its boxplot and histogram.
In [11]:
print('Statistics for the target vector in training data:')
print('--------------------------------')
print('98th percentile : {:.2f}'.format(np.percentile(Y_train, 98)))
print('75th percentile : {:.2f}'.format(np.percentile(Y_train, 75)))
print('2th percentile : {:.2f}'.format(np.percentile(Y_train, 5)))
print('median : {:.2f}'.format(np.median(Y_train)))
print('mean : {:.2f}'.format(np.mean(Y_train)))
print('std : {:.2f}'.format(np.std(Y_train)))
In [12]:
fig = plt.figure(figsize = (7,3))
ax1 = plt.subplot(121)
ax1.boxplot(Y_train)
ax1.set_ylabel('target value')
ax2 = plt.subplot(122)
_, _,_ = ax2.hist(Y_train,bins = 30)
ax2.set_xlabel('target value')
plt.show()
One can argue that values outside the interval [-20,20] are outliers. Yet they could contain useful information as well. Note that the number of data points is much larger than number of features, and we'll be using a regularized fit. So let's keep the larger observations in the data set.
Since we decided to leave feature engineering to Lasso, the only crucial element in the model selection procedure is the tuning of the regularization parameter. We will leave this task to scikit-learn's LassoCV method which automatically finds the best value using cross validation. We chose to use 10-fold cross validation. Specifically we need to pick an optimal $\alpha$ for the following minimization problem:
$\min_{ \beta \in \mathbb{R}^p } \left\{ \frac{1}{2N} \left\| Y - X \beta \right\|_2^2 + \alpha \| \beta \|_1 \right\}$,
where $\beta$ is the unknown parameter vector, N is the number of samples, and $X$ and $Y$ are training data and target vector repectively.
In [41]:
#The code in this cell is partially borrowed from here: http://bit.ly/1JPHF2x
# LassoCV:
from sklearn.linear_model import LassoCV
model = LassoCV(cv=5).fit(X_train, Y_train)
# list of alphas that were tried-out
m_log_alphas = -np.log10(model.alphas_)
plt.figure()
plt.plot(m_log_alphas, model.mse_path_, ':')
plt.plot(m_log_alphas, model.mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='optimal alpha')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold')
plt.axis('tight')
plt.show()
from sklearn.metrics import mean_squared_error
Y_predict = model.predict(X_train)
r2 = model.score(X_train, Y_train)
mse = mean_squared_error(Y_predict, Y_train)
print('Optimal alpha: {:.2f}'.format(model.alpha_))
print('R2 with Lasso: {:.6f}'.format(r2))
print('MSE with Lasso: {:.6f}'.format(mse))
The above plot nicely demonstrates the MSE error as a funciton of the regularization parameter $\alpha$. The optimal value is picked right where the average (over all cross validation errors) MSE is minimum (rigth before it increases again due to over-regularization).
Let's visualize given target vector with our prediction for the first 100 data instances.
In [42]:
fig = plt.figure(figsize=(15,3))
plt.plot(Y_train[:100], 'o', label = 'data')
plt.plot(Y_predict[:100], '-o',color = 'g', label = 'prediction')
plt.legend()
plt.xlabel('data index')
plt.ylabel('target value')
title = ['Training performance of the Lasso model for the first 100 data points -' +
' Overall R2:' + str(round(r2,2)) + ' - MSE:' + str(round(mse,2))]
plt.title(title[0])
plt.ylim([-15,15])
plt.show()
Let's find out which parameters are set to zero by the L1 regularizer:
In [43]:
beta = model.coef_
feature_indexes = []
feature_names = []
feature_coeffs = []
for i,b in enumerate(beta):
if b>0:
feature_indexes.append(i)
feature_names.append(vec.feature_names_[i])
feature_coeffs.append(model.coef_[i])
#print('{:5s}: {:.4}'.format(vec.feature_names_[i],b))
print('{} features with nonzero coefficients:'.format(len(feature_names)))
feature_coeffs, feature_names = zip(*(sorted(zip(feature_coeffs, feature_names),reverse = True)))
Note that only 37 features were selected to be important by Lasso. Next, let's take a look at the coefficient values of selected features.
In [44]:
width = 0.6
fig = plt.figure(figsize = (12,3))
ax = plt.subplot(111)
ax.bar(range(len(feature_coeffs)), feature_coeffs, width = width)
ax.set_xticks(np.arange(len(feature_coeffs)) + width/2)
ax.set_xticklabels(feature_names, rotation = 90)
ax.set_title('Features and their fitted coefficients predicted by Lasso')
ax.set_ylabel('coefficient value')
plt.show()
As observed from the above bar plot the most effective features are f_175 and f_205 within the framework of our model. More specifically, a unit standard deviation (computed using the training data) increase in the features shown above will result in an increase in the target vector given by the coefficients given above. Note that this insight is only valid for our linear model and one should avoid comments on causality as the correlation, independance or interaction between features are not considered.
We conclude the feature analysis here. The important features selected by the Lasso regression is given above. Their predictive power could further be investigated via a more detailed analysis of the coefficient values and perhaps their p-values.
Below we try out two more methods:
Let's create a new training set with only the important features and see if we'll improveon the R2 performance using a ordinary least squares fit:
In [45]:
#manually add the missing categorical features for f_237 and f_61"
for key in vec.vocabulary_.keys():
if 'f_237' in key or 'f_61' in key:
feature_indexes.append(vec.vocabulary_[key])
feature_indexes = set(feature_indexes)
In [46]:
X_train2 = X_train[:,list(feature_indexes)]
X_test2 = X_test[:,list(feature_indexes)]
In [47]:
from sklearn.linear_model import LinearRegression
olr = LinearRegression()
olr.fit(X_train2,Y_train)
Y_predict2 = olr.predict(X_train2)
mse2 = mean_squared_error(Y_predict2, Y_train)
r22 = olr.score(X_train2, Y_train)
print('R2 with ordinary linear regression with selected features: {:.6f}'.format(r22))
print('MSE with ordinary linear regression with selected features: {:.6f}'.format(mse2))
Random forrests are good as they control overtiffing by averaging the results from lots of trees fitted into data. Let's devise a quick grid search random forrest regressor and see how we'll do.
In [37]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import grid_search
parameters = {'max_features': ('sqrt', 'log2'), 'n_estimators': [5,10,15,20,25,30]}
rf = RandomForestRegressor()
model_rf = grid_search.GridSearchCV(rf, parameters, cv = 5)
model_rf.fit(X_train, Y_train)
Y_predict = model_rf.predict(X_train)
mse3 = mean_squared_error(Y_predict, Y_train)
r23 = model_rf.score(X_train, Y_train)
print('Random Forest R2: {}'.format(r23))
print('Random Forest MSE: {}'.format(mse3))
print('\nParameters Selected by Grid Search:')
print(model_rf.best_params_)
We can access the importance of the features according to the random foorrest classifier as well. feature_importances_
atrribute of the random forrest classfier will give an importance of each feature according the average reduction in the variance due to all the cuts in that feature.
In [48]:
#first generate a rf regressor with determinded paremeters
model_rf = RandomForestRegressor(max_features = params['max_features'], n_estimators = params['n_estimators'])
model_rf.fit(X_train, Y_train)
Y_predict = model_rf.predict(X_train)
mse3 = mean_squared_error(Y_predict, Y_train)
r23 = model_rf.score(X_train, Y_train)
print('Random Forest R2: {}'.format(r23))
print('Random Forest MSE: {}'.format(mse3))
In [49]:
feature_names_rf = []
for i, v in enumerate(model_rf.feature_importances_):
feature_names_rf.append(vec.feature_names_[i])
std = np.mean([tree.feature_importances_ for tree in model_rf.estimators_], axis = 0)
feature_names_rf, feature_importance, std = zip(*sorted(
zip(feature_names_rf, model_rf.feature_importances_, std), key = lambda x: x[1], reverse = True))
In [50]:
width = 0.6
fig = plt.figure(figsize = (12,3))
ax = plt.subplot(111)
ax.bar(range(37), feature_importance[:37], width = width, yerr = std[:37])
ax.set_xticks(np.arange(37) + width/2)
ax.set_xticklabels(feature_names_rf[:37], rotation = 90)
ax.set_title('Features and their fitted coefficients predicted by Random Forests')
ax.set_ylabel('Importance')
width = 0.6
fig = plt.figure(figsize = (12,3))
ax = plt.subplot(111)
ax.bar(range(len(feature_coeffs)), feature_coeffs, width = width)
ax.set_xticks(np.arange(len(feature_coeffs)) + width/2)
ax.set_xticklabels(feature_names, rotation = 90)
ax.set_title('Features and their fitted coefficients predicted by Lasso')
ax.set_ylabel('coefficient value')
plt.show()
plt.show()
As shown in the figures above two most important fetures determined random forrest and Lasso are f_175
and f_205
. Let's look at the common features among the most important 15 features returned by each model:
In [51]:
for name in set(feature_names_rf[:15]) & set(feature_names[:15]):
print(name)
Cool, eight among most important fifteen features for each model are common. We can now be more confident and say that the features above are the most important ones.
Although our grid search did not cover high number of combinations of hyperparameters we may want to make sure that the good training set error is not achieved by chance. One solution is to set aside a portion of the training set and perform a final test on that portion of the training set which has not been seen by the model before.:
In [39]:
from sklearn.cross_validation import train_test_split
X_train_train, X_train_test, Y_train_train, Y_train_test = train_test_split(X_train, Y_train, test_size = 0.2)
model_rf2 = grid_search.GridSearchCV(rf, parameters, cv = 4)
model_rf2.fit(X_train_train, Y_train_train)
Y_predict = model_rf2.predict(X_train_train)
mse3 = mean_squared_error(Y_predict, Y_train_train)
r23 = model_rf2.score(X_train_train, Y_train_train)
print('Random Forest train R2: {}'.format(r23))
print('Random Forest train MSE: {}'.format(mse3))
print('\nParameters Selected by Grid Search:')
print(model_rf2.best_params_)
Y_predict = model_rf2.predict(X_train_test)
mse3 = mean_squared_error(Y_predict, Y_train_test)
r23 = model_rf2.score(X_train_test, Y_train_test)
print('\nRandom Forest test R2: {}'.format(r23))
print('Random Forest test MSE: {}'.format(mse3))
Now, we notice that the performance on the test much worse than the cross-validation score on the training set.
In [53]:
#The code in this cell is partially borrowed from here: http://bit.ly/1JPHF2x
# LassoCV:
model = LassoCV(cv=5).fit(X_train_train, Y_train_train)
# list of alphas that were tried-out
m_log_alphas = -np.log10(model.alphas_)
plt.figure()
plt.plot(m_log_alphas, model.mse_path_, ':')
plt.plot(m_log_alphas, model.mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='optimal alpha')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold')
plt.axis('tight')
plt.show()
Y_predict = model.predict(X_train_train)
r2 = model.score(X_train_train, Y_train_train)
mse = mean_squared_error(Y_predict, Y_train_train)
print('Optimal alpha: {:.2f}'.format(model.alpha_))
print('Training R2 with Lasso: {:.6f}'.format(r2))
print('Training MSE with Lasso: {:.6f}'.format(mse))
Y_predict = model.predict(X_train_test)
r2 = model.score(X_train_test, Y_train_test)
mse = mean_squared_error(Y_predict, Y_train_test)
print('\nTest R2 with Lasso: {:.6f}'.format(r2))
print('Test MSE with Lasso: {:.6f}'.format(mse))
It turns out Lasso performs better when tested with unseen data! This example well illustrates the importance of setting apart a test set when possible to access the performance of different models!
We used Lasso regresssion to automatically select important features where 37 features were indentified to be important.
Random forrest classifier performed significantly better on the trainin set, where the performance was measured using 4-fold cross validation. Here are the measn square errors (MSE) and $R^2$ scores for three methods we tried for the whole of the data set:
Train | Lasso | OLR with Selected Features | Random Forrest |
---|---|---|---|
MSE | 11.29 | 11.18 | 2.86 |
R2 | 0.59 | 0.60 | 0.897 |
We also split the training data into test and training part with the dation of 4 to 1. Trained our models with the training set. Here are the performances for the unseen test data for Lasso and random forest regressors:
Test | Lasso | Random Forrest |
---|---|---|
MSE | 12.67 | 19.33 |
R2 | 0.56 | 0.33 |
As a result we chose to deploy the Lasso regression model. One thing to note is that the performance of the random forrest regressor could be enhanced by increasing the number of trees.
It would have been interesting to see if performance could be enhanced by a more detailed feature engineering that incorporates higher order terms and interactions between features. Also, trying an elastic net regularizatiom, which is a mixture of L1 and L2 type penalties could be interesting.
Let's perform prediction on the test data and save it to a file. We will use a random forrest model as stated above:
In [54]:
Y_test_rf = model_rf2.predict(X_test)
Y_test = model.predict(X_test)
In [55]:
with open('codetest_part1_out.txt','w') as f:
for y in Y_test:
f.write(str(y)+'\n')
Sanity check to make sure Lasso and random forrest result in similar estimations for the test data set
In [56]:
fig = plt.figure(figsize=(15,3))
plt.plot(Y_test[:100], '-o', label = 'lasso')
plt.plot(Y_test_rf[:100], '-o',color = 'g', label = 'random forrest')
plt.legend()
plt.show()
In [58]:
!ipython nbconvert regression_wo_knowing_feature_names.ipynb