This notebook applies different regression methods to the finalized dataset in search of the best regression model.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
dataset_1min = pd.read_csv('dataset-1min.csv')
print(dataset_1min.shape)
dataset_1min.head(3)
Out[2]:
In [3]:
# Delete duplicate rows in the dataset
dataset_1min = dataset_1min.drop_duplicates()
print(dataset_1min.shape)
dataset_1min.head(3)
Out[3]:
In [4]:
# Subset the features needed
names = ['temperature', 'humidity', 'co2', 'light', 'noise', 'bluetooth_devices','occupancy_count']
df = dataset_1min[names]
df.head(3)
Out[4]:
In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import scale
In [6]:
# Set occupancy_count as the dependent variable and others as independent variables
data = df.iloc[:,0:-1]
target = df.iloc[:,-1]
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(data, target)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
In [7]:
# Standarize data
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
X_train = standard_scaler.fit_transform(X_train)
X_test = standard_scaler.transform(X_test)
In [8]:
# Select the optimal alpha using Yellowbrick
from yellowbrick.regressor import AlphaSelection
from yellowbrick.regressor import PredictionError, ResidualsPlot
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
model = AlphaSelection(LassoCV())
model.fit(X_train, y_train)
model.poof()
In [9]:
lasso = Lasso(alpha = 0.078)
y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
print("Test set R^2: %.4f"
% lasso.score(X_test, y_test))
print("Mean squared error: %.4f"
% np.mean((y_test - y_pred_lasso) ** 2))
In [10]:
# Coefficients for each feature
pd.DataFrame(lasso.coef_, names[0:6])
Out[10]:
In [11]:
# Plot Regressor evaluation for Lasso
visualizer_res = ResidualsPlot(lasso)
visualizer_res.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer_res.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer_res.poof()
In [12]:
# Plot prediction error plot for Lasso
visualizer_pre = PredictionError(lasso)
visualizer_pre.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer_pre.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer_pre.poof()
"A residuals plot shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate." (http://www.scikit-yb.org/en/latest/examples/methods.html#regressor-evaluation)
Based on the residual plot for Lasso Model, a linear model might not be approporiate for our data.
Elastic net regression is a hybrid approach that blends both penalization of the L2 and L1 norms. Specifically, elastic net regression minimizes the following...
‖y−Xβ‖+λ[(1−α)|β|22+α|β|1]
the α hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used (0 is ridge, 1 is lasso). The aggressiveness of the penalty for overfitting is controlled by a parameter λ. The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon and is heavily dependent on the alpha hyper-parameter. (http://www.onthelambda.com/2015/08/19/kickin-it-with-elastic-net-regression/)
In [13]:
from sklearn.linear_model import ElasticNetCV
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from yellowbrick.regressor import AlphaSelection
from yellowbrick.regressor import PredictionError, ResidualsPlot
In [14]:
# Hyperparameter tuning with yellowbrick
model = AlphaSelection(ElasticNetCV())
model.fit(X_train, y_train)
model.poof()
In [15]:
# Fit Elastic Net Model using standarized data
elastic = ElasticNet(alpha = 0.022)
y_pred_elastic = elastic.fit(X_train, y_train).predict(X_test)
print("Test set R^2: %.4f"
% elastic.score(X_test, y_test))
print("Mean squared error: %.4f"
% np.mean((y_test - y_pred_elastic) ** 2))
In [16]:
# Coefficients for each feature
pd.DataFrame(elastic.coef_, names[0:6])
Out[16]:
In [17]:
# Regressor evaluation using Yellowbrick
visualizer_res = ResidualsPlot(elastic)
visualizer_res.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer_res.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer_res.poof()
In [18]:
# Instantiate the visualizer and fit
visualizer = PredictionError(elastic)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.poof()
In [21]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import grid_search
from yellowbrick.regressor import PredictionError, ResidualsPlot
In [31]:
clf = GradientBoostingRegressor(learning_rate = 0.1, random_state = 0)
parameters = {'max_depth': [7,8,9],'n_estimators':[50,100,150]}
gs = grid_search.GridSearchCV(clf, parameters,cv=5)
gs.fit(X_train,y_train)
gs.grid_scores_
Out[31]:
In [32]:
gs.best_params_
Out[32]:
In [23]:
gbr = GradientBoostingRegressor(max_depth=9, n_estimators=150)
y_pred_gbr = gbr.fit(X_train, y_train).predict(X_test)
print("Test set R^2: %.4f"
% gbr.score(X_test, y_test))
print("Mean squared error: %.4f"
% np.mean((y_test - y_pred_gbr) ** 2))
In [35]:
# Plot feature importance
params = names[0:6]
feature_importance = gbr.feature_importances_
sorted_features=sorted(zip(feature_importance,params))
importances,params_sorted=zip(*sorted_features)
#plt.ylim([-1,len()])
plt.barh(range(len(params)),importances,align='center',alpha=0.6,color='g')
plt.tick_params(axis='y', which='both', labelleft='off', labelright='on')
plt.yticks(range(len(params)),params_sorted,fontsize=12)
plt.xlabel('Mean Importance',fontsize=12)
plt.title('Mean feature importances\n for gradient boosting classifier')
In [25]:
# Plot Regressor evaluation for GradientBoostingRegressor
visualizer_res = ResidualsPlot(gbr)
visualizer_res.fit(X_train, y_train)
visualizer_res.score(X_test, y_test)
g = visualizer_res.poof()
In [26]:
# Plot prediction error plot for GradientBoostingRegressor
visualizer_pre = PredictionError(gbr)
visualizer_pre.fit(X_train, y_train)
visualizer_pre.score(X_test, y_test)
g = visualizer_pre.poof()
In [27]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import grid_search
from yellowbrick.regressor import PredictionError, ResidualsPlot
In [33]:
rf = RandomForestRegressor(max_features = 'log2')
y_pred_rf = rf.fit(X_train, y_train).predict(X_test)
print("Test set R^2: %.4f"
% rf.score(X_test, y_test))
print("Mean squared error: %.4f"
% np.mean((y_test - y_pred_rf) ** 2))
In [37]:
# Plot feature importance
params = names[0:6]
feature_importance = rf.feature_importances_
sorted_features=sorted(zip(feature_importance,params))
importances,params_sorted=zip(*sorted_features)
#plt.ylim([-1,len()])
plt.barh(range(len(params)),importances,align='center',alpha=0.6,color='g')
plt.tick_params(axis='y', which='both', labelleft='off', labelright='on')
plt.yticks(range(len(params)),params_sorted,fontsize=12)
plt.xlabel('Mean Importance',fontsize=12)
plt.title('Mean feature importances\n for random forest regressor')
Out[37]:
In [30]:
# Plot Regressor evaluation for Lasso using Yellowbrick
visualizer_res = ResidualsPlot(rf)
visualizer_res.fit(X_train, y_train)
visualizer_res.score(X_test, y_test)
g = visualizer_res.poof()
In [36]:
# Plot prediction error plot for RandomForestRegressor
visualizer_pre = PredictionError(rf)
visualizer_pre.fit(X_train, y_train)
visualizer_pre.score(X_test, y_test)
g = visualizer_pre.poof()
Based on the residual plots, linear models such as Lasso and ElasticNet are not the best to describe our data (poor performance for both train data and test data). Non-linear models (RandomForest, Gradient Boosting Regression) perform better (good performance for describing train data at least).
Based on r2 score and mean squared error, Gradient Boosting Regression has the best predicting performance among the four models tested.
CO2 demonstrates the highest importance in Lasso, ElasticNet and Random Forest Regressor. For Gradient Boosting Regression, though, light is the most important feature. Noise and temperature has little prediction value in our case, largely due to the restrictions on the data collection process (stable temperature in the classroom, low variability in noises).
I'm surprised about how high the prediction scores are for all models tested above (usually above 70% indicates very good model). My guess is certain features (CO2) is highly correlated with occupancy and gives the machine clear "hints" on occupancy. (What do you think?)
In [ ]: