Heating load and Cooling load is a good indicator for building energy efficiency. In this notebook, we get the energy efficiency Data Set from the UCI Machine Learning Repository, implement machine learning model SVC and linear regression to train our datasets. Our goal is to find a pattern between the building shapes and energy efficiency, analyze the predicted result to improve our model.
The dataset perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.
In [4]:
import pandas as pd
import numpy as np
import scipy
from scipy import stats
import matplotlib.pyplot as plt
In [8]:
df = pd.read_csv('ENB2012_data.csv', na_filter=False)
df = df.drop(['Unnamed: 10','Unnamed: 11'], axis=1)
df['X1'] = pd.to_numeric(df['X1'], errors='coerce')
df['X2'] = pd.to_numeric(df['X2'], errors='coerce')
df['X3'] = pd.to_numeric(df['X3'], errors='coerce')
df['X4'] = pd.to_numeric(df['X4'], errors='coerce')
df['X5'] = pd.to_numeric(df['X5'], errors='coerce')
df['X6'] = pd.to_numeric(df['X6'], errors='coerce')
df['X7'] = pd.to_numeric(df['X7'], errors='coerce')
df['X8'] = pd.to_numeric(df['X8'], errors='coerce')
df['Y1'] = pd.to_numeric(df['Y1'], errors='coerce')
df['Y2'] = pd.to_numeric(df['Y2'], errors='coerce')
df = df.dropna()
print (df.dtypes)
print (df.head())
plt.show()
plt.plot(df.values[:,8])
plt.show()
plt.plot(df.values[:,9])
plt.close()
In [9]:
plt.scatter(df['Y1'], df['Y2'])
plt.show()
plt.close()
In this problem, we are going to use two different machine learning model to train this datasets and compare the result of it. First, we implement the basic linear regression to this datasets. Next we implement the SVR (Support Vector Regression) to see the differece between linear regression model. Last, plot the result and compare the the true label to see whether the model assumption is robust or not.
The method of Support Vector Classification can be extended to solve regression problems. This method is called Support Vector Regression.
A (linear) support vector machine (SVM) solves the canonical machine learning optimization problem using hinge loss and linear hypothesis, plus an additional regularization term.
Unlike least squares, we solve these optimization problems by using gradient descent to update the funtion loss.
In [10]:
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn import linear_model
We did a simple holdout cross-validation by seperating the dataset into training set (70%) and validation set (30%). Drop the input label from datasets and create the label vector. Here we sort the validaton set by the label value in order to analyze the result by plot and implement two different model and predict by the validation set.
In [11]:
train, test = train_test_split(df, test_size = 0.3)
X_tr = train.drop(['Y1','Y2'], axis=1)
y_tr = train['Y1']
test = test.sort_values('Y1')
X_te = test.drop(['Y1','Y2'], axis=1)
y_te = test['Y1']
reg_svr = svm.SVR()
reg_svr.fit(X_tr, y_tr)
reg_lin = linear_model.LinearRegression()
reg_lin.fit(X_tr, y_tr)
y_pre_svr = reg_svr.predict(X_te)
y_lin_svr = reg_lin.predict(X_te)
print ("Coefficient R^2 of the SVR prediction: " + str(reg_svr.score(X_tr, y_tr)))
print ("Coefficient R^2 of the Linear Regression prediction:" + str(reg_lin.score(X_tr, y_tr)))
In [12]:
plt.plot(y_pre_svr, label="Prediction for SVR")
plt.plot(y_te.values, label="Heating Load")
plt.plot(y_lin_svr, label="Prediction for linear")
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
ncol=2, mode="expand", borderaxespad=0.)
plt.show()
In [13]:
train, test = train_test_split(df, test_size = 0.3)
X_tr = train.drop(['Y1','Y2'], axis=1)
y_tr = train['Y2']
test = test.sort_values('Y2')
X_te = test.drop(['Y1','Y2'], axis=1)
y_te = test['Y2']
reg_svr = svm.SVR()
reg_svr.fit(X_tr, y_tr)
reg_lin = linear_model.LinearRegression()
reg_lin.fit(X_tr, y_tr)
y_pre_svr = reg_svr.predict(X_te)
y_lin_svr = reg_lin.predict(X_te)
print ("Coefficient R^2 of the SVR prediction: " + str(reg_svr.score(X_tr, y_tr)))
print ("Coefficient R^2 of the Linear Regression prediction: " + str(reg_lin.score(X_tr, y_tr)))
In [14]:
plt.plot(y_pre_svr, label="Prediction for SVR")
plt.plot(y_te.values, label="Cooling Load")
plt.plot(y_lin_svr, label="Prediction for linear")
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
ncol=2, mode="expand", borderaxespad=0.)
plt.show()
In [18]:
# coefficients of linear model
print (reg_lin.coef_)
The result for both prediction are quite good, Both of the model have the same issue that it cannot predict the high energy load very well. The reason for this might because of the problem itself is not linear, we need to implement the non-linear model to solve it better. Another reason is that the dataset it self is not big enough to yield a good result, getting more training data (like 10000 datas) will give a better result for this problem.
The coefiicients of linear model shows that the X1 and X7 features, Relative Compactness and Glazing Area are the dominate features. The more relative compactness the less energy load it has, to the glazing area on the other hand.
In [ ]: