Import libraries:
In [1]:
import sklearn
sklearn.__version__
Out[1]:
In [2]:
import pandas as pd
import numpy as np
import time
import datetime as dt
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import train_test_split
%matplotlib inline
Importing the data and the column names from the codebook:
In [3]:
column_names = pd.read_excel('langevincodebook.xlsx',sheetname = 'Sheet2')
data_names = column_names['Description'].values
data = pd.read_csv('LANGEVIN_DATA.txt',sep=' ',names = data_names,index_col =False)
The time in the data is measured with MATLAB's absolute time, converting to a meaningful timestamp:
In [4]:
def toTimestamp(datenum):
python_datetime = dt.datetime.fromordinal(int(datenum) - 366) + dt.timedelta(days=datenum%1)
return python_datetime
In [5]:
data['Timestamp'] = [toTimestamp(data['Time'][date]) for date in range(len(data['Time']))]
data['Occupant Number']=data['Occupant Number'].astype(int)
data['Hour'] = data['Timestamp'].dt.hour
data['Month'] = data['Timestamp'].dt.month
Now we need to remove any times where there is no Thermal comfort measurement taken and select our uncontrollable variables:
In [6]:
data = data[pd.notnull(data['General Thermal Comfort (right now)'])]
uncont_data = data[['Occupant Number','Timestamp','Gender','Age','General Thermal Comfort (right now)','INDOOR Ambient Temp.','INDOOR Relative Humidity','OUTDOOR Ambient Temp.','OUTDOOR Relative Humidity','INDOOR Air Velocity','OUTDOOR Air Velocity','Hour','Month']]
Check if there are still any null values left and check where they are:
In [7]:
print(uncont_data.isnull().sum())
o = uncont_data.isnull().values
plt.imshow(o,aspect = 'auto',interpolation = 'nearest')
Out[7]:
Now to remove those rows with nulls:
In [8]:
uncont_data = uncont_data.dropna()
In [9]:
uncont_data.head(10)
Out[9]:
With the data cleaned we move on to the analysis.
We thought it would be interesting to look at how much the indoor/outdoor measurements vary from each other to get an idea for how climate controlled the building is.
In [23]:
temp_plot = uncont_data.plot('Timestamp',['INDOOR Ambient Temp.','OUTDOOR Ambient Temp.'],figsize = (8,5))
print('The average absolute difference in temperature is: ' + str(np.average(abs(uncont_data['INDOOR Ambient Temp.']-uncont_data['OUTDOOR Ambient Temp.'])))+' Celsius')
In [24]:
hum_plot = uncont_data.plot('Timestamp',['INDOOR Relative Humidity','OUTDOOR Relative Humidity'],figsize = (10,8))
print('The average absolute difference in relative humidity is: ' + str(np.average(abs(uncont_data['INDOOR Relative Humidity']-uncont_data['OUTDOOR Relative Humidity']))))
In [25]:
velo_plot = uncont_data.plot('Timestamp',['INDOOR Air Velocity','OUTDOOR Air Velocity'],figsize = (8,5))
print('The average absolute difference in air velocity is: ' + str(np.average(abs(uncont_data['INDOOR Air Velocity']-uncont_data['OUTDOOR Air Velocity']))))
Next we decided to look at how General comfort level related to some of the uncontrollable variables.
In [46]:
y = uncont_data['General Thermal Comfort (right now)']
x1 = uncont_data['INDOOR Ambient Temp.']
x2 = uncont_data['INDOOR Relative Humidity']
x3 = uncont_data['INDOOR Air Velocity']
fig = plt.figure(figsize=(15,10))
ax1 = plt.subplot(311)
plt.scatter(x1,y)
ax1.set_xlabel("Indoor Ambient Temperature")
ax1.set_ylabel('General Thermal Comfort')
ax1.set_ylim(1,6)
ax2 = plt.subplot(312,sharey = ax1)
plt.scatter(x2,y)
ax2.set_xlabel('Indoor Relative Humidity')
ax3 = plt.subplot(313,sharey = ax1)
plt.scatter(x3,y)
ax3.set_xlabel("Indoor Air Velocity")
plt.tight_layout()
plt.savefig('comfort_subplots.png')
# Correlation Values for each plot
z = uncont_data['General Thermal Comfort (right now)']
print('Correlation coefficient for Thermal Comfort and Temp: ',z.corr(uncont_data['INDOOR Ambient Temp.']))
print('Correlation coefficient for Thermal Comfort and Relative Humidity: ',z.corr(uncont_data['INDOOR Relative Humidity']))
print('Correlation coefficient for Thermal Comfort and Air Velocity',z.corr(uncont_data['INDOOR Air Velocity']))
Now we're testing how effectively the uncontrollable variables can predict the general comfort level. We'll run two regression trees, one with our environmental and time variables and just splitting on occupant number. If occupant number plays a significant role in predicting the general thermal comfort level than personal preference will tell us more about general thermal comfort level than the uncontrollable variables.
First let's set up our features and response.
In [11]:
X1 = uncont_data[['INDOOR Ambient Temp.','INDOOR Relative Humidity','INDOOR Air Velocity']]
Y = uncont_data['General Thermal Comfort (right now)']
The first regression tree will test our environmental and time variables.
In [27]:
X1_train, X1_test, Y_train, Y_test = train_test_split(X1, Y, test_size=0.3)
reg = tree.DecisionTreeRegressor()
reg = reg.fit(X1_train,Y_train)
r2_score_avg1 = np.average([reg.score(X1_test,Y_test) for i in range(5000)])
print('R^2 value: ', r2_score_avg1)
The second regression tree just checks the predictive power of occupant number alone.
In [16]:
X2 = uncont_data['Occupant Number']
X2 = X2[:,None]
In [28]:
X2_train, X2_test, Y_train, Y_test = train_test_split(X2, Y, test_size=0.3)
reg = tree.DecisionTreeRegressor()
reg = reg.fit(X2_train,Y_train)
r2_score_avg2 = np.average([reg.score(X2_test,Y_test) for i in range(5000)])
print('R^2 Value: ',r2_score_avg2)
Our first regression tree had absolutely no predictive power, implying that environmental factors and time of day/year had little to no effect on general thermal comfort. This could be due to the relative stability of indoor temp, air velocity, and humidity year round. The fact that the occupant number alone has some predictive power demonstrates that general thermal comfort is more likely due to personal preference.
In [31]:
fig1 = temp_plot.get_figure()
fig1.savefig('temp_plot.png')
fig2 = hum_plot.get_figure()
fig2.savefig('humidity_plot.png')
fig3 = velo_plot.get_figure()
fig3.savefig('velocity_plot.png')
In [29]:
uncont_data.describe()
Out[29]:
In [ ]: