12-752: Data-Driven Building Energy Management

Fall 2016, Carnegie Mellon University

Assignment #3

We will begin by unpickling the dataset containing the loadCurves that we had used in Assignment #2.

But first, as always, we will load most modules we will be using:


In [114]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pickle

%matplotlib inline

Now let's load the load curve dataset:


In [115]:
pkf = open('../../lectures/data/loadCurves.pkl','rb')
data,loadCurves = pickle.load(pkf)
pkf.close()

Because the dataset we have is not sufficient to run an interesting regression task, we will also load a temperature dataset which shows 5-minute interval temperatures measured on the rooftop of Margaret Morrison at CMU.

You will find this dataset on BlackBoard. Download it and then load it into a new DataFrame called temperature.

Task #1 [10%]

Create a new DataFrame with the temperature data, and set the index to be the Timestamp.


In [116]:
# Your code goes here

f = open('../../lectures/data/temperature.csv')
temperature = pd.read_csv(f,sep=',', header='infer', parse_dates=[0],index_col=0)
f.close()
temperature.index
data = data.set_index(['Time'])

In [117]:
temperature


Out[117]:
IW.Weather.Kpit56.Csv.Temp
Timestamp
2013-11-10 05:00:00 54.430
2013-11-10 05:05:00 54.691
2013-11-10 05:10:00 54.604
2013-11-10 05:15:00 54.648
2013-11-10 05:20:00 54.648
2013-11-10 05:25:00 54.734
2013-11-10 05:30:00 54.734
2013-11-10 05:35:00 54.734
2013-11-10 05:40:00 54.604
2013-11-10 05:45:00 54.516
2013-11-10 05:50:00 54.387
2013-11-10 05:55:00 54.561
2013-11-10 06:00:00 54.212
2013-11-10 06:05:00 54.255
2013-11-10 06:10:00 54.082
2013-11-10 06:15:00 53.865
2013-11-10 06:20:00 53.776
2013-11-10 06:25:00 53.602
2013-11-10 06:30:00 53.341
2013-11-10 06:35:00 53.515
2013-11-10 06:40:00 53.559
2013-11-10 06:45:00 53.515
2013-11-10 06:50:00 53.602
2013-11-10 06:55:00 53.515
2013-11-10 07:00:00 53.472
2013-11-10 07:05:00 53.298
2013-11-10 07:10:00 53.166
2013-11-10 07:15:00 53.121
2013-11-10 07:20:00 53.078
2013-11-10 07:25:00 52.817
... ...
2014-11-11 02:35:00 57.070
2014-11-11 02:40:00 56.820
2014-11-11 02:45:00 56.890
2014-11-11 02:50:00 56.730
2014-11-11 02:55:00 56.610
2014-11-11 03:00:00 56.610
2014-11-11 03:05:00 56.610
2014-11-11 03:10:00 56.730
2014-11-11 03:15:00 56.730
2014-11-11 03:20:00 56.770
2014-11-11 03:25:00 56.430
2014-11-11 03:30:00 56.300
2014-11-11 03:35:00 56.390
2014-11-11 03:40:00 56.520
2014-11-11 03:45:00 56.610
2014-11-11 03:50:00 56.390
2014-11-11 03:55:00 56.210
2014-11-11 04:00:00 56.120
2014-11-11 04:05:00 55.960
2014-11-11 04:10:00 55.870
2014-11-11 04:15:00 55.780
2014-11-11 04:20:00 55.690
2014-11-11 04:25:00 55.625
2014-11-11 04:30:00 55.560
2014-11-11 04:35:00 55.390
2014-11-11 04:40:00 55.220
2014-11-11 04:45:00 55.085
2014-11-11 04:50:00 54.950
2014-11-11 04:55:00 54.860
2014-11-11 05:00:00 54.640

105409 rows × 1 columns

Since the power consumption time series has a different sampling frequency, we need to resample the datasets so that they have the same frequency.

Task #2 [10%]:

Resample the original data and temperature DataFrames to one sample every five minutes. You can use any resampling technique, though an averaging method is probably the most appropriate here.


In [118]:
# Your code goes here...
power = data.resample('5T').mean()
temperature = temperature.resample('5T').mean()

It may be easier to join both DataFrames into one. Let's do that.

Task #3 [5%]:

Create a new DataFrame called alldata which is an inner join of the temperature and power datasets (i.e. the rows of aldata will be those with a timestamp that is present in both the electric power time series and the temperature time series).

For good measure, rename the columns of the resulting DataFrame to make them more descriptive and concise.


In [119]:
# Your code goes here.

alldata = power.join(temperature,how='inner')
alldata.rename(columns={'IW.Weather.Kpit56.Csv.Temp': 'Temperature', 'Value': 'Power'},inplace=True)

Depending on how you answered the previous question, there is a chance that some of the entries of your DataFrame contain null values. There is a convenient method for DataFrames called isnull() which you can use to check this.

Task #4 [5%]:

Find out how many null values there are in the DataFrame.

Using your method of choice (e.g., interpolation) fill those values so that they don't exist anymore and make a copy of the resulting DataFrame.

Comment on why these null values showed up.


In [120]:
# Your code goes here...

originaldata = alldata
alldata.interpolate(inplace=True)

# This also means that all of the derived columns need to be recomputed as they make no sense anymore (i.e., the Hour, DayOfYear and Weekday cannot be reasonably interpolated)

alldata['DayOfYear'] = alldata.index.dayofyear
alldata['Weekday'] = alldata.index.dayofweek
alldata['Hour'] = alldata.index.hour

# To check the results
## plt.plot(originaldata[alldata.isnull().apply(np.any,axis=1)],'rx')
# alldata.isnull().any().any()

Now let's turn our attention to the task of fitting a model for regression. To start, let's visualize the time series and the relationship between them.

Task #5 [10%]:

Create a line plot with two separate vertical axes showing the Power and Temperature time series (e.g., one vertical axis for temperature and the other one for power, against a single time axis).

Then create a scatter plot for Power (y-axis) versus Temperature (x-axis).


In [121]:
# Your code goes here...

fig, ax1 = plt.subplots()
ax1.plot(alldata['Power'], 'b-')
ax1.set_ylabel('Power', color='b')
for tl in ax1.get_yticklabels():
    tl.set_color('b')

ax2 = ax1.twinx()
ax2.plot(alldata['Temperature'], 'r-')
ax2.set_ylabel('Temperature', color='r')
for t1 in ax2.get_yticklabels():
    t1.set_color('r')
    
plt.show()

fig2 = plt.figure()
plt.plot(alldata['Temperature'],alldata['Power'],'.')


Out[121]:
[<matplotlib.lines.Line2D at 0x1285dfbe0>]

Task #6 [40%]:

Separate the data into the features (X) and the response values (Y). Assume, to start, that the features are 'Hour','Weekday' and 'Temperature'; and that the response values are 'Power'.

(a: 10%) Fit a regression tree to the whole dataset. In other words, your fitted tree should be able to predict the power consumption for campus when given the hour of the day, day of the week and temperature for a specific time. Calculate the score of your fit (i.e., the R^2 value for your prediction).

(b: 20%) Now, rearrange X and Y such that your response values are no longer scalars, but instead a vector of 1 hour periods (i.e., 12 continguous values at the 5-minute resolution). Using this arrangement, fit another regression tree (multi-output)

(c: 5%) Using 10-fold cross-validation, figure out how to best optimize the performance (score) of your model on the dataset, by tweaking the way you are configuring the features X.

(d: 5%) Download the 2015 dataset for the campus power consumption (and temperature), and test the results of your trained model (the best one you found in the previous step on that dataset.


In [169]:
X = alldata[['Hour','Weekday','Temperature']]
Y = alldata['Power']

In [170]:
from sklearn import tree
reg = tree.DecisionTreeRegressor(max_depth=15)
reg.fit(X,Y)


Out[170]:
DecisionTreeRegressor(criterion='mse', max_depth=15, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

Let's see how much power consumption the fitted tree predicts for a Wednesday at 2am if it is 40F:


In [171]:
reg.predict(np.array([2,3,40]).reshape(1,-1))


Out[171]:
array([ 8065.45206226])

Now let's see, overall, how well the tree did in the regression task. This is the coefficient of determination (R2):


In [172]:
reg.score(X,Y)


Out[172]:
0.83273088957901376

In [173]:
reg.predict(X)


Out[173]:
array([ 7994.220675  ,  7994.220675  ,  7994.220675  , ...,  3451.249075  ,
        3451.249075  ,  7030.83438033])

Let's see how that looks:


In [174]:
plt.plot(reg.predict(X),'--r')
plt.plot(alldata['Power'].values,'-b')


Out[174]:
[<matplotlib.lines.Line2D at 0x1377dd550>]

In [175]:
# Task 6.b
Y_multi = []
for i in range(len(alldata)):
    Y_multi.append(np.array(alldata[i:i+12]['Power']))

In [176]:
reg_multi = tree.DecisionTreeRegressor(max_depth=15)
# X = X[0:-12:12]
# X = X.resample('1H').mean()[0:-1]

reg_multi.fit(X[0:-12:12],Y[0:-12:12])


Out[176]:
DecisionTreeRegressor(criterion='mse', max_depth=15, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [177]:
reg_multi.predict(np.array([2,3,40]).reshape(1,-1))


Out[177]:
array([ 8129.22879583])

In [178]:
reg_multi.score(X[0:-12:12],Y_multi[0:-12:12])


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-178-2812e436f4e7> in <module>()
----> 1 reg_multi.score(X[0:-12:12],Y_multi[0:-12:12])

/Users/mberges/anaconda3/lib/python3.5/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    346         from .metrics import r2_score
    347         return r2_score(y, self.predict(X), sample_weight=sample_weight,
--> 348                         multioutput='variance_weighted')
    349 
    350 

/Users/mberges/anaconda3/lib/python3.5/site-packages/sklearn/metrics/regression.py in r2_score(y_true, y_pred, sample_weight, multioutput)
    442     """
    443     y_type, y_true, y_pred, multioutput = _check_reg_targets(
--> 444         y_true, y_pred, multioutput)
    445 
    446     if sample_weight is not None:

/Users/mberges/anaconda3/lib/python3.5/site-packages/sklearn/metrics/regression.py in _check_reg_targets(y_true, y_pred, multioutput)
     84     if y_true.shape[1] != y_pred.shape[1]:
     85         raise ValueError("y_true and y_pred have different number of output "
---> 86                          "({0}!={1})".format(y_true.shape[1], y_pred.shape[1]))
     87 
     88     n_outputs = y_true.shape[1]

ValueError: y_true and y_pred have different number of output (12!=1)

In [ ]:
plt.plot(reg_multi.predict(X[0:-12:12]).reshape(-1,1),'--r')
plt.plot(alldata['Power'].values,'-b')

In [72]:



Out[72]:
Hour Weekday Temperature
2013-11-10 05:00:00 5 6 54.607583
2013-11-10 06:00:00 6 6 53.736583
2013-11-10 07:00:00 7 6 52.801167
2013-11-10 08:00:00 8 6 51.805000
2013-11-10 09:00:00 9 6 51.072250
2013-11-10 10:00:00 10 6 49.692250
2013-11-10 11:00:00 11 6 48.170917
2013-11-10 12:00:00 12 6 47.515917
2013-11-10 13:00:00 13 6 46.866750
2013-11-10 14:00:00 14 6 46.413500
2013-11-10 15:00:00 15 6 46.173500
2013-11-10 16:00:00 16 6 46.286000
2013-11-10 17:00:00 17 6 45.352917
2013-11-10 18:00:00 18 6 45.191250
2013-11-10 19:00:00 19 6 44.960750
2013-11-10 20:00:00 20 6 44.673167
2013-11-10 21:00:00 21 6 44.366167
2013-11-10 22:00:00 22 6 44.502583
2013-11-10 23:00:00 23 6 44.373833
2013-11-11 00:00:00 0 0 43.853667
2013-11-11 01:00:00 1 0 43.308500
2013-11-11 02:00:00 2 0 42.612417
2013-11-11 03:00:00 3 0 41.878250
2013-11-11 04:00:00 4 0 40.764750
2013-11-11 05:00:00 5 0 40.040583
2013-11-11 06:00:00 6 0 39.407583
2013-11-11 07:00:00 7 0 37.872583
2013-11-11 08:00:00 8 0 37.396583
2013-11-11 09:00:00 9 0 36.425833
2013-11-11 10:00:00 10 0 35.503417
... ... ... ...
2014-11-09 17:00:00 17 6 47.126250
2014-11-09 18:00:00 18 6 48.884167
2014-11-09 19:00:00 19 6 50.559583
2014-11-09 20:00:00 20 6 51.020833
2014-11-09 21:00:00 21 6 50.233333
2014-11-09 22:00:00 22 6 48.404167
2014-11-09 23:00:00 23 6 47.989167
2014-11-10 00:00:00 0 0 47.797500
2014-11-10 01:00:00 1 0 48.281667
2014-11-10 02:00:00 2 0 47.917500
2014-11-10 03:00:00 3 0 47.670000
2014-11-10 04:00:00 4 0 47.642500
2014-11-10 05:00:00 5 0 47.500000
2014-11-10 06:00:00 6 0 47.235000
2014-11-10 07:00:00 7 0 46.403333
2014-11-10 08:00:00 8 0 45.560000
2014-11-10 09:00:00 9 0 45.375000
2014-11-10 10:00:00 10 0 44.737083
2014-11-10 11:00:00 11 0 43.673334
2014-11-10 12:00:00 12 0 44.982500
2014-11-10 13:00:00 13 0 48.499167
2014-11-10 14:00:00 14 0 51.720000
2014-11-10 15:00:00 15 0 55.223750
2014-11-10 16:00:00 16 0 58.567500
2014-11-10 17:00:00 17 0 60.640000
2014-11-10 18:00:00 18 0 63.654167
2014-11-10 19:00:00 19 0 65.135833
2014-11-10 20:00:00 20 0 66.253333
2014-11-10 21:00:00 21 0 65.109999
2014-11-10 22:00:00 22 0 59.866818

8778 rows × 3 columns


In [ ]: