This notebook outlines the process for obtaining EV presence predictions from a pretrained pylearn2 neural network. We have formulated this as a binary classification problem: given examples of total usage signals(such as the total usage for a home) with EV and without EV, can we correctly assign labels to previously unseen examples, for example can we correctly predict if a home has an EV or not?
The neural network used here was created using a library called pylearn2, a deep neural network library made by researchers for researchers. Unfortunately, since it is still under rapid development, the library is a little tricky to use (and the version we use cannot simply be pip installed). The library allows for relatively quick training of neural nets and high extensibility, but comes at the cost of simplicity of use. This notebook tries to alleviate that problem by introducing the library, its dependencies, and how to use it with the rich Pecan Street WikiEnergy dataset.
Note: This notebook is an interactive python (iPython) notebook. Cells can be executed by clicking the "run" triangle in the toolbar or by typing "< Shift>< Enter>."
The module versions on which this library was trained are as follows:
Theano==0.6.0
numpy==1.8.1
-e git://github.com/lisa-lab/pylearn2.git@2c6196fce42ccb39b43df0026780e4370dca25a4#egg=pylearn2-master
scipy==0.14.0
The theano, numpy, and scipy versions are those included in the Anaconda distribution. The pylearn2 version was the bleeding edge version at time of training.
Where did this neural net come from?
The data were drawn from the following shared dataset tables:
validated_01_2014
validated_02_2014
validated_03_2014
validated_04_2014
validated_05_2014
The data were filtered to use only dataids for which data was available across all five months.
The use columns from the following dataids were used as examples of the "EV present" dataset.
[ 624, 661, 1714, 1782, 1953, 2470, 2638, 2769, 2814, 3192,
3367, 3482, 3723, 3795, 4135, 4505, 4526, 4641, 4767, 4957,
4998, 5109, 5357, 6139, 6836, 6910, 6941, 7850, 7863, 7875,
7940, 8046, 8142, 8197, 8645, 8669, 9484, 9609, 9729, 9830,
9932, 9934]
The use columns from the following dataids were used as examples of the "EV not present" dataset. (Note that there are ~3x as many of these)
[ 86, 93, 94, 410, 484, 585, 739, 744, 821, 871,
936, 1167, 1283, 1334, 1632, 1718, 1790, 1800, 1994, 2094,
2129, 2156, 2158, 2171, 2233, 2242, 2337, 2449, 2575, 2606,
2829, 2864, 2945, 2953, 2974, 3092, 3221, 3263, 3394, 3456,
3504, 3544, 3649, 3652, 3736, 3778, 3893, 3918, 4031, 4154,
4298, 4313, 4447, 4732, 4874, 4922, 4956, 5026, 5209, 5218,
5262, 5275, 5395, 5545, 5568, 5677, 5785, 5814, 5874, 5938,
5949, 5972, 6412, 6636, 6673, 6730, 7062, 7319, 7390, 7531,
7536, 7617, 7731, 7769, 7788, 7800, 7951, 8079, 8084, 8292,
8317, 8342, 8419, 8467, 8741, 8829, 8852, 8956, 9019, 9036,
9121, 9160, 9343, 9356, 9555, 9578, 9643, 9654, 9701, 9737,
9771, 9875, 9915, 9922, 9926, 9937, 9938, 9939, 9982, 9983]
Each of these datasets were further broken down into three subsets. A random* quarter of the ids were used for testing, a random quarter was used for validation, and the remaining half was used for training. These subsets are shown below.
EV present:
Training:
[1782, 1714, 6139, 9830, 4641, 7875, 4957, 8669, 8046, 5357,
5109, 8197, 7850, 8645, 7940, 8142, 9729, 1953, 4135, 3367,
9934]
Validation:
[6836, 6941, 9484, 4998, 4767, 6910, 2638, 7863, 3795, 2769]
Testing:
[9932, 661, 4526, 624, 4505, 2470, 3482, 3192, 2814, 3723,
9609]
EV not present:
Training:
[9036, 8467, 4874, 9356, 9019, 6730, 8292, 4732, 3736, 5218,
585, 1790, 8342, 1632, 5209, 2953, 6636, 2606, 5785, 3092,
9939, 7788, 2864, 5275, 9737, 2094, 4313, 4031, 8084, 7531,
93, 8852, 3649, 4298, 2575, 3504, 9578, 9982, 1800, 9875,
7390, 5938, 6673, 1994, 484, 3778, 4956, 3456, 3221, 9926,
2129, 9555, 5262, 7769, 7617, 9983, 8419, 1167, 5545, 7800]
Validation:
[1283, 94, 8829, 9771, 9160, 739, 86, 9654, 5677, 4922,
7319, 9121, 3893, 5395, 9922, 8317, 8956, 7951, 936, 2974,
2945, 821, 3394, 9701, 3263, 2449, 2171, 5814, 871, 2158]
Testing:
[8741, 9343, 744, 8079, 2242, 9938, 5568, 1718, 7731, 3544,
7536, 4447, 2337, 7062, 3652, 2233, 5874, 9915, 5026, 4154,
2156, 5949, 410, 5972, 2829, 6412, 9643, 3918, 9937, 1334]
*psuedo-random: np.random.seed(1); np.random.shuffle(indices)
10n examples, an array
X of training examples of shape (n,10) and an array y of labels of shape (n,2)Pylearn2 organizes neural networks into yaml files which fully specify the structure of the network.
The file used to train this network is printed below. It has the following attributes:
In [20]:
# basic configurable parameters
params = {'data_dir':'data',
'dataset_prefix':'dataset',
'saved_model_prefix':'saved_model'}
# open and read network
with open('./ev_conv_nn_2_layer_32_stride_2.yaml','r') as f:
print f.read() % params
Neural networks are notoriously difficult to interpret - they're kind of a "black box". So, why did we use this one? It was a trade-off between performance and interpretability - in this case, we value the ability to interpret the features the model uses less than the ability to obtain an accurate probability of EV presence. Convoultional neural networks have some particular advantages which make them particularly applicable to the task of EV detection.
Neural networks with "convolutional" layers have an inherent ability to learn translation invariant features. We know that the input layer is a time series, and that we are looking for a particular signal to appear. Importantly, we don't care when the EV signal appears, just that it appears. Each convolutional layer learns features which are translation invariant. That means that we have the same ability to recognize the EV signal anywhere it appears. This intuition that we have about the nature of the problem is directly coded into the network itself. Additionally, in general, neural networks have the ability to learn non-linearities. Deep neural networks, like this one, can learn both simple and complex functions.
Because neural networks are so flexible, we must be very careful to avoid overfitting. In other words, we don't want to tune the network to recognise attributes of the signals which are due simply to random sampling noise, and, consequently, do not generalize well to other signals. To alleviate this problem, we have employed a technique called "early stopping", which is why we have three different datasets - one each for training, validation, and testing. The training dataset is used to 'fit' the model, iteratively adjusting all of the weights in the network to produce better outputs for each of the labeled sample inputs. The validation dataset is used to select the best set of weights learned on the training set. The testing set is used to evaluate that selection of parameters. Once the outputs on the testing set start getting worse (which indicates overfitting!) we stop "early" (i.e. before maximizing the performance on the training set) and go back to the set of parameters best for the three sets collectively.
In [21]:
from pylearn2.utils import serial
import theano
def model_to_function(model):
'''
Returns an executable function to give input to a trained model
and obtain output.
'''
X = model.get_input_space().make_theano_batch()
y = model.fprop(X)
f = theano.function([X],y,allow_input_downcast=True)
return f
path_to_saved_model = "models/ev_conv_2_32_2_live2_best.pkl"
model = serial.load(path_to_saved_model)
trained_nn = model_to_function(model)
That's it! Now you can use this model to generate predictions for previously unseen training examples!
Now we can load data from the database to test on. Let's grab a house from the test set. This will involve the following steps:
PecanStreetDatasetAdapter and supplying appropriate credentials
In [22]:
import sys
import os
# add the path of the 'disaggregator' module, which contains functions for accessing the pecan street database
sys.path.append(os.path.join(os.pardir,os.pardir))
from disaggregator import PecanStreetDatasetAdapter as psda
# set database credentials
db_url = "postgresql://USERNAME:PASSWORD@db.wiki-energy.org:5432/postgres"
psda.set_url(db_url)
Now we can get an array of data from a particular house.
In [24]:
import numpy
def get_nn_input(schema,table,dataid):
'''
Returns a numpy array formatted appropriately for passing through
the example trained neural network.
'''
#Get an appliance trace of the use column sampled at 15 minutes
trace = psda.generate_appliance_trace(schema, table, 'use', dataid, '15T')
# break it into windows
window_length = 24 * 4 * 7
window_step = 24 * 4
windows = trace.get_windows(window_length, window_step)
# add two additional dimensions for the neural network
return windows[:,:,numpy.newaxis,numpy.newaxis]
# Which ids to load?
schema = 'shared'
table = 'validated_01_2014'
ev_present_dataid = 1782
ev_not_present_dataid = 9036
ev_present_input = get_nn_input(schema,table,ev_present_dataid)
ev_not_present_input = get_nn_input(schema,table,ev_not_present_dataid)
Now that we have properly formatted inputs and a function which takes these inputs, we can obtain new predictions.
In [26]:
ev_present_output = trained_nn(ev_present_input)
ev_not_present_output = trained_nn(ev_not_present_input)
These outputs need to be aggregated to obtain a single result.
In [27]:
def prediction_from_outputs(outputs,threshold):
'''
Takes all outputs for a particular signal and returns an aggregate prediction
'''
prediction_means = numpy.mean(outputs, axis=0)
prediction = prediction_means[1] > threshold
return prediction, prediction_means[1]
# configurable threshold
threshold = 0.384
ev_present, present_mean = \
prediction_from_outputs(ev_present_output,threshold)
ev_not_present ,not_present_mean = \
prediction_from_outputs(ev_not_present_output,threshold)
print ev_present, ev_not_present
print present_mean, not_present_mean
In [7]:
def predict_ev(schema,tables,dataids,threshold,model):
prediction_function = model_to_function(model)
all_predictions = []
all_means = []
for dataid in dataids:
predictions = []
means = []
for table in tables:
# query for and format inputs
present_input = get_nn_input(schema,table,dataid)
# get raw outputs
present_output = prediction_function(present_input)
# process outputs using threshold
present_prediction, present_mean = prediction_from_outputs(present_output,threshold)
predictions.append(present_prediction)
means.append(present_mean)
total_mean = numpy.mean(means)
total_prediction = total_mean > threshold
all_predictions.append(total_prediction)
all_means.append(total_means)
print "Predictions for dataid {}:".format(dataid)
print " Monthly predictions: {}".format(predictions)
print " Monthly means: {}".format(means)
print " Final prediction of electric vehicle presence: {}".format(total_prediction)
print " Final mean: {}".format(total_mean)
print
return all_predictions
schema = 'shared'
tables = ['validated_01_2014','validated_02_2014','validated_03_2014','validated_04_2014','validated_05_2014']
# dataids drawn from test set
present = [1782, 1714, 6139, 9830, 4641, 7875, 4957, 8669, 8046, 5357]
not_present = [9036, 8467, 4874, 9356, 9019, 6730, 8292, 4732, 3736, 5218]
dataids = present + not_present
threshold = 0.384
predict_ev(schema,tables,dataids,threshold,model)
Out[7]:
Choosing a reasonable threshold for detection is quite important - and will affect the type of prediction errors which occur. The threhsold chosen above tends to overestimate the likelihood of EV presence (valuing recall over precision). A lower threshold would underestimate the likelihood of EV presence. Outcomes are quite sensitive to threshold choice, so it is a good idea to test threshold choices on test data, like above.
In Austin, TX, the trickiest months for detecting EV presence are the summer months, because the AC signal can be mistaken for the EV signal. We advise using total use from winter, fall and spring months as the most accurate inputs to the EV detection network.
So what do you do when you want to change the model? or the dataset it was trained on?
The following will train a new neural network. But beware! Running this code on a machine without a CUDA GPU could take a week or longer!
from pylearn2.config import yaml_parse
with open('/path/to/network_spec.yaml','r') as f:
nn_yaml = f.read()
hyper_params = {"data_dir": '/path/to/data_directory',
"dataset_prefix": 'prefix_to_dataset', # datasets should be saved in the format prefix_train.pkl, prefix_valid.pkl, prefix_test.pkl
"saved_model_prefix": 'prefix_of_saved_model'}
train = yaml_parse.load(nn_yaml % hyper_params)
train.main_loop()
The following creates and pickles a new dataset:
import disaggregator as da
import disaggregator.PecanStreetDatasetAdapter as psda
import pickle
import numpy as np
import pylearn2
import pylearn2.datasets as ds
# tables to select from
schema = 'shared'
tables = [u'validated_01_2014',
u'validated_02_2014',
u'validated_03_2014',
u'validated_04_2014',
u'validated_05_2014',]
# query for all ids
all_car_ids = []
all_use_ids = []
for table in tables:
all_car_ids.append(psda.get_dataids_with_real_values(schema,table,'car1'))
all_use_ids.append(psda.get_dataids_with_real_values(schema,table,'use'))
# find common ids between all five tables
common_car_ids = sorted(da.utils.get_common_ids(all_car_ids))
common_use_ids = sorted(da.utils.get_common_ids(all_use_ids))
non_car_ids = list(set(common_use_ids) - set(common_car_ids))
n_cars = len(common_car_ids)
n_non_cars = len(non_car_ids)
# for experimental repeatability
np.random.seed(1)
def get_train_valid_test_indices(n):
'''
Returns a random permutation of indices broken into training, validation, and testing sets
'''
indices = np.arange(n)
np.random.shuffle(indices)
n_train = n/2
n_valid = n/4
n_test = n - n_train - n_valid
assert(n == n_train + n_valid + n_test)
return (indices[:n_train],
indices[n_train : n_train+n_valid],
indices[n_train+n_valid:])
def get_training_arrays(schema, table, ids, column, sample_rate,
window_length, window_step, label):
'''
Returns X, y arrays for a list of ids and window steps. Applies a particular
one-hot label (label should be given in one-hot form) to each example.
'''
training_array = []
for id_ in ids:
trace = psda.generate_appliance_trace(
schema, table, column, id_, sample_rate)
id_array_chunk = trace.get_windows(window_length,window_step)
training_array.append(id_array_chunk)
training_array = np.concatenate(training_array,axis=0)
label_array = np.array([label for _ in xrange(training_array.shape[0])])
return training_array,label_array
# randomly pick indices
car_train_i, car_valid_i, car_test_i = get_train_valid_test_indices(n_cars)
non_car_train_i, non_car_valid_i, non_car_test_i =\
get_train_valid_test_indices(n_non_cars)
# turn these into sets of ids
car_train_ids = [common_car_ids[i] for i in car_train_i[:]]
car_valid_ids = [common_car_ids[i] for i in car_valid_i[:]]
car_test_ids = [common_car_ids[i] for i in car_test_i[:]]
non_car_train_ids = [non_car_ids[i] for i in non_car_train_i[:]]
non_car_valid_ids = [non_car_ids[i] for i in non_car_valid_i[:]]
non_car_test_ids = [non_car_ids[i] for i in non_car_test_i[:]]
# make arrays and labels
sample_rate = '15T'
window = 24 * 4 * 7
stride = 24 * 4
column = 'use'
car_label = [0,1] # one-hot
non_car_label = [1,0] # one-hot
prefix = 'dataset'
for i,table in enumerate(tables):
car_train_X, car_train_y = get_training_arrays(
schema, table, car_train_ids, column,
sample_rate, window, stride, car_label)
car_valid_X, car_valid_y = get_training_arrays(
schema, table, car_valid_ids, column,
sample_rate, window, stride, car_label)
car_test_X, car_test_y = get_training_arrays(
schema, table, car_test_ids, column,
sample_rate, window, stride, car_label)
non_car_train_X, non_car_train_y = get_training_arrays(
schema, table, non_car_train_ids, column,
sample_rate, window, stride, non_car_label)
non_car_valid_X, non_car_valid_y = get_training_arrays(
schema, table, non_car_valid_ids, column,
sample_rate, window, stride, non_car_label)
non_car_test_X, non_car_test_y = get_training_arrays(
schema, table, non_car_test_ids, column,
sample_rate, window, stride, non_car_label)
#concatenate
train_X = np.concatenate((car_train_X,non_car_train_X),axis=0)
train_y = np.concatenate((car_train_y,non_car_train_y),axis=0)
valid_X = np.concatenate((car_valid_X,non_car_valid_X),axis=0)
valid_y = np.concatenate((car_valid_y,non_car_valid_y),axis=0)
test_X = np.concatenate((car_test_X,non_car_test_X),axis=0)
test_y = np.concatenate((car_test_y,non_car_test_y),axis=0)
# make pylearn2 datasets
train_set = ds.DenseDesignMatrix(X=train_X,y=train_y)
valid_set = ds.DenseDesignMatrix(X=valid_X,y=valid_y)
test_set = ds.DenseDesignMatrix(X=test_X,y=test_y)
# pickle the datasets
with open('/path/to/data/{}_{:02d}_train.pkl'.format(prefix,i),'w') as f:
pickle.dump(train_set,f)
with open('/path/to/data/{}_{:02d}_valid.pkl'.format(prefix,i),'w') as f:
pickle.dump(valid_set,f)
with open('/path/to/data/{}_{:02d}_test.pkl'.format(prefix,i),'w') as f:
pickle.dump(test_set,f)
In [ ]: