EV Detection Example

This notebook outlines the process for obtaining EV presence predictions from a pretrained pylearn2 neural network. We have formulated this as a binary classification problem: given examples of total usage signals(such as the total usage for a home) with EV and without EV, can we correctly assign labels to previously unseen examples, for example can we correctly predict if a home has an EV or not?

The neural network used here was created using a library called pylearn2, a deep neural network library made by researchers for researchers. Unfortunately, since it is still under rapid development, the library is a little tricky to use (and the version we use cannot simply be pip installed). The library allows for relatively quick training of neural nets and high extensibility, but comes at the cost of simplicity of use. This notebook tries to alleviate that problem by introducing the library, its dependencies, and how to use it with the rich Pecan Street WikiEnergy dataset.

Note: This notebook is an interactive python (iPython) notebook. Cells can be executed by clicking the "run" triangle in the toolbar or by typing "< Shift>< Enter>."

Outline

  • Dependencies
  • Background
  • Setup
  • Loading a model
  • Loading data from the database
  • Obtaining predictions from the model
  • Next steps (training new neural networks)

Dependencies

The module versions on which this library was trained are as follows:

Theano==0.6.0
numpy==1.8.1
-e git://github.com/lisa-lab/pylearn2.git@2c6196fce42ccb39b43df0026780e4370dca25a4#egg=pylearn2-master
scipy==0.14.0

The theano, numpy, and scipy versions are those included in the Anaconda distribution. The pylearn2 version was the bleeding edge version at time of training.

Background

Where did this neural net come from?

Data Source

The data were drawn from the following shared dataset tables:

validated_01_2014
validated_02_2014
validated_03_2014
validated_04_2014
validated_05_2014

The data were filtered to use only dataids for which data was available across all five months.

The use columns from the following dataids were used as examples of the "EV present" dataset.

[ 624,  661, 1714, 1782, 1953, 2470, 2638, 2769, 2814, 3192,
 3367, 3482, 3723, 3795, 4135, 4505, 4526, 4641, 4767, 4957,
 4998, 5109, 5357, 6139, 6836, 6910, 6941, 7850, 7863, 7875,
 7940, 8046, 8142, 8197, 8645, 8669, 9484, 9609, 9729, 9830,
 9932, 9934]

The use columns from the following dataids were used as examples of the "EV not present" dataset. (Note that there are ~3x as many of these)

[  86,   93,   94,  410,  484,  585,  739,  744,  821,  871,
  936, 1167, 1283, 1334, 1632, 1718, 1790, 1800, 1994, 2094,
 2129, 2156, 2158, 2171, 2233, 2242, 2337, 2449, 2575, 2606,
 2829, 2864, 2945, 2953, 2974, 3092, 3221, 3263, 3394, 3456,
 3504, 3544, 3649, 3652, 3736, 3778, 3893, 3918, 4031, 4154,
 4298, 4313, 4447, 4732, 4874, 4922, 4956, 5026, 5209, 5218,
 5262, 5275, 5395, 5545, 5568, 5677, 5785, 5814, 5874, 5938,
 5949, 5972, 6412, 6636, 6673, 6730, 7062, 7319, 7390, 7531,
 7536, 7617, 7731, 7769, 7788, 7800, 7951, 8079, 8084, 8292,
 8317, 8342, 8419, 8467, 8741, 8829, 8852, 8956, 9019, 9036,
 9121, 9160, 9343, 9356, 9555, 9578, 9643, 9654, 9701, 9737,
 9771, 9875, 9915, 9922, 9926, 9937, 9938, 9939, 9982, 9983]

Each of these datasets were further broken down into three subsets. A random* quarter of the ids were used for testing, a random quarter was used for validation, and the remaining half was used for training. These subsets are shown below.

EV present:

Training:

[1782, 1714, 6139, 9830, 4641, 7875, 4957, 8669, 8046, 5357,
 5109, 8197, 7850, 8645, 7940, 8142, 9729, 1953, 4135, 3367,
 9934]

Validation:

[6836, 6941, 9484, 4998, 4767, 6910, 2638, 7863, 3795, 2769]

Testing:

[9932,  661, 4526,  624, 4505, 2470, 3482, 3192, 2814, 3723,
 9609]



EV not present:

Training:

[9036, 8467, 4874, 9356, 9019, 6730, 8292, 4732, 3736, 5218,
  585, 1790, 8342, 1632, 5209, 2953, 6636, 2606, 5785, 3092,
 9939, 7788, 2864, 5275, 9737, 2094, 4313, 4031, 8084, 7531,
   93, 8852, 3649, 4298, 2575, 3504, 9578, 9982, 1800, 9875,
 7390, 5938, 6673, 1994,  484, 3778, 4956, 3456, 3221, 9926,
 2129, 9555, 5262, 7769, 7617, 9983, 8419, 1167, 5545, 7800]

Validation:

[1283,   94, 8829, 9771, 9160,  739,   86, 9654, 5677, 4922,
 7319, 9121, 3893, 5395, 9922, 8317, 8956, 7951,  936, 2974,
 2945,  821, 3394, 9701, 3263, 2449, 2171, 5814,  871, 2158]

Testing:

[8741, 9343,  744, 8079, 2242, 9938, 5568, 1718, 7731, 3544,
 7536, 4447, 2337, 7062, 3652, 2233, 5874, 9915, 5026, 4154,
 2156, 5949,  410, 5972, 2829, 6412, 9643, 3918, 9937, 1334]


*psuedo-random: np.random.seed(1); np.random.shuffle(indices)

Preprocessing steps

  • NaNs were replaced with zeros.
  • Signals were aggregated into 15 minute intervals (summing up from one minute intervals).
  • Week-long windows of data were taken in day-long strides. (24 hrs/day $\cdot$ 4 samples/hr $\cdot$ 7 days/week = 672 samples/week; taken in strides of 24 $\cdot$ 4 = 96, samples/day).
  • The bottom 10 percentile of 10-point windows for ev use in the "EV present" dataset were dropped. This was to avoid training on signals labeled as having EV, but in which EV was not present.
  • Samples originating from a house in which EV was present in all tables were all labeled 1
  • Samples originating from a house in which EV was not present in any tables were all labeled 0
  • Labels were converted into a corresponding "one-hot" binary representation:
    • Example for our case:
      • $0 \rightarrow [1,0]$
      • $1 \rightarrow [0,1]$
    • Another example (less relevant but more illustrative):
      • $0 \rightarrow [1,0,0,0]$
      • $1 \rightarrow [0,1,0,0]$
      • $2 \rightarrow [0,0,1,0]$
      • $3 \rightarrow [0,0,0,1]$
  • Training sets and one-hot labels for EV and non EV were concatenated and dumped as python pickles.
    • An example of this is show below, but two numpy arrays are required. For n examples, an array X of training examples of shape (n,10) and an array y of labels of shape (n,2)
  • Same for testing and validation sets.

Network Specification

Pylearn2 organizes neural networks into yaml files which fully specify the structure of the network.

The file used to train this network is printed below. It has the following attributes:

  • An single channel input space corresponding to the shape of the weeklong input vectors defined above. (could be extended to include features like time of day, outside temperature, etc)
  • An output vector space representing class probabilities (classes are "EV present" and "EV not present"), calculated using a softmax regression
  • 3 hidden layers:
    • 2 identical convolutional layers with a 1D kernel of length 10, with "max pooling" of outputs (take the highest value of two adjacent neurons)
    • 1 fully connected rectified linear layer with 32 neurons.

In [20]:
# basic configurable parameters
params = {'data_dir':'data',
          'dataset_prefix':'dataset',
          'saved_model_prefix':'saved_model'}

# open and read network
with open('./ev_conv_nn_2_layer_32_stride_2.yaml','r') as f:
    print f.read() % params


!obj:pylearn2.train.Train {
    dataset: &train !pkl: "data/dataset_train.pkl",
    model: !obj:pylearn2.models.mlp.MLP {
        batch_size: 100,
        input_space: !obj:pylearn2.space.Conv2DSpace {
            shape: [672,1],
            num_channels: 1
        },
        layers: [ !obj:pylearn2.models.mlp.ConvRectifiedLinear {
                     layer_name: 'h0',
                     output_channels: 32,
                     irange: .05,
                     kernel_shape: [10, 1],
                     pool_shape: [2, 1],
                     pool_stride: [2, 1],
                     max_kernel_norm: 1.9365
                 }, !obj:pylearn2.models.mlp.ConvRectifiedLinear {
                     layer_name: 'h1',
                     output_channels: 32,
                     irange: .05,
                     kernel_shape: [10, 1],
                     pool_shape: [2, 1],
                     pool_stride: [2, 1],
                     max_kernel_norm: 1.9365
                 }, !obj:pylearn2.models.mlp.RectifiedLinear {
                     layer_name: 'h2',
                     dim: 32,
                     sparse_init: 15
                 }, !obj:pylearn2.models.mlp.Softmax {
                     max_col_norm: 1.9365,
                     layer_name: 'y',
                     n_classes: 2,
                     istdev: .05
                 }
                ],
    },
    algorithm: !obj:pylearn2.training_algorithms.bgd.BGD {
        batch_size: 100,
        line_search_mode: 'exhaustive',
        conjugate: 1,
        monitoring_dataset:
        {
            'train' : *train,
            'valid' : !pkl: "data/dataset_valid.pkl",
            'test'  : !pkl: "data/dataset_test.pkl",
        },
        termination_criterion: !obj:pylearn2.termination_criteria.MonitorBased {
            channel_name: "valid_y_misclass"
        }
    },
    extensions: [
        !obj:pylearn2.train_extensions.best_params.MonitorBasedSaveBest {
             channel_name: 'valid_y_misclass',
             save_path: "models/saved_model_best.pkl"
        },
    ],
    save_path: "models/saved_model.pkl",
    save_freq: 1
}

Why this network? Why not something simpler?

Neural networks are notoriously difficult to interpret - they're kind of a "black box". So, why did we use this one? It was a trade-off between performance and interpretability - in this case, we value the ability to interpret the features the model uses less than the ability to obtain an accurate probability of EV presence. Convoultional neural networks have some particular advantages which make them particularly applicable to the task of EV detection.

Neural networks with "convolutional" layers have an inherent ability to learn translation invariant features. We know that the input layer is a time series, and that we are looking for a particular signal to appear. Importantly, we don't care when the EV signal appears, just that it appears. Each convolutional layer learns features which are translation invariant. That means that we have the same ability to recognize the EV signal anywhere it appears. This intuition that we have about the nature of the problem is directly coded into the network itself. Additionally, in general, neural networks have the ability to learn non-linearities. Deep neural networks, like this one, can learn both simple and complex functions.

Caveats

Because neural networks are so flexible, we must be very careful to avoid overfitting. In other words, we don't want to tune the network to recognise attributes of the signals which are due simply to random sampling noise, and, consequently, do not generalize well to other signals. To alleviate this problem, we have employed a technique called "early stopping", which is why we have three different datasets - one each for training, validation, and testing. The training dataset is used to 'fit' the model, iteratively adjusting all of the weights in the network to produce better outputs for each of the labeled sample inputs. The validation dataset is used to select the best set of weights learned on the training set. The testing set is used to evaluate that selection of parameters. Once the outputs on the testing set start getting worse (which indicates overfitting!) we stop "early" (i.e. before maximizing the performance on the training set) and go back to the set of parameters best for the three sets collectively.

Loading a trained model

Here's how to load one of the models we have already trained:

(This step requires the dependencies listed above)


In [21]:
from pylearn2.utils import serial
import theano

def model_to_function(model):
    '''
    Returns an executable function to give input to a trained model
    and obtain output.
    '''
    X = model.get_input_space().make_theano_batch()
    y = model.fprop(X)
    f = theano.function([X],y,allow_input_downcast=True)
    return f

path_to_saved_model = "models/ev_conv_2_32_2_live2_best.pkl"
model = serial.load(path_to_saved_model)
trained_nn = model_to_function(model)

That's it! Now you can use this model to generate predictions for previously unseen training examples!

Loading data from the database

Now we can load data from the database to test on. Let's grab a house from the test set. This will involve the following steps:


In [22]:
import sys
import os

# add the path of the 'disaggregator' module, which contains functions for accessing the pecan street database
sys.path.append(os.path.join(os.pardir,os.pardir))

from disaggregator import PecanStreetDatasetAdapter as psda

# set database credentials
db_url = "postgresql://USERNAME:PASSWORD@db.wiki-energy.org:5432/postgres"
psda.set_url(db_url)

Now we can get an array of data from a particular house.


In [24]:
import numpy

def get_nn_input(schema,table,dataid):
    '''
    Returns a numpy array formatted appropriately for passing through
    the example trained neural network.
    '''
    #Get an appliance trace of the use column sampled at 15 minutes
    trace = psda.generate_appliance_trace(schema, table, 'use', dataid, '15T')
    
    # break it into windows
    window_length = 24 * 4 * 7
    window_step = 24 * 4
    windows = trace.get_windows(window_length, window_step)
    
    # add two additional dimensions for the neural network
    return windows[:,:,numpy.newaxis,numpy.newaxis]
    
# Which ids to load?
schema = 'shared'
table = 'validated_01_2014'
ev_present_dataid = 1782
ev_not_present_dataid = 9036

ev_present_input = get_nn_input(schema,table,ev_present_dataid)
ev_not_present_input = get_nn_input(schema,table,ev_not_present_dataid)


select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=1782
select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=9036

Now that we have properly formatted inputs and a function which takes these inputs, we can obtain new predictions.


In [26]:
ev_present_output = trained_nn(ev_present_input)
ev_not_present_output = trained_nn(ev_not_present_input)

These outputs need to be aggregated to obtain a single result.


In [27]:
def prediction_from_outputs(outputs,threshold):
    '''
    Takes all outputs for a particular signal and returns an aggregate prediction
    '''
    prediction_means = numpy.mean(outputs, axis=0)
    prediction = prediction_means[1] > threshold
    return prediction, prediction_means[1]

# configurable threshold
threshold = 0.384

ev_present, present_mean = \
    prediction_from_outputs(ev_present_output,threshold)
ev_not_present ,not_present_mean = \
    prediction_from_outputs(ev_not_present_output,threshold)

print ev_present, ev_not_present
print present_mean, not_present_mean


True False
0.384443294214 0.382134946338

Repeat in bulk


In [7]:
def predict_ev(schema,tables,dataids,threshold,model):
    prediction_function = model_to_function(model)
    
    all_predictions = []
    all_means = []
    for dataid in dataids:
        predictions = []
        means = []
        for table in tables:
            # query for and format inputs
            present_input = get_nn_input(schema,table,dataid)

            # get raw outputs
            present_output = prediction_function(present_input)

            # process outputs using threshold
            present_prediction, present_mean = prediction_from_outputs(present_output,threshold)

            predictions.append(present_prediction)
            means.append(present_mean)
        total_mean = numpy.mean(means)
        total_prediction = total_mean > threshold
        all_predictions.append(total_prediction)
        all_means.append(total_means)
        print "Predictions for dataid {}:".format(dataid)
        print "  Monthly predictions: {}".format(predictions)
        print "  Monthly means: {}".format(means)
        print "  Final prediction of electric vehicle presence: {}".format(total_prediction)
        print "  Final mean: {}".format(total_mean)
        print
    return all_predictions

schema = 'shared'

tables = ['validated_01_2014','validated_02_2014','validated_03_2014','validated_04_2014','validated_05_2014']

# dataids drawn from test set
present = [1782, 1714, 6139, 9830, 4641, 7875, 4957, 8669, 8046, 5357]
not_present = [9036, 8467, 4874, 9356, 9019, 6730, 8292, 4732, 3736, 5218]
dataids = present + not_present

threshold = 0.384

predict_ev(schema,tables,dataids,threshold,model)


select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=1782
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=1782
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=1782
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=1782
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=1782
Predictions for dataid 1782:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.38444329421442142, 0.38701898888067393, 0.38663993657953255, 0.39029276304348359, 0.39774443144044963]
  Final prediction of electric vehicle presence: True
  Final mean: 0.389227882832

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=1714
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=1714
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=1714
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=1714
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=1714
Predictions for dataid 1714:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.40697162620898503, 0.40549931727968447, 0.40604094016776104, 0.40895474621879763, 0.40698137955378538]
  Final prediction of electric vehicle presence: True
  Final mean: 0.406889601886

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=6139
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=6139
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=6139
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=6139
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=6139
Predictions for dataid 6139:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.39956848794238481, 0.40223090954717683, 0.39617006659167886, 0.40090321991140199, 0.40252308051787866]
  Final prediction of electric vehicle presence: True
  Final mean: 0.400279152902

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=9830
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=9830
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=9830
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=9830
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=9830
Predictions for dataid 9830:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.39982160551351581, 0.40314928875577294, 0.40841650916953598, 0.41235157884815637, 0.40113256072966907]
  Final prediction of electric vehicle presence: True
  Final mean: 0.404974308603

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=4641
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=4641
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=4641
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=4641
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=4641
Predictions for dataid 4641:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.40709765075026905, 0.42192307497141013, 0.41737530399755468, 0.41717556880386436, 0.40812096320940083]
  Final prediction of electric vehicle presence: True
  Final mean: 0.414338512346

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=7875
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=7875
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=7875
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=7875
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=7875
Predictions for dataid 7875:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.38691026336344542, 0.38891957568857338, 0.39608331868321345, 0.4000495470269127, 0.40657082771386238]
  Final prediction of electric vehicle presence: True
  Final mean: 0.395706706495

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=4957
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=4957
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=4957
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=4957
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=4957
Predictions for dataid 4957:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.39327705411545222, 0.39936817346393144, 0.39122566656342328, 0.39367894988495383, 0.39283948696464499]
  Final prediction of electric vehicle presence: True
  Final mean: 0.394077866198

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=8669
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=8669
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=8669
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=8669
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=8669
Predictions for dataid 8669:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.40489001305117522, 0.40136708251406911, 0.420328569062423, 0.40589821646089946, 0.40352868095228722]
  Final prediction of electric vehicle presence: True
  Final mean: 0.407202512408

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=8046
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=8046
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=8046
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=8046
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=8046
Predictions for dataid 8046:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.39613893223304614, 0.39277741359188245, 0.40359571195214883, 0.40178333283497919, 0.39344997015083055]
  Final prediction of electric vehicle presence: True
  Final mean: 0.397549072153

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=5357
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=5357
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=5357
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=5357
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=5357
Predictions for dataid 5357:
  Monthly predictions: [True, True, True, True, True]
  Monthly means: [0.4216663748715313, 0.42153314616797505, 0.42241386944790138, 0.4124196817359102, 0.39564997583923928]
  Final prediction of electric vehicle presence: True
  Final mean: 0.414736609613

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=9036
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=9036
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=9036
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=9036
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=9036
Predictions for dataid 9036:
  Monthly predictions: [False, False, True, False, False]
  Monthly means: [0.38213494633799056, 0.38159029020508584, 0.38491431067403992, 0.37777233620539935, 0.36927261589381283]
  Final prediction of electric vehicle presence: False
  Final mean: 0.379136899863

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=8467
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=8467
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=8467
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=8467
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=8467
Predictions for dataid 8467:
  Monthly predictions: [False, False, False, False, False]
  Monthly means: [0.36489798333056694, 0.36408505064580993, 0.37356533595590496, 0.37383959632792674, 0.37775680716217491]
  Final prediction of electric vehicle presence: False
  Final mean: 0.370828954684

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=4874
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=4874
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=4874
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=4874
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=4874
Predictions for dataid 4874:
  Monthly predictions: [False, False, False, False, False]
  Monthly means: [0.37328280637875882, 0.37590873723774743, 0.37254323804333739, 0.37394555378269478, 0.37162627345676624]
  Final prediction of electric vehicle presence: False
  Final mean: 0.37346132178

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=9356
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=9356
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=9356
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=9356
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=9356
Predictions for dataid 9356:
  Monthly predictions: [False, False, False, False, False]
  Monthly means: [0.31970449683987673, 0.33523846800291068, 0.34168340616258386, 0.35368725017352076, 0.34830844301772929]
  Final prediction of electric vehicle presence: False
  Final mean: 0.339724412839

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=9019
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=9019
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=9019
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=9019
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=9019
Predictions for dataid 9019:
  Monthly predictions: [False, False, False, False, False]
  Monthly means: [0.38211554710330448, 0.38163290245389159, 0.38186312217057489, 0.38326468116898643, 0.38221990809802797]
  Final prediction of electric vehicle presence: False
  Final mean: 0.382219232199

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=6730
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=6730
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=6730
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=6730
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=6730
Predictions for dataid 6730:
  Monthly predictions: [False, False, False, False, False]
  Monthly means: [0.38318294046031992, 0.38161190831307823, 0.37770841002124966, 0.37889882824169929, 0.38241471867008314]
  Final prediction of electric vehicle presence: False
  Final mean: 0.380763361141

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=8292
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=8292
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=8292
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=8292
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=8292
Predictions for dataid 8292:
  Monthly predictions: [False, False, False, False, False]
  Monthly means: [0.38229218583563385, 0.37650696095304614, 0.38193758454642163, 0.3809077844621665, 0.37714453082851518]
  Final prediction of electric vehicle presence: False
  Final mean: 0.379757809325

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=4732
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=4732
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=4732
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=4732
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=4732
Predictions for dataid 4732:
  Monthly predictions: [False, True, True, True, True]
  Monthly means: [0.38362216025664059, 0.38457195190381388, 0.3853031226321747, 0.38622792381043569, 0.38713265354499049]
  Final prediction of electric vehicle presence: True
  Final mean: 0.38537156243

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=3736
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=3736
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=3736
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=3736
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=3736
Predictions for dataid 3736:
  Monthly predictions: [False, False, False, False, False]
  Monthly means: [0.37960245478766347, 0.37824023849174021, 0.37595875729300465, 0.37865985067972757, 0.37572167430398418]
  Final prediction of electric vehicle presence: False
  Final mean: 0.377636595111

select use,localminute from "PecanStreet_SharedData".validated_01_2014 where dataid=5218
select use,localminute from "PecanStreet_SharedData".validated_02_2014 where dataid=5218
select use,localminute from "PecanStreet_SharedData".validated_03_2014 where dataid=5218
select use,localminute from "PecanStreet_SharedData".validated_04_2014 where dataid=5218
select use,localminute from "PecanStreet_SharedData".validated_05_2014 where dataid=5218
Predictions for dataid 5218:
  Monthly predictions: [False, True, False, True, True]
  Monthly means: [0.37490044767523895, 0.38400758705447469, 0.37841594306115495, 0.38525791354737471, 0.38435735832485995]
  Final prediction of electric vehicle presence: False
  Final mean: 0.381387849933

Out[7]:
[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False]

Choosing a good threshold

Choosing a reasonable threshold for detection is quite important - and will affect the type of prediction errors which occur. The threhsold chosen above tends to overestimate the likelihood of EV presence (valuing recall over precision). A lower threshold would underestimate the likelihood of EV presence. Outcomes are quite sensitive to threshold choice, so it is a good idea to test threshold choices on test data, like above.

Note aboute time of year

In Austin, TX, the trickiest months for detecting EV presence are the summer months, because the AC signal can be mistaken for the EV signal. We advise using total use from winter, fall and spring months as the most accurate inputs to the EV detection network.

Next steps

So what do you do when you want to change the model? or the dataset it was trained on?

Training a new model

The following will train a new neural network. But beware! Running this code on a machine without a CUDA GPU could take a week or longer!

from pylearn2.config import yaml_parse

with open('/path/to/network_spec.yaml','r') as f:
    nn_yaml = f.read()

hyper_params = {"data_dir": '/path/to/data_directory',
                "dataset_prefix": 'prefix_to_dataset', # datasets should be saved in the format prefix_train.pkl, prefix_valid.pkl, prefix_test.pkl
                "saved_model_prefix": 'prefix_of_saved_model'}

train = yaml_parse.load(nn_yaml % hyper_params)
train.main_loop()

Obtaining a new dataset

The following creates and pickles a new dataset:

import disaggregator as da
import disaggregator.PecanStreetDatasetAdapter as psda
import pickle
import numpy as np
import pylearn2
import pylearn2.datasets as ds

# tables to select from
schema = 'shared'
tables = [u'validated_01_2014',
          u'validated_02_2014',
          u'validated_03_2014',
          u'validated_04_2014',
          u'validated_05_2014',]

# query for all ids
all_car_ids = []
all_use_ids = []
for table in tables:
    all_car_ids.append(psda.get_dataids_with_real_values(schema,table,'car1'))
    all_use_ids.append(psda.get_dataids_with_real_values(schema,table,'use'))

# find common ids between all five tables
common_car_ids = sorted(da.utils.get_common_ids(all_car_ids))
common_use_ids = sorted(da.utils.get_common_ids(all_use_ids))
non_car_ids = list(set(common_use_ids) - set(common_car_ids))

n_cars = len(common_car_ids)
n_non_cars = len(non_car_ids)

# for experimental repeatability
np.random.seed(1)

def get_train_valid_test_indices(n):
    '''
    Returns a random permutation of indices broken into training, validation, and testing sets
    '''
    indices = np.arange(n)
    np.random.shuffle(indices)
    n_train = n/2
    n_valid = n/4
    n_test = n - n_train - n_valid
    assert(n == n_train + n_valid + n_test)
    return (indices[:n_train],
           indices[n_train : n_train+n_valid],
           indices[n_train+n_valid:])

def get_training_arrays(schema, table, ids, column, sample_rate,
        window_length, window_step, label):
    '''
    Returns X, y arrays for a list of ids and window steps. Applies a particular
    one-hot label (label should be given in one-hot form) to each example. 
    '''
    training_array = []
    for id_ in ids:
        trace = psda.generate_appliance_trace(
            schema, table, column, id_, sample_rate)
        id_array_chunk = trace.get_windows(window_length,window_step)
        training_array.append(id_array_chunk)
    training_array = np.concatenate(training_array,axis=0)
    label_array = np.array([label for _ in xrange(training_array.shape[0])])
    return training_array,label_array

# randomly pick indices
car_train_i, car_valid_i, car_test_i = get_train_valid_test_indices(n_cars)
non_car_train_i, non_car_valid_i, non_car_test_i =\
    get_train_valid_test_indices(n_non_cars)

# turn these into sets of ids
car_train_ids = [common_car_ids[i] for i in car_train_i[:]]
car_valid_ids = [common_car_ids[i] for i in car_valid_i[:]]
car_test_ids = [common_car_ids[i] for i in car_test_i[:]]
non_car_train_ids = [non_car_ids[i] for i in non_car_train_i[:]]
non_car_valid_ids = [non_car_ids[i] for i in non_car_valid_i[:]]
non_car_test_ids = [non_car_ids[i] for i in non_car_test_i[:]]

# make arrays and labels

sample_rate = '15T'
window = 24 * 4 * 7
stride = 24 * 4
column = 'use'
car_label = [0,1] # one-hot
non_car_label = [1,0] # one-hot
prefix = 'dataset'
for i,table in enumerate(tables):
    car_train_X, car_train_y = get_training_arrays(
            schema, table, car_train_ids, column,
            sample_rate, window, stride, car_label)
    car_valid_X, car_valid_y = get_training_arrays(
            schema, table, car_valid_ids, column,
            sample_rate, window, stride, car_label)
    car_test_X, car_test_y = get_training_arrays(
            schema, table, car_test_ids, column,
            sample_rate, window, stride, car_label)

    non_car_train_X, non_car_train_y = get_training_arrays(
            schema, table, non_car_train_ids, column,
            sample_rate, window, stride, non_car_label)
    non_car_valid_X, non_car_valid_y = get_training_arrays(
            schema, table, non_car_valid_ids, column,
            sample_rate, window, stride, non_car_label)
    non_car_test_X, non_car_test_y = get_training_arrays(
            schema, table, non_car_test_ids, column,
            sample_rate, window, stride, non_car_label)

    #concatenate
    train_X = np.concatenate((car_train_X,non_car_train_X),axis=0)
    train_y = np.concatenate((car_train_y,non_car_train_y),axis=0)

    valid_X = np.concatenate((car_valid_X,non_car_valid_X),axis=0)
    valid_y = np.concatenate((car_valid_y,non_car_valid_y),axis=0)

    test_X = np.concatenate((car_test_X,non_car_test_X),axis=0)
    test_y = np.concatenate((car_test_y,non_car_test_y),axis=0)


    # make pylearn2 datasets
    train_set = ds.DenseDesignMatrix(X=train_X,y=train_y)
    valid_set = ds.DenseDesignMatrix(X=valid_X,y=valid_y)
    test_set = ds.DenseDesignMatrix(X=test_X,y=test_y)


    # pickle the datasets
    with open('/path/to/data/{}_{:02d}_train.pkl'.format(prefix,i),'w') as f:
        pickle.dump(train_set,f)
    with open('/path/to/data/{}_{:02d}_valid.pkl'.format(prefix,i),'w') as f:
        pickle.dump(valid_set,f)
    with open('/path/to/data/{}_{:02d}_test.pkl'.format(prefix,i),'w') as f:
        pickle.dump(test_set,f)

In [ ]: