Getting started with the eemeter library

This jupyter notebook is an interactive tutorial. It walks through loading data, running the CalTRACK methods, and plotting results. You'll run all the code yourself. Cells can be executed with <shift><enter>. If you feel so inspired, make edits to the code in these cells and dig deeper.

Note on tutorial scope

This tutorial assumes the reader has properly installed python and the eemeter package (pip install eemeter) and has a basic working knowledge of python syntax and usage.

Outline

This tutorial is a self-paced walkthrough of how to use the eemeter package. We'll cover the following:

  • Background - why this library
  • Loading data
  • Plotting and visualization
  • Filtering data to baseline or reporting periods
  • Creating design matrix datasets
  • Fitting baseline (and reporting) models
  • Using fitted models for prediction
  • Computing CalTRACK metered savings

The tutorial is focused on demonstrating how to use the package to run the CalTRACK Hourly, Daily, and Billing methods on hourly, daily, and billing meter data.

Background and Cautions

At time of writing (Sept 2018), the OpenEEmeter, as implemented in the eemeter package, contains the most complete open source implementation of the CalTRACK methods, which specify a way of calculating avoided energy use at a single meter. However, using the OpenEEmeter to calculate avoided energy use does not in itself guarantee compliance with the CalTRACK method specification. Nor is using the OpenEEmeter a requirement of the CalTRACK methods. The eemeter package is a toolkit that may help with implementing a CalTRACK compliant analysis, as it provides a particular implementation of the CalTRACK methods which consists of a set of functions, parameters, and classes which can be configured to run the CalTRACK methods and variants. Please keep in mind while using the package that the eemeter assumes certain data cleaning tasks that are specified in the CalTRACK methods have occurred prior to usage with the eemeter. The package will create warnings to expose errors of this nature where possible.

The eemeter package is built for flexibility and modularity. While this is generally helpful and makes it easier to use the package, one potential consequence of this for users is that without being careful to follow the both the eemeter documentation and the guidance provided in the CalTRACK methods, it is very possible to use the eemeter in a way that does not comply with the CalTRACK methods. For example, while the CalTRACK methods set specific hard limits for the purpose of standardization and consistency, the eemeter can be configured to edit or entirely ignore those limits. The main reason for this flexibility is that the emeter package is used not only to comply with the CalTRACK methods, but also to develop, test, and propose potential changes to those methods.

Rather than providing a single method that directly calculates avoided energy use from the required inputs, the eemeter library provides a series of modular functions that can be strung together in a variety of ways. The tutorial below describes common usage and sequencing of these functions, especially when it might not otherwise be apparent from the API documentation.

Some new users have assumed that the eemeter package constitutes an entire application suitable for running metering analytics at scale. This is not necessarily the case. It is designed instead to be embedded within other applications or to be used in one-off analyses. The eemeter is a toolbox that leaves to the user decisions about when to use or how to embed the provided tools within other applications. This limitation is an important consequence of the decision to make the methods and implementation as open and accessible as possible.

As you dive in, remember that this is a work in progress and that we welcome feedback and contributions. To contribute, please open an issue or a pull request on github.

Jupyter housekeeping

Note: these Jupyter cell magics enable some useful special features but are unrelated to eemeter.


In [1]:
# inline plotting
%matplotlib inline

# allow live package editing
%load_ext autoreload
%autoreload 2

Importing the eemeter library

Once the eemeter has been installed, it can be imported as shown below. The package exposes most of its API at the top level of the library, so this single import is generally sufficient.


In [2]:
import eemeter

This tutorial requires eemeter version 2.x.x. You can verify the version you have installed with the command below.


In [3]:
eemeter.get_version()


Out[3]:
'2.8.4'

Loading data

The three essential inputs to eemeter library functions are the following:

  1. Meter data
  2. Temperature data from a nearby weather station
  3. Project or intervention dates

Users of the library are responsible for obtaining and formatting this data (to get weather data, see eeweather, which helps perform site to weather station matching and can pull and cache temperature data directly from public (US) data sources). Some samples come loaded with the library and we'll load these first to save you the trouble of loading in your own data. The simulated sample data additionally has the useful property that we can load the same underlying data in three different frequencies: hourly, daily, and billing data.

We directly use pandas DataFrame amd Series objects to hold the input meter and temperature time series data, which allows us to easily take advantage of the powerful methods provided by the pandas package. Use pandas has the added advantage that usage is a bit more familiar to pythonistas who work frequently with data of this nature in python. These formats are discussed in more detail below. If working with your own data instead of these samples, please refer directly to the excellent pandas documentation for instructions for loading data (e.g., pandas.read_csv). For some common cases, eemeter does come packaged with loading methods, but these will only work for particular data formats.

Useful eemeter methods for loading and manipulating data:

Remember: the sample data is simulated, not real!


In [4]:
meter_data_hourly, temperature_data_hourly, metadata_hourly = \
    eemeter.load_sample('il-electricity-cdd-hdd-hourly')

meter_data_daily, temperature_data_daily, metadata_daily = \
    eemeter.load_sample('il-electricity-cdd-hdd-daily')

meter_data_billing, temperature_data_billing, metadata_billing = \
    eemeter.load_sample('il-electricity-cdd-hdd-billing_monthly')

The metadata has project start and end that we can use to determine a baseline period. All three of these have the same project dates, so we'll just use one of them. All we are using this for is to define the baseline end date and the reporting period start date.


In [5]:
baseline_end_date = metadata_billing['blackout_start_date']
baseline_end_date


Out[5]:
datetime.datetime(2016, 12, 26, 0, 0, tzinfo=<UTC>)

More about the meter data data structure

The convention for formatting meter data is to create a pandas DataFrame with a DatetimeIndex called start and a single column of meter readings called value. The index datetime values represent the start dates of each metering period. The end of each period is the start of the next period, even for data with variable period lengths like billing data. The end date of the last period can be supplied by appending an extra period with the final end date and a NaN value. Missing data is represented by one or more periods of value NaN. Data should be sorted by time and deduplicated prior to use with eemeter. Timestamps must be timezone aware.

Data is formatted like this as a convenience to avoid the need to store a start and an end period for each data point. However, the convention that uses start dates as timestamps can be a bit confusing. Make sure that if you are starting with billing data, which is sometimes defined primarily by period end dates that the transformation is done properly so that the meter data ends up with start dates as time stamps.

Take a look at the hourly, daily, and billing data we just loaded. It follows the conventions described above. Notice that the format is identical but the timestamps and values are different.


In [6]:
meter_data_hourly.head()  # pandas.DataFrame.head filters to just the first 5 rows


Out[6]:
value
start
2015-11-22 06:00:00+00:00 0.29
2015-11-22 07:00:00+00:00 1.47
2015-11-22 08:00:00+00:00 0.58
2015-11-22 09:00:00+00:00 0.28
2015-11-22 10:00:00+00:00 1.25

In [7]:
meter_data_daily.head()


Out[7]:
value
start
2015-11-22 00:00:00+00:00 32.34
2015-11-23 00:00:00+00:00 23.80
2015-11-24 00:00:00+00:00 26.26
2015-11-25 00:00:00+00:00 21.32
2015-11-26 00:00:00+00:00 6.70

In [8]:
meter_data_billing.tail()  # last 5 rows


Out[8]:
value
start
2017-09-26 06:00:00+00:00 526.25
2017-10-27 06:00:00+00:00 649.80
2017-11-25 06:00:00+00:00 650.52
2017-12-22 06:00:00+00:00 1393.40
2018-01-20 06:00:00+00:00 NaN

The convention for formatting temperature is as a pandas Series, also with a DatetimeIndex. These three versions are all exactly the same. That is because we always start with hourly temperature data. This is necessary even for daily and billing analyses because we must be able to aggregate the temperatures in different ways over different time series - including dates in many different timezones, which have midnight timestamps which don't always align with the UTC midnights provided in preaggregated daily data.


In [9]:
temperature_data_hourly.head()


Out[9]:
dt
2015-11-22 06:00:00+00:00    21.01
2015-11-22 07:00:00+00:00    20.35
2015-11-22 08:00:00+00:00    19.38
2015-11-22 09:00:00+00:00    19.02
2015-11-22 10:00:00+00:00    17.82
Freq: H, Name: tempF, dtype: float64

In [10]:
temperature_data_daily.head()


Out[10]:
dt
2015-11-22 06:00:00+00:00    21.01
2015-11-22 07:00:00+00:00    20.35
2015-11-22 08:00:00+00:00    19.38
2015-11-22 09:00:00+00:00    19.02
2015-11-22 10:00:00+00:00    17.82
Freq: H, Name: tempF, dtype: float64

In [11]:
temperature_data_billing.head()


Out[11]:
dt
2015-11-22 06:00:00+00:00    21.01
2015-11-22 07:00:00+00:00    20.35
2015-11-22 08:00:00+00:00    19.38
2015-11-22 09:00:00+00:00    19.02
2015-11-22 10:00:00+00:00    17.82
Freq: H, Name: tempF, dtype: float64

Plotting (part 1)

The eemeter plotting functions allow visual exploration of meter and temperature data.

Time series plots

Plotting in time series, we see the difference in the frequency of the data more clearly.


In [12]:
eemeter.plot_time_series(meter_data_hourly, temperature_data_hourly, figsize=(16, 4))


Out[12]:
(<matplotlib.axes._subplots.AxesSubplot at 0x7fb67c481240>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7fb67a44a630>)

In [13]:
eemeter.plot_time_series(meter_data_daily, temperature_data_daily, figsize=(16, 4))


Out[13]:
(<matplotlib.axes._subplots.AxesSubplot at 0x7fb679f11cf8>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7fb679edacf8>)

In [14]:
eemeter.plot_time_series(meter_data_billing, temperature_data_billing, figsize=(16, 4))


Out[14]:
(<matplotlib.axes._subplots.AxesSubplot at 0x7fb679f1a320>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7fb67a364470>)

Energy signature plots

The following stacks the three versions of the data - hourly, billing and daily - right on top of each other in energy signature form. This shows the temperature dependence of usage on external temperatures. These plots convert the meter data to "usage per day", which normalizes things and makes usage patterns appear roughly comparable at different sampling intervals.

Remember, this data is simulated. If these correlations look too good to be true, they are!


In [15]:
ax = eemeter.plot_energy_signature(meter_data_hourly, temperature_data_hourly, figsize=(14, 8))
eemeter.plot_energy_signature(meter_data_daily, temperature_data_daily, ax=ax)
eemeter.plot_energy_signature(meter_data_billing, temperature_data_billing, ax=ax)
ax.legend(labels=['hourly', 'daily', 'billing'])


Out[15]:
<matplotlib.legend.Legend at 0x7fb67a3e3ac8>

Filtering to a baseline data

The CalTRACK methods require building a model of the usage during the baseline period and then projecting that forward into the reporting period. Before we can build the baseline model we need to get isolate 365 days of meter data as immediately prior to the end of the baseline period as we can. The following function performs this filtering for us an returns a new dataset with only baseline data.


In [16]:
baseline_meter_data_hourly, baseline_warnings_hourly = eemeter.get_baseline_data(
    meter_data_hourly, end=baseline_end_date, max_days=365)

baseline_meter_data_daily, baseline_warnings_daily = eemeter.get_baseline_data(
    meter_data_daily, end=baseline_end_date, max_days=365)

baseline_meter_data_billing, baseline_warnings_billing = eemeter.get_baseline_data(
    meter_data_billing, end=baseline_end_date, max_days=365)

To give you a sense for what this data looks like, let's tail the data again. Remember that we had a baseline end date of 2016-12-26 - so this data goes up to that data but no further, as we specified above with the end argument. It's also no more than 365 days long, as we specified above with the max_days argument. Notice that the billing data is a bit shorter because of the unevenness of billing periods. Billing periods that fall across (rather than exactly at) the boundaries are removed in this method.


In [17]:
baseline_meter_data_hourly.tail()


Out[17]:
value
start
2016-12-25 20:00:00+00:00 0.14
2016-12-25 21:00:00+00:00 0.10
2016-12-25 22:00:00+00:00 0.19
2016-12-25 23:00:00+00:00 0.06
2016-12-26 00:00:00+00:00 NaN

In [18]:
baseline_meter_data_daily.tail()


Out[18]:
value
start
2016-12-22 00:00:00+00:00 30.79
2016-12-23 00:00:00+00:00 28.74
2016-12-24 00:00:00+00:00 31.63
2016-12-25 00:00:00+00:00 26.45
2016-12-26 00:00:00+00:00 NaN

In [19]:
baseline_meter_data_billing.tail()


Out[19]:
value
start
2016-08-22 06:00:00+00:00 942.15
2016-09-22 06:00:00+00:00 632.31
2016-10-24 06:00:00+00:00 538.24
2016-11-23 06:00:00+00:00 921.55
2016-12-19 06:00:00+00:00 NaN

If there had been any issues (e.g., unexpected gaps in the data) in filtering the data to the baseline period, some warnings would have been reported. This time we got off easy, but that will not always be the case in real-life datasets.


In [20]:
baseline_warnings_hourly, baseline_warnings_daily, baseline_warnings_billing


Out[20]:
([], [], [])

Clean CalTRACK Daily/Billing meter data

CalTRACK defines certain changes to the meter data such as:

  • CalTRACK 2.2.3.4: Off-cycle reads (spanning less than 25 days) should be dropped from analysis
  • CalTRACK 2.2.3.5: For pseudo-monthly billing cycles, periods spanning more than 35 days should be dropped from analysis. For bi-monthly billing cycles, periods spanning more than 70 days should be dropped from the analysis.
  • CalTRACK 2.2.2.1: If summing to daily usage from higher frequency interval data, no more than 50% of high-frequency values should be missing. Missing values should be filled in with average of non-missing values (e.g., for hourly data, 24 * average hourly usage).

A helper function was created to handle these cases called clean_caltrack_billing_daily_data


In [24]:
baseline_meter_data_billing = eemeter.clean_caltrack_billing_daily_data(baseline_meter_data_billing, 'billing')

baseline_meter_data_daily = eemeter.clean_caltrack_billing_daily_data(baseline_meter_data_daily, 'daily')

baseline_meter_data_daily_from_hourly = eemeter.clean_caltrack_billing_daily_data(baseline_meter_data_hourly, 'hourly')

Creating CalTRACK Daily/Billing Method datasets

The CalTRACK daily and billing methods specify a way of modeling the energy signature we plotted a few cells above. We need to select a model which fits the data as well as possible. The parameters in the model are heating and cooling balance points (i.e., the temperatures at which heating/cooling related energy use tend to kick in), and the heating and cooling beta parameters, which define the slope of the energy response to incremental differences between outdoor temperature and the balance point. We'll do a grid search over possible heating and cooling balance points and fit models to the heating and cooling degree days) defined by the outdoor temperatures and each of those balance points. To do this, we precompute the heating and cooling degree days using the methods below before we feed them into the modeling routines.

To make this dataset, we need to merge the meter data and temperature data into a single DataFrame. The compute_usage_per_day_feature function transforms the meter data into usage per day. The compute_temperature_features function lets us create a bunch of heating and cooling degree day values if we specify balance points to use. In this case, we'll use the wide balance point ranges recommended in the CalTRACK spec. Then we can combine the two using merge_features.


In [25]:
design_matrix_daily = eemeter.create_caltrack_daily_design_matrix(
    baseline_meter_data_daily, temperature_data_daily,
)

A preview of this dataset is shown below:


In [26]:
design_matrix_daily.tail()


Out[26]:
meter_value temperature_not_null temperature_null temperature_mean n_days_kept n_days_dropped cdd_30 cdd_31 cdd_32 cdd_33 ... hdd_81 hdd_82 hdd_83 hdd_84 hdd_85 hdd_86 hdd_87 hdd_88 hdd_89 hdd_90
start
2016-12-22 00:00:00+00:00 30.79 24.0 0.0 37.755833 1.0 0.0 7.755833 6.755833 5.755833 4.755833 ... 43.244167 44.244167 45.244167 46.244167 47.244167 48.244167 49.244167 50.244167 51.244167 52.244167
2016-12-23 00:00:00+00:00 28.74 24.0 0.0 34.483333 1.0 0.0 4.483333 3.483333 2.483333 1.483333 ... 46.516667 47.516667 48.516667 49.516667 50.516667 51.516667 52.516667 53.516667 54.516667 55.516667
2016-12-24 00:00:00+00:00 31.63 24.0 0.0 40.632500 1.0 0.0 10.632500 9.632500 8.632500 7.632500 ... 40.367500 41.367500 42.367500 43.367500 44.367500 45.367500 46.367500 47.367500 48.367500 49.367500
2016-12-25 00:00:00+00:00 26.45 24.0 0.0 39.362500 1.0 0.0 9.362500 8.362500 7.362500 6.362500 ... 41.637500 42.637500 43.637500 44.637500 45.637500 46.637500 47.637500 48.637500 49.637500 50.637500
2016-12-26 00:00:00+00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 128 columns


In [27]:
design_matrix_daily.index.min(), design_matrix_daily.index.max()


Out[27]:
(Timestamp('2015-12-27 00:00:00+0000', tz='UTC', freq='D'),
 Timestamp('2016-12-26 00:00:00+0000', tz='UTC', freq='D'))

We can do roughly the same thing for the billing data, adding a tolerance as specified in the CalTRACK methods.


In [28]:
design_matrix_billing = eemeter.create_caltrack_billing_design_matrix(
    baseline_meter_data_billing, temperature_data_billing,
)

You'll notice that this billing data shares the structure used above for the daily data. Notice however that the magnitide of the meter value column is significantly smaller than it was before calling compute_usage_per_day_feature - that is because the values are returned as average usage per day, as specified by the CalTRACK methods, not as totals per period, as they are represented in the inputs. The heating/cooling degree days returned by compute_temperature_features are also average heating/cooling degree days per day, and not total heating/cooling degree days per period. This averaging behavior can be modified with the use_mean_daily_values parameter, which is set to True by default.


In [29]:
design_matrix_billing.tail()


Out[29]:
meter_value temperature_not_null temperature_null temperature_mean n_days_kept n_days_dropped cdd_30 cdd_31 cdd_32 cdd_33 ... hdd_81 hdd_82 hdd_83 hdd_84 hdd_85 hdd_86 hdd_87 hdd_88 hdd_89 hdd_90
start
2016-08-22 06:00:00+00:00 30.391935 744.0 0.0 74.667392 31.0 0.0 44.667392 43.667392 42.667392 41.667392 ... 6.476290 7.411774 8.350968 9.332608 10.332608 11.332608 12.332608 13.332608 14.332608 15.332608
2016-09-22 06:00:00+00:00 19.759687 768.0 0.0 64.546667 32.0 0.0 34.546667 33.546667 32.546667 31.546667 ... 16.453333 17.453333 18.453333 19.453333 20.453333 21.453333 22.453333 23.453333 24.453333 25.453333
2016-10-24 06:00:00+00:00 17.941333 720.0 0.0 53.130458 30.0 0.0 23.130458 22.130458 21.143861 20.177194 ... 27.869542 28.869542 29.869542 30.869542 31.869542 32.869542 33.869542 34.869542 35.869542 36.869542
2016-11-23 06:00:00+00:00 35.444231 624.0 0.0 32.976378 26.0 0.0 6.240080 5.624696 5.040913 4.463990 ... 48.023622 49.023622 50.023622 51.023622 52.023622 53.023622 54.023622 55.023622 56.023622 57.023622
2016-12-19 06:00:00+00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 128 columns


In [30]:
design_matrix_billing.index.min(), design_matrix_billing.index.max()


Out[30]:
(Timestamp('2016-01-22 06:00:00+0000', tz='UTC'),
 Timestamp('2016-12-19 06:00:00+0000', tz='UTC'))

If you are not running the CalTRACK hourly methods, at this point you should skip down the the section called "Running the CalTRACK Billing/Daily Methods".

Creating CalTRACK Hourly Method datasets

The hourly methods require a multi-stage dataset creation process which is a bit more involved than the daily/billing dataset creation process above. There are two primary reasons for this extra complexity. First, unlike the daily/billing methods, the hourly methods build separate models for each calendar month, which adds a few extra steps. Second, also unlike the billing and daily methods, there are two features of the dataset creation which must themselves be fitted to a preliminary dataset -- the occupancy feature and the temperature bin features.

Creating a preliminary dataset

The preliminary dataset has some simple time and temperature features. These features do not vary by segment and are precursors to other features (See below for a better explanation of segmentation). This step looks a lot like the daily/billing dataset creation. These features are used subsequently to fit the occupancy and temperature bin features.


In [31]:
preliminary_design_matrix_hourly = eemeter.create_caltrack_hourly_preliminary_design_matrix(
    baseline_meter_data_hourly, temperature_data_hourly,
)

Let's take a peek at this data. This time, we have only two fixed heating and cooling degree day columns - these are used to fit the occupancy model. But we also have an hour of week column, which is a categorical variable indicating the hour of the week using an integer from 1 to 168 (i.e., 7*24).


In [32]:
preliminary_design_matrix_hourly.tail()


Out[32]:
meter_value temperature_mean cdd_65 hdd_50 n_hours_dropped n_hours_kept hour_of_week
start
2016-12-25 20:00:00+00:00 0.14 42.30 0.0 7.70 0.0 1.0 164.0
2016-12-25 21:00:00+00:00 0.10 42.99 0.0 7.01 0.0 1.0 165.0
2016-12-25 22:00:00+00:00 0.19 43.65 0.0 6.35 0.0 1.0 166.0
2016-12-25 23:00:00+00:00 0.06 44.64 0.0 5.36 0.0 1.0 167.0
2016-12-26 00:00:00+00:00 NaN NaN NaN NaN NaN NaN NaN

Segmentation

To handle creating multiple independent models on a shared dataset (as is required for CalTRACK hourly), we have introduced a concept which we are calling segmentation. Segmentation breaks a dataset into $n$ named and weighted subsets.

Before we can move on to the next steps of creating the CalTRACK hourly dataset, we need to create a segmentation for the hourly data. We will use this to create 12 independent hourly models - one for each month of the calendar year. The eemeter function for creating these weights is called segment_time_series and it takes a DatetimeIndex as input.

This segmentation matrix contains 1 column for each segment (12 in all), each of which contains the segmentation weights for that column. The segmentation scheme we use here is to have one segment for each month which contains a single fully weighted calendar month and two half-weighted neighboring calendar months. The eemeter code name for this segmentation scheme is called 'three_month_weighted' (There's also all, one_month, and three_month, each of which behaves a bit differently).

We are creating this segmentation over the time index of the baseline period that is represented in the preliminary hourly design matrix.


In [33]:
segmentation_hourly = eemeter.segment_time_series(
    preliminary_design_matrix_hourly.index,
    'three_month_weighted'
)
segmentation_hourly.head()


Out[33]:
dec-jan-feb-weighted jan-feb-mar-weighted feb-mar-apr-weighted mar-apr-may-weighted apr-may-jun-weighted may-jun-jul-weighted jun-jul-aug-weighted jul-aug-sep-weighted aug-sep-oct-weighted sep-oct-nov-weighted oct-nov-dec-weighted nov-dec-jan-weighted
start
2015-12-27 00:00:00+00:00 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0
2015-12-27 01:00:00+00:00 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0
2015-12-27 02:00:00+00:00 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0
2015-12-27 03:00:00+00:00 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0
2015-12-27 04:00:00+00:00 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 1.0

These segments are probably a bit easier to understand when plotted visually. The areas in the following chart represent the weights assigned to the data at particular hours. A weight of 1 is full weight, as weight of 0 indicates that the data is ignored for that segment. These segments look like 3-month long tetris blocks and indicate half-weight/full-weight/half-weight for the three months they cover. For instance, the dec-jan-feb-weighted segment (which will eventually be used to estimate usage for january) includes a fully weighted january but also half-weighted december and february. These weights wrap around the calendar year, so both January and December of 2017 might end up in the same dataset.


In [34]:
# example segmentation weights
segmentation_hourly[[
    'dec-jan-feb-weighted',
    'apr-may-jun-weighted',
    'jun-jul-aug-weighted'
]].plot.area(stacked=False, alpha=0.3, figsize=(15, 2.5))


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb679f5f198>

Fitting segmented occupancy lookups

Occupancy is estimated by building a simple model from the preliminary design matrix hdd_50 and cdd_65 columns. This is done for each segment independently, so results are returned as a dataframe with one segment of results per column. The segmentation argument indicates that the analysis should be done once per segment. Occupancy is determined by hour of week category. A value of 1 for a particular hour indicates an "occupied" mode, and a value of 0 indicates "unoccupied" mode. These modes are determined by the tendency of the hdd_50/cdd_65 model to over- or under-predict usage for that hour, given a particular threshold between 0 and 1 (if the percent of underpredictions (by count) is lower than that threshold, then the mode is "unoccupied", otherwise the mode is "occupied").


In [35]:
occupancy_lookup_hourly = eemeter.estimate_hour_of_week_occupancy(
    preliminary_design_matrix_hourly,
    segmentation=segmentation_hourly,
    # threshold=0.65  # default
)

The occupancy lookup is organized by hour of week (rows) and model segment (columns).


In [36]:
occupancy_lookup_hourly.head()


Out[36]:
dec-jan-feb-weighted jan-feb-mar-weighted feb-mar-apr-weighted mar-apr-may-weighted apr-may-jun-weighted may-jun-jul-weighted jun-jul-aug-weighted jul-aug-sep-weighted aug-sep-oct-weighted sep-oct-nov-weighted oct-nov-dec-weighted nov-dec-jan-weighted
hour_of_week
0 False False False False False False False True True True False False
1 False False False False False False False False False False False False
2 False False False False False False False True False False False False
3 False False True True False False False False False False False False
4 False False False True True True False False False False False False

Fitting segmented temperature bins

Temperature bins are fit for each segment such that each bin has sufficient number of temperature readings. Bins are defined by starting with a proposed set of bins (see the default_bins argument) and systematically dropping bin endpoints. Bins themselves are not dropped but are effectively combined with neighboring bins. Except for the fact that zero-weighted times are dropped, segment weights are not considered when fitting temperature bins.

  • eemeter.fit_temperature_bins: Fit temperature bins to data, dropping bin endpoints for bins that do not meet the minimum temperature count such that remaining bins meet the minimum count.

In [37]:
temperature_bins_hourly = eemeter.fit_temperature_bins(
    preliminary_design_matrix_hourly,
    segmentation=segmentation_hourly,
    # default_bins=[30, 45, 55, 65, 75, 90],  # default
    # min_temperature_count=20  # default
)

Because bin fitting and validation is done independently for each segment, results are returned as a dataframe with one segment of results per column. The contents of the dataframe are boolean indicators of whether the bin endpoint should be used for temperatures in that segment. Some bin endpoints are dropped because of insufficient reading counts. The bin endpoints that are dropped for each segment are given a value of False. You'll notice in this dataset that the the winter months tend to have combined high temperature bins and the summer months tend to have combined low temperature bins.


In [38]:
temperature_bins_hourly


Out[38]:
dec-jan-feb-weighted jan-feb-mar-weighted feb-mar-apr-weighted mar-apr-may-weighted apr-may-jun-weighted may-jun-jul-weighted jun-jul-aug-weighted jul-aug-sep-weighted aug-sep-oct-weighted sep-oct-nov-weighted oct-nov-dec-weighted nov-dec-jan-weighted
bin_endpoints
30 True True True True False False False False False True True True
45 True True True True True False False False True True True True
55 True True True True True True False False True True True True
65 False True True True True True True True True True True True
75 False False True True True True True True True True True False
90 False False False False True True True True True False False False

With these in hand, now we can combine them into a segmented dataset using the helper function iterate_segmented_dataset and a prefabricated feature processor caltrack_hourly_fit_feature_processor which is provided to assist creating the segmented dataset given a preliminary design matrix of the form created above. The feature processor transforms the each segment of the dataset using the occupancy lookup and temperature bins created above. We are creating a python dict of pandas Dataframes - one for each time series segment encountered in the baseline data.


In [39]:
segmented_design_matrices_hourly = eemeter.create_caltrack_hourly_segmented_design_matrices(
    preliminary_design_matrix_hourly,
    segmentation_hourly,
    occupancy_lookup_hourly,
    temperature_bins_hourly,
)

The keys of the dict are segment names. The values are DataFrame objects containing the exact data needed to fit the a CalTRACK hourly model.


In [40]:
print(segmented_design_matrices_hourly.keys())
segmented_design_matrices_hourly['dec-jan-feb-weighted'].head()


dict_keys(['dec-jan-feb-weighted', 'jan-feb-mar-weighted', 'feb-mar-apr-weighted', 'mar-apr-may-weighted', 'apr-may-jun-weighted', 'may-jun-jul-weighted', 'jun-jul-aug-weighted', 'jul-aug-sep-weighted', 'aug-sep-oct-weighted', 'sep-oct-nov-weighted', 'oct-nov-dec-weighted', 'nov-dec-jan-weighted'])
Out[40]:
meter_value hour_of_week occupancy bin_0 bin_1 bin_2 bin_3 weight
start
2015-12-27 00:00:00+00:00 0.46 144 False 30.0 15.0 4.98 0.0 0.5
2015-12-27 01:00:00+00:00 0.95 145 False 30.0 15.0 3.77 0.0 0.5
2015-12-27 02:00:00+00:00 0.73 146 False 30.0 15.0 2.33 0.0 0.5
2015-12-27 03:00:00+00:00 0.23 147 False 30.0 15.0 1.07 0.0 0.5
2015-12-27 04:00:00+00:00 2.52 148 False 30.0 15.0 0.33 0.0 0.5

Running the CalTRACK Billing/Daily Methods

The following use the design matrix datasets that we created in the previous steps and uses them with the CalTRACK method. This gives us a baseline model, which is the key predictive component of the metered savings calculation.


In [41]:
baseline_model_results_daily = eemeter.fit_caltrack_usage_per_day_model(
    design_matrix_daily,
)

In [42]:
baseline_model_results_billing = eemeter.fit_caltrack_usage_per_day_model(
    design_matrix_billing,
    use_billing_presets=True,
    weights_col='n_days_kept',
)

Running the CalTRACK Hourly Methods


In [43]:
baseline_segmented_model_hourly = eemeter.fit_caltrack_hourly_model(
    segmented_design_matrices_hourly,
    occupancy_lookup_hourly,
    temperature_bins_hourly,
)

Plotting (part 2)

Daily and billing method plots

To see what this model fit looks like, we can plot the result against the energy signature.


In [44]:
ax = eemeter.plot_energy_signature(meter_data_daily, temperature_data_daily)
baseline_model_results_daily.plot(ax=ax, temp_range=(-5, 88))


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb6703a0a90>

In [45]:
ax = eemeter.plot_energy_signature(meter_data_billing, temperature_data_billing)
baseline_model_results_billing.plot(ax=ax, temp_range=(18, 80))


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb6704b0048>

We can also compare the two models and see that there is a slight, but not drastic, difference between them.


In [46]:
ax = baseline_model_results_daily.model.plot(color='C0', best=True, label='daily')
ax = baseline_model_results_billing.model.plot(ax=ax, color='C1', best=True, label='billing')
ax.legend()


Out[46]:
<matplotlib.legend.Legend at 0x7fb6704c65c0>

Reporting period metered savings

The moment of truth. Calculating CalTRACK metered savings.

First, we need a reporting period. The following gets the reporting period date.


In [47]:
reporting_start_date = metadata_billing['blackout_start_date']

Now we get the first year of data for that period.


In [48]:
reporting_meter_data_hourly, warnings = eemeter.get_reporting_data(
    meter_data_hourly, start=reporting_start_date, max_days=365)

reporting_meter_data_daily, warnings = eemeter.get_reporting_data(
    meter_data_daily, start=reporting_start_date, max_days=365)

reporting_meter_data_billing, warnings = eemeter.get_reporting_data(
    meter_data_billing, start=reporting_start_date, max_days=365)

The eemeter.metered_savings method performs the logic of estimating counterfactual baseline reporting period usage. For this, it requires the fitted baseline model, the reporting period meter data (for its index - so that it can be properly joined later), and corresponding temperature data. Note that this method can return results disaggregated into base load, cooling load, or heating load or as the aggregated usage. We do both here for demonstration purposes.


In [49]:
metered_savings_hourly, error_bands = eemeter.metered_savings(
    baseline_segmented_model_hourly, reporting_meter_data_hourly,
    temperature_data_hourly
)

metered_savings_daily, error_bands = eemeter.metered_savings(
    baseline_model_results_daily, reporting_meter_data_daily,
    temperature_data_daily, with_disaggregated=True
)

metered_savings_billing, error_bands = eemeter.metered_savings(
    baseline_model_results_billing, reporting_meter_data_billing,
    temperature_data_billing, with_disaggregated=True
)

In [50]:
metered_savings_hourly.head()


Out[50]:
reporting_observed counterfactual_usage metered_savings
start
2016-12-26 00:00:00+00:00 0.71 1.331161 0.621161
2016-12-26 01:00:00+00:00 0.54 0.271275 -0.268725
2016-12-26 02:00:00+00:00 1.12 -0.067412 -1.187412
2016-12-26 03:00:00+00:00 0.66 0.204523 -0.455477
2016-12-26 04:00:00+00:00 1.01 0.357236 -0.652764

In [51]:
metered_savings_daily.head()


Out[51]:
reporting_observed counterfactual_usage metered_savings counterfactual_base_load counterfactual_heating_load counterfactual_cooling_load
start
2016-12-26 00:00:00+00:00 12.75 13.472386 0.722386 13.472386 0.000000 0.0
2016-12-27 00:00:00+00:00 22.68 32.476405 9.796405 13.472386 19.004020 0.0
2016-12-28 00:00:00+00:00 28.52 32.419400 3.899400 13.472386 18.947015 0.0
2016-12-29 00:00:00+00:00 36.85 29.588147 -7.261853 13.472386 16.115761 0.0
2016-12-30 00:00:00+00:00 26.87 33.227204 6.357204 13.472386 19.754819 0.0

In [52]:
metered_savings_billing.head()


Out[52]:
reporting_observed counterfactual_usage metered_savings counterfactual_base_load counterfactual_heating_load counterfactual_cooling_load
start
2017-01-21 06:00:00+00:00 717.58 871.909019 154.329019 416.022903 455.886117 0.000000
2017-02-21 06:00:00+00:00 682.58 767.491490 84.911490 416.022903 351.468587 0.000000
2017-03-24 06:00:00+00:00 503.42 521.984343 18.564343 429.442996 46.631350 45.909996
2017-04-25 06:00:00+00:00 467.65 597.360739 129.710739 429.442996 22.066013 145.851730
2017-05-27 06:00:00+00:00 929.39 955.441124 26.051124 389.182715 0.000000 566.258409

In [53]:
columns = ["reporting_observed", "counterfactual_usage", "metered_savings"]

In [54]:
metered_savings_hourly[columns].resample('MS').sum().plot(figsize=(10, 6), drawstyle="steps-post")


Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb6704dc198>

In [55]:
metered_savings_daily[columns].resample('MS').sum().plot(figsize=(10, 6), drawstyle="steps-post")


Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb6703f1710>

In [56]:
metered_savings_billing[columns].plot(figsize=(10, 6), drawstyle="steps-post")


Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb6701fe7f0>

These can be easily aggregated


In [57]:
total_savings_hourly = metered_savings_hourly.metered_savings.sum()
percent_savings_hourly = total_savings_hourly / metered_savings_hourly.counterfactual_usage.sum() * 100
print('Hourly: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_hourly, percent_savings_hourly))

total_savings_daily = metered_savings_daily.metered_savings.sum()
percent_savings_daily = total_savings_daily / metered_savings_daily.counterfactual_usage.sum() * 100
print('Daily: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_daily, percent_savings_daily))

total_savings_billing = metered_savings_billing.metered_savings.sum()
percent_savings_billing = total_savings_billing / metered_savings_billing.counterfactual_usage.sum() * 100
print('Billing: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_billing, percent_savings_billing))


Hourly: Saved 1084.9 kWh in first year (11.0%)
Daily: Saved 814.5 kWh in first year (8.5%)
Billing: Saved 795.7 kWh in first year (9.3%)

NOTE: These results differ somewhat due to the different lengths of the reporting periods - the billing version of the reporting period was a bit shorter because the billing periods over which we computed savings didn't exactly align with 365 day period we requested, as they did for the daily reporting period data.

Annual weather-normalized modeled savings

If we want to compute annual weather normalized modeled savings, we'll need a reporting period model. The following code repeats what we did for the baseline period with the reporting period.


In [58]:
reporting_preliminary_design_matrix_hourly = eemeter.create_caltrack_hourly_preliminary_design_matrix(
    reporting_meter_data_hourly, temperature_data_hourly,
)
reporting_segmentation_hourly = eemeter.segment_time_series(
    reporting_preliminary_design_matrix_hourly.index,
    'three_month_weighted'
)
reporting_occupancy_lookup_hourly = eemeter.estimate_hour_of_week_occupancy(
    reporting_preliminary_design_matrix_hourly,
    segmentation=reporting_segmentation_hourly,
)
reporting_temperature_bins_hourly = eemeter.fit_temperature_bins(
    reporting_preliminary_design_matrix_hourly,
    segmentation=reporting_segmentation_hourly,
)
reporting_segmentation_design_matrices_hourly = eemeter.create_caltrack_hourly_segmented_design_matrices(
    reporting_preliminary_design_matrix_hourly,
    reporting_segmentation_hourly,
    reporting_occupancy_lookup_hourly,
    reporting_temperature_bins_hourly
)

In [59]:
reporting_design_matrix_daily = eemeter.create_caltrack_daily_design_matrix(
    reporting_meter_data_daily, temperature_data_daily,
)

In [60]:
reporting_design_matrix_billing = eemeter.create_caltrack_billing_design_matrix(
    reporting_meter_data_billing, temperature_data_billing,
)

In [61]:
reporting_segmented_model_hourly = eemeter.fit_caltrack_hourly_model(
    reporting_segmentation_design_matrices_hourly,
    reporting_occupancy_lookup_hourly,
    reporting_temperature_bins_hourly
)

In [62]:
reporting_model_results_daily = eemeter.fit_caltrack_usage_per_day_model(
    reporting_design_matrix_daily,
)

In [63]:
reporting_model_results_billing = eemeter.fit_caltrack_usage_per_day_model(
    reporting_design_matrix_billing,
    use_billing_presets=True,
    weights_col='n_days_kept',
)

In [64]:
ax = eemeter.plot_energy_signature(meter_data_daily, temperature_data_daily)
ax = baseline_model_results_daily.model.plot(ax=ax, color='C1', best=True, label='baseline', temp_range=(-5, 88))
ax = reporting_model_results_daily.model.plot(ax=ax, color='C2', best=True, label='reporting', temp_range=(-5, 88))
ax.legend()


Out[64]:
<matplotlib.legend.Legend at 0x7fb66bc1fa90>

In [65]:
ax = eemeter.plot_energy_signature(meter_data_billing, temperature_data_billing)
ax = baseline_model_results_billing.model.plot(ax=ax, color='C1', best=True, label='baseline', temp_range=(18, 80))
ax = reporting_model_results_billing.model.plot(ax=ax, color='C2', best=True, label='reporting', temp_range=(18, 80))
ax.legend()


Out[65]:
<matplotlib.legend.Legend at 0x7fb66bc2eb38>

The last thing we need to do before obtaining annualized and weather-normalized results is to obtain normal year temperature data. For simplicity, let's just call 2017 our "normal year". To be completely clear, this is not something you would do in practice, but this demonstrates the functionality. To use real temperature normals, check out the EEWeather package.


In [66]:
import pandas as pd
normal_year_temperatures = temperature_data_daily[temperature_data_daily.index.year == 2017]
result_index = pd.date_range('2017-01-01', periods=365, freq='D', tz='UTC')

Now we are ready to obtain our annualized savings.


In [69]:
annualized_savings_hourly, annualized_savings_warnings_hourly = eemeter.modeled_savings(
    baseline_segmented_model_hourly, reporting_segmented_model_hourly,
    result_index, normal_year_temperatures, with_disaggregated=True
)

annualized_savings_daily, annualized_savings_warnings_daily = eemeter.modeled_savings(
    baseline_model_results_daily, reporting_model_results_daily,
    result_index, normal_year_temperatures, with_disaggregated=True
)

annualized_savings_billing, annualized_savings_warnings_billing = eemeter.modeled_savings(
    baseline_model_results_billing, reporting_model_results_billing,
    result_index, normal_year_temperatures, with_disaggregated=True
)

In [70]:
annualized_savings_hourly.head()


Out[70]:
modeled_baseline_usage modeled_reporting_usage modeled_savings
2017-01-01 00:00:00+00:00 1.154961 0.908344 0.246617
2017-01-02 00:00:00+00:00 0.826994 0.911777 -0.084783
2017-01-03 00:00:00+00:00 0.870990 0.599395 0.271594
2017-01-04 00:00:00+00:00 2.365750 1.616250 0.749500
2017-01-05 00:00:00+00:00 1.333611 1.569218 -0.235607

In [71]:
annualized_savings_daily.head()


Out[71]:
modeled_baseline_usage modeled_reporting_usage modeled_savings modeled_baseline_base_load modeled_baseline_heating_load modeled_baseline_cooling_load modeled_reporting_base_load modeled_reporting_heating_load modeled_reporting_cooling_load modeled_base_load_savings modeled_heating_load_savings modeled_cooling_load_savings
2017-01-01 00:00:00+00:00 39.323043 35.088762 4.234281 13.472386 25.850658 0.0 12.155685 22.933077 0.0 1.316701 2.917580 0.0
2017-01-02 00:00:00+00:00 28.253393 24.826921 3.426472 13.472386 14.781007 0.0 12.155685 12.671236 0.0 1.316701 2.109771 0.0
2017-01-03 00:00:00+00:00 26.005167 22.742760 3.262407 13.472386 12.532781 0.0 12.155685 10.587075 0.0 1.316701 1.945706 0.0
2017-01-04 00:00:00+00:00 49.620112 44.634401 4.985711 13.472386 36.147727 0.0 12.155685 32.478716 0.0 1.316701 3.669011 0.0
2017-01-05 00:00:00+00:00 55.221907 49.827404 5.394503 13.472386 41.749521 0.0 12.155685 37.671719 0.0 1.316701 4.077802 0.0

In [72]:
annualized_savings_billing.head()


Out[72]:
modeled_baseline_usage modeled_reporting_usage modeled_savings modeled_baseline_base_load modeled_baseline_heating_load modeled_baseline_cooling_load modeled_reporting_base_load modeled_reporting_heating_load modeled_reporting_cooling_load modeled_base_load_savings modeled_heating_load_savings modeled_cooling_load_savings
2017-01-01 00:00:00+00:00 38.822906 32.898922 5.923984 13.420094 25.402812 0.0 6.544231 26.354691 0.0 6.875863 -0.951879 0.0
2017-01-02 00:00:00+00:00 28.393771 24.763770 3.630002 13.420094 14.973678 0.0 6.544231 18.219539 0.0 6.875863 -3.245861 0.0
2017-01-03 00:00:00+00:00 26.275633 23.111535 3.164098 13.420094 12.855540 0.0 6.544231 16.567304 0.0 6.875863 -3.711765 0.0
2017-01-04 00:00:00+00:00 48.524162 40.466299 8.057862 13.420094 35.104068 0.0 6.544231 33.922069 0.0 6.875863 1.182000 0.0
2017-01-05 00:00:00+00:00 53.801823 44.583092 9.218732 13.420094 40.381729 0.0 6.544231 38.038861 0.0 6.875863 2.342869 0.0

The following plot demonstrates that in this case, the billing model represents most of the modeled savings as base load savings. This reflects the behavior seen in the model comparison above.


In [73]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 1, figsize=(10, 4))

annualized_savings_hourly[[
    'modeled_baseline_usage',
    'modeled_reporting_usage',
    'modeled_savings',
]].plot(ax=axes)
axes.set_title('Total normalized/annualized savings')
plt.show()



In [74]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(4, 1, figsize=(10, 16))

annualized_savings_daily[[
    'modeled_baseline_usage',
    'modeled_reporting_usage',
    'modeled_savings',
]].plot(ax=axes[0])
axes[0].set_title('Total normalized/annualized savings')

annualized_savings_daily[[
    'modeled_baseline_cooling_load',
    'modeled_reporting_cooling_load',
    'modeled_cooling_load_savings',
]].plot(ax=axes[1])
axes[1].set_title('Modeled cooling load savings')

annualized_savings_daily[[
    'modeled_baseline_heating_load',
    'modeled_reporting_heating_load',
    'modeled_heating_load_savings',
]].plot(ax=axes[2])
axes[2].set_title('Modeled heating load savings')

ax = annualized_savings_daily[[
    'modeled_baseline_base_load',
    'modeled_reporting_base_load',
    'modeled_base_load_savings',
]].plot(ax=axes[3])
axes[3].set_title('Modeled base load savings')
lim = axes[3].set_ylim((0, None))

plt.show()



In [75]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(4, 1, figsize=(10, 16))

annualized_savings_billing[[
    'modeled_baseline_usage',
    'modeled_reporting_usage',
    'modeled_savings',
]].plot(ax=axes[0])
axes[0].set_title('Total normalized/annualized savings')

annualized_savings_billing[[
    'modeled_baseline_cooling_load',
    'modeled_reporting_cooling_load',
    'modeled_cooling_load_savings',
]].plot(ax=axes[1])
axes[1].set_title('Modeled cooling load savings')

annualized_savings_billing[[
    'modeled_baseline_heating_load',
    'modeled_reporting_heating_load',
    'modeled_heating_load_savings',
]].plot(ax=axes[2])
axes[2].set_title('Modeled heating load savings')

ax = annualized_savings_billing[[
    'modeled_baseline_base_load',
    'modeled_reporting_base_load',
    'modeled_base_load_savings',
]].plot(ax=axes[3])
axes[3].set_title('Modeled base load savings')
lim = axes[3].set_ylim((0, None))

plt.show()


In this case, as totals, the annualized savings look pretty similar to the metered savings.


In [76]:
total_savings_hourly = annualized_savings_hourly.modeled_savings.sum()
percent_savings_hourly = total_savings_hourly / annualized_savings_hourly.modeled_baseline_usage.sum() * 100
print('Hourly: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_hourly, percent_savings_hourly))

total_savings_daily = annualized_savings_daily.modeled_savings.sum()
percent_savings_daily = total_savings_daily / annualized_savings_daily.modeled_baseline_usage.sum() * 100
print('Daily: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_daily, percent_savings_daily))

total_savings_billing = annualized_savings_billing.modeled_savings.sum()
percent_savings_billing = total_savings_billing / annualized_savings_billing.modeled_baseline_usage.sum() * 100
print('Billing: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_billing, percent_savings_billing))


Hourly: Saved 33.4 kWh in first year (6.9%)
Daily: Saved 824.1 kWh in first year (8.5%)
Billing: Saved 983.5 kWh in first year (10.1%)

If we're interested in seeing more about models the CalTRACK method tried, we can even plot all the model candidates as well. There are a ton of these, so the reduced alpha makes it a bit easier to see what's going on. Each faint line represents a model that was tried and bested by the (orange) selected model, which had the highest r-squared. Candidates appear green if QUALIFIED and red if DISQUALIFIED. A model might be disqualified if it had unphysical (i.e., negative) coefficients.


In [77]:
ax = eemeter.plot_energy_signature(meter_data_daily, temperature_data_daily)
baseline_model_results_daily.plot(
    ax=ax,
    candidate_alpha=0.02,
    with_candidates=True,
    temp_range=(-5, 88)
)


Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb66b0b2278>

In [78]:
ax = eemeter.plot_energy_signature(meter_data_billing, temperature_data_billing)
baseline_model_results_billing.plot(
    ax=ax,
    candidate_alpha=0.02,
    with_candidates=True,
    temp_range=(18, 80)
)


Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb669ec10b8>

CalTRACK Hourly plots

Coming soon.

Inspecting CalTRACKUsagePerDayModelResults objects: status, candidates, warnings, json

In addition to being plottable, the model_fit object is an instance of the class eemeter.ModelFit and contains a bunch of interesting information about this modeling process.

For instance, there's a status. This status is one of the following:

  • 'SUCCESS': qualified model selected.
  • 'NO MODEL': no candidate models qualified.
  • 'NO DATA': no data was given.

In [79]:
baseline_model_results_billing.status, baseline_model_results_daily.status


Out[79]:
('SUCCESS', 'SUCCESS')

There is also a "best" candidate model:


In [80]:
baseline_model_results_billing.model, baseline_model_results_daily.model


Out[80]:
(CalTRACKUsagePerDayCandidateModel(model_type='cdd_hdd', formula='meter_value ~ cdd_67 + hdd_54', status='QUALIFIED', r_squared_adj=0.994),
 CalTRACKUsagePerDayCandidateModel(model_type='cdd_hdd', formula='meter_value ~ cdd_68 + hdd_53', status='QUALIFIED', r_squared_adj=0.768))

And a list of all candidate models that were tried, many of which have (much) lower r-squared than the best model.


In [81]:
baseline_model_results_billing.candidates[:5]  # (there are a lot, so only showing the first 5)


Out[81]:
[CalTRACKUsagePerDayCandidateModel(model_type='intercept_only', formula='meter_value ~ 1', status='QUALIFIED', r_squared_adj=0),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_30', status='QUALIFIED', r_squared_adj=0.081),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_31', status='QUALIFIED', r_squared_adj=0.081),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_32', status='QUALIFIED', r_squared_adj=0.079),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_33', status='QUALIFIED', r_squared_adj=0.076)]

In [82]:
baseline_model_results_daily.candidates[:5]


Out[82]:
[CalTRACKUsagePerDayCandidateModel(model_type='intercept_only', formula='meter_value ~ 1', status='QUALIFIED', r_squared_adj=0),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_30', status='QUALIFIED', r_squared_adj=0.235),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_31', status='QUALIFIED', r_squared_adj=0.241),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_32', status='QUALIFIED', r_squared_adj=0.248),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_33', status='QUALIFIED', r_squared_adj=0.253)]

Any associated warnings on both the model_fit object and the best candidate model object:


In [83]:
baseline_model_results_billing.warnings, baseline_model_results_billing.warnings


Out[83]:
([], [])

In [84]:
baseline_model_results_daily.warnings, baseline_model_results_daily.warnings


Out[84]:
([], [])

The best models don't appear to have any issues but the billing model did (see the faint red lines in the chart above).


In [85]:
disqualified_candidates = [
    candidate
    for candidate in baseline_model_results_billing.candidates
    if candidate.status == 'DISQUALIFIED'
]  # this is a python list comprehension
disqualified_candidates[:10]


Out[85]:
[CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_67', status='DISQUALIFIED', r_squared_adj=-0.111),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_68', status='DISQUALIFIED', r_squared_adj=-0.111),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_69', status='DISQUALIFIED', r_squared_adj=-0.109),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_70', status='DISQUALIFIED', r_squared_adj=-0.108),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_71', status='DISQUALIFIED', r_squared_adj=-0.105),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_72', status='DISQUALIFIED', r_squared_adj=-0.102),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_73', status='DISQUALIFIED', r_squared_adj=-0.099),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_74', status='DISQUALIFIED', r_squared_adj=-0.096),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_75', status='DISQUALIFIED', r_squared_adj=-0.092),
 CalTRACKUsagePerDayCandidateModel(model_type='hdd_only', formula='meter_value ~ hdd_76', status='DISQUALIFIED', r_squared_adj=-0.089)]

The warnings associated with the disqualified candidates will be a bit more interesting. For instance, this one was disqualified because the 'beta_hdd' parameter was negative, which is unphysicial behavior that the CalTRACK working group should be considered to be evidence of overfitting:


In [86]:
import json  # for nice indentation
warning = disqualified_candidates[0].warnings[0]
print(json.dumps(warning.json(), indent=2))


{
  "qualified_name": "eemeter.caltrack_daily.hdd_only.beta_hdd_negative",
  "description": "Model fit beta_hdd parameter is negative. Candidate model rejected.",
  "data": {
    "intercept": 27.403096292465268,
    "beta_hdd": -0.0002667900820258473,
    "heating_balance_point": 67
  }
}

The whole model can be serialized. The .json(with_candidates=True) flag will also serialize all candidate models results:


In [87]:
print(json.dumps(baseline_model_results_billing.json(), indent=2))


{
  "status": "SUCCESS",
  "method_name": "caltrack_usage_per_day",
  "interval": "billing",
  "model": {
    "model_type": "cdd_hdd",
    "formula": "meter_value ~ cdd_67 + hdd_54",
    "status": "QUALIFIED",
    "model_params": {
      "intercept": 13.420093629452865,
      "beta_cdd": 2.2578686654124094,
      "beta_hdd": 1.0479347638717031,
      "cooling_balance_point": 67,
      "heating_balance_point": 54
    },
    "r_squared_adj": 0.9944443281461993,
    "warnings": []
  },
  "r_squared_adj": 0.9944443281461993,
  "warnings": [],
  "metadata": {},
  "settings": {
    "fit_cdd": true,
    "minimum_non_zero_cdd": 0,
    "minimum_non_zero_hdd": 0,
    "minimum_total_cdd": 20,
    "minimum_total_hdd": 20,
    "beta_cdd_maximum_p_value": 1,
    "beta_hdd_maximum_p_value": 1
  },
  "totals_metrics": {
    "observed_length": 11.0,
    "predicted_length": 11.0,
    "merged_length": 11.0,
    "num_parameters": 2.0,
    "observed_mean": 826.9845454545452,
    "predicted_mean": 826.9845454545463,
    "observed_variance": 60832.673242975194,
    "predicted_variance": 60990.36005665033,
    "observed_skew": 0.025472367564941884,
    "predicted_skew": 0.03399907556814557,
    "observed_kurtosis": -1.9013092894472514,
    "predicted_kurtosis": -1.8624351662191883,
    "observed_cvstd": 0.3128004720412377,
    "predicted_cvstd": 0.31320562098895544,
    "r_squared": 0.9955097659655124,
    "r_squared_adj": 0.9943872074568906,
    "rmse": 16.550423225663966,
    "rmse_adj": 18.297181320370296,
    "cvrmse": 0.020012977650709498,
    "cvrmse_adj": 0.02212517926839056,
    "mape": 0.018162928799010797,
    "mape_no_zeros": 0.018162928799010797,
    "num_meter_zeros": 0.0,
    "nmae": 0.016581222523252066,
    "nmbe": 7.560934613659672e-16,
    "autocorr_resid": -0.2741205787861062
  },
  "avgs_metrics": {
    "observed_length": 11.0,
    "predicted_length": 11.0,
    "merged_length": 11.0,
    "num_parameters": 2.0,
    "observed_mean": 27.55257630879596,
    "predicted_mean": 27.544736275955856,
    "observed_variance": 68.89622790853522,
    "predicted_variance": 68.4631692541485,
    "observed_skew": -0.20540921890838723,
    "predicted_skew": -0.21655924382853398,
    "observed_kurtosis": -2.088545941643638,
    "predicted_kurtosis": -2.0567169959498837,
    "observed_cvstd": 0.31595981517065225,
    "predicted_cvstd": 0.3150548892009618,
    "r_squared": 0.995372710958323,
    "r_squared_adj": 0.9942158886979038,
    "rmse": 0.5647228136472847,
    "rmse_adj": 0.6243245611406124,
    "cvrmse": 0.020496189079313106,
    "cvrmse_adj": 0.02265938960275382,
    "mape": 0.0181629287990108,
    "mape_no_zeros": 0.0181629287990108,
    "num_meter_zeros": 0.0,
    "nmae": 0.01676862002322526,
    "nmbe": -0.0002845480855305531,
    "autocorr_resid": -0.28481353582058017
  },
  "candidates": null
}

In [88]:
print(json.dumps(baseline_model_results_daily.json(), indent=2))


{
  "status": "SUCCESS",
  "method_name": "caltrack_usage_per_day",
  "interval": "daily",
  "model": {
    "model_type": "cdd_hdd",
    "formula": "meter_value ~ cdd_68 + hdd_53",
    "status": "QUALIFIED",
    "model_params": {
      "intercept": 13.472385591795213,
      "beta_cdd": 2.516255026490126,
      "beta_hdd": 1.1122947857443488,
      "cooling_balance_point": 68,
      "heating_balance_point": 53
    },
    "r_squared_adj": 0.7684732992831353,
    "warnings": []
  },
  "r_squared_adj": 0.7684732992831353,
  "warnings": [],
  "metadata": {},
  "settings": {
    "fit_cdd": true,
    "minimum_non_zero_cdd": 10,
    "minimum_non_zero_hdd": 10,
    "minimum_total_cdd": 20,
    "minimum_total_hdd": 20,
    "beta_cdd_maximum_p_value": 1,
    "beta_hdd_maximum_p_value": 1
  },
  "totals_metrics": {
    "observed_length": 365.0,
    "predicted_length": 365.0,
    "merged_length": 365.0,
    "num_parameters": 2.0,
    "observed_mean": 28.515205479452053,
    "predicted_mean": 28.515205479452053,
    "observed_variance": 227.14091865640836,
    "predicted_variance": 174.8406827417312,
    "observed_skew": 0.8886811331096771,
    "predicted_skew": 0.5455712763914577,
    "observed_kurtosis": 0.8614282249213239,
    "predicted_kurtosis": -0.6097327958678194,
    "observed_cvstd": 0.5292573816587004,
    "predicted_cvstd": 0.4643446881483751,
    "r_squared": 0.7697454240123474,
    "r_squared_adj": 0.7684732992831338,
    "rmse": 7.2318902034445225,
    "rmse_adj": 7.25178539977947,
    "cvrmse": 0.2536152232413613,
    "cvrmse_adj": 0.25431292806236583,
    "mape": 0.24130173784949416,
    "mape_no_zeros": 0.24130173784949416,
    "num_meter_zeros": 0.0,
    "nmae": 0.19518739588748232,
    "nmbe": -1.2151806242792632e-16,
    "autocorr_resid": -0.014795668094989898
  },
  "avgs_metrics": {
    "observed_length": 365.0,
    "predicted_length": 365.0,
    "merged_length": 365.0,
    "num_parameters": 2.0,
    "observed_mean": 28.515205479452053,
    "predicted_mean": 28.515205479452053,
    "observed_variance": 227.14091865640836,
    "predicted_variance": 174.8406827417312,
    "observed_skew": 0.8886811331096771,
    "predicted_skew": 0.5455712763914577,
    "observed_kurtosis": 0.8614282249213239,
    "predicted_kurtosis": -0.6097327958678194,
    "observed_cvstd": 0.5292573816587004,
    "predicted_cvstd": 0.4643446881483751,
    "r_squared": 0.7697454240123474,
    "r_squared_adj": 0.7684732992831338,
    "rmse": 7.2318902034445225,
    "rmse_adj": 7.25178539977947,
    "cvrmse": 0.2536152232413613,
    "cvrmse_adj": 0.25431292806236583,
    "mape": 0.24130173784949416,
    "mape_no_zeros": 0.24130173784949416,
    "num_meter_zeros": 0.0,
    "nmae": 0.19518739588748232,
    "nmbe": -1.2151806242792632e-16,
    "autocorr_resid": -0.014795668094989898
  },
  "candidates": null
}

CalTRACK Data Sufficiency

Another important part of the CalTRACK methods are the data sufficiency requirements. We can check the data sufficiency requirements of our baseline data. Note that we include the requested end dates to indicate the intended extent of the period should stop at the baseline end date.


In [89]:
baseline_data_sufficiency_billing = eemeter.caltrack_sufficiency_criteria(
    design_matrix_billing, requested_start=None, requested_end=baseline_end_date)

baseline_data_sufficiency_daily = eemeter.caltrack_sufficiency_criteria(
    design_matrix_daily, requested_start=None, requested_end=baseline_end_date)

In [90]:
baseline_data_sufficiency_billing.warnings


Out[90]:
[EEMeterWarning(qualified_name=eemeter.caltrack_sufficiency_criteria.incorrect_number_of_total_days)]

In [91]:
baseline_data_sufficiency_daily.warnings


Out[91]:
[EEMeterWarning(qualified_name=eemeter.caltrack_sufficiency_criteria.extreme_values_detected)]

These warnings carry useful information about the extent of the data sufficiency issues:


In [92]:
print(json.dumps(baseline_data_sufficiency_billing.json(), indent=2))


{
  "status": "FAIL",
  "criteria_name": "caltrack_sufficiency_criteria",
  "warnings": [
    {
      "qualified_name": "eemeter.caltrack_sufficiency_criteria.incorrect_number_of_total_days",
      "description": "Total data span does not match the required value.",
      "data": {
        "num_days": 365,
        "n_days_total": 338
      }
    }
  ],
  "data": {
    "extra_data_after_requested_end_date": {
      "requested_end": "2016-12-26T00:00:00+00:00",
      "data_end": "2016-12-19T06:00:00+00:00",
      "n_days_end_gap": 6
    },
    "extra_data_before_requested_start_date": {
      "requested_start": null,
      "data_start": "2016-01-22T06:00:00+00:00",
      "n_days_start_gap": 0
    },
    "negative_meter_values": {
      "n_negative_meter_values": 0
    },
    "incorrect_number_of_total_days": {
      "num_days": 365,
      "n_days_total": 338
    },
    "too_many_days_with_missing_data": {
      "n_valid_days": 332,
      "n_days_total": 338
    },
    "too_many_days_with_missing_meter_data": {
      "n_valid_meter_data_days": 332,
      "n_days_total": 338
    },
    "too_many_days_with_missing_temperature_data": {
      "n_valid_temperature_data_days": 332,
      "n_days_total": 338
    },
    "extreme_values_detected": {
      "n_extreme_values": 0,
      "median": 30.391935483870967,
      "upper_quantile": 35.59730056980057,
      "lower_quantile": 19.650705818965516,
      "extreme_value_limit": 78.23171973637612,
      "max_value": 36.68096774193548
    }
  },
  "settings": {
    "num_days": 365,
    "min_fraction_daily_coverage": 0.9,
    "min_fraction_hourly_temperature_coverage_per_period": 0.9
  }
}

In [93]:
print(json.dumps(baseline_data_sufficiency_daily.json(), indent=2))


{
  "status": "PASS",
  "criteria_name": "caltrack_sufficiency_criteria",
  "warnings": [
    {
      "qualified_name": "eemeter.caltrack_sufficiency_criteria.extreme_values_detected",
      "description": "Extreme values (greater than (median + (3 * IQR)), must be flagged for manual review.",
      "data": {
        "n_extreme_values": 1,
        "median": 26.34,
        "upper_quantile": 37.34,
        "lower_quantile": 16.53,
        "extreme_value_limit": 88.77000000000001,
        "max_value": 96.39
      }
    }
  ],
  "data": {
    "extra_data_after_requested_end_date": {
      "requested_end": "2016-12-26T00:00:00+00:00",
      "data_end": "2016-12-26T00:00:00+00:00",
      "n_days_end_gap": 0
    },
    "extra_data_before_requested_start_date": {
      "requested_start": null,
      "data_start": "2015-12-27T00:00:00+00:00",
      "n_days_start_gap": 0
    },
    "negative_meter_values": {
      "n_negative_meter_values": 0
    },
    "incorrect_number_of_total_days": {
      "num_days": 365,
      "n_days_total": 365
    },
    "too_many_days_with_missing_data": {
      "n_valid_days": 365,
      "n_days_total": 365
    },
    "too_many_days_with_missing_meter_data": {
      "n_valid_meter_data_days": 365,
      "n_days_total": 365
    },
    "too_many_days_with_missing_temperature_data": {
      "n_valid_temperature_data_days": 365,
      "n_days_total": 365
    },
    "extreme_values_detected": {
      "n_extreme_values": 1,
      "median": 26.34,
      "upper_quantile": 37.34,
      "lower_quantile": 16.53,
      "extreme_value_limit": 88.77000000000001,
      "max_value": 96.39
    }
  },
  "settings": {
    "num_days": 365,
    "min_fraction_daily_coverage": 0.9,
    "min_fraction_hourly_temperature_coverage_per_period": 0.9
  }
}

Next steps

Congrats! You've finished the basic tutorial. The following are all highly recommended as ways to learn more about open energy efficiency metering:

The following prints the names of the other samples to try out with this notebook, if interested:


In [94]:
eemeter.samples()


Out[94]:
['il-electricity-cdd-hdd-billing_bimonthly',
 'il-electricity-cdd-hdd-billing_monthly',
 'il-electricity-cdd-hdd-daily',
 'il-electricity-cdd-hdd-hourly',
 'il-electricity-cdd-only-billing_bimonthly',
 'il-electricity-cdd-only-billing_monthly',
 'il-electricity-cdd-only-daily',
 'il-electricity-cdd-only-hourly',
 'il-gas-hdd-only-billing_bimonthly',
 'il-gas-hdd-only-billing_monthly',
 'il-gas-hdd-only-daily',
 'il-gas-hdd-only-hourly',
 'il-gas-intercept-only-billing_bimonthly',
 'il-gas-intercept-only-billing_monthly',
 'il-gas-intercept-only-daily',
 'il-gas-intercept-only-hourly']