This jupyter notebook is an interactive tutorial. It walks through loading data, running the CalTRACK methods, and plotting results. You'll run all the code yourself. Cells can be executed with <shift><enter>
. If you feel so inspired, make edits to the code in these cells and dig deeper.
This tutorial assumes the reader has properly installed python and the eemeter package (pip install eemeter
) and has a basic working knowledge of python syntax and usage.
This tutorial is a self-paced walkthrough of how to use the eemeter package. We'll cover the following:
The tutorial is focused on demonstrating how to use the package to run the CalTRACK Hourly, Daily, and Billing methods on hourly, daily, and billing meter data.
At time of writing (Sept 2018), the OpenEEmeter, as implemented in the eemeter
package, contains the most complete open source implementation of the CalTRACK methods, which specify a way of calculating avoided energy use at a single meter. However, using the OpenEEmeter to calculate avoided energy use does not in itself guarantee compliance with the CalTRACK method specification. Nor is using the OpenEEmeter a requirement of the CalTRACK methods. The eemeter package is a toolkit that may help with implementing a CalTRACK compliant analysis, as it provides a particular implementation of the CalTRACK methods which consists of a set of functions, parameters, and classes which can be configured to run the CalTRACK methods and variants. Please keep in mind while using the package that the eemeter assumes certain data cleaning tasks that are specified in the CalTRACK methods have occurred prior to usage with the eemeter. The package will create warnings to expose errors of this nature where possible.
The eemeter package is built for flexibility and modularity. While this is generally helpful and makes it easier to use the package, one potential consequence of this for users is that without being careful to follow the both the eemeter documentation and the guidance provided in the CalTRACK methods, it is very possible to use the eemeter in a way that does not comply with the CalTRACK methods. For example, while the CalTRACK methods set specific hard limits for the purpose of standardization and consistency, the eemeter can be configured to edit or entirely ignore those limits. The main reason for this flexibility is that the emeter package is used not only to comply with the CalTRACK methods, but also to develop, test, and propose potential changes to those methods.
Rather than providing a single method that directly calculates avoided energy use from the required inputs, the eemeter library provides a series of modular functions that can be strung together in a variety of ways. The tutorial below describes common usage and sequencing of these functions, especially when it might not otherwise be apparent from the API documentation.
Some new users have assumed that the eemeter package constitutes an entire application suitable for running metering analytics at scale. This is not necessarily the case. It is designed instead to be embedded within other applications or to be used in one-off analyses. The eemeter is a toolbox that leaves to the user decisions about when to use or how to embed the provided tools within other applications. This limitation is an important consequence of the decision to make the methods and implementation as open and accessible as possible.
As you dive in, remember that this is a work in progress and that we welcome feedback and contributions. To contribute, please open an issue or a pull request on github.
Note: these Jupyter cell magics enable some useful special features but are unrelated to eemeter.
In [1]:
# inline plotting
%matplotlib inline
# allow live package editing
%load_ext autoreload
%autoreload 2
In [2]:
import eemeter
This tutorial requires eemeter version 2.x.x. You can verify the version you have installed with the command below.
In [3]:
eemeter.get_version()
Out[3]:
The three essential inputs to eemeter library functions are the following:
Users of the library are responsible for obtaining and formatting this data (to get weather data, see eeweather, which helps perform site to weather station matching and can pull and cache temperature data directly from public (US) data sources). Some samples come loaded with the library and we'll load these first to save you the trouble of loading in your own data. The simulated sample data additionally has the useful property that we can load the same underlying data in three different frequencies: hourly, daily, and billing data.
We directly use pandas DataFrame
amd Series
objects to hold the input meter and temperature time series data, which allows us to easily take advantage of the powerful methods provided by the pandas package. Use pandas has the added advantage that usage is a bit more familiar to pythonistas who work frequently with data of this nature in python. These formats are discussed in more detail below. If working with your own data instead of these samples, please refer directly to the excellent pandas documentation for instructions for loading data (e.g., pandas.read_csv). For some common cases, eemeter does come packaged with loading methods, but these will only work for particular data formats.
Useful eemeter methods for loading and manipulating data:
eemeter.meter_data_from_csv
: Load meter data from CSV.eemeter.temperature_data_from_csv
: Load temperature data from CSV.eemeter.meter_data_from_json
: Load meter data from JSON.eemeter.temperature_data_from_json
: Load temperature data from JSON.eemeter.samples
: Return a list of sample data names.eemeter.load_sample
: Load sample data by name.eemeter.as_freq
: Coerce meter data into a different frequency.Remember: the sample data is simulated, not real!
In [4]:
meter_data_hourly, temperature_data_hourly, metadata_hourly = \
eemeter.load_sample('il-electricity-cdd-hdd-hourly')
meter_data_daily, temperature_data_daily, metadata_daily = \
eemeter.load_sample('il-electricity-cdd-hdd-daily')
meter_data_billing, temperature_data_billing, metadata_billing = \
eemeter.load_sample('il-electricity-cdd-hdd-billing_monthly')
The metadata has project start and end that we can use to determine a baseline period. All three of these have the same project dates, so we'll just use one of them. All we are using this for is to define the baseline end date and the reporting period start date.
In [5]:
baseline_end_date = metadata_billing['blackout_start_date']
baseline_end_date
Out[5]:
The convention for formatting meter data is to create a pandas DataFrame with a DatetimeIndex called start
and a single column of meter readings called value
. The index datetime values represent the start dates of each metering period. The end of each period is the start of the next period, even for data with variable period lengths like billing data. The end date of the last period can be supplied by appending an extra period with the final end date and a NaN value. Missing data is represented by one or more periods of value NaN. Data should be sorted by time and deduplicated prior to use with eemeter. Timestamps must be timezone aware.
Data is formatted like this as a convenience to avoid the need to store a start and an end period for each data point. However, the convention that uses start dates as timestamps can be a bit confusing. Make sure that if you are starting with billing data, which is sometimes defined primarily by period end dates that the transformation is done properly so that the meter data ends up with start dates as time stamps.
Take a look at the hourly, daily, and billing data we just loaded. It follows the conventions described above. Notice that the format is identical but the timestamps and values are different.
In [6]:
meter_data_hourly.head() # pandas.DataFrame.head filters to just the first 5 rows
Out[6]:
In [7]:
meter_data_daily.head()
Out[7]:
In [8]:
meter_data_billing.tail() # last 5 rows
Out[8]:
The convention for formatting temperature is as a pandas Series, also with a DatetimeIndex. These three versions are all exactly the same. That is because we always start with hourly temperature data. This is necessary even for daily and billing analyses because we must be able to aggregate the temperatures in different ways over different time series - including dates in many different timezones, which have midnight timestamps which don't always align with the UTC midnights provided in preaggregated daily data.
In [9]:
temperature_data_hourly.head()
Out[9]:
In [10]:
temperature_data_daily.head()
Out[10]:
In [11]:
temperature_data_billing.head()
Out[11]:
The eemeter plotting functions allow visual exploration of meter and temperature data.
Plotting in time series, we see the difference in the frequency of the data more clearly.
eemeter.plot_time_series
: Plot meter and temperature data in time series.
In [12]:
eemeter.plot_time_series(meter_data_hourly, temperature_data_hourly, figsize=(16, 4))
Out[12]:
In [13]:
eemeter.plot_time_series(meter_data_daily, temperature_data_daily, figsize=(16, 4))
Out[13]:
In [14]:
eemeter.plot_time_series(meter_data_billing, temperature_data_billing, figsize=(16, 4))
Out[14]:
The following stacks the three versions of the data - hourly, billing and daily - right on top of each other in energy signature form. This shows the temperature dependence of usage on external temperatures. These plots convert the meter data to "usage per day", which normalizes things and makes usage patterns appear roughly comparable at different sampling intervals.
eemeter.plot_energy_signature
: Plot meter and temperature data as an energy signature.Remember, this data is simulated. If these correlations look too good to be true, they are!
In [15]:
ax = eemeter.plot_energy_signature(meter_data_hourly, temperature_data_hourly, figsize=(14, 8))
eemeter.plot_energy_signature(meter_data_daily, temperature_data_daily, ax=ax)
eemeter.plot_energy_signature(meter_data_billing, temperature_data_billing, ax=ax)
ax.legend(labels=['hourly', 'daily', 'billing'])
Out[15]:
The CalTRACK methods require building a model of the usage during the baseline period and then projecting that forward into the reporting period. Before we can build the baseline model we need to get isolate 365 days of meter data as immediately prior to the end of the baseline period as we can. The following function performs this filtering for us an returns a new dataset with only baseline data.
eemeter.get_baseline_data
: Filter a dataset to baseline period data.
In [16]:
baseline_meter_data_hourly, baseline_warnings_hourly = eemeter.get_baseline_data(
meter_data_hourly, end=baseline_end_date, max_days=365)
baseline_meter_data_daily, baseline_warnings_daily = eemeter.get_baseline_data(
meter_data_daily, end=baseline_end_date, max_days=365)
baseline_meter_data_billing, baseline_warnings_billing = eemeter.get_baseline_data(
meter_data_billing, end=baseline_end_date, max_days=365)
To give you a sense for what this data looks like, let's tail the data again. Remember that we had a baseline end date of 2016-12-26 - so this data goes up to that data but no further, as we specified above with the end
argument. It's also no more than 365 days long, as we specified above with the max_days
argument. Notice that the billing data is a bit shorter because of the unevenness of billing periods. Billing periods that fall across (rather than exactly at) the boundaries are removed in this method.
In [17]:
baseline_meter_data_hourly.tail()
Out[17]:
In [18]:
baseline_meter_data_daily.tail()
Out[18]:
In [19]:
baseline_meter_data_billing.tail()
Out[19]:
If there had been any issues (e.g., unexpected gaps in the data) in filtering the data to the baseline period, some warnings would have been reported. This time we got off easy, but that will not always be the case in real-life datasets.
In [20]:
baseline_warnings_hourly, baseline_warnings_daily, baseline_warnings_billing
Out[20]:
CalTRACK defines certain changes to the meter data such as:
A helper function was created to handle these cases called clean_caltrack_billing_daily_data
In [24]:
baseline_meter_data_billing = eemeter.clean_caltrack_billing_daily_data(baseline_meter_data_billing, 'billing')
baseline_meter_data_daily = eemeter.clean_caltrack_billing_daily_data(baseline_meter_data_daily, 'daily')
baseline_meter_data_daily_from_hourly = eemeter.clean_caltrack_billing_daily_data(baseline_meter_data_hourly, 'hourly')
The CalTRACK daily and billing methods specify a way of modeling the energy signature we plotted a few cells above. We need to select a model which fits the data as well as possible. The parameters in the model are heating and cooling balance points (i.e., the temperatures at which heating/cooling related energy use tend to kick in), and the heating and cooling beta parameters, which define the slope of the energy response to incremental differences between outdoor temperature and the balance point. We'll do a grid search over possible heating and cooling balance points and fit models to the heating and cooling degree days) defined by the outdoor temperatures and each of those balance points. To do this, we precompute the heating and cooling degree days using the methods below before we feed them into the modeling routines.
To make this dataset, we need to merge the meter data and temperature data into a single DataFrame. The compute_usage_per_day_feature
function transforms the meter data into usage per day. The compute_temperature_features
function lets us create a bunch of heating and cooling degree day values if we specify balance points to use. In this case, we'll use the wide balance point ranges recommended in the CalTRACK spec. Then we can combine the two using merge_features
.
eemeter.create_caltrack_daily_design_matrix
: Create a design matrix for CalTRACK daily methods.eemeter.create_caltrack_billing_design_matrix
: Create a design matrix for CalTRACK billing methods.eemeter.compute_usage_per_day_feature
: Transform meter data into usage per day.eemeter.compute_temperature_features
: Compute heating and cooling degree days and other useful temperature features.eemeter.merge_features
: Combine a list of Dataframe or Series objects which share an index into a single DataFrame.
In [25]:
design_matrix_daily = eemeter.create_caltrack_daily_design_matrix(
baseline_meter_data_daily, temperature_data_daily,
)
A preview of this dataset is shown below:
In [26]:
design_matrix_daily.tail()
Out[26]:
In [27]:
design_matrix_daily.index.min(), design_matrix_daily.index.max()
Out[27]:
We can do roughly the same thing for the billing data, adding a tolerance as specified in the CalTRACK methods.
In [28]:
design_matrix_billing = eemeter.create_caltrack_billing_design_matrix(
baseline_meter_data_billing, temperature_data_billing,
)
You'll notice that this billing data shares the structure used above for the daily data. Notice however that the magnitide of the meter value column is significantly smaller than it was before calling compute_usage_per_day_feature
- that is because the values are returned as average usage per day, as specified by the CalTRACK methods, not as totals per period, as they are represented in the inputs. The heating/cooling degree days returned by compute_temperature_features
are also average heating/cooling degree days per day, and not total heating/cooling degree days per period. This averaging behavior can be modified with the use_mean_daily_values
parameter, which is set to True
by default.
In [29]:
design_matrix_billing.tail()
Out[29]:
In [30]:
design_matrix_billing.index.min(), design_matrix_billing.index.max()
Out[30]:
If you are not running the CalTRACK hourly methods, at this point you should skip down the the section called "Running the CalTRACK Billing/Daily Methods".
The hourly methods require a multi-stage dataset creation process which is a bit more involved than the daily/billing dataset creation process above. There are two primary reasons for this extra complexity. First, unlike the daily/billing methods, the hourly methods build separate models for each calendar month, which adds a few extra steps. Second, also unlike the billing and daily methods, there are two features of the dataset creation which must themselves be fitted to a preliminary dataset -- the occupancy feature and the temperature bin features.
The preliminary dataset has some simple time and temperature features. These features do not vary by segment and are precursors to other features (See below for a better explanation of segmentation). This step looks a lot like the daily/billing dataset creation. These features are used subsequently to fit the occupancy and temperature bin features.
eemeter.create_caltrack_hourly_preliminary_design_matrix
: Create a design matrix for the first stage of CalTRACK hourly.eemeter.compute_time_features
: Create a time feature for the index (time_of_week
).eemeter.compute_temperature_features
: Compute heating and cooling degree days and other useful temperature features.eemeter.merge_features
: Combine a list of Dataframe or Series objects which share an index into a single DataFrame.
In [31]:
preliminary_design_matrix_hourly = eemeter.create_caltrack_hourly_preliminary_design_matrix(
baseline_meter_data_hourly, temperature_data_hourly,
)
Let's take a peek at this data. This time, we have only two fixed heating and cooling degree day columns - these are used to fit the occupancy model. But we also have an hour of week column, which is a categorical variable indicating the hour of the week using an integer from 1 to 168 (i.e., 7*24
).
In [32]:
preliminary_design_matrix_hourly.tail()
Out[32]:
To handle creating multiple independent models on a shared dataset (as is required for CalTRACK hourly), we have introduced a concept which we are calling segmentation. Segmentation breaks a dataset into $n$ named and weighted subsets.
Before we can move on to the next steps of creating the CalTRACK hourly dataset, we need to create a segmentation for the hourly data. We will use this to create 12 independent hourly models - one for each month of the calendar year. The eemeter function for creating these weights is called segment_time_series
and it takes a DatetimeIndex
as input.
This segmentation matrix contains 1 column for each segment (12 in all), each of which contains the segmentation weights for that column. The segmentation scheme we use here is to have one segment for each month which contains a single fully weighted calendar month and two half-weighted neighboring calendar months. The eemeter code name for this segmentation scheme is called 'three_month_weighted'
(There's also all
, one_month
, and three_month
, each of which behaves a bit differently).
We are creating this segmentation over the time index of the baseline period that is represented in the preliminary hourly design matrix.
eemeter.segment_time_series
: Create a segmentation using the specified scheme.
In [33]:
segmentation_hourly = eemeter.segment_time_series(
preliminary_design_matrix_hourly.index,
'three_month_weighted'
)
segmentation_hourly.head()
Out[33]:
These segments are probably a bit easier to understand when plotted visually. The areas in the following chart represent the weights assigned to the data at particular hours. A weight of 1 is full weight, as weight of 0 indicates that the data is ignored for that segment. These segments look like 3-month long tetris blocks and indicate half-weight/full-weight/half-weight for the three months they cover. For instance, the dec-jan-feb-weighted
segment (which will eventually be used to estimate usage for january) includes a fully weighted january but also half-weighted december and february. These weights wrap around the calendar year, so both January and December of 2017 might end up in the same dataset.
In [34]:
# example segmentation weights
segmentation_hourly[[
'dec-jan-feb-weighted',
'apr-may-jun-weighted',
'jun-jul-aug-weighted'
]].plot.area(stacked=False, alpha=0.3, figsize=(15, 2.5))
Out[34]:
Occupancy is estimated by building a simple model from the preliminary design matrix hdd_50 and cdd_65 columns. This is done for each segment independently, so results are returned as a dataframe with one segment of results per column. The segmentation
argument indicates that the analysis should be done once per segment. Occupancy is determined by hour of week category. A value of 1 for a particular hour indicates an "occupied" mode, and a value of 0 indicates "unoccupied" mode. These modes are determined by the tendency of the hdd_50/cdd_65 model to over- or under-predict usage for that hour, given a particular threshold between 0 and 1 (if the percent of underpredictions (by count) is lower than that threshold, then the mode is "unoccupied", otherwise the mode is "occupied").
eemeter.estimate_hour_of_week_occupancy
: Estimate occupancy by time of week for each segment.
In [35]:
occupancy_lookup_hourly = eemeter.estimate_hour_of_week_occupancy(
preliminary_design_matrix_hourly,
segmentation=segmentation_hourly,
# threshold=0.65 # default
)
The occupancy lookup is organized by hour of week (rows) and model segment (columns).
In [36]:
occupancy_lookup_hourly.head()
Out[36]:
Temperature bins are fit for each segment such that each bin has sufficient number of temperature readings. Bins are defined by starting with a proposed set of bins (see the default_bins
argument) and systematically dropping bin endpoints. Bins themselves are not dropped but are effectively combined with neighboring bins. Except for the fact that zero-weighted times are dropped, segment weights are not considered when fitting temperature bins.
eemeter.fit_temperature_bins
: Fit temperature bins to data, dropping bin endpoints for bins that do not meet the minimum temperature count such that remaining bins meet the minimum count.
In [37]:
temperature_bins_hourly = eemeter.fit_temperature_bins(
preliminary_design_matrix_hourly,
segmentation=segmentation_hourly,
# default_bins=[30, 45, 55, 65, 75, 90], # default
# min_temperature_count=20 # default
)
Because bin fitting and validation is done independently for each segment, results are returned as a dataframe with one segment of results per column. The contents of the dataframe are boolean indicators of whether the bin endpoint should be used for temperatures in that segment. Some bin endpoints are dropped because of insufficient reading counts. The bin endpoints that are dropped for each segment are given a value of False
. You'll notice in this dataset that the the winter months tend to have combined high temperature bins and the summer months tend to have combined low temperature bins.
In [38]:
temperature_bins_hourly
Out[38]:
With these in hand, now we can combine them into a segmented dataset using the helper function iterate_segmented_dataset
and a prefabricated feature processor caltrack_hourly_fit_feature_processor
which is provided to assist creating the segmented dataset given a preliminary design matrix of the form created above. The feature processor transforms the each segment of the dataset using the occupancy lookup and temperature bins created above. We are creating a python dict
of pandas Dataframes
- one for each time series segment encountered in the baseline data.
In [39]:
segmented_design_matrices_hourly = eemeter.create_caltrack_hourly_segmented_design_matrices(
preliminary_design_matrix_hourly,
segmentation_hourly,
occupancy_lookup_hourly,
temperature_bins_hourly,
)
The keys of the dict are segment names. The values are DataFrame objects containing the exact data needed to fit the a CalTRACK hourly model.
In [40]:
print(segmented_design_matrices_hourly.keys())
segmented_design_matrices_hourly['dec-jan-feb-weighted'].head()
Out[40]:
In [41]:
baseline_model_results_daily = eemeter.fit_caltrack_usage_per_day_model(
design_matrix_daily,
)
In [42]:
baseline_model_results_billing = eemeter.fit_caltrack_usage_per_day_model(
design_matrix_billing,
use_billing_presets=True,
weights_col='n_days_kept',
)
In [43]:
baseline_segmented_model_hourly = eemeter.fit_caltrack_hourly_model(
segmented_design_matrices_hourly,
occupancy_lookup_hourly,
temperature_bins_hourly,
)
In [44]:
ax = eemeter.plot_energy_signature(meter_data_daily, temperature_data_daily)
baseline_model_results_daily.plot(ax=ax, temp_range=(-5, 88))
Out[44]:
In [45]:
ax = eemeter.plot_energy_signature(meter_data_billing, temperature_data_billing)
baseline_model_results_billing.plot(ax=ax, temp_range=(18, 80))
Out[45]:
We can also compare the two models and see that there is a slight, but not drastic, difference between them.
In [46]:
ax = baseline_model_results_daily.model.plot(color='C0', best=True, label='daily')
ax = baseline_model_results_billing.model.plot(ax=ax, color='C1', best=True, label='billing')
ax.legend()
Out[46]:
In [47]:
reporting_start_date = metadata_billing['blackout_start_date']
Now we get the first year of data for that period.
In [48]:
reporting_meter_data_hourly, warnings = eemeter.get_reporting_data(
meter_data_hourly, start=reporting_start_date, max_days=365)
reporting_meter_data_daily, warnings = eemeter.get_reporting_data(
meter_data_daily, start=reporting_start_date, max_days=365)
reporting_meter_data_billing, warnings = eemeter.get_reporting_data(
meter_data_billing, start=reporting_start_date, max_days=365)
The eemeter.metered_savings
method performs the logic of estimating counterfactual baseline reporting period usage. For this, it requires the fitted baseline model, the reporting period meter data (for its index - so that it can be properly joined later), and corresponding temperature data. Note that this method can return results disaggregated into base load, cooling load, or heating load or as the aggregated usage. We do both here for demonstration purposes.
In [49]:
metered_savings_hourly, error_bands = eemeter.metered_savings(
baseline_segmented_model_hourly, reporting_meter_data_hourly,
temperature_data_hourly
)
metered_savings_daily, error_bands = eemeter.metered_savings(
baseline_model_results_daily, reporting_meter_data_daily,
temperature_data_daily, with_disaggregated=True
)
metered_savings_billing, error_bands = eemeter.metered_savings(
baseline_model_results_billing, reporting_meter_data_billing,
temperature_data_billing, with_disaggregated=True
)
In [50]:
metered_savings_hourly.head()
Out[50]:
In [51]:
metered_savings_daily.head()
Out[51]:
In [52]:
metered_savings_billing.head()
Out[52]:
In [53]:
columns = ["reporting_observed", "counterfactual_usage", "metered_savings"]
In [54]:
metered_savings_hourly[columns].resample('MS').sum().plot(figsize=(10, 6), drawstyle="steps-post")
Out[54]:
In [55]:
metered_savings_daily[columns].resample('MS').sum().plot(figsize=(10, 6), drawstyle="steps-post")
Out[55]:
In [56]:
metered_savings_billing[columns].plot(figsize=(10, 6), drawstyle="steps-post")
Out[56]:
These can be easily aggregated
In [57]:
total_savings_hourly = metered_savings_hourly.metered_savings.sum()
percent_savings_hourly = total_savings_hourly / metered_savings_hourly.counterfactual_usage.sum() * 100
print('Hourly: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_hourly, percent_savings_hourly))
total_savings_daily = metered_savings_daily.metered_savings.sum()
percent_savings_daily = total_savings_daily / metered_savings_daily.counterfactual_usage.sum() * 100
print('Daily: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_daily, percent_savings_daily))
total_savings_billing = metered_savings_billing.metered_savings.sum()
percent_savings_billing = total_savings_billing / metered_savings_billing.counterfactual_usage.sum() * 100
print('Billing: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_billing, percent_savings_billing))
NOTE: These results differ somewhat due to the different lengths of the reporting periods - the billing version of the reporting period was a bit shorter because the billing periods over which we computed savings didn't exactly align with 365 day period we requested, as they did for the daily reporting period data.
In [58]:
reporting_preliminary_design_matrix_hourly = eemeter.create_caltrack_hourly_preliminary_design_matrix(
reporting_meter_data_hourly, temperature_data_hourly,
)
reporting_segmentation_hourly = eemeter.segment_time_series(
reporting_preliminary_design_matrix_hourly.index,
'three_month_weighted'
)
reporting_occupancy_lookup_hourly = eemeter.estimate_hour_of_week_occupancy(
reporting_preliminary_design_matrix_hourly,
segmentation=reporting_segmentation_hourly,
)
reporting_temperature_bins_hourly = eemeter.fit_temperature_bins(
reporting_preliminary_design_matrix_hourly,
segmentation=reporting_segmentation_hourly,
)
reporting_segmentation_design_matrices_hourly = eemeter.create_caltrack_hourly_segmented_design_matrices(
reporting_preliminary_design_matrix_hourly,
reporting_segmentation_hourly,
reporting_occupancy_lookup_hourly,
reporting_temperature_bins_hourly
)
In [59]:
reporting_design_matrix_daily = eemeter.create_caltrack_daily_design_matrix(
reporting_meter_data_daily, temperature_data_daily,
)
In [60]:
reporting_design_matrix_billing = eemeter.create_caltrack_billing_design_matrix(
reporting_meter_data_billing, temperature_data_billing,
)
In [61]:
reporting_segmented_model_hourly = eemeter.fit_caltrack_hourly_model(
reporting_segmentation_design_matrices_hourly,
reporting_occupancy_lookup_hourly,
reporting_temperature_bins_hourly
)
In [62]:
reporting_model_results_daily = eemeter.fit_caltrack_usage_per_day_model(
reporting_design_matrix_daily,
)
In [63]:
reporting_model_results_billing = eemeter.fit_caltrack_usage_per_day_model(
reporting_design_matrix_billing,
use_billing_presets=True,
weights_col='n_days_kept',
)
In [64]:
ax = eemeter.plot_energy_signature(meter_data_daily, temperature_data_daily)
ax = baseline_model_results_daily.model.plot(ax=ax, color='C1', best=True, label='baseline', temp_range=(-5, 88))
ax = reporting_model_results_daily.model.plot(ax=ax, color='C2', best=True, label='reporting', temp_range=(-5, 88))
ax.legend()
Out[64]:
In [65]:
ax = eemeter.plot_energy_signature(meter_data_billing, temperature_data_billing)
ax = baseline_model_results_billing.model.plot(ax=ax, color='C1', best=True, label='baseline', temp_range=(18, 80))
ax = reporting_model_results_billing.model.plot(ax=ax, color='C2', best=True, label='reporting', temp_range=(18, 80))
ax.legend()
Out[65]:
The last thing we need to do before obtaining annualized and weather-normalized results is to obtain normal year temperature data. For simplicity, let's just call 2017 our "normal year". To be completely clear, this is not something you would do in practice, but this demonstrates the functionality. To use real temperature normals, check out the EEWeather package.
In [66]:
import pandas as pd
normal_year_temperatures = temperature_data_daily[temperature_data_daily.index.year == 2017]
result_index = pd.date_range('2017-01-01', periods=365, freq='D', tz='UTC')
Now we are ready to obtain our annualized savings.
In [69]:
annualized_savings_hourly, annualized_savings_warnings_hourly = eemeter.modeled_savings(
baseline_segmented_model_hourly, reporting_segmented_model_hourly,
result_index, normal_year_temperatures, with_disaggregated=True
)
annualized_savings_daily, annualized_savings_warnings_daily = eemeter.modeled_savings(
baseline_model_results_daily, reporting_model_results_daily,
result_index, normal_year_temperatures, with_disaggregated=True
)
annualized_savings_billing, annualized_savings_warnings_billing = eemeter.modeled_savings(
baseline_model_results_billing, reporting_model_results_billing,
result_index, normal_year_temperatures, with_disaggregated=True
)
In [70]:
annualized_savings_hourly.head()
Out[70]:
In [71]:
annualized_savings_daily.head()
Out[71]:
In [72]:
annualized_savings_billing.head()
Out[72]:
The following plot demonstrates that in this case, the billing model represents most of the modeled savings as base load savings. This reflects the behavior seen in the model comparison above.
In [73]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 1, figsize=(10, 4))
annualized_savings_hourly[[
'modeled_baseline_usage',
'modeled_reporting_usage',
'modeled_savings',
]].plot(ax=axes)
axes.set_title('Total normalized/annualized savings')
plt.show()
In [74]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(4, 1, figsize=(10, 16))
annualized_savings_daily[[
'modeled_baseline_usage',
'modeled_reporting_usage',
'modeled_savings',
]].plot(ax=axes[0])
axes[0].set_title('Total normalized/annualized savings')
annualized_savings_daily[[
'modeled_baseline_cooling_load',
'modeled_reporting_cooling_load',
'modeled_cooling_load_savings',
]].plot(ax=axes[1])
axes[1].set_title('Modeled cooling load savings')
annualized_savings_daily[[
'modeled_baseline_heating_load',
'modeled_reporting_heating_load',
'modeled_heating_load_savings',
]].plot(ax=axes[2])
axes[2].set_title('Modeled heating load savings')
ax = annualized_savings_daily[[
'modeled_baseline_base_load',
'modeled_reporting_base_load',
'modeled_base_load_savings',
]].plot(ax=axes[3])
axes[3].set_title('Modeled base load savings')
lim = axes[3].set_ylim((0, None))
plt.show()
In [75]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(4, 1, figsize=(10, 16))
annualized_savings_billing[[
'modeled_baseline_usage',
'modeled_reporting_usage',
'modeled_savings',
]].plot(ax=axes[0])
axes[0].set_title('Total normalized/annualized savings')
annualized_savings_billing[[
'modeled_baseline_cooling_load',
'modeled_reporting_cooling_load',
'modeled_cooling_load_savings',
]].plot(ax=axes[1])
axes[1].set_title('Modeled cooling load savings')
annualized_savings_billing[[
'modeled_baseline_heating_load',
'modeled_reporting_heating_load',
'modeled_heating_load_savings',
]].plot(ax=axes[2])
axes[2].set_title('Modeled heating load savings')
ax = annualized_savings_billing[[
'modeled_baseline_base_load',
'modeled_reporting_base_load',
'modeled_base_load_savings',
]].plot(ax=axes[3])
axes[3].set_title('Modeled base load savings')
lim = axes[3].set_ylim((0, None))
plt.show()
In this case, as totals, the annualized savings look pretty similar to the metered savings.
In [76]:
total_savings_hourly = annualized_savings_hourly.modeled_savings.sum()
percent_savings_hourly = total_savings_hourly / annualized_savings_hourly.modeled_baseline_usage.sum() * 100
print('Hourly: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_hourly, percent_savings_hourly))
total_savings_daily = annualized_savings_daily.modeled_savings.sum()
percent_savings_daily = total_savings_daily / annualized_savings_daily.modeled_baseline_usage.sum() * 100
print('Daily: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_daily, percent_savings_daily))
total_savings_billing = annualized_savings_billing.modeled_savings.sum()
percent_savings_billing = total_savings_billing / annualized_savings_billing.modeled_baseline_usage.sum() * 100
print('Billing: Saved {:.1f} kWh in first year ({:.1f}%)'.format(total_savings_billing, percent_savings_billing))
If we're interested in seeing more about models the CalTRACK method tried, we can even plot all the model candidates as well. There are a ton of these, so the reduced alpha makes it a bit easier to see what's going on. Each faint line represents a model that was tried and bested by the (orange) selected model, which had the highest r-squared. Candidates appear green if QUALIFIED
and red if DISQUALIFIED
. A model might be disqualified if it had unphysical (i.e., negative) coefficients.
In [77]:
ax = eemeter.plot_energy_signature(meter_data_daily, temperature_data_daily)
baseline_model_results_daily.plot(
ax=ax,
candidate_alpha=0.02,
with_candidates=True,
temp_range=(-5, 88)
)
Out[77]:
In [78]:
ax = eemeter.plot_energy_signature(meter_data_billing, temperature_data_billing)
baseline_model_results_billing.plot(
ax=ax,
candidate_alpha=0.02,
with_candidates=True,
temp_range=(18, 80)
)
Out[78]:
In addition to being plottable, the model_fit object is an instance of the class eemeter.ModelFit and contains a bunch of interesting information about this modeling process.
For instance, there's a status. This status is one of the following:
'SUCCESS'
: qualified model selected.'NO MODEL'
: no candidate models qualified.'NO DATA'
: no data was given.
In [79]:
baseline_model_results_billing.status, baseline_model_results_daily.status
Out[79]:
There is also a "best" candidate model:
In [80]:
baseline_model_results_billing.model, baseline_model_results_daily.model
Out[80]:
And a list of all candidate models that were tried, many of which have (much) lower r-squared than the best model.
In [81]:
baseline_model_results_billing.candidates[:5] # (there are a lot, so only showing the first 5)
Out[81]:
In [82]:
baseline_model_results_daily.candidates[:5]
Out[82]:
Any associated warnings on both the model_fit object and the best candidate model object:
In [83]:
baseline_model_results_billing.warnings, baseline_model_results_billing.warnings
Out[83]:
In [84]:
baseline_model_results_daily.warnings, baseline_model_results_daily.warnings
Out[84]:
The best models don't appear to have any issues but the billing model did (see the faint red lines in the chart above).
In [85]:
disqualified_candidates = [
candidate
for candidate in baseline_model_results_billing.candidates
if candidate.status == 'DISQUALIFIED'
] # this is a python list comprehension
disqualified_candidates[:10]
Out[85]:
The warnings associated with the disqualified candidates will be a bit more interesting. For instance, this one was disqualified because the 'beta_hdd' parameter was negative, which is unphysicial behavior that the CalTRACK working group should be considered to be evidence of overfitting:
In [86]:
import json # for nice indentation
warning = disqualified_candidates[0].warnings[0]
print(json.dumps(warning.json(), indent=2))
The whole model can be serialized. The .json(with_candidates=True)
flag will also serialize all candidate models results:
In [87]:
print(json.dumps(baseline_model_results_billing.json(), indent=2))
In [88]:
print(json.dumps(baseline_model_results_daily.json(), indent=2))
Another important part of the CalTRACK methods are the data sufficiency requirements. We can check the data sufficiency requirements of our baseline data. Note that we include the requested end dates to indicate the intended extent of the period should stop at the baseline end date.
In [89]:
baseline_data_sufficiency_billing = eemeter.caltrack_sufficiency_criteria(
design_matrix_billing, requested_start=None, requested_end=baseline_end_date)
baseline_data_sufficiency_daily = eemeter.caltrack_sufficiency_criteria(
design_matrix_daily, requested_start=None, requested_end=baseline_end_date)
In [90]:
baseline_data_sufficiency_billing.warnings
Out[90]:
In [91]:
baseline_data_sufficiency_daily.warnings
Out[91]:
These warnings carry useful information about the extent of the data sufficiency issues:
In [92]:
print(json.dumps(baseline_data_sufficiency_billing.json(), indent=2))
In [93]:
print(json.dumps(baseline_data_sufficiency_daily.json(), indent=2))
Congrats! You've finished the basic tutorial. The following are all highly recommended as ways to learn more about open energy efficiency metering:
The following prints the names of the other samples to try out with this notebook, if interested:
In [94]:
eemeter.samples()
Out[94]: