Looking at the Fairbanks script, these are the fields that are necessary to do the benchmarking:
'Site ID': This was a 4 letter code identifying the particular facility. It doesn't have to be a 4-letter code for ARIS, just some unique ID.
'From': The start date of the billing period. Any date format works.
'Thru': The end date of the billing period. This could be multiple months after the start date in the case of oil or propane.
'Service Name': This was ['Oil #1', 'Oil #2', 'Electricity', 'Steam', 'Natural Gas', 'Water', 'Sewer', 'Refuse'] for FNSB.
'Item Description': This was the description of the particular line item on the bill, like "Customer Charge", "Cost Adjustments", "Demand Charge" plus a very wide variety of other types of charges. In the script, I combined all the miscellaneous charges into "Other Charge". These were billing line items that had no Usage associated with them.
'Usage': For usage-based charges, like kWh, kW, gallons, etc, this contains the usage. For "Other Charges" (fixed charges), this is blank.
'Cost': $ cost of the line item.
'Units': For Usage based charges, the units. These are what was in the Fairbanks data: 'Gallons', 'kWh', 'klbs', 'lbs', 'MMBtu', 'kVAR', 'kW', 'CCF', 'kVARh', 'Loads', 'Cgallons', 'kGal', 'Tons'.
So if we can get the ARIS data into this format, and if we could keep Service Name, Units, and some of the Item Descriptions the same, the script should run. As far as Item descriptions that need to stay the same, I think it is just peak Electric Demand that needs to be labeled like Fairbanks. It can be any one of the following: 'KW Charge', 'On peak demand', 'Demand Charge'
benchmark.py script, including creation of new Template.
In [1]:
import time
from datetime import timedelta
import pandas as pd
import numpy as np
import requests
###################### Matplotlib #######################
# import matplotlib pyplot commands
from matplotlib.pyplot import *
# Show Plots in the Notebook
%matplotlib inline
# 'style' the plot like fivethirtyeight.com website
style.use('bmh')
In [2]:
rcParams['figure.figsize']= (10, 8) # set Chart Size
rcParams['font.size'] = 14 # set Font size in Chart
In [3]:
# URLs and Passwords
my_username = 'buildingenergyapp'
my_password = 'buildingenergyappTest1!'
base_url = 'http://arisapi.test.ahfc.us/api/buildingenergy'
building_list = '/GetBuildingList'
building_energy_detail = '/GetBuildingEnergyDetail'
my_params = {'username': my_username,
'password':my_password}
building_list_url = base_url + building_list
building_energy_url = base_url + building_energy_detail
In [4]:
results = requests.post(building_list_url, params=my_params).json()
df_bldgs = pd.DataFrame(results)
df_bldgs.head()
Out[4]:
In [5]:
# Add a Degree-Day Site column by looking up via zip code
df_zip_to_dd = pd.read_excel('../data/Zip_to_DD_Site.xlsx', skiprows=4)
df_zip_to_dd['zip_code'] = df_zip_to_dd.zip_code.astype(str)
print(df_zip_to_dd.head())
zip_to_dd = dict(zip(df_zip_to_dd.zip_code, df_zip_to_dd.dd_site))
zip_to_dd.get('99645')
Out[5]:
In [6]:
df_bldgs['dd_site'] = df_bldgs.BuildingZip.map(zip_to_dd)
df_bldgs.head()
Out[6]:
In [7]:
# Check that all zip codes produced a degree-day site; NO
print(sum(df_bldgs.BuildingZip.isna()), sum(df_bldgs.dd_site.isna()))
# So need to find the zip codes that don't map to a Degree-Day site
# 'dd_site != dd_site' is a hack for finding NaN values.
df_no_map = df_bldgs.query('(BuildingZip > "") and (dd_site != dd_site)')
df_no_map.BuildingZip.unique()
Out[7]:
In [8]:
col_map = [
('BuildingId', 'site_id'),
('BuildingName', 'site_name'),
('BuildingOwnerName', 'site_category'),
('BuildingStreet', 'address'),
('BuildingCity', 'city'),
('BuildingUsageName', 'primary_func'),
('YearBuilt', 'year_built'),
('SquareFeet', 'sq_ft'),
('dd_site', 'dd_site')
]
old_cols, new_cols = zip(*col_map)
df_bldgs2 = df_bldgs[list(old_cols)].copy()
df_bldgs2.columns = new_cols
df_bldgs2['onsite_gen'] = '' # not used
df_bldgs2.to_excel('data/Buildings.xlsx', startrow=3, index=False)
df_bldgs2.head()
Out[8]:
In [9]:
# Now work on the detailed records
print(len(df_bldgs2.site_id.unique()))
my_data = {'username': my_username,
'password':my_password,
'buildingId':44}
detail = requests.post(building_energy_url, data=my_data).json()
df_detail = pd.DataFrame(detail['BuildingEnergyDetailList'])
df_detail['UsageDate'] = pd.to_datetime(df_detail.UsageDate)
df_detail.head()
Out[9]:
In [86]:
my_data = {'username': my_username,
'password':my_password,
'buildingId': None}
dfd = None
next_prn = time.time()
for bldg_id in df_bldgs2.site_id.unique():
my_data['buildingId'] = bldg_id
detail = requests.post(building_energy_url, data=my_data).json()
if len(detail['BuildingEnergyDetailList']):
df_detail = pd.DataFrame(detail['BuildingEnergyDetailList'])
if dfd is not None:
dfd = dfd.append(df_detail, ignore_index=True)
else:
dfd = df_detail.copy()
if time.time() > next_prn:
print('{:,} records fetched'.format(len(dfd)))
next_prn += 10.0 # wait 10 seconds before printing
dfd = dfd.apply(pd.to_numeric, errors='ignore')
dfd[['UsageDate', 'MeterReadDate']] = dfd[['UsageDate', 'MeterReadDate']].apply(pd.to_datetime)
dfd.to_pickle('dfd.pkl')
In [11]:
dfd.head()
Out[11]:
In [12]:
len(dfd)
Out[12]:
In [40]:
dfd.to_pickle('dfd.pkl')
dfd.head()
Out[40]:
In [41]:
dfd.dtypes
Out[41]:
In [15]:
# All of the columns in the FNSB CSV file:
fnsb_cols = ["Site ID","Site Name","Vendor Code","Vendor Name","Account Number","Bill Date","Due Date","Entry Date","Invoice #","Voucher #","From","Thru","Service Name","Item Description","Meter Number","Usage","Cost","Units","Account Financial Code","Site Financial Code"]
fnsb_cols
Out[15]:
In [3]:
dfd = pd.read_pickle('dfd.pkl')
dfd = dfd.apply(pd.to_numeric, errors='ignore')
dfd[['UsageDate', 'MeterReadDate']] = dfd[['UsageDate', 'MeterReadDate']].apply(pd.to_datetime)
dfd.head()
Out[3]:
In [76]:
# Get rid of unneeded columns
dfd.drop(columns=['EnergyTypeId', 'EnergyUnitId', 'UsageYear'], inplace=True)
dfd.head()
Out[76]:
In [5]:
dfd.query('EnergyTypeName=="Coal"').BuildingId.unique()
Out[5]:
In [6]:
# The dictionary that renames the columns to names needed
# by the benchmarking script
col_map = {
'BuildingId': 'Site ID',
'EnergyTypeName': 'Service Name',
'EnergyUnitTypeName': 'Units',
'EnergyQuantity': 'Usage',
'DollarCost': 'Cost',
}
def add_to_final(df_to_add):
global df_final
df_add = df_to_add.copy()
df_add.rename(columns=col_map, inplace=True)
df_add.drop(columns=['DemandUse', 'DemandCost', 'UsageDate', 'MeterReadDate'], inplace=True)
df_final = df_final.append(df_add, ignore_index=True)
In [7]:
# For the usage end date, 'Thru', use the MeterReadDate if available, otherwise
# use the middle of the UsageDate month.
def thru_date(row):
if pd.isnull(row.MeterReadDate):
return row.UsageDate.replace(day=15)
else:
return row.MeterReadDate
dfd['Thru'] = dfd.apply(thru_date, axis=1)
dfd.head()
Out[7]:
In [8]:
dfd.query('MeterReadDate > "2000-01-01"').head()
Out[8]:
In [9]:
dfd.EnergyTypeName.value_counts()
Out[9]:
In [10]:
dfd.query('EnergyTypeName=="Demand - Electric"')
Out[10]:
In [11]:
# Change these to 'Electric'
dfd.loc[dfd.EnergyTypeName == 'Demand - Electric', 'EnergyTypeName'] = 'Electric'
dfd.EnergyTypeName.value_counts()
Out[11]:
In [12]:
# There are a number of records where the EnergyQuantity is 0 or NaN,
# which probably occurs because someone doesn't have the bill for that
# month or there was no fuel fill-up in that month. We will eliminate
# those records, because they distort the period over which fuel usage
# occurred for sporadically bought fuels like oil and wood. For
# monthly-billed fuels, we will later in the code make sure that the
# From - Thru billing period only covers 1 month.
# Start by converting 0s to NaN to make future tests easier.
dfd.loc[dfd.EnergyQuantity == 0.0, 'EnergyQuantity'] = np.NaN
dfd.loc[dfd.DemandUse == 0.0, 'DemandUse'] = np.NaN
# Also found that there were a bunch of -1.0 values for DemandUse that
# are very likely not valid.
dfd.loc[dfd.DemandUse == -1.0, 'DemandUse'] = np.NaN
dfd.query('(EnergyQuantity == 0.0) or (DemandUse == 0.0) or (DemandUse == -1.0)') # should be no records
Out[12]:
In [13]:
# Now filter down to just the records where we have a number for
# either EnergyQuantity or DemandUse.
print(len(dfd))
mask = ~(dfd.EnergyQuantity.isnull() & dfd.DemandUse.isnull())
dfd = dfd[mask].copy()
print(len(dfd))
In [14]:
# These have Demand but no energy
dfd.query('EnergyQuantity != EnergyQuantity').describe()
Out[14]:
In [27]:
# Fill out the From date by using the Thru date from the prior bill
# for the building and for the particular fuel type
df_final = None
for gp, recs in dfd.groupby(['BuildingId', 'EnergyTypeName']):
recs = recs.sort_values(['Thru']).copy()
# Start date comes from prior record
recs['From'] = recs.Thru.shift(1)
recs['Item Description'] = 'Energy'
if df_final is None:
df_final = recs.copy()
else:
df_final = df_final.append(recs, ignore_index=True)
len(df_final)
Out[27]:
In [49]:
# For the services that are normally billed on a monthly basis, fill out
# any missing From dates (e.g. the first bill for a building) with a value
# 30 days prior to Thru. Also, restrict the Thru - From difference to 25 to 35 days.
# If it is outside that range, set to Thru - 30 days.
# Fuel types that are normally billed on a monthly basis
mo_fuels = ['Electric', 'Natural Gas', 'Steam District Ht', 'Hot Wtr District Ht']
mask_mo = df_final.EnergyTypeName.isin(mo_fuels)
# Find records of that type that have NaT for From date and
# set to 30 days prior to Thru
df_final.loc[mask_mo & df_final.From.isnull(), 'From'] = df_final.Thru - timedelta(days=30)
In [45]:
# Now find any records where Thru - From is outside 25 - 35 window and fix those.
# Perhaps they are buildings where there are two separate electric bills.
bill_len = df_final.Thru - df_final.From
mask2 = mask_mo & ((bill_len < timedelta(days=25)) | (bill_len > timedelta(days=35)))
df_final.loc[mask2, 'From'] = df_final.Thru - timedelta(days=30)
In [47]:
print(len(df_final[mask2]))
df_final[mask2].head()
Out[47]:
In [57]:
# Now work on the fuel types that are not billed monthly. Some of these records
# have NaT for the From date because they were the first record for the building
# and a particular fuel type. We will ultimately delete these. In this step
# find sporadically billed records that have a billing length of greater than 450
# days and put a NaT in for From, so that deleting all From==NaT records will catch
# them as well. A billing period more than 450 days probably indicates that a fuel
# fill was missed making the record invalid.
mask_sporadic = ~mask_mo
mask3 = mask_sporadic & (bill_len > timedelta(days=450))
df_final.loc[mask3, 'From'] = pd.NaT
len(df_final)
Out[57]:
In [59]:
# Now eliminate all the sporadically billed records that have a From
# with a NaT
mask_elim = (mask_sporadic & df_final.From.isnull())
df_final = df_final[~mask_elim].copy()
len(df_final)
Out[59]:
In [60]:
# Now add the Electric Demand Charge records. The From-Thru dates on these
# have already been set. The demand quantity and cost
# appear in separate, dedicated columns, but we will move them to the 'EnergyQuantity'
# and 'DollarCost' columns.
df_demand = df_final.query('DemandUse > 0 and EnergyTypeName=="Electric"').copy()
df_demand['EnergyQuantity'] = df_demand.DemandUse
df_demand['DollarCost'] = df_demand.DemandCost
df_demand['EnergyUnitTypeName'] = 'kW'
df_demand['Item Description'] = 'Demand Charge'
# add these to the final DataFrame
df_final = df_final.append(df_demand, ignore_index=True)
len(df_final)
Out[60]:
In [61]:
df_final.columns
Out[61]:
In [62]:
# Eliminate the columns that are not needed
df_final.drop(columns=['DemandCost', 'DemandUse', 'MeterReadDate', 'UsageDate'], inplace=True)
df_final.head()
Out[62]:
In [65]:
df_final.info()
In [66]:
col_map = {
'BuildingId': 'Site ID',
'EnergyTypeName': 'Service Name',
'EnergyUnitTypeName': 'Units',
'EnergyQuantity': 'Usage',
'DollarCost': 'Cost',
}
df_final.rename(col_map, axis=1, inplace=True)
df_final.head()
Out[66]:
In [67]:
df_final['Units'].value_counts()
Out[67]:
In [68]:
df_final['Service Name'].value_counts()
Out[68]:
In [69]:
df_final.dtypes
Out[69]:
In [70]:
set(zip(df_final['Service Name'], df_final['Units']))
Out[70]:
In [71]:
df_final['Item Description'].value_counts()
Out[71]:
In [72]:
# These fields are used in the report summarizing vendors.
df_final['Account Number'] = ''
df_final['Vendor Name'] = ''
In [73]:
df_final.groupby('Service Name').sum()
Out[73]:
In [74]:
df_final.to_pickle('df_final.pkl')
df_final.to_csv('df_final.csv')
In [78]:
# FNSB Processed Data
dfu3 = pd.read_pickle('../df_processed.pkl')
dfu3.head()
Out[78]:
In [79]:
dfu3.service_type.value_counts()
Out[79]:
In [82]:
dfu3.units.value_counts()
Out[82]:
In [91]:
df_raw = pd.read_pickle('../df_raw.pkl')
df_raw[df_raw['Service Name']=='Electricity'].head()
Out[91]:
In [2]:
df_mo = pd.read_pickle('../df_mo.pkl')
df_mo.head()
Out[2]:
In [18]:
dfmf = None
for gp, recs in df_mo.groupby(['BuildingId', 'EnergyTypeName']):
recs = recs.query('(DollarCost > 0) or (EnergyQuantity > 0)').copy()
if len(recs) == 0:
continue
recs.sort_values(['Thru'], inplace=True)
# Start date comes from prior record
recs['From'] = recs.Thru.shift(1)
recs['Item Description'] = 'Energy'
recs.loc[recs.From.isnull(), 'From'] = recs.Thru - timedelta(days=30)
# If any of the billing periods are longer than 45 days, then probably
# a missing prior bill. Just make those periods 30 days long.
mask = ((recs.Thru - recs.From) > timedelta(days=45))
recs.loc[mask, 'From'] = recs.Thru - timedelta(days=30)
if dfmf is None:
dfmf = recs.copy()
else:
dfmf = dfmf.append(recs, ignore_index=True)
In [19]:
(dfmf.Thru - dfmf.From).value_counts()
Out[19]:
In [22]:
dfmf.query('EnergyUnitTypeName == "kW"')
dfmf.head()
Out[22]: