Processing of ARIS data for Benchmarking Script

Looking at the Fairbanks script, these are the fields that are necessary to do the benchmarking:

'Site ID': This was a 4 letter code identifying the particular facility. It doesn't have to be a 4-letter code for ARIS, just some unique ID.
'From': The start date of the billing period. Any date format works.
'Thru': The end date of the billing period. This could be multiple months after the start date in the case of oil or propane.
'Service Name': This was ['Oil #1', 'Oil #2', 'Electricity', 'Steam', 'Natural Gas', 'Water', 'Sewer', 'Refuse'] for FNSB.
'Item Description': This was the description of the particular line item on the bill, like "Customer Charge", "Cost Adjustments", "Demand Charge" plus a very wide variety of other types of charges. In the script, I combined all the miscellaneous charges into "Other Charge". These were billing line items that had no Usage associated with them.
'Usage': For usage-based charges, like kWh, kW, gallons, etc, this contains the usage. For "Other Charges" (fixed charges), this is blank.
'Cost': $ cost of the line item.
'Units': For Usage based charges, the units. These are what was in the Fairbanks data: 'Gallons', 'kWh', 'klbs', 'lbs', 'MMBtu', 'kVAR', 'kW', 'CCF', 'kVARh', 'Loads', 'Cgallons', 'kGal', 'Tons'.

So if we can get the ARIS data into this format, and if we could keep Service Name, Units, and some of the Item Descriptions the same, the script should run. As far as Item descriptions that need to stay the same, I think it is just peak Electric Demand that needs to be labeled like Fairbanks. It can be any one of the following: 'KW Charge', 'On peak demand', 'Demand Charge'

To Do

  • Standardize values in each field to match FNSB values.
    • To be safe, change spelling of units to match (may not really be necessary although there are some specific tests for CCF)
    • Map 'Service Type' to categories and then only do processing based on those categories.
  • Think more about how long of a billing period is too long.
  • Document the Zip-to-DD_site file. Tell people that Zip code is needed to determine degree-day site. Have them look in Buildings.xlsx file to see if a Degree-day site was assigned.
  • Find degree-day sites for the missing zip codes.
  • Add a README.md in the Data directory to explain what the files are.
  • Update the main README.md.
  • Add a section to the Settings File to accomodate ARIS passwords/URL and other ARIS specific settings.
  • Incorporate new graphs into the benchmark.py script, including creation of new Template.
  • Labels in Existing Template Files
    • 'FNSB' appears in Energy Index Comparison.
    • No Propane or Wood present in headings.
    • Oil #2 and #1 probably should be combined and labeled "Fuel Oil", "Heating Oil", or "Oil"
    • Hot Water District Heat and Steam should probably be combined

In [1]:
import time
from datetime import timedelta
import pandas as pd
import numpy as np
import requests

###################### Matplotlib #######################
# import matplotlib pyplot commands
from matplotlib.pyplot import *
 
# Show Plots in the Notebook
%matplotlib inline
 
# 'style' the plot like fivethirtyeight.com website
style.use('bmh')

In [2]:
rcParams['figure.figsize']= (10, 8)   # set Chart Size
rcParams['font.size'] = 14            # set Font size in Chart

Read Data from ARIS API


In [3]:
# URLs and Passwords
my_username = 'buildingenergyapp'
my_password = 'buildingenergyappTest1!'
base_url = 'http://arisapi.test.ahfc.us/api/buildingenergy'
building_list = '/GetBuildingList'
building_energy_detail = '/GetBuildingEnergyDetail'
my_params = {'username': my_username,
             'password':my_password}
building_list_url = base_url + building_list
building_energy_url = base_url + building_energy_detail

In [4]:
results = requests.post(building_list_url, params=my_params).json()
df_bldgs = pd.DataFrame(results)
df_bldgs.head()


Out[4]:
BuildingAddress BuildingCity BuildingEnergyDetailList BuildingId BuildingName BuildingOwnerId BuildingOwnerName BuildingState BuildingStreet BuildingUsageName BuildingZip OrganizationId OrganizationName SquareFeet YearBuilt
0 207 Spruce Dr PORT LIONS AK 99615 PORT LIONS None 44 Library 44 City of Port Lions AK 207 Spruce Dr Office 99615 2.0 Municipal Government/Subdivision 2000.0 1988.0
1 441 Bayview PORT LIONS AK 99550 PORT LIONS None 45 Public Safety Building 45 City of Port Lions AK 441 Bayview Office 99550 2.0 Municipal Government/Subdivision 1700.0 1980.0
2 443 Bayview Dr PORT LIONS AK 99550 PORT LIONS None 46 Water Building 46 City of Port Lions AK 443 Bayview Dr Other 99550 2.0 Municipal Government/Subdivision 1200.0 2008.0
3 177 N. Birch SOLDOTNA AK 99669 SOLDOTNA None 47 City Hall 47 City of Soldotna AK 177 N. Birch Office 99669 2.0 Municipal Government/Subdivision 11676.0 1986.0
4 235 N. Binkley SOLDOTNA AK 99669 SOLDOTNA None 48 Public Library 48 City of Soldotna AK 235 N. Binkley Other 99669 2.0 Municipal Government/Subdivision 10727.0 1972.0

In [5]:
# Add a Degree-Day Site column by looking up via zip code
df_zip_to_dd = pd.read_excel('../data/Zip_to_DD_Site.xlsx', skiprows=4)
df_zip_to_dd['zip_code'] = df_zip_to_dd.zip_code.astype(str)
print(df_zip_to_dd.head())
zip_to_dd = dict(zip(df_zip_to_dd.zip_code, df_zip_to_dd.dd_site))
zip_to_dd.get('99645')


  zip_code dd_site
0    99553    PADU
1    99571    PACD
2    99583    PACD
3    99612    PACD
4    99661    PASD
Out[5]:
'PAAQ'

In [6]:
df_bldgs['dd_site'] = df_bldgs.BuildingZip.map(zip_to_dd)
df_bldgs.head()


Out[6]:
BuildingAddress BuildingCity BuildingEnergyDetailList BuildingId BuildingName BuildingOwnerId BuildingOwnerName BuildingState BuildingStreet BuildingUsageName BuildingZip OrganizationId OrganizationName SquareFeet YearBuilt dd_site
0 207 Spruce Dr PORT LIONS AK 99615 PORT LIONS None 44 Library 44 City of Port Lions AK 207 Spruce Dr Office 99615 2.0 Municipal Government/Subdivision 2000.0 1988.0 PADQ
1 441 Bayview PORT LIONS AK 99550 PORT LIONS None 45 Public Safety Building 45 City of Port Lions AK 441 Bayview Office 99550 2.0 Municipal Government/Subdivision 1700.0 1980.0 PADQ
2 443 Bayview Dr PORT LIONS AK 99550 PORT LIONS None 46 Water Building 46 City of Port Lions AK 443 Bayview Dr Other 99550 2.0 Municipal Government/Subdivision 1200.0 2008.0 PADQ
3 177 N. Birch SOLDOTNA AK 99669 SOLDOTNA None 47 City Hall 47 City of Soldotna AK 177 N. Birch Office 99669 2.0 Municipal Government/Subdivision 11676.0 1986.0 PAEN
4 235 N. Binkley SOLDOTNA AK 99669 SOLDOTNA None 48 Public Library 48 City of Soldotna AK 235 N. Binkley Other 99669 2.0 Municipal Government/Subdivision 10727.0 1972.0 PAEN

In [7]:
# Check that all zip codes produced a degree-day site; NO
print(sum(df_bldgs.BuildingZip.isna()), sum(df_bldgs.dd_site.isna()))

# So need to find the zip codes that don't map to a Degree-Day site
# 'dd_site != dd_site' is a hack for finding NaN values. 
df_no_map = df_bldgs.query('(BuildingZip > "") and (dd_site != dd_site)')
df_no_map.BuildingZip.unique()


174 248
Out[7]:
array(['99577', '99506', '99567', '99587', '99505', '99642', '99779',
       '     ', '99783', '99707', '99109', '99725', '99373'], dtype=object)

In [8]:
col_map = [
    ('BuildingId', 'site_id'),
    ('BuildingName', 'site_name'),
    ('BuildingOwnerName', 'site_category'),
    ('BuildingStreet', 'address'),
    ('BuildingCity', 'city'),
    ('BuildingUsageName', 'primary_func'),
    ('YearBuilt', 'year_built'),
    ('SquareFeet', 'sq_ft'),
    ('dd_site', 'dd_site')
]
old_cols, new_cols = zip(*col_map)
df_bldgs2 = df_bldgs[list(old_cols)].copy()
df_bldgs2.columns = new_cols
df_bldgs2['onsite_gen'] = ''    # not used
df_bldgs2.to_excel('data/Buildings.xlsx', startrow=3, index=False)
df_bldgs2.head()


Out[8]:
site_id site_name site_category address city primary_func year_built sq_ft dd_site onsite_gen
0 44 Library City of Port Lions 207 Spruce Dr PORT LIONS Office 1988.0 2000.0 PADQ
1 45 Public Safety Building City of Port Lions 441 Bayview PORT LIONS Office 1980.0 1700.0 PADQ
2 46 Water Building City of Port Lions 443 Bayview Dr PORT LIONS Other 2008.0 1200.0 PADQ
3 47 City Hall City of Soldotna 177 N. Birch SOLDOTNA Office 1986.0 11676.0 PAEN
4 48 Public Library City of Soldotna 235 N. Binkley SOLDOTNA Other 1972.0 10727.0 PAEN

In [9]:
# Now work on the detailed records
print(len(df_bldgs2.site_id.unique()))
my_data = {'username': my_username,
           'password':my_password,
           'buildingId':44}

detail = requests.post(building_energy_url, data=my_data).json()
df_detail = pd.DataFrame(detail['BuildingEnergyDetailList'])
df_detail['UsageDate'] = pd.to_datetime(df_detail.UsageDate)
df_detail.head()


2767
Out[9]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeId EnergyTypeName EnergyUnitId EnergyUnitTypeName MeterReadDate UsageDate UsageYear
0 44 None None 81.0 412.0 1 Electric 1 kWh None 2009-01-01 2009
1 44 None None 99.0 538.0 1 Electric 1 kWh None 2009-02-01 2009
2 44 None None 81.0 432.0 1 Electric 1 kWh None 2009-03-01 2009
3 44 None None 78.0 406.0 1 Electric 1 kWh None 2009-04-01 2009
4 44 None None 116.0 771.0 1 Electric 1 kWh None 2009-05-01 2009

In [86]:
my_data = {'username': my_username,
           'password':my_password,
           'buildingId': None}
dfd = None
next_prn = time.time()
for bldg_id in df_bldgs2.site_id.unique():
    my_data['buildingId'] =  bldg_id
    detail = requests.post(building_energy_url, data=my_data).json()
    if len(detail['BuildingEnergyDetailList']):
        df_detail = pd.DataFrame(detail['BuildingEnergyDetailList'])
        if dfd is not None:
            dfd = dfd.append(df_detail, ignore_index=True)
        else:
            dfd = df_detail.copy()
        if time.time() > next_prn:
            print('{:,} records fetched'.format(len(dfd)))
            next_prn += 10.0   # wait 10 seconds before printing
dfd = dfd.apply(pd.to_numeric, errors='ignore')
dfd[['UsageDate', 'MeterReadDate']] = dfd[['UsageDate', 'MeterReadDate']].apply(pd.to_datetime)
dfd.to_pickle('dfd.pkl')


48 records fetched
2,520 records fetched
4,920 records fetched
6,886 records fetched
8,743 records fetched
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\Anaconda2\envs\py36\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    379             try:  # Python 2.7, use buffering of HTTP responses
--> 380                 httplib_response = conn.getresponse(buffering=True)
    381             except TypeError:  # Python 2.6 and older, Python 3

TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-86-ae4adbdf165b> in <module>()
      6 for bldg_id in df_bldgs2.site_id.unique():
      7     my_data['buildingId'] =  bldg_id
----> 8     detail = requests.post(building_energy_url, data=my_data).json()
      9     if len(detail['BuildingEnergyDetailList']):
     10         df_detail = pd.DataFrame(detail['BuildingEnergyDetailList'])

~\Anaconda2\envs\py36\lib\site-packages\requests\api.py in post(url, data, json, **kwargs)
    110     """
    111 
--> 112     return request('post', url, data=data, json=json, **kwargs)
    113 
    114 

~\Anaconda2\envs\py36\lib\site-packages\requests\api.py in request(method, url, **kwargs)
     56     # cases, and look like a memory leak in others.
     57     with sessions.Session() as session:
---> 58         return session.request(method=method, url=url, **kwargs)
     59 
     60 

~\Anaconda2\envs\py36\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    506         }
    507         send_kwargs.update(settings)
--> 508         resp = self.send(prep, **send_kwargs)
    509 
    510         return resp

~\Anaconda2\envs\py36\lib\site-packages\requests\sessions.py in send(self, request, **kwargs)
    616 
    617         # Send the request
--> 618         r = adapter.send(request, **kwargs)
    619 
    620         # Total elapsed time of the request (approximately)

~\Anaconda2\envs\py36\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    438                     decode_content=False,
    439                     retries=self.max_retries,
--> 440                     timeout=timeout
    441                 )
    442 

~\Anaconda2\envs\py36\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    599                                                   timeout=timeout_obj,
    600                                                   body=body, headers=headers,
--> 601                                                   chunked=chunked)
    602 
    603             # If we're going to release the connection in ``finally:``, then

~\Anaconda2\envs\py36\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    381             except TypeError:  # Python 2.6 and older, Python 3
    382                 try:
--> 383                     httplib_response = conn.getresponse()
    384                 except Exception as e:
    385                     # Remove the TypeError from the exception chain in Python 3;

~\Anaconda2\envs\py36\lib\http\client.py in getresponse(self)
   1329         try:
   1330             try:
-> 1331                 response.begin()
   1332             except ConnectionError:
   1333                 self.close()

~\Anaconda2\envs\py36\lib\http\client.py in begin(self)
    295         # read until we get a non-100 response
    296         while True:
--> 297             version, status, reason = self._read_status()
    298             if status != CONTINUE:
    299                 break

~\Anaconda2\envs\py36\lib\http\client.py in _read_status(self)
    256 
    257     def _read_status(self):
--> 258         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    259         if len(line) > _MAXLINE:
    260             raise LineTooLong("status line")

~\Anaconda2\envs\py36\lib\socket.py in readinto(self, b)
    584         while True:
    585             try:
--> 586                 return self._sock.recv_into(b)
    587             except timeout:
    588                 self._timeout_occurred = True

KeyboardInterrupt: 

In [11]:
dfd.head()


Out[11]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeId EnergyTypeName EnergyUnitId EnergyUnitTypeName MeterReadDate UsageDate UsageYear
0 44 NaN NaN 81.0 412.0 1 Electric 1 kWh NaT 2009-01-01 2009
1 44 NaN NaN 99.0 538.0 1 Electric 1 kWh NaT 2009-02-01 2009
2 44 NaN NaN 81.0 432.0 1 Electric 1 kWh NaT 2009-03-01 2009
3 44 NaN NaN 78.0 406.0 1 Electric 1 kWh NaT 2009-04-01 2009
4 44 NaN NaN 116.0 771.0 1 Electric 1 kWh NaT 2009-05-01 2009

In [12]:
len(dfd)


Out[12]:
67787

In [40]:
dfd.to_pickle('dfd.pkl')
dfd.head()


Out[40]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeId EnergyTypeName EnergyUnitId EnergyUnitTypeName MeterReadDate UsageDate UsageYear
0 44 NaN NaN 81.0 412.0 1 Electric 1 kWh NaT 2009-01-01 2009
1 44 NaN NaN 99.0 538.0 1 Electric 1 kWh NaT 2009-02-01 2009
2 44 NaN NaN 81.0 432.0 1 Electric 1 kWh NaT 2009-03-01 2009
3 44 NaN NaN 78.0 406.0 1 Electric 1 kWh NaT 2009-04-01 2009
4 44 NaN NaN 116.0 771.0 1 Electric 1 kWh NaT 2009-05-01 2009

In [41]:
dfd.dtypes


Out[41]:
BuildingId                     int64
DemandCost                   float64
DemandUse                    float64
DollarCost                   float64
EnergyQuantity               float64
EnergyTypeId                   int64
EnergyTypeName                object
EnergyUnitId                   int64
EnergyUnitTypeName            object
MeterReadDate         datetime64[ns]
UsageDate             datetime64[ns]
UsageYear                      int64
dtype: object

Process ARIS API data into Form useable by FNSB Script


In [15]:
# All of the columns in the FNSB CSV file:
fnsb_cols = ["Site ID","Site Name","Vendor Code","Vendor Name","Account Number","Bill Date","Due Date","Entry Date","Invoice #","Voucher #","From","Thru","Service Name","Item Description","Meter Number","Usage","Cost","Units","Account Financial Code","Site Financial Code"]
fnsb_cols


Out[15]:
['Site ID',
 'Site Name',
 'Vendor Code',
 'Vendor Name',
 'Account Number',
 'Bill Date',
 'Due Date',
 'Entry Date',
 'Invoice #',
 'Voucher #',
 'From',
 'Thru',
 'Service Name',
 'Item Description',
 'Meter Number',
 'Usage',
 'Cost',
 'Units',
 'Account Financial Code',
 'Site Financial Code']

In [3]:
dfd = pd.read_pickle('dfd.pkl')
dfd = dfd.apply(pd.to_numeric, errors='ignore')
dfd[['UsageDate', 'MeterReadDate']] = dfd[['UsageDate', 'MeterReadDate']].apply(pd.to_datetime)

dfd.head()


Out[3]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeId EnergyTypeName EnergyUnitId EnergyUnitTypeName MeterReadDate UsageDate UsageYear
0 44 NaN NaN 81.0 412.0 1 Electric 1 kWh NaT 2009-01-01 2009
1 44 NaN NaN 99.0 538.0 1 Electric 1 kWh NaT 2009-02-01 2009
2 44 NaN NaN 81.0 432.0 1 Electric 1 kWh NaT 2009-03-01 2009
3 44 NaN NaN 78.0 406.0 1 Electric 1 kWh NaT 2009-04-01 2009
4 44 NaN NaN 116.0 771.0 1 Electric 1 kWh NaT 2009-05-01 2009

In [76]:
# Get rid of unneeded columns
dfd.drop(columns=['EnergyTypeId', 'EnergyUnitId', 'UsageYear'], inplace=True)
dfd.head()


Out[76]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeName EnergyUnitTypeName MeterReadDate UsageDate
0 44 NaN NaN 81.0 412.0 Electric kWh NaT 2009-01-01
1 44 NaN NaN 99.0 538.0 Electric kWh NaT 2009-02-01
2 44 NaN NaN 81.0 432.0 Electric kWh NaT 2009-03-01
3 44 NaN NaN 78.0 406.0 Electric kWh NaT 2009-04-01
4 44 NaN NaN 116.0 771.0 Electric kWh NaT 2009-05-01

In [5]:
dfd.query('EnergyTypeName=="Coal"').BuildingId.unique()


Out[5]:
array([ 1026,  1125,   423,     4,  1240,  1241, 13805], dtype=int64)

In [6]:
# The dictionary that renames the columns to names needed 
# by the benchmarking script
col_map = {
    'BuildingId': 'Site ID',
    'EnergyTypeName': 'Service Name',
    'EnergyUnitTypeName': 'Units',
    'EnergyQuantity': 'Usage',
    'DollarCost': 'Cost',
}
def add_to_final(df_to_add):
    global df_final
    df_add = df_to_add.copy()
    df_add.rename(columns=col_map, inplace=True)
    df_add.drop(columns=['DemandUse', 'DemandCost', 'UsageDate', 'MeterReadDate'], inplace=True)
    df_final = df_final.append(df_add, ignore_index=True)

In [7]:
# For the usage end date, 'Thru', use the MeterReadDate if available, otherwise
# use the middle of the UsageDate month.
def thru_date(row):
    if pd.isnull(row.MeterReadDate):
        return row.UsageDate.replace(day=15)
    else:
        return row.MeterReadDate
dfd['Thru'] = dfd.apply(thru_date, axis=1)
dfd.head()


Out[7]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeName EnergyUnitTypeName MeterReadDate UsageDate Thru
0 44 NaN NaN 81.0 412.0 Electric kWh NaT 2009-01-01 2009-01-15
1 44 NaN NaN 99.0 538.0 Electric kWh NaT 2009-02-01 2009-02-15
2 44 NaN NaN 81.0 432.0 Electric kWh NaT 2009-03-01 2009-03-15
3 44 NaN NaN 78.0 406.0 Electric kWh NaT 2009-04-01 2009-04-15
4 44 NaN NaN 116.0 771.0 Electric kWh NaT 2009-05-01 2009-05-15

In [8]:
dfd.query('MeterReadDate > "2000-01-01"').head()


Out[8]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeName EnergyUnitTypeName MeterReadDate UsageDate Thru
21255 1 0.00 0.00 3909.87 4085.0 Natural Gas CCF 2014-01-15 2014-01-01 2014-01-15
21256 1 3752.91 221.28 3959.78 82000.0 Electric kWh 2014-01-21 2014-01-01 2014-01-21
21257 1 0.00 0.00 2716.99 2712.0 Natural Gas CCF 2014-02-15 2014-02-01 2014-02-15
21258 1 3752.91 221.28 3921.15 81200.0 Electric kWh 2014-02-21 2014-02-01 2014-02-21
21259 1 0.00 0.00 3223.50 3295.0 Natural Gas CCF 2014-03-15 2014-03-01 2014-03-15

In [9]:
dfd.EnergyTypeName.value_counts()


Out[9]:
Electric               36813
#1 Fuel Oil            14547
Natural Gas            10815
Propane                 1560
#2 Fuel Oil             1553
Steam District Ht       1542
Hot Wtr District Ht      657
Coal                     168
Spruce Wood              120
Demand - Electric         12
Name: EnergyTypeName, dtype: int64

In [10]:
dfd.query('EnergyTypeName=="Demand - Electric"')


Out[10]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeName EnergyUnitTypeName MeterReadDate UsageDate Thru
45174 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-07-01 2016-07-15
45175 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-08-01 2016-08-15
45176 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-09-01 2016-09-15
45177 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-10-01 2016-10-15
45178 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-11-01 2016-11-15
45179 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-12-01 2016-12-15
45180 13650 0.0 0.0 440.79 39039.0 Demand - Electric kWh NaT 2016-01-01 2016-01-15
45181 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-02-01 2016-02-15
45182 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-03-01 2016-03-15
45183 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-04-01 2016-04-15
45184 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-05-01 2016-05-15
45185 13650 0.0 0.0 0.00 0.0 Demand - Electric kWh NaT 2016-06-01 2016-06-15

In [11]:
# Change these to 'Electric'
dfd.loc[dfd.EnergyTypeName == 'Demand - Electric', 'EnergyTypeName'] = 'Electric'
dfd.EnergyTypeName.value_counts()


Out[11]:
Electric               36825
#1 Fuel Oil            14547
Natural Gas            10815
Propane                 1560
#2 Fuel Oil             1553
Steam District Ht       1542
Hot Wtr District Ht      657
Coal                     168
Spruce Wood              120
Name: EnergyTypeName, dtype: int64

In [12]:
# There are a number of records where the EnergyQuantity is 0 or NaN,
# which probably occurs because someone doesn't have the bill for that 
# month or there was no fuel fill-up in that month.  We will eliminate
# those records, because they distort the period over which fuel usage
# occurred for sporadically bought fuels like oil and wood.  For 
# monthly-billed fuels, we will later in the code make sure that the 
# From - Thru billing period only covers 1 month.

# Start by converting 0s to NaN to make future tests easier.
dfd.loc[dfd.EnergyQuantity == 0.0, 'EnergyQuantity'] = np.NaN
dfd.loc[dfd.DemandUse == 0.0, 'DemandUse'] = np.NaN

# Also found that there were a bunch of -1.0 values for DemandUse that
# are very likely not valid.
dfd.loc[dfd.DemandUse == -1.0, 'DemandUse'] = np.NaN

dfd.query('(EnergyQuantity == 0.0) or (DemandUse == 0.0) or (DemandUse == -1.0)')   # should be no records


Out[12]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeName EnergyUnitTypeName MeterReadDate UsageDate Thru

In [13]:
# Now filter down to just the records where we have a number for
# either EnergyQuantity or DemandUse.  
print(len(dfd))
mask = ~(dfd.EnergyQuantity.isnull() & dfd.DemandUse.isnull())
dfd = dfd[mask].copy()
print(len(dfd))


67787
56873

In [14]:
# These have Demand but no energy
dfd.query('EnergyQuantity != EnergyQuantity').describe()


Out[14]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity
count 53.000000 53.000000 53.000000 53.000000 0.0
mean 813.698113 177.547925 4.778302 43.115094 NaN
std 2.197769 64.323284 2.938123 6.566051 NaN
min 798.000000 1.510000 0.240000 15.180000 NaN
25% 814.000000 182.510000 2.130000 38.380000 NaN
50% 814.000000 182.510000 6.800000 42.420000 NaN
75% 814.000000 237.360000 8.000000 48.730000 NaN
max 814.000000 237.360000 8.000000 52.760000 NaN

In [27]:
# Fill out the From date by using the Thru date from the prior bill 
# for the building and for the particular fuel type
df_final = None
for gp, recs in dfd.groupby(['BuildingId', 'EnergyTypeName']):
    recs = recs.sort_values(['Thru']).copy()
    # Start date comes from prior record
    recs['From'] = recs.Thru.shift(1)
    recs['Item Description'] =  'Energy'
    if df_final is None:
        df_final = recs.copy()
    else:
        df_final = df_final.append(recs, ignore_index=True)
len(df_final)


Out[27]:
56873

In [49]:
# For the services that are normally billed on a monthly basis, fill out
# any missing From dates (e.g. the first bill for a building) with a value
# 30 days prior to Thru.  Also, restrict the Thru - From difference to 25 to 35 days.
# If it is outside that range, set to Thru - 30 days.

# Fuel types that are normally billed on a monthly basis
mo_fuels = ['Electric', 'Natural Gas', 'Steam District Ht', 'Hot Wtr District Ht']
mask_mo = df_final.EnergyTypeName.isin(mo_fuels)

# Find records of that type that have NaT for From date and 
# set to 30 days prior to Thru
df_final.loc[mask_mo & df_final.From.isnull(), 'From'] = df_final.Thru - timedelta(days=30)

In [45]:
# Now find any records where Thru - From is outside 25 - 35 window and fix those.
# Perhaps they are buildings where there are two separate electric bills.
bill_len = df_final.Thru - df_final.From
mask2 = mask_mo & ((bill_len < timedelta(days=25)) | (bill_len > timedelta(days=35)))
df_final.loc[mask2, 'From'] = df_final.Thru - timedelta(days=30)

In [47]:
print(len(df_final[mask2]))
df_final[mask2].head()


2243
Out[47]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeName EnergyUnitTypeName MeterReadDate UsageDate Thru From Item Description
24 1 NaN NaN 10561.95 91200.0 Electric kWh NaT 2011-03-01 2011-03-15 2011-02-13 Energy
34 1 2973.58 NaN 4814.98 133600.0 Electric kWh NaT 2012-03-01 2012-03-15 2012-02-14 Energy
56 1 3752.91 221.28 3959.78 82000.0 Electric kWh 2014-01-21 2014-01-01 2014-01-21 2013-12-22 Energy
128 1 NaN NaN 653.77 395.0 Natural Gas CCF NaT 2012-04-01 2012-04-15 2012-03-16 Energy
164 1 0.00 NaN 4043.89 4145.0 Natural Gas CCF 2015-04-17 2015-04-01 2015-04-17 2015-03-18 Energy

In [57]:
# Now work on the fuel types that are not billed monthly. Some of these records
# have NaT for the From date because they were the first record for the building
# and a particular fuel type.  We will ultimately delete these.  In this step
# find sporadically billed records that have a billing length of greater than 450
# days and put a NaT in for From, so that deleting all From==NaT records will catch
# them as well. A billing period more than 450 days probably indicates that a fuel
# fill was missed making the record invalid.
mask_sporadic = ~mask_mo
mask3 = mask_sporadic & (bill_len > timedelta(days=450))
df_final.loc[mask3, 'From'] = pd.NaT
len(df_final)


Out[57]:
56873

In [59]:
# Now eliminate all the sporadically billed records that have a From
# with a NaT
mask_elim = (mask_sporadic & df_final.From.isnull())
df_final = df_final[~mask_elim].copy()
len(df_final)


Out[59]:
56085

In [60]:
# Now add the Electric Demand Charge records.  The From-Thru dates on these
# have already been set. The demand quantity and cost
# appear in separate, dedicated columns, but we will move them to the 'EnergyQuantity'
# and 'DollarCost' columns.
df_demand = df_final.query('DemandUse > 0 and EnergyTypeName=="Electric"').copy()
df_demand['EnergyQuantity'] = df_demand.DemandUse
df_demand['DollarCost'] =  df_demand.DemandCost
df_demand['EnergyUnitTypeName'] = 'kW'
df_demand['Item Description'] = 'Demand Charge'

# add these to the final DataFrame
df_final = df_final.append(df_demand, ignore_index=True)
len(df_final)


Out[60]:
59965

In [61]:
df_final.columns


Out[61]:
Index(['BuildingId', 'DemandCost', 'DemandUse', 'DollarCost', 'EnergyQuantity',
       'EnergyTypeName', 'EnergyUnitTypeName', 'MeterReadDate', 'UsageDate',
       'Thru', 'From', 'Item Description'],
      dtype='object')

In [62]:
# Eliminate the columns that are not needed
df_final.drop(columns=['DemandCost', 'DemandUse', 'MeterReadDate', 'UsageDate'], inplace=True)
df_final.head()


Out[62]:
BuildingId DollarCost EnergyQuantity EnergyTypeName EnergyUnitTypeName Thru From Item Description
0 1 10472.0 129800.0 Electric kWh 2009-01-15 2008-12-16 Energy
1 1 9768.0 117600.0 Electric kWh 2009-02-15 2009-01-15 Energy
2 1 11883.0 116000.0 Electric kWh 2009-03-15 2009-02-15 Energy
3 1 15421.0 158800.0 Electric kWh 2009-04-15 2009-03-15 Energy
4 1 13802.0 137400.0 Electric kWh 2009-05-15 2009-04-15 Energy

In [65]:
df_final.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59965 entries, 0 to 59964
Data columns (total 8 columns):
BuildingId            59965 non-null int64
DollarCost            58769 non-null float64
EnergyQuantity        59912 non-null float64
EnergyTypeName        59965 non-null object
EnergyUnitTypeName    59965 non-null object
Thru                  59965 non-null datetime64[ns]
From                  59965 non-null datetime64[ns]
Item Description      59965 non-null object
dtypes: datetime64[ns](2), float64(2), int64(1), object(3)
memory usage: 3.7+ MB

In [66]:
col_map = {
    'BuildingId': 'Site ID',
    'EnergyTypeName': 'Service Name',
    'EnergyUnitTypeName': 'Units',
    'EnergyQuantity': 'Usage',
    'DollarCost': 'Cost',
}
df_final.rename(col_map, axis=1, inplace=True)
df_final.head()


Out[66]:
Site ID Cost Usage Service Name Units Thru From Item Description
0 1 10472.0 129800.0 Electric kWh 2009-01-15 2008-12-16 Energy
1 1 9768.0 117600.0 Electric kWh 2009-02-15 2009-01-15 Energy
2 1 11883.0 116000.0 Electric kWh 2009-03-15 2009-02-15 Energy
3 1 15421.0 158800.0 Electric kWh 2009-04-15 2009-03-15 Energy
4 1 13802.0 137400.0 Electric kWh 2009-05-15 2009-04-15 Energy

In [67]:
df_final['Units'].value_counts()


Out[67]:
kWh                33976
Gallon             10275
CCF                 9660
kW                  3880
thousand pounds     1377
Btu                  609
Ton                   96
Cord                  92
Name: Units, dtype: int64

In [68]:
df_final['Service Name'].value_counts()


Out[68]:
Electric               37856
Natural Gas             9660
#1 Fuel Oil             8918
Steam District Ht       1377
#2 Fuel Oil              900
Hot Wtr District Ht      609
Propane                  457
Coal                      96
Spruce Wood               92
Name: Service Name, dtype: int64

In [69]:
df_final.dtypes


Out[69]:
Site ID                      int64
Cost                       float64
Usage                      float64
Service Name                object
Units                       object
Thru                datetime64[ns]
From                datetime64[ns]
Item Description            object
dtype: object

In [70]:
set(zip(df_final['Service Name'], df_final['Units']))


Out[70]:
{('#1 Fuel Oil', 'Gallon'),
 ('#2 Fuel Oil', 'Gallon'),
 ('Coal', 'Ton'),
 ('Electric', 'kW'),
 ('Electric', 'kWh'),
 ('Hot Wtr District Ht', 'Btu'),
 ('Natural Gas', 'CCF'),
 ('Propane', 'Gallon'),
 ('Spruce Wood', 'Cord'),
 ('Steam District Ht', 'thousand pounds')}

In [71]:
df_final['Item Description'].value_counts()


Out[71]:
Energy           56085
Demand Charge     3880
Name: Item Description, dtype: int64

In [72]:
# These fields are used in the report summarizing vendors.
df_final['Account Number'] = ''
df_final['Vendor Name'] = ''

In [73]:
df_final.groupby('Service Name').sum()


Out[73]:
Site ID Cost Usage
Service Name
#1 Fuel Oil 61030333 4.069266e+07 1.402547e+07
#2 Fuel Oil 5628457 3.815718e+06 1.701116e+06
Coal 381679 9.760000e+03 7.705000e+02
Electric 225508046 1.660733e+08 1.105300e+09
Hot Wtr District Ht 2494589 1.556914e+06 2.750714e+10
Natural Gas 28425111 4.985783e+07 5.107488e+07
Propane 1830040 4.876821e+05 2.641949e+05
Spruce Wood 376717 -2.300000e+01 6.200000e+02
Steam District Ht 13588686 6.975902e+06 8.634288e+07

In [74]:
df_final.to_pickle('df_final.pkl')
df_final.to_csv('df_final.csv')

In [78]:
# FNSB Processed Data
dfu3 = pd.read_pickle('../df_processed.pkl')
dfu3.head()


Out[78]:
site_id service_type cal_year cal_mo item_desc units cost days_served usage fiscal_year fiscal_mo mmbtu
0 03 Electricity 2005 12 Electricity charge kWh 1904.657880 49.5 14790.748577 2006 6 50.466034
1 03 Electricity 2006 1 Electricity charge kWh 5430.493797 93.0 42665.790911 2006 7 145.575679
2 03 Electricity 2006 2 Electricity charge kWh 5764.406730 84.0 45010.439348 2006 8 153.575619
3 03 Electricity 2006 3 Electricity charge kWh 6349.255299 93.0 46311.547557 2006 9 158.015000
4 03 Electricity 2006 4 Electricity charge kWh 5529.385224 90.0 40392.812893 2006 10 137.820278

In [79]:
dfu3.service_type.value_counts()


Out[79]:
Electricity    24817
Water          16480
Oil #1         10533
Sewer          10175
Refuse          7266
Natural Gas     3113
Steam           1882
Oil #2             2
Name: service_type, dtype: int64

In [82]:
dfu3.units.value_counts()


Out[82]:
-           36661
Gallons     16831
kWh         12369
Loads        3215
kW           2237
CCF          1146
klbs          768
MMBtu         499
kVAR          214
lbs           195
kGal           93
Tons           31
kVARh           6
Cgallons        3
Name: units, dtype: int64

In [91]:
df_raw = pd.read_pickle('../df_raw.pkl')
df_raw[df_raw['Service Name']=='Electricity'].head()


Out[91]:
Site ID Site Name Vendor Code Vendor Name Account Number Bill Date Due Date Entry Date Invoice # Voucher # From Thru Service Name Item Description Meter Number Usage Cost Units Account Financial Code Site Financial Code
22 CLXGP2 FNSB-CLRA River Park General Area VU797000 Army Corps of Engineers 01 River Park 08/01/2006 08/01/2006 11/14/2008 NaN NaN 2006-07-01 2006-07-31 Electricity Energy charge NaN 110.0 31.65 kWh 61831.0 NaN
23 CLXGP2 FNSB-CLRA River Park General Area VU797000 Army Corps of Engineers 01 River Park 08/31/2006 08/31/2006 11/14/2008 NaN NaN 2006-07-31 2006-08-31 Electricity Energy charge NaN 600.0 172.65 kWh 61831.0 NaN
24 CLXGP2 FNSB-CLRA River Park General Area VU797000 Army Corps of Engineers 01 River Park 09/30/2006 09/30/2006 11/14/2008 NaN NaN 2006-08-31 2006-09-30 Electricity Energy charge NaN 930.0 267.60 kWh 61831.0 NaN
25 CLXGP2 FNSB-CLRA River Park General Area VU797000 Army Corps of Engineers 01 River Park 10/31/2006 10/31/2006 11/14/2008 NaN NaN 2006-09-30 2006-10-31 Electricity Energy charge NaN 0.0 0.00 NaN 61831.0 NaN
26 CLXGP2 FNSB-CLRA River Park General Area VU797000 Army Corps of Engineers 01 River Park 11/30/2006 11/30/2006 11/14/2008 NaN NaN 2006-10-31 2006-11-30 Electricity Energy charge NaN 0.0 0.00 NaN 61831.0 NaN

In [2]:
df_mo = pd.read_pickle('../df_mo.pkl')
df_mo.head()


Out[2]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeName EnergyUnitTypeName MeterReadDate UsageDate Thru
0 44 NaN NaN 81.0 412.0 Electric kWh NaT 2009-01-01 2009-01-15
1 44 NaN NaN 99.0 538.0 Electric kWh NaT 2009-02-01 2009-02-15
2 44 NaN NaN 81.0 432.0 Electric kWh NaT 2009-03-01 2009-03-15
3 44 NaN NaN 78.0 406.0 Electric kWh NaT 2009-04-01 2009-04-15
4 44 NaN NaN 116.0 771.0 Electric kWh NaT 2009-05-01 2009-05-15

In [18]:
dfmf = None
for gp, recs in df_mo.groupby(['BuildingId', 'EnergyTypeName']):
    recs = recs.query('(DollarCost > 0) or (EnergyQuantity > 0)').copy()
    if len(recs) == 0:
        continue
    recs.sort_values(['Thru'], inplace=True)
    
    # Start date comes from prior record
    recs['From'] = recs.Thru.shift(1)
    recs['Item Description'] = 'Energy'
    recs.loc[recs.From.isnull(), 'From'] = recs.Thru - timedelta(days=30)

    # If any of the billing periods are longer than 45 days, then probably
    # a missing prior bill.  Just make those periods 30 days long.
    mask = ((recs.Thru - recs.From) > timedelta(days=45))
    recs.loc[mask, 'From'] = recs.Thru - timedelta(days=30)
    
    if dfmf is None:
        dfmf = recs.copy()
    else:
        dfmf =  dfmf.append(recs, ignore_index=True)

In [19]:
(dfmf.Thru - dfmf.From).value_counts()


Out[19]:
31 days    23881
30 days    16182
28 days     3257
0 days      1872
29 days      379
32 days       35
35 days       19
33 days        9
25 days        6
27 days        6
2 days         4
22 days        3
38 days        2
37 days        2
34 days        1
26 days        1
36 days        1
5 days         1
45 days        1
3 days         1
dtype: int64

In [22]:
dfmf.query('EnergyUnitTypeName == "kW"')
dfmf.head()


Out[22]:
BuildingId DemandCost DemandUse DollarCost EnergyQuantity EnergyTypeName EnergyUnitTypeName MeterReadDate UsageDate Thru From Item Description
0 1 NaN 233.8 10472.0 129800.0 Electric kWh NaT 2009-01-01 2009-01-15 2008-12-16 Energy
1 1 NaN 277.8 9768.0 117600.0 Electric kWh NaT 2009-02-01 2009-02-15 2009-01-15 Energy
2 1 NaN 275.6 11883.0 116000.0 Electric kWh NaT 2009-03-01 2009-03-15 2009-02-15 Energy
3 1 NaN 313.6 15421.0 158800.0 Electric kWh NaT 2009-04-01 2009-04-15 2009-03-15 Energy
4 1 NaN 308.8 13802.0 137400.0 Electric kWh NaT 2009-05-01 2009-05-15 2009-04-15 Energy