Extract facility generation and fuel use data

This notebook creates dataframes with monthly facility generation and fuel use data, merges them, and exports the results. The code assumes that you have already downloaded an ELEC.txt file from EIA's bulk download website.



In [1]:

    
import json
import pandas as pd
import os
from os.path import join
import numpy as np
from joblib import Parallel, delayed
import sys

cwd = os.getcwd()
data_path = join(cwd, '..', 'Data storage')

Date string for filenames

This will be inserted into all filenames (reading and writing)



In [ ]:

    
file_date = '2018-03-06'



In [2]:

    
%load_ext watermark



In [3]:

    
%watermark -iv -v









    



json        2.0.9
pandas      0.22.0
numpy       1.14.2
CPython 3.6.4
IPython 6.2.1



In [4]:

    
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1



In [5]:

    
# add the 'src' directory as one where we can import modules
src_dir = join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)



In [6]:

    
%aimport Data.data_extraction
from Data.data_extraction import facility_line_to_df

%aimport Analysis.index
from Analysis.index import add_datetime, add_quarter

Read `ELEC.txt` file

Download the most current file from EIA's bulk download site. Save it to \Data storage\Raw data. I've tried to do this via the requests library, but the data file often gets corrupted.



In [7]:

    
path = join(data_path, 'Raw EIA bulk', '{} ELEC.txt'.format(file_date))
with open(path, 'r') as f:
    raw_txt = f.readlines()

Filter lines to only include facility generation

Include ELEC.PLANT in the series_id
Include "All" as the only allowable prime mover
- Some facilities have incorrect data at the individual prime mover level
Do not include "All" as a fuel code
Only monthly frequency



In [8]:

    
gen_rows = [row for row in raw_txt if 'ELEC.PLANT.GEN' in row 
            and 'series_id' in row 
            and 'ALL.M' in row 
            and 'ALL-' not in row]
total_fuel_rows = [row for row in raw_txt if 'ELEC.PLANT.CONS_TOT_BTU' in row 
                   and 'series_id' in row 
                   and 'ALL.M' in row 
                   and 'ALL-' not in row]
eg_fuel_rows = [row for row in raw_txt if 'ELEC.PLANT.CONS_EG_BTU' in row 
                and 'series_id' in row 
                and 'ALL.M' in row 
                and 'ALL-' not in row]

Combine generation into one large dataframe



In [9]:

    
if __name__ == '__main__':
    exception_list = []
    facility_gen = pd.concat(Parallel(n_jobs=-1)(delayed(facility_line_to_df)(json.loads(row)) for row in gen_rows))
    facility_gen.reset_index(drop=True, inplace=True)
    facility_gen.rename({'value':'generation (MWh)'}, axis=1, inplace=True)



In [10]:

    
facility_gen.loc[:,'lat'] = facility_gen.loc[:,'lat'].astype(float)
facility_gen.loc[:,'lon'] = facility_gen.loc[:,'lon'].astype(float)
facility_gen.loc[:, 'plant id'] = facility_gen.loc[:, 'plant id'].astype(int)



In [11]:

    
#drop
facility_gen.tail()









    Out[11]:







  
    
      
      f
      fuel
      geography
      last_updated
      lat
      lon
      month
      plant id
      prime mover
      series_id
      units
      generation (MWh)
      year
    
  
  
    
      1645711
      M
      SC
      USA-FL
      2016-07-07T17:18:42-04:00
      29.733056
      -81.632778
      5
      136
      ALL
      ELEC.PLANT.GEN.136-SC-ALL.M
      megawatthours
      111269.0
      2003
    
    
      1645712
      M
      SC
      USA-FL
      2016-07-07T17:18:42-04:00
      29.733056
      -81.632778
      4
      136
      ALL
      ELEC.PLANT.GEN.136-SC-ALL.M
      megawatthours
      0.0
      2003
    
    
      1645713
      M
      SC
      USA-FL
      2016-07-07T17:18:42-04:00
      29.733056
      -81.632778
      3
      136
      ALL
      ELEC.PLANT.GEN.136-SC-ALL.M
      megawatthours
      131711.0
      2003
    
    
      1645714
      M
      SC
      USA-FL
      2016-07-07T17:18:42-04:00
      29.733056
      -81.632778
      2
      136
      ALL
      ELEC.PLANT.GEN.136-SC-ALL.M
      megawatthours
      127918.0
      2003
    
    
      1645715
      M
      SC
      USA-FL
      2016-07-07T17:18:42-04:00
      29.733056
      -81.632778
      1
      136
      ALL
      ELEC.PLANT.GEN.136-SC-ALL.M
      megawatthours
      112533.0
      2003

Combine total fuel use into one large dataframe



In [12]:

    
if __name__ == '__main__':
    exception_list = []
    facility_all_fuel = pd.concat(Parallel(n_jobs=-1)(delayed(facility_line_to_df)(json.loads(row)) for row in total_fuel_rows))
    facility_all_fuel.reset_index(drop=True, inplace=True)
    facility_all_fuel.rename({'value':'total fuel (mmbtu)'}, axis=1, inplace=True)



In [13]:

    
facility_all_fuel.loc[:,'lat'] = facility_all_fuel.loc[:,'lat'].astype(float)
facility_all_fuel.loc[:,'lon'] = facility_all_fuel.loc[:,'lon'].astype(float)
facility_all_fuel.loc[:,'plant id'] = facility_all_fuel.loc[:,'plant id'].astype(int)

Combine total fuel use for electricity into one large dataframe



In [14]:

    
if __name__ == '__main__':
    exception_list = []
    facility_eg_fuel = pd.concat(Parallel(n_jobs=-1)(delayed(facility_line_to_df)(json.loads(row)) for row in eg_fuel_rows))
    facility_eg_fuel.reset_index(drop=True, inplace=True)
    facility_eg_fuel.rename({'value':'elec fuel (mmbtu)'}, axis=1, inplace=True)



In [15]:

    
facility_eg_fuel.loc[:,'lat'] = facility_eg_fuel.loc[:,'lat'].astype(float)
facility_eg_fuel.loc[:,'lon'] = facility_eg_fuel.loc[:,'lon'].astype(float)
facility_eg_fuel.loc[:,'plant id'] = facility_eg_fuel.loc[:,'plant id'].astype(int)

Merge dataframes

Need to be careful here because there are fuel/prime mover combinations that have generation but no fuel use (e.g. the steam cycle of a combined cycle system - but only in some cases).



In [16]:

    
keep_cols = ['fuel', 'generation (MWh)', 'month', 'plant id', 'prime mover', 'year',
             'geography', 'lat', 'lon', 'last_updated']
merge_cols = ['fuel', 'month', 'plant id', 'year']
gen_total_fuel = facility_all_fuel.merge(facility_gen.loc[:,keep_cols], 
                                    how='outer', on=merge_cols)

Fill in missing values from the first merge



In [17]:

    
def fill_missing(df):
    cols = [col[:-2] for col in df.columns if '_x' in col]
    
    # Create new column from the _x version, fill missing values from the _y version
    for col in cols:
        df[col] = df.loc[:, col + '_x']
        df.loc[df[col].isnull(), col] = df.loc[df[col].isnull(), col + '_y']
        
        df.drop([col+'_x', col+'_y'], axis=1, inplace=True)



In [18]:

    
fill_missing(gen_total_fuel)



In [19]:

    
keep_cols = ['fuel', 'elec fuel (mmbtu)', 'month', 'plant id', 'prime mover', 'year',
             'geography', 'lat', 'lon', 'last_updated']
all_facility_data = gen_total_fuel.merge(facility_eg_fuel.loc[:,keep_cols], 
                                         how='outer', on=merge_cols)

FIll in missing values from second merge and drop units/series_id columns



In [20]:

    
fill_missing(all_facility_data)



In [21]:

    
all_facility_data.drop(['units', 'series_id'], axis=1, inplace=True)
all_facility_data.head()









    Out[21]:







  
    
      
      f
      fuel
      month
      plant id
      total fuel (mmbtu)
      year
      generation (MWh)
      elec fuel (mmbtu)
      geography
      last_updated
      lat
      lon
      prime mover
    
  
  
    
      0
      M
      DFO
      12
      1001
      1116.0
      2017
      114.587
      1116.0
      USA-IN
      2018-02-28T02:03:13-05:00
      39.9242
      -87.4244
      ALL
    
    
      1
      M
      DFO
      11
      1001
      2772.0
      2017
      265.428
      2772.0
      USA-IN
      2018-02-28T02:03:13-05:00
      39.9242
      -87.4244
      ALL
    
    
      2
      M
      DFO
      10
      1001
      4819.0
      2017
      460.695
      4819.0
      USA-IN
      2018-02-28T02:03:13-05:00
      39.9242
      -87.4244
      ALL
    
    
      3
      M
      DFO
      9
      1001
      2720.0
      2017
      300.705
      2720.0
      USA-IN
      2018-02-28T02:03:13-05:00
      39.9242
      -87.4244
      ALL
    
    
      4
      M
      DFO
      8
      1001
      6273.0
      2017
      609.358
      6273.0
      USA-IN
      2018-02-28T02:03:13-05:00
      39.9242
      -87.4244
      ALL

Add datetime and quarter columns



In [22]:

    
add_quarter(all_facility_data)

Load emission factors

These are mostly EIA emission factors



In [23]:

    
path = join(data_path, 'Final emission factors.csv')
ef = pd.read_csv(path, index_col=0)

Apply factors to facility generation



In [24]:

    
fossil_factors = dict(zip(ef.index, ef['Fossil Factor']))
total_factors = dict(zip(ef.index, ef['Total Factor']))
fossil_factors, total_factors









    Out[24]:





({'AB': 0.0,
  'BFG': 274.32,
  'BIT': 93.3,
  'BLQ': 0.0,
  'DFO': 73.16,
  'GEO': 7.71,
  'JF': 70.9,
  'KER': 72.3,
  'LFG': 0.0,
  'LIG': 97.7,
  'MSB': 0.0,
  'MSN': 90.7,
  'MSW': 41.69,
  'NG': 53.07,
  'NUC': 0.0,
  'OBG': 0.0,
  'OBL': 0.0,
  'OBS': 0.0,
  'OG': 59.0,
  'OTH': 0.0,
  'PC': 102.1,
  'PG': 63.07,
  'PUR': 0.0,
  'RC': 93.3,
  'RFO': 78.79,
  'SC': 93.3,
  'SGC': 93.3,
  'SGP': 73.16,
  'SLW': 0.0,
  'SUB': 97.2,
  'SUN': 0.0,
  'TDF': 85.97,
  'WAT': 0.0,
  'WC': 93.3,
  'WDL': 0.0,
  'WDS': 0.0,
  'WH': 0.0,
  'WND': 0.0,
  'WO': 95.25},
 {'AB': 118.17,
  'BFG': 274.32,
  'BIT': 93.3,
  'BLQ': 94.4,
  'DFO': 73.16,
  'GEO': 7.71,
  'JF': 70.9,
  'KER': 72.3,
  'LFG': 52.17,
  'LIG': 97.7,
  'MSB': 90.7,
  'MSN': 90.7,
  'MSW': 41.69,
  'NG': 53.07,
  'NUC': 0.0,
  'OBG': 52.17,
  'OBL': 83.98,
  'OBS': 105.51,
  'OG': 59.0,
  'OTH': 0.0,
  'PC': 102.1,
  'PG': 63.07,
  'PUR': 0.0,
  'RC': 93.3,
  'RFO': 78.79,
  'SC': 93.3,
  'SGC': 93.3,
  'SGP': 73.16,
  'SLW': 83.98,
  'SUB': 97.2,
  'SUN': 0.0,
  'TDF': 85.97,
  'WAT': 0.0,
  'WC': 93.3,
  'WDL': 83.98,
  'WDS': 93.8,
  'WH': 0.0,
  'WND': 0.0,
  'WO': 95.25})

Apply emission factors

Fuel emission factor is kg/mmbtu



In [25]:

    
# Start with 0 emissions in all rows
# For fuels where we have an emission factor, replace the 0 with the calculated value
all_facility_data['all fuel fossil CO2 (kg)'] = 0
all_facility_data['elec fuel fossil CO2 (kg)'] = 0
all_facility_data['all fuel total CO2 (kg)'] = 0
all_facility_data['elec fuel total CO2 (kg)'] = 0
for fuel in total_factors.keys():
    # All fuel CO2 emissions
    all_facility_data.loc[all_facility_data['fuel']==fuel,'all fuel fossil CO2 (kg)'] = \
        all_facility_data.loc[all_facility_data['fuel']==fuel,'total fuel (mmbtu)'] * fossil_factors[fuel]      
    all_facility_data.loc[all_facility_data['fuel']==fuel,'all fuel total CO2 (kg)'] = \
        all_facility_data.loc[all_facility_data['fuel']==fuel,'total fuel (mmbtu)'] * total_factors[fuel]
    
    # Electric fuel CO2 emissions
    all_facility_data.loc[all_facility_data['fuel']==fuel,'elec fuel fossil CO2 (kg)'] = \
        all_facility_data.loc[all_facility_data['fuel']==fuel,'elec fuel (mmbtu)'] * fossil_factors[fuel]
    all_facility_data.loc[all_facility_data['fuel']==fuel,'elec fuel total CO2 (kg)'] = \
        all_facility_data.loc[all_facility_data['fuel']==fuel,'elec fuel (mmbtu)'] * total_factors[fuel]

Set nan and negative emissions to 0

When no fuel was used for electricity production, or when negative fuel is somehow reported by EIA, set the emissions to 0. This is implemented by filtering out all values that are greater than or equal to 0.



In [26]:

    
# Fossil CO2
all_facility_data.loc[~(all_facility_data['all fuel fossil CO2 (kg)']>=0),
                      'all fuel fossil CO2 (kg)'] = 0
all_facility_data.loc[~(all_facility_data['elec fuel fossil CO2 (kg)']>=0),
                      'elec fuel fossil CO2 (kg)'] = 0
# Total CO2
all_facility_data.loc[~(all_facility_data['all fuel total CO2 (kg)']>=0),
                      'all fuel total CO2 (kg)'] = 0
all_facility_data.loc[~(all_facility_data['elec fuel total CO2 (kg)']>=0),
                      'elec fuel total CO2 (kg)'] = 0

Export



In [27]:

    
path = join(data_path, 'Derived data',
            'Facility gen fuels and CO2 {}.csv'.format(file_date))
all_facility_data.to_csv(path, index=False)

	f	fuel	geography	last_updated	lat	lon	month	plant id	prime mover	series_id	units	generation (MWh)	year
1645711	M	SC	USA-FL	2016-07-07T17:18:42-04:00	29.733056	-81.632778	5	136	ALL	ELEC.PLANT.GEN.136-SC-ALL.M	megawatthours	111269.0	2003
1645712	M	SC	USA-FL	2016-07-07T17:18:42-04:00	29.733056	-81.632778	4	136	ALL	ELEC.PLANT.GEN.136-SC-ALL.M	megawatthours	0.0	2003
1645713	M	SC	USA-FL	2016-07-07T17:18:42-04:00	29.733056	-81.632778	3	136	ALL	ELEC.PLANT.GEN.136-SC-ALL.M	megawatthours	131711.0	2003
1645714	M	SC	USA-FL	2016-07-07T17:18:42-04:00	29.733056	-81.632778	2	136	ALL	ELEC.PLANT.GEN.136-SC-ALL.M	megawatthours	127918.0	2003
1645715	M	SC	USA-FL	2016-07-07T17:18:42-04:00	29.733056	-81.632778	1	136	ALL	ELEC.PLANT.GEN.136-SC-ALL.M	megawatthours	112533.0	2003

	f	fuel	month	plant id	total fuel (mmbtu)	year	generation (MWh)	elec fuel (mmbtu)	geography	last_updated	lat	lon	prime mover
0	M	DFO	12	1001	1116.0	2017	114.587	1116.0	USA-IN	2018-02-28T02:03:13-05:00	39.9242	-87.4244	ALL
1	M	DFO	11	1001	2772.0	2017	265.428	2772.0	USA-IN	2018-02-28T02:03:13-05:00	39.9242	-87.4244	ALL
2	M	DFO	10	1001	4819.0	2017	460.695	4819.0	USA-IN	2018-02-28T02:03:13-05:00	39.9242	-87.4244	ALL
3	M	DFO	9	1001	2720.0	2017	300.705	2720.0	USA-IN	2018-02-28T02:03:13-05:00	39.9242	-87.4244	ALL
4	M	DFO	8	1001	6273.0	2017	609.358	6273.0	USA-IN	2018-02-28T02:03:13-05:00	39.9242	-87.4244	ALL