Validation of FERC Form 1 Large Steam Plants

This notebook runs sanity checks on the FERC Form 1 large steam plants table (plants_steam_ferc1). These are the same tests which are run by the plants_steam_ferc1 validation tests by PyTest. The notebook and visualizations are meant to be used as a diagnostic tool, to help understand what's wrong when the PyTest based data validations fail for some reason.



In [ ]:

    
%load_ext autoreload
%autoreload 2



In [ ]:

    
import sys
import pandas as pd
import sqlalchemy as sa
import pudl



In [ ]:

    
import warnings
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]



In [ ]:

    
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline



In [ ]:

    
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (10,4)
mpl.rcParams['figure.dpi'] = 150
pd.options.display.max_columns = 56



In [ ]:

    
pudl_settings = pudl.workspace.setup.get_defaults()
ferc1_engine = sa.create_engine(pudl_settings['ferc1_db'])
pudl_engine = sa.create_engine(pudl_settings['pudl_db'])
pudl_settings

Pull `plants_steam_ferc1` and calculate some useful values

First we pull the original (post-ETL) FERC 1 large plants data out of the PUDL database using an output object. The FERC Form 1 data only exists at annual resolution, so there's no inter-frequency aggregation to think about.



In [ ]:

    
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine)
plants_steam_ferc1 = (
    pudl_out.plants_steam_ferc1().
    assign(
        water_limited_ratio=lambda x: x.water_limited_capacity_mw / x.capacity_mw,
        not_water_limited_ratio=lambda x: x.not_water_limited_capacity_mw / x.capacity_mw,
        peak_demand_ratio=lambda x: x.peak_demand_mw / x.capacity_mw,
        capability_ratio=lambda x: x.plant_capability_mw / x.capacity_mw,
    )
)

Validating Historical Distributions

As a sanity check of the testing process itself, we can check to see whether the entire historical distribution has attributes that place it within the extremes of a historical subsampling of the distribution. In this case, we sample each historical year, and look at the range of values taken on by some quantile, and see whether the same quantile for the whole of the dataset fits within that range



In [ ]:

    
pudl.validate.plot_vs_self(plants_steam_ferc1, pudl.validate.plants_steam_ferc1_self)

Validation Against Fixed Bounds

Some of the variables reported in this table have a fixed range of reasonable values, like the heat content per unit of a given fuel type. These varaibles can be tested for validity against external standards directly. In general we have two kinds of tests in this section:

Tails: are the exteme values too extreme? Typically, this is at the 5% and 95% level, but depending on the distribution, sometimes other thresholds are used.
Middle: Is the central value of the distribution where it should be?

Plant Capacities



In [ ]:

    
pudl.validate.plot_vs_bounds(plants_steam_ferc1, pudl.validate.plants_steam_ferc1_capacity)

CapEx & OpEx



In [ ]:

    
pudl.validate.plot_vs_bounds(plants_steam_ferc1, pudl.validate.plants_steam_ferc1_expenses)

Plant Capacity Ratios



In [ ]:

    
pudl.validate.plot_vs_bounds(plants_steam_ferc1, pudl.validate.plants_steam_ferc1_capacity_ratios)

Plant Connected Hours

Currently expected to fail: ~10% of all plants have > 8760 hours.



In [ ]:

    
pudl.validate.plot_vs_bounds(plants_steam_ferc1, pudl.validate.plants_steam_ferc1_connected_hours)

Validate an Individual Column

If there's a particular column that is failing the validation, you can check several different validation cases with something like this cell:



In [ ]:

    
testcol =  "plant_hours_connected_while_generating"
self_tests = [x for x in pudl.validate.plants_steam_ferc1_self if x["data_col"] == testcol]
pudl.validate.plot_vs_self(plants_steam_ferc1, self_tests)