This notebook runs sanity checks on the FERC Form 1 Fuel by Plant output compilation. These are the same tests which are run by the fbp_ferc1
validation tests by PyTest. The notebook and visualizations are meant to be used as a diagnostic tool, to help understand what's wrong when the PyTest based data validations fail for some reason.
In [ ]:
%load_ext autoreload
%autoreload 2
In [ ]:
import sys
import pandas as pd
import numpy as np
import sqlalchemy as sa
import pudl
import pudl.validate as pv
In [ ]:
import warnings
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]
In [ ]:
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
In [ ]:
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (10,4)
mpl.rcParams['figure.dpi'] = 150
pd.options.display.max_columns = 56
In [ ]:
pudl_settings = pudl.workspace.setup.get_defaults()
ferc1_engine = sa.create_engine(pudl_settings['ferc1_db'])
pudl_engine = sa.create_engine(pudl_settings['pudl_db'])
pudl_settings
In [ ]:
pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine)
fbp_ferc1 = pudl_out.fbp_ferc1()
In [ ]:
import seaborn as sns
mpl.pyplot.figure(figsize=(6, 6))
sns.regplot(x="coal_fraction_mmbtu", y="coal_fraction_cost", data=fbp_ferc1, scatter_kws={"alpha": 0.05}, label="coal", color="black")
sns.regplot(x="gas_fraction_mmbtu", y="gas_fraction_cost", data=fbp_ferc1, scatter_kws={"alpha": 0.05}, label="gas", color="blue")
plt.xlabel("Heat Content Fraction")
plt.ylabel("Cost Fraction");
In [ ]:
print(fbp_ferc1[["gas_fraction_mmbtu", "gas_fraction_cost"]].corr().iloc[0,1])
print(fbp_ferc1[["oil_fraction_mmbtu", "oil_fraction_cost"]].corr().iloc[0,1])
print(fbp_ferc1[["coal_fraction_mmbtu", "coal_fraction_cost"]].corr().iloc[0,1])
As a sanity check of the testing process itself, we can check to see whether the entire historical distribution has attributes that place it within the extremes of a historical subsampling of the distribution. In this case, we sample each historical year, and look at the range of values taken on by some quantile, and see whether the same quantile for the whole of the dataset fits within that range
In [ ]:
# This is required to get the fuel costs per unit back into the dataframe.... just for sanity checking:
for fuel in ["gas", "oil", "coal", "waste", "nuclear"]:
fbp_ferc1[f"{fuel}_cost_per_mmbtu"] = (fbp_ferc1[f"{fuel}_fraction_cost"] * fbp_ferc1["fuel_cost"]) / (fbp_ferc1[f"{fuel}_fraction_mmbtu"] * fbp_ferc1["fuel_mmbtu"])
In [ ]:
pv.plot_vs_self(fbp_ferc1, pv.fbp_ferc1_self)
Some of the variables reported in this table have a fixed range of reasonable values, like the heat content per unit of a given fuel type. These varaibles can be tested for validity against external standards directly. In general we have two kinds of tests in this section:
In [ ]:
pudl.validate.plot_vs_bounds(fbp_ferc1, pv.fbp_ferc1_gas_cost_per_mmbtu_bounds)
pudl.validate.plot_vs_bounds(fbp_ferc1, pv.fbp_ferc1_oil_cost_per_mmbtu_bounds)
pudl.validate.plot_vs_bounds(fbp_ferc1, pv.fbp_ferc1_coal_cost_per_mmbtu_bounds)
In [ ]: