This notebook lays out a standard format and some best practices for creating interactive / exploratory notebooks which can be relatively easily shared between different PUDL users, and turned into reusable Python modules for integration into our underlying Python packages.
Each notebook should start with a brief explanation (in Markdown) explaining the purpose of the analysis, and outlining the different stages / steps which are taken to accomplish the analysis. As the analysis develops, you can add new sections or details to this section.
Insofar as it's possible, another PUDL user who has cloned the repository that the notebook is part of should be able to update their pudl-dev
conda environment, open the notebook, and run all cells successfully.
If there are required data or other prerequisites that the notebook cannot manage on its own -- like a file that needs to be downloaded by hand and placed in a particular location -- those steps should be laid out clearly at the beginning of the notebook.
Most of our work is done in public Github repositories. No authentication information should ever appear in a notebook. These values can be stored in environment variables on your local computer.
If the analysis depends on particular choices of input values, those should be called out explicitly at the beginning of the notebook. (NB: We should explore ways to parameterize notebooks, papermill is one tool that does this).
If anyone is going to be able to use the notebook, the files it uses will need to be stored somewhere that makes sense on both your and other computers. We assume that anyone using this template has the PUDL package installed, and has a local PUDL data management environment set up.
<PUDL_IN>/data/local/<data_source>/
.Consider persisting to disk the results of computations that take more than a few minutes, if the outputs are small enough to be checked into the repository and shared with other users. Only require the expensive computation to be run if this pre-computed output is not available.
Don't leave lots of additional code laying around, even commented out, "just in case" you want to look at it again later. Notebooks need to be relatively linear in the end, even though the thought processes and exploratory analyses that generate them may not be. Once you have a working analysis, either prune those branches, or encapsulate them as options within functions.
Only import libraries which are required by the notebook, to avoid unnecessary dependencies.
If your analysis requires a new library that isn't yet part of the shared pudl-dev
environment, add it to the devtools/environment.yml
file so that others will get it when they update their environment.
Lots of these guidelines are taken directly from Emily Riederer's post: RMarkdown Driven Development. For more in depth explanation of the motivations behind this layout, do go check it out!
In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
# Standard libraries
import logging
import pathlib
import sys
# 3rd party libraries
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns
import sqlalchemy as sa
# Local libraries
import pudl
In [3]:
sns.set()
%matplotlib inline
mpl.rcParams['figure.figsize'] = (10,4)
mpl.rcParams['figure.dpi'] = 150
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
In [4]:
logger=logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.handlers = [handler]
In many cases, the eventual product of a notebook analysis is going to be the creation of new, reusable functions that are integrated into the underlying PUDL code. You should begin the process of accumulating and organizing those functions as soon as you notice repeated patterns in your code.
src
directory in the same directory as the notebook. You will want to import this module
In [5]:
def mcoe_by_fuel(mcoe_df, fuel_type=None):
"""
Select only MCOE records pertaining to a particular fuel type.
Args:
mcoe_df (pandas.DataFrame): A PUDL MCOE dataframe.
fuel_type (str or None): A string indicating what value of
fuel_type_code_pudl should be selected from the input
dataframe. If None, all fuels are retained.
Returns:
pandas.DataFrame: A dataframe containing MCOE records for only a
single PUDL fuel type code.
"""
out_df = mcoe_df
if fuel_type is not None:
out_df = mcoe_df[mcoe_df.fuel_type_code_pudl==fuel_type]
return out_df
def finite_distplot(df, data_col, wt_col=None, nbins=100, max_val=np.infty):
"""
Plot weighted distribution of values less than a maximum value.
Args:
df (pandas.DataFrame): The dataframe containing the data and
weights to plot.
data_col (str): Label of the column containing the data.
wt_col (str or None): Label of the column to use to weight the data.
If None (the default) data is not weighted.
nbins (int): Number of histogram bins to use.
max_val (float): Maximum data value to allow in data visualized.
Returns:
None
"""
df = df[df[data_col] < max_val]
weights = None
if wt_col is not None:
weights = df[wt_col]
_ = sns.distplot(df[data_col], bins=nbins, hist_kws={"weights": weights})
If there are overarching parameters which determine the nature of the analysis -- which US states to look at, which utilities are of interest, a particular start and end date -- state those clearly at the beginning of the analysis, so that they can be referred to by the rest of the notebook and easily changed if need be.
pudl_settings
dictionary and connections to your local PUDL databases
In [6]:
pudl_settings = pudl.workspace.setup.get_defaults()
display(pudl_settings)
ferc1_engine = sa.create_engine(pudl_settings['ferc1_db'])
display(ferc1_engine)
pudl_engine = sa.create_engine(pudl_settings['pudl_db'])
display(pudl_engine)
# What granularity should we aggregate MCOE data to?
mcoe_freq = "AS" # Annual
# What date range are we interested in here?
mcoe_start_date = "2015-01-01"
mcoe_end_date = "2018-12-31"
my_new_data_url = "https://mynewdata.website.gov/path/to/new/data.csv"
my_new_datadir = pathlib.Path(pudl_settings["data_dir"]) / "new_data_source"
<PUDL_IN>/data/local/
.logging
output.%%time
cell magic can also help users understand which pieces of work / data acquisition are hard:
In [7]:
%%time
pudl_out = pudl.output.pudltabl.PudlTabl(
freq=mcoe_freq,
start_date=mcoe_start_date,
end_date=mcoe_end_date,
pudl_engine=pudl_engine,
)
mcoe_df = pudl_out.mcoe()
If there's any validation that can be done on the data which you've loaded to flag if/when it is inappropriate for the analysis that follows, do it here. If you find the data is unusable, use assert
statements or raise
Exceptions to stop the notebook from proceeding, and indicate what the problem is.
In [8]:
assert mcoe_df.capacity_factor.min() >= 0.0
assert mcoe_df.capacity_factor.max() <= 1.5
mean_hr = mcoe_df[np.isfinite(mcoe_df.heat_rate_mmbtu_mwh)].heat_rate_mmbtu_mwh.mean()
assert mean_hr > 5
assert mean_hr < 20
Once all of the data is loaded and looks like it's in good shape, do any initial wrangling that's specific to this particular analysis. This should mostly make use of the higher level functions which were defined above. If this step takes a while, don't be shy about producing logging
outputs.
In [9]:
mcoe_coal = mcoe_by_fuel(mcoe_df, fuel_type="coal")
mcoe_gas = mcoe_by_fuel(mcoe_df, fuel_type="gas")
In [10]:
coal_ax = finite_distplot(mcoe_coal, "heat_rate_mmbtu_mwh", max_val=20)
plt.title("Coal heat rate distribution");
In [11]:
gas_ax = finite_distplot(mcoe_gas, "heat_rate_mmbtu_mwh", max_val=20)
plt.title("Gas heat rate distribution");
In [ ]: