This report describes the process and results of a data science project to predict increases or decreases in electricity generation at fossil power plants in Texas. It begins with data collection and processing, and shows how the data are combined. The data fall into two general groups: 1) average operating characteristics of power plants over the course of a year, and 2) the state of the ERCOT grid at a given point in time ($t$) and the hour directly preceding $t$. We use the first set of data to group power plants into clusters, and the second set of data to predict the behavior of each cluster.
Most dispatchable electricity - electricity that can be produced on-demand - in the U.S. is generated at fossil power plants[1]. The plant, or plants, that generate more electricity in response to increased demand are called marginal generating units (MGUs). Models that predict MGUs generally fall into one of two categories: regression-based or unit-commitment economic dispatch[2]. The first category can account for effects (e.g. imperfect information) that are ignored in the second by examining real-world behavior. However, models that simply regress on historical behavior are ill-suited to making predictions about future grid conditions that differ from the data they were trained on. Applying machine learning techniques to energy sector analyses represents a pathway for potential improvements in the prediction of MGU behavior, and understanding the environmental impacts of energy transitions. This area of research is especially important as researchers and policy makers try to predict the economic and environmental impacts of vehicle electrification, demand-response management, and large scale deployment of variable renewable generating sources like wind and solar.
The goal of this project is to build a model that can predict the type of fossil MGU that will provide electricity for additional demand when given the current set of grid conditions. This type of problem can be difficult to solve, especially when the model is also trying to predict grid conditions like demand or wind generation. We are simplifying the model by treating these inputs as exogenous - the time of day or day of the year doesn't matter.
Predicting which individual power plant will provide marginal generation under a given set of grid conditions is also difficult. Individual facilities might go down for maintanence or otherwise not respond to changing grid conditions in an expected manner. We group individual fossil power plants based on their historical operating characteristics (average heat rate, capacity, capacity factor, and 95th percentile 1-hour ramp rate) using k-means clustering. The model treats each group as a single generating unit, and predicts it's change in generation given a change in grid demand.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
import glob
import re
import cPickle as pickle
import gzip
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR, LinearSVR
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
Our dataset covers 9 years of operation in the ERCOT power grid, which covers most of Texas. Data sources include:
All datasets were downloaded as Excel or CSV files. All of the raw data files combined are larger than 1 GB, and can be found on the GitHub repository for this project.
In [2]:
# Set up all the file paths, and column names to keep.
path = '../Raw Data/ERCOT/Hourly wind generation'
full_xls = os.path.join(path, '*.xls')
full_xlsx = os.path.join(path, '*.xlsx')
files = glob.glob(full_xls)
files.extend(glob.glob(full_xlsx))
In [3]:
cols = ['ERCOT Load, MW', 'Total Wind Installed, MW',
'Total Wind Output, MW', 'Wind Output, % of Installed',
'Wind Output, % of Load', '1-hr MW change', '1-hr % change']
Read the excel files, combine them into a single dataframe, and only keep the columns we need.
In [4]:
ercot = pd.concat([pd.read_excel(fn, sn='numbers', index_col=0) for fn in files])
ercot = ercot.loc[:,cols]
ercot.sort_index(inplace=True)
ercot = ercot.iloc[:-1,:]
Create a few plots to ensure the data looks good.
In [5]:
ercot.plot(y='Total Wind Installed, MW', use_index=True)
Out[5]:
In [6]:
ercot.plot(y='ERCOT Load, MW', use_index=True)
Out[6]:
In [7]:
ercot.head()
Out[7]:
In [8]:
ercot.loc[:,'Net Load (MW)'] = ercot.loc[:,'ERCOT Load, MW'] - ercot.loc[:,'Total Wind Output, MW']
ercot.loc[1:,'Net Load Change (MW)'] = ercot.iloc[1:,-1].values - ercot.iloc[:-1,-1].values
ercot.loc[:,'DATETIME'] = pd.to_datetime(ercot.index)
In [11]:
ercot.head()
Out[11]:
Export Dataframe to csv and reload to avoid pre-processing of data
In [12]:
# filename = 'Ercot_Data.csv'
# fullpath = os.path.join(filename)
# ercot.to_csv(fullpath)
In [2]:
filename = 'Ercot_Data.csv'
fullpath = os.path.join(filename)
ercot = pd.read_csv(fullpath, index_col=0)
In [3]:
ercot.loc[:,'DATETIME'] = pd.to_datetime(ercot.index)
In [4]:
ercot.head()
Out[4]:
In [11]:
#Iterate through the directory to find all the files to import
path = os.path.join('../Raw Data/EIA Data', '923-No_Header')
full_path = os.path.join(path, '*.*')
eiaNames = os.listdir(path)
#Rename the keys for easier merging later
fileNameMap = {'EIA923 SCHEDULES 2_3_4_5 Final 2010.xls':2010,
'EIA923 SCHEDULES 2_3_4_5 M Final 2009 REVISED 05252011.XLS':2009,
'eia923December2008.xls':2008,
'EIA923_Schedules_2_3_4_5_2011_Final_Revision.xlsx':2011,
'EIA923_Schedules_2_3_4_5_2012_Final_Release_12.04.2013.xlsx':2012,
'EIA923_Schedules_2_3_4_5_2013_Final_Revision.xlsx':2013,
'EIA923_Schedules_2_3_4_5_M_12_2014_Final_Revision.xlsx':2014,
'EIA923_Schedules_2_3_4_5_M_12_2015_Final.xlsx':2015,
'f906920_2007.xls':2007}
#Load the files into data frames, one df per file
eiaDict = {fileNameMap[fn]:pd.read_excel(os.path.join(path, fn)) for fn in eiaNames}
eiaDict = {key:val[val["NERC Region"] == "TRE"] for key, val in eiaDict.iteritems()}
In [12]:
#Dict of values to replace to standardize column names across all dataframes
monthDict = {"JANUARY":"JAN",
"FEBRUARY":"FEB",
"MARCH":"MAR",
"APRIL":"APR",
"MAY":"MAY",
"JUNE":"JUN",
"JULY":"JUL",
"AUGUST":"AUG",
"SEPTEMBER":"SEP",
"OCTOBER":"OCT",
"NOVEMBER":"NOV",
"DECEMBER":"DEC"}
replaceDict = {"ELECTRIC":"ELEC",
"&":"AND",
"I.D.":"ID",
"MMBTUPER":"MMBTU_PER"}
for month in monthDict.values():
replaceDict["MMBTU"+month] = "MMBTU_" + month
#Replace the column name
def rename(col):
for old, new in monthDict.iteritems():
col = col.replace(old, new)
for old, new in replaceDict.iteritems():
col = col.replace(old, new)
col = col.replace("MMBTUS", "MMBTU")
return col
#Iterate through each column name of each dataframe to standardize
for key, df in eiaDict.iteritems():
colNames = [name.replace("\n", "_").replace(" ", "_").strip().upper() for name in df.columns]
colNames = [rename(col) for col in colNames]
eiaDict[key].columns = colNames
In [13]:
#Define the columns that are necessary but are not summable
allCols = eiaDict[fileNameMap.values()[0]].columns
nonSumCols = ["PLANT_ID", "PLANT_NAME", "YEAR"]
#Define the columns that contain the year's totals (Used to calc fuel type %)
yearCols = ["TOTAL_FUEL_CONSUMPTION_QUANTITY", "ELEC_FUEL_CONSUMPTION_QUANTITY",
"TOTAL_FUEL_CONSUMPTION_MMBTU", "ELEC_FUEL_CONSUMPTION_MMBTU",
"NET_GENERATION_(MEGAWATTHOURS)"]
#Define the columns that are necessary and summable
sumCols = []
sumCols.extend(yearCols)
# regex = re.compile(r"^ELEC_QUANTITY_.*")
# sumCols.extend([col for col in allCols if regex.search(col)])
regex = re.compile(r"^MMBTU_PER_UNIT_.*")
sumCols.extend([col for col in allCols if regex.search(col)])
regex = re.compile(r"^TOT_MMBTU_.*")
sumCols.extend([col for col in allCols if regex.search(col)])
regex = re.compile(r"^ELEC_MMBTUS_.*")
sumCols.extend([col for col in allCols if regex.search(col)])
regex = re.compile(r"^NETGEN_.*")
sumCols.extend([col for col in allCols if regex.search(col)])
In [14]:
#Get a list of all the different fuel type codes.
fuelTypes = []
fuelTypes.extend([fuelType for df in eiaDict.values() for fuelType in df["REPORTED_FUEL_TYPE_CODE"].tolist()])
fuelTypes = set(fuelTypes)
In [15]:
# EIA-923 splits out electricity generated at a power plant generating units
# and the type of fuel. We are aggregating everything to the facility level and
# assigning a single fuel type to the plant. This fuel type is used to filter out
# non-fossil plants and for analysis of what types of power plants are in the clusters.
# The cell below has three parts to aggregate by facility and to calculate the
# % of each type of fuel. This will take a few minutes to run.
# The end result is aggEIADict.
#Actually calculate the % fuel type for each facility grouping
def calcPerc(group, aggGroup, fuelType, col):
#Check to see if the facility has a record for the fuel type, and if the total column > 0
if len(group[group["REPORTED_FUEL_TYPE_CODE"] == fuelType]) > 0 and aggGroup[col] > 0:
#summing fuel type because a facility may have multiple plants with the same fuel type
return float((group[group["REPORTED_FUEL_TYPE_CODE"] == fuelType][col]).sum())/aggGroup[col]
else:
return 0
#Perform the aggregation on facility level
def aggAndCalcPerc(group):
aggGroup = group.iloc[0][nonSumCols] #Get the non-agg columns
aggGroup = aggGroup.append(group[sumCols].sum()) #Aggregate the agg columns and append to non-agg
percCols = {col + " %" + fuelType:calcPerc(group, aggGroup, fuelType, col) for col in yearCols for fuelType in fuelTypes}
aggGroup = aggGroup.append(pd.Series(percCols))
return aggGroup
#Iterate through each dataframe to perform aggregation by facility
aggEIADict = dict()
for key, df in eiaDict.iteritems():
gb = df.groupby(by="PLANT_ID")
#aggGroup will be a list of panda series, each series representing a facility
aggGroup = [aggAndCalcPerc(gb.get_group(group)) for group in gb.groups]
aggEIADict[key] = pd.DataFrame(aggGroup)
In [16]:
# Save the aggEIADict file
filename = 'aggEIADict.pkl'
pickle.dump(aggEIADict, open(filename, 'wb'))
In [5]:
#Load the aggEIADict pickle file
aggEIADict = pickle.load(open('aggEIADict.pkl', 'rb'))
In [65]:
aggEIADict[2011].loc[:,['PLANT_ID',
'YEAR',
'ELEC_FUEL_CONSUMPTION_QUANTITY',
'NET_GENERATION_(MEGAWATTHOURS)']].describe()
Out[65]:
Combine all df's from the dict into one df
In [17]:
#Concat all dataframes, reset the index, determine the primary fuel type for
# each facility, filter to only include fossil power plants
all923 = pd.concat(aggEIADict)
all923.reset_index(drop=True, inplace=True)
In [18]:
def top_fuel(row):
#Fraction of largest fuel for electric heat input
try:
fuel = row.iloc[1:27].idxmax()[29:]
except:
return None
return fuel
In [19]:
all923['FUEL'] = all923.apply(top_fuel, axis=1)
Because the EPA data only includes power plants that burn fossil fuels, we are only keeping these facilities. The codes below correspond to:
In [20]:
fossil923 = all923.loc[all923['FUEL'].isin(['DFO', 'LIG', 'NG', 'PC', 'SUB'])]
In [21]:
fossil923.head()
Out[21]:
In [25]:
filename = 'Fossil_EIA_923.csv'
fullpath = os.path.join(filename)
fossil923.to_csv(fullpath, index=None)
In [7]:
## Reload data from File to avoid running the data pre-processing steps:
filename = 'Fossil_EIA_923.csv'
fullpath = os.path.join(filename)
fossil923 = pd.read_csv(fullpath)
In [8]:
fossil923.head()
Out[8]:
In [22]:
# Iterate through the directory to find all the files to import
path = os.path.join('../Raw Data/EIA Data', '860-No_Header')
full_path = os.path.join(path, '*.*')
eia860Names = os.listdir(path)
# Rename the keys for easier merging later
fileName860Map = { 'GenY07.xls':2007,
'GenY08.xls':2008,
'GeneratorY09.xls':2009,
'GeneratorsY2010.xls':2010,
'GeneratorY2011.xlsx':2011,
'GeneratorY2012.xlsx':2012,
'3_1_Generator_Y2013.xlsx':2013,
'3_1_Generator_Y2014.xlsx':2014,
'3_1_Generator_Y2015.xlsx':2015}
#Load the files into data frames, one df per file
eia860Dict = {fileName860Map[fn]:pd.read_excel(os.path.join(path, fn)) for fn in eia860Names}
In [23]:
#Dict of values to replace to standardize column names across all dataframes
renameDict = { "PLNTCODE":"PLANT_ID",
"PLANT_CODE":"PLANT_ID",
"Plant Code":"PLANT_ID",
"NAMEPLATE":"NAMEPLATE_CAPACITY(MW)",
"Nameplate Capacity (MW)":"NAMEPLATE_CAPACITY(MW)"}
#Replace the column name
def rename860(col):
for old, new in renameDict.iteritems():
col = col.replace(old, new)
return col
#Iterate through each column name of each dataframe to standardize and select columns 'PLANT_ID', 'NAMEPLATE_CAPACITY(MW)'
for key, df in eia860Dict.iteritems():
colNames = [rename860(col) for col in df.columns]
eia860Dict[key].columns = colNames
eia860Dict[key] = eia860Dict[key][["PLANT_ID", "NAMEPLATE_CAPACITY(MW)"]]
In [24]:
# Power plants can include multiple generating units. We are aggregating the
# total generating capacity of units in the EIA-860 database to the facility level.
# Iterate through each dataframe to perform aggregation by PLANT_ID
for key, df in eia860Dict.iteritems():
gb = df.groupby(by='PLANT_ID').apply(lambda x: x['NAMEPLATE_CAPACITY(MW)'].sum())
eia860Dict[key]['NAMEPLATE_CAPACITY(MW)'] = eia860Dict[key].PLANT_ID.apply(gb.get_value)
eia860Dict[key] = eia860Dict[key].drop_duplicates(subset=['PLANT_ID', 'NAMEPLATE_CAPACITY(MW)'])
eia860Dict[key] = eia860Dict[key].sort_values(by='PLANT_ID').reset_index(drop=True)
In [30]:
eia860Dict[2007].head()
Out[30]:
In [25]:
# Save the aggEIADict file
filename = 'eia860.pkl'
pickle.dump(eia860Dict, open(filename, 'wb'))
In [9]:
#Load the eia860Dict pickle file
eia860Dict = pickle.load(open('eia860.pkl', 'rb'))
In [10]:
eia860Dict[2010].head()
Out[10]:
EIA provides historical data on natural gas and coal prices for electric power customers in each state. We use data for power plants in Texas and assume that they are accurate for plants in ERCOT. Natural gas data is provided on a monthly basis. Coal data is available by type (lignite, subbituminous, and a combination of the two labeled “All coal”) on a quarterly basis, but only going back through 2008. Coal prices have not changed drastically over time so we make the assumption that prices in 2007 were equal to the average price in 2008.
In [31]:
#The natural gas data are easy to process. We load the Excel file and add
#columns for `Month` and `Year`.
path = os.path.join('../Raw Data/EIA data', 'Natural gas price', 'N3045TX3m.xls')
ng_price = pd.read_excel(path, sheetname='Data 1', header=2)
In [32]:
ng_price.rename(columns={"Texas Natural Gas Price Sold to Electric Power Consumers (Dollars per Thousand Cubic Feet)":
"NG Price ($/mcf)"}, inplace=True)
In [33]:
ng_price.loc[:,'Month'] = ng_price.loc[:,'Date'].apply(lambda x: x.month)
ng_price.loc[:,'Year'] = ng_price.loc[:,'Date'].apply(lambda x: x.year)
In [34]:
ng_price.tail()
Out[34]:
In [35]:
path = os.path.join('../Raw Data/EIA data', 'Coal prices', 'Texas electric sector coal price.xlsx')
coal_temp = pd.read_excel(path, sheetname='Sheet1')
In [36]:
coal_temp.loc[:,'Quarter'] = coal_temp.loc[:,'Date'].apply(lambda x: int(x[1]))
coal_temp.loc[:,'Year'] = coal_temp.loc[:,'Date'].apply(lambda x: int(x[-4:]))
In [37]:
coal_temp.head()
Out[37]:
In [38]:
coal_2007 = pd.DataFrame(columns=coal_temp.columns)
coal_2007['Quarter'] = [1,2,3,4]
coal_2007['Year'] = 2007
for coal in ['All coal', 'Lignite', 'Subbituminous']:
coal_2007[coal] = coal_temp.loc[coal_temp['Year']==2008,coal].mean()
In [39]:
coal_temp = coal_temp.append(coal_2007)
In [40]:
coal_temp.tail()
Out[40]:
In [41]:
# Create a time dataframe grom 2007-2015
df = pd.DataFrame(pd.date_range('2007-1-15', periods=12*9, freq='M'), columns=['tempdate'])
df.loc[:,'Quarter'] = df.loc[:,'tempdate'].apply(lambda x: x.quarter)
df.loc[:,'Month'] = df.loc[:,'tempdate'].apply(lambda x: x.month)
df.loc[:,'Year'] = df.loc[:,'tempdate'].apply(lambda x: x.year)
In [42]:
coal_price = pd.merge(coal_temp, df, on=['Quarter', 'Year'])
In [43]:
coal_price.head()
Out[43]:
In [44]:
#Combine the natural gas and coal price dataframes.
fuel_price = pd.merge(ng_price, coal_price, how='right', on=['Month', 'Year'])
fuel_price.drop(['Date_x', 'Date_y', 'tempdate', 'Quarter'], axis=1, inplace=True)
In [48]:
fuel_price.head()
Out[48]:
In [46]:
#Save the dataframe to csv
filename = 'Fuel_Prices.csv'
fullpath = os.path.join(filename)
fuel_price.to_csv(fullpath, index=None)
In [11]:
## Reload data from File to avoid running the data pre-processing steps:
filename = 'Fuel_Prices.csv'
fullpath = os.path.join(filename)
fuel_price = pd.read_csv(fullpath)
In [16]:
fuel_price.head()
Out[16]:
In [38]:
#Read the EPA files into a dataframe
path2 = os.path.join('../Raw Data/EPA air markets')
epaNames = os.listdir(path2)
filePaths = {dn:os.path.join(path2, dn, "*.txt") for dn in epaNames}
filePaths = {dn:glob.glob(val) for dn, val in filePaths.iteritems()}
epaDict = {key:pd.read_csv(fp, index_col = False) for key, val in filePaths.iteritems() for fp in val}
The DataFrames in epaDict contain all power plants in Texas. We can filter on NERC REGION so that it only includes ERCOT.
In [39]:
#Rename the column names to remove the leading space.
for key, df in epaDict.iteritems():
colNames = [name.upper().strip() for name in df.columns]
colNames[colNames.index("FACILITY ID (ORISPL)")] = "PLANT_ID"
epaDict[key].columns = colNames
#Convert DATE to datetime object
#Add new column DATETIME with both date and hour
for key, df in epaDict.iteritems():
epaDict[key]["DATE"] = pd.to_datetime(df["DATE"])
epaDict[key]['DATETIME'] = df['DATE'] + pd.to_timedelta(df['HOUR'], unit='h')
In [40]:
#Boolean filter to only keep ERCOT plants
for key, df in epaDict.iteritems():
epaDict[key] = df[df["NERC REGION"] == "ERCOT"].reset_index(drop = True)
In [41]:
epaDict['2015 July-Dec'].head()
Out[41]:
In [34]:
allEPA = pd.concat(epaDict)
In [35]:
allEPA.fillna(0, inplace=True)
Out[35]:
In [36]:
def plant_gen_delta(df):
"""
For every plant in the input df, calculate the change in gross load (MW)
from the previous hour.
input:
df: dataframe of EPA clean air markets data
return:
df: concatanated list of dataframes
"""
df_list = []
for plant in df['PLANT_ID'].unique():
temp = df.loc[df['PLANT_ID'] == plant,:]
temp.loc[:,'Gen Change (MW)'] = temp.loc[:,'GROSS LOAD (MW)'].values - temp.loc[:,'GROSS LOAD (MW)'].shift(1).values
df_list.append(temp)
return pd.concat(df_list)
In [37]:
allEPA = plant_gen_delta(allEPA)
In [38]:
allEPA.reset_index(drop=True, inplace=True)
In [27]:
filename = 'ALL_EPA.gzip'
fullpath = os.path.join(filename)
allEPA.to_csv(fullpath, index=None, compression='gzip')
In [13]:
# ## Reload data from File to avoid running the data pre-processing steps:
filename = 'ALL_EPA.gzip'
fullpath = os.path.join(filename)
allEPA = pd.read_csv(fullpath, compression='gzip',
usecols=['DATETIME', 'GROSS LOAD (MW)', 'PLANT_ID', 'YEAR', 'Gen Change (MW)'])
allEPA.reset_index(drop=True, inplace=True)
In [14]:
allEPA.tail()
Out[14]:
These variables are used to cluster the power plants.
In [36]:
clusterDict = dict()
for key, df in eia860Dict.iteritems():
clusterDict[key] = pd.merge(aggEIADict[key], eia860Dict[key], how='left', on='PLANT_ID')[['PLANT_ID', 'NAMEPLATE_CAPACITY(MW)']]
clusterDict[key].rename(columns={'NAMEPLATE_CAPACITY(MW)': 'capacity', 'PLANT_ID': 'plant_id'}, inplace=True)
In [37]:
fuel_cols = [col for col in aggEIADict[2008].columns if 'ELEC_FUEL_CONSUMPTION_MMBTU %' in col]
def top_fuel(row):
#Fraction of largest fuel for electric heat input
try:
fuel = row.idxmax()[29:]
except:
return None
return fuel
Calculate Capacity factor, Efficiency, Fuel type
In [38]:
for key, df in clusterDict.iteritems():
clusterDict[key]['year'] = key
clusterDict[key]['capacity_factor'] = aggEIADict[key]['NET_GENERATION_(MEGAWATTHOURS)'] / (8670*clusterDict[key]['capacity'])
clusterDict[key]['efficiency'] = (aggEIADict[key]['NET_GENERATION_(MEGAWATTHOURS)']*3.412)/(1.0*aggEIADict[key]['ELEC_FUEL_CONSUMPTION_MMBTU'])
clusterDict[key]['fuel_type'] = aggEIADict[key][fuel_cols].apply(top_fuel, axis=1)
clusterDict[key] = clusterDict[key][clusterDict[key]['fuel_type'].isin(['SUB',
'LIG',
'DFO',
'NG',
'PC'])]
Calculate 95th percentile ramp rate for each plant in each year and merge with the EIA data
In [39]:
# Add ramp rate using the allEPA df filtered by year (key)
# That should let us get rid of most of the code above.
# key is year
for key, df in clusterDict.iteritems():
temp_df = allEPA.loc[allEPA['YEAR']==key,['PLANT_ID', 'YEAR','Gen Change (MW)']]
ramp_list = []
for plant in temp_df.loc[:,'PLANT_ID'].unique():
ramp_95 = temp_df.loc[(temp_df['PLANT_ID']== plant),'Gen Change (MW)'].quantile(0.95, interpolation='nearest')
ramp_list.append([plant, key, ramp_95])
ramp_rate_df = pd.DataFrame(ramp_list, columns=['plant_id', 'year', 'ramp_rate'])
clusterDict[key] = pd.merge(clusterDict[key], ramp_rate_df, how='left', on=['plant_id', 'year'])
In [69]:
sns.pairplot(clusterDict[2012], hue='fuel_type', vars=['capacity', 'capacity_factor', 'efficiency', 'ramp_rate'])
Out[69]:
In [43]:
for key in clusterDict.keys():
print key, clusterDict[key].plant_id.count(), clusterDict[key].ramp_rate.count()
In [44]:
# re-arrange column order
columns = ['year', 'plant_id', 'capacity', 'capacity_factor', 'efficiency', 'ramp_rate', 'fuel_type']
filename = 'Cluster_Data_121216.csv'
fullpath = os.path.join(filename)
counter = 0
for key, df in clusterDict.iteritems():
# create the csv file
if counter == 0:
df[columns].sort_values(by='plant_id').to_csv(fullpath, sep=',', index = False)
counter += 1
# append to existing csv file
else:
df[columns].sort_values(by='plant_id').to_csv(fullpath, sep=',', index = False, header=False, mode = 'a')
The number of possible clusters can range from k=1 to k=n, where n is the number of power plants operating. The point of clustering power plants is to improve our ability to predict how a group of similar power plants will increase or decrease their generation in response to changes in grid conditions. This will be especially important when using the model to predict the behavior of power plants that have not yet been built. Reducing k improves our ability to predict changes in generation, but it reduces the usefulness of the model. When k=1, we can conclude (unhelpfully) that power plants will respond to meet changes in load. Ideally, we want to find a value of k that is small enough to improve predictions but large enough to differentiate power plants into helpful groups. Future work that focuses on predicting emissions from changes in generation might use a more specific loss function related to error or uncertainty in those predictions.
This section uses the class Clusters
, which is defined in cluster.py
.
In [47]:
from cluster_kplus import Clusters
Right now Clusters
reads the data in from a file.
In [48]:
# filename = 'Cluster_Data.csv'
filename = 'Cluster_Data_121216.csv'
# path = '../Clean data'
fullpath = os.path.join(filename)
cluster = Clusters(fullpath)
In [91]:
cluster.fossil_with_ramp.describe()
Out[91]:
The make_clusters()
function uses KMeans from scikit-learn to produce clusters across a range of k-values.
In [50]:
cluster.make_clusters(n_clusters=range(5,15))
evaluate_clusters()
calculates and plots the calinski-harabaz score and silhouette score for each k value
In [51]:
cluster.evaluate_clusters()
In [67]:
cluster.evaluate_clusters()
From the figures above, it is difficult to determine an optimal number of clusters. The silhouette score clearly shows that we need more than 5 clusters. 6 looks like a good number, but we also look at other options. We are most concerned with how the power plants in a cluster change their generation in response to a change in ERCOT net load. Plots with these two variables for k=6 and k=10 are included below. The net change in load from the previous hour is on the x-axis, and change in total generation from the clustered power plants is on the y-axis. Each row represents all hours of operation for a single cluster of power plants in 2007, 2011, and 2015. We do not include plots for other k values because they are large and do not add additional information.
In both of the cluster plots, there is always one cluster that looks like it should have two independent regression lines (cluster 1 with k=7, cluster 10 with k=12). Future work can determine if these are separate plants, or plants that operate under two different modes. Either way, we expect that predictions of behavior for this group may suffer if our model doesn’t include the features that lead to this behavior.
In [53]:
labeled_plants = cluster.label_and_export(k=6)
In [54]:
cluster.plot_clusters(ercot, allEPA, labeled_plants)
Out[54]:
In [55]:
labeled_plants = cluster.label_and_export(k=12)
In [56]:
cluster.plot_clusters(ercot, allEPA, labeled_plants)
Out[56]:
Now that we have imported/cleaned all of our data, put it on an hourly basis, selected an appropriate value of k for clustering power plants into groups, and labeled each plant, we can calculate the change in generation for each cluster. This historical change in generation will serve as the known y-vector for our model.
In [57]:
from MarginalUnit import Marginal_Unit
In [60]:
margUnit = Marginal_Unit(ercot, allEPA, labeled_plants, eia860Dict, fuel_price)
In [61]:
X = margUnit.getX()
y = margUnit.getY()
Save dataframes to csv
In [8]:
x_fn = 'x_121216.gzip'
y_fn = 'y_121216.csv'
x_path = os.path.join(x_fn)
y_path = os.path.join(y_fn)
X.to_csv(x_path, index=None, compression='gzip')
y.to_csv(y_path, index=None)
In [3]:
#Load dataframes from csv
x_fn = 'x_121216.gzip'
y_fn = 'y_121216.csv'
x_path = os.path.join(x_fn)
y_path = os.path.join(y_fn)
X = pd.read_csv(x_path, compression='gzip')
y = pd.read_csv(y_path)
Description of each column used in X:
In [72]:
X.tail()
Out[72]:
In [73]:
y.tail()
Out[73]:
Because the coal and natural gas prices provided by EIA are nominal, they are not appropriate to use in modeling future scenarios. Rather than adjusting for inflation - which is also difficult to anticipate for the future - we divide the coal prices by the price of natural gas. This ratio of prices contains all the same information, but is invariant to inflationary effects.
In [74]:
for fuel in ['All coal', 'Lignite', 'Subbituminous']:
X.loc[:,fuel] = X.loc[:,fuel].values/X.loc[:,'NG Price ($/mcf)'].values
X.drop('NG Price ($/mcf)', axis=1, inplace=True)
In [75]:
cluster_ids = X['cluster'].unique()
for cluster_id in cluster_ids:
X['cluster_{}'.format(cluster_id)] = np.eye(len(cluster_ids))[X['cluster'],cluster_id]
In [76]:
# Drop unnecessary columns and replace nan's with 0
X_cols = ['nameplate_capacity', 'GROSS LOAD (MW)', 'ERCOT Load, MW',
'Total Wind Installed, MW', 'Total Wind Output, MW', 'Net Load Change (MW)',
'NG Price ($/mcf)', 'All coal', 'Lignite', 'Subbituminous']
X_cluster_cols = ['cluster_{}'.format(cluster_id) for cluster_id in cluster_ids]
X_clean = X.loc[:,X_cols+X_cluster_cols]
X_clean.fillna(0, inplace=True)
y_clean = y.loc[:,'Gen Change (MW)']
y_clean.fillna(0, inplace=True)
In [77]:
print X_clean.shape
print y_clean.shape
We have 9 years (~79,000 hours) of data, representing 2009-2015. Of these 9 years, we use 5 for training (2007-2011), reserve 2 for validation (2012-2013), and 2 for final testing (2014-2015). Because most of the regression models perform better with scaled data, we also transform the data (centered at zero with unit standard deviation) using the scikit-learn StandardScaler() function.
In [78]:
X_train = X_clean.loc[(X['Year']<2012),:]
y_train = y_clean.loc[(X['Year']<2012)]
X_va = X_clean.loc[X['Year'].isin([2012, 2013]),:]
y_va = y_clean.loc[X['Year'].isin([2012, 2013])]
X_test = X_clean.loc[X['Year']>2013,:]
y_test = y_clean.loc[X['Year']>2013]
In [79]:
print X_va.shape, y_va.shape
In [80]:
X_train_scaled = StandardScaler().fit_transform(X_train)
X_va_scaled = StandardScaler().fit_transform(X_va)
X_test_scaled = StandardScaler().fit_transform(X_test)
Check size of all arrays
In [81]:
print X_train_scaled.shape, y_train.shape
print X_va_scaled.shape, y_va.shape
print X_test_scaled.shape, y_test.shape
We start by fitting our data to a linear model that minimizes ordinary least squares loss. There are no hyperparameters to test on this type of model. According to the scikit-learn documentation[9]:
“The.. [score] …is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.”
In [82]:
lm = LinearRegression()
lm.fit(X_train_scaled, y_train)
Out[82]:
In [83]:
# for kMeans k=6 clusters
lm.score(X_va_scaled, y_va)
Out[83]:
In [214]:
# for kMeans k=10 clusters
lm.score(X_va_scaled, y_va)
Out[214]:
In [84]:
y_pr = lm.predict(X_va_scaled)
After fitting a model, it is helpful to plot the predicted residuals against the predicted values. The residuals should be randomly distributed around 0, with equal error in both directions. In this plot we see that the quality of the predictions varies by cluster. Cluster 2 is especially skewed, which is not surprising given the behavior seen in the plot of generation change vs load change in the Clustering section. Some of the residual errors shown in this plot are large compared to the predicted values - they range from +/-1,000 MW change in generation over a single hour, when the predicted values are almost entirely in the range of +/-400 MW.
In [98]:
y_lm_resids = pd.DataFrame(dict(zip(['Gen Change (MW)', 'y_pr', 'cluster'],
[y_va.values, y_pr, X.loc[X['Year'].isin([2012, 2013]),'cluster'].values])))
y_lm_resids.loc[:,'residuals'] = y_lm_resids.loc[:,'y_pr'] - y_lm_resids.loc[:,'Gen Change (MW)']
In [99]:
g = sns.FacetGrid(y_lm_resids, hue='cluster', col='cluster',
col_wrap=3)
g.map(plt.scatter, 'y_pr', 'residuals')
g.add_legend()
Out[99]:
Out[99]:
This is a linear support vector regressor, which scales to large datasets better than a SVM with a radial basis function kernel. It still employs a regularization parameter (C), which we can vary using GridSearch. From the validation curve plot below, it is apparent that the model performs poorly with small values of C (large regularization). This makes sense, because with large regularization the model will always predict the average value of y. And an average prediction of y will result in a score of 0.
In [100]:
svm = LinearSVR()
In [101]:
parameters = {'C':np.logspace(-5, 3, num=15)}
In [104]:
lm = GridSearchCV(svm, parameters, n_jobs=-1)
Run the LinearSVR with gridsearch over the 15 parameter values. GridSearchCV does 3-fold CV by default.
In [105]:
results = lm.fit(X_train_scaled, y_train)
In [106]:
results.score(X_va_scaled, y_va)
Out[106]:
Validation Curve
Validation curves[10] are included to show how the training and cross-validation scores change as a hyperparameter is varied. For this LinearSVR model we vary "C", the regularization parameter. Low values of C correspond to a large regularization, which leads the model to predict the average value of y from the training data. High values of C can lead to over-fitting of data.
The learning curve plot shows that we get the best possible performance once 40-50% of the data is included in the model. The model does not improve beyond this point.
In [108]:
from sklearn.model_selection import validation_curve, learning_curve
In [109]:
param_range = np.logspace(-5, 3, num=15)
train_scores, valid_scores = validation_curve(LinearSVR(), X_train_scaled, y_train, "C", param_range)
In [110]:
#http://scikit-learn.org/stable/modules/learning_curve.html
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)
plt.title("Validation Curve - SVR")
plt.xlabel("C")
plt.ylabel("Score")
plt.ylim(0.0, .2)
lw = 2
plt.semilogx(param_range, train_scores_mean, label="Training score",
color="darkorange", lw=lw)
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange", lw=lw)
plt.semilogx(param_range, valid_scores_mean, label="Cross-validation score",
color="navy", lw=lw)
plt.fill_between(param_range, valid_scores_mean - valid_scores_std,
valid_scores_mean + valid_scores_std, alpha=0.2,
color="navy", lw=lw)
plt.legend(loc="best")
Out[110]:
Out[110]:
Out[110]:
Out[110]:
Out[110]:
Out[110]:
Out[110]:
Out[110]:
Out[110]:
Learning Curve A learning curve[10] looks at how the training and cross validation score change as more data is added to the model. If there is a large gap in training and cv scores that does not close as more data is added, then the model is probably overfitting. That isn't the case here. It appers that a LinearSVR model is unlikely to give good results for our data.
In [111]:
#http://scikit-learn.org/stable/modules/learning_curve.html
xLen = len(X_train_scaled)
tSize = [.1, .2, .3, .4, .5, .6, .7, .8, .9, 1]
train_sizes, train_scores, valid_scores = learning_curve(LinearSVR(), X_train_scaled, y_train, train_sizes = tSize)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)
plt.grid()
plt.title("Learning Curve - LinearSVR")
plt.fill_between(tSize, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="darkorange")
plt.fill_between(tSize, valid_scores_mean - valid_scores_std,
valid_scores_mean + valid_scores_std, alpha=0.1, color="navy")
plt.plot(tSize, train_scores_mean, 'o-', color="darkorange",
label="Training score")
plt.plot(tSize, valid_scores_mean, 'o-', color="navy",
label="Cross-validation score")
plt.legend(loc="best")
Out[111]:
Out[111]:
Out[111]:
Out[111]:
Out[111]:
Out[111]:
In [69]:
xLen = len(X_train_scaled)
tSize = [.1, .2, .3, .4, .5]
train_sizes, train_scores, valid_scores = learning_curve(SVR(), X_train_scaled, y_train, train_sizes = tSize, n_jobs = -1, verbose = 3)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)
plt.grid()
plt.title("Validation Curve - SVR")
plt.fill_between(tSize, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="darkorange")
plt.fill_between(tSize, valid_scores_mean - valid_scores_std,
valid_scores_mean + valid_scores_std, alpha=0.1, color="navy")
plt.plot(tSize, train_scores_mean, 'o-', color="darkorange",
label="Training score")
plt.plot(tSize, valid_scores_mean, 'o-', color="navy",
label="Cross-validation score")
plt.legend(loc="best")
plt.show
Out[69]:
Since none of the previous methods were able to able to predict the change in cluster generation with much accuracy, we move on to boosted gradient regression. Boosted regression trees do not require scaled data and are a very powerful tool for modeling. The code below uses the XGBoost library for learning and validation curves, which is not included in the Anaconda python distribution. The results can be matched reasonably well with GradientBoostingRegressor from scikit-learn. We use XGBoost here because it is much faster and allows us to include validation and learning curves.
In [113]:
gbr = GradientBoostingRegressor()
In [114]:
gbr.fit(X_train, y_train)
Out[114]:
In [115]:
gbr.score(X_va, y_va)
Out[115]:
In [116]:
from xgboost import XGBRegressor
We start by looking at a validation curve for the parameter n_estimators
In [117]:
param_values = [50, 100, 200, 300]
train_scores, valid_scores = validation_curve(XGBRegressor(), X_train, y_train, "n_estimators", param_values,
n_jobs=-1)
In [118]:
#http://scikit-learn.org/stable/modules/learning_curve.html
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)
plt.title("Validation Curve with XGBoost")
plt.xlabel("n_estimators")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.plot(param_values, train_scores_mean, label="Training score",
color="darkorange", lw=lw)
plt.fill_between(param_values, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange", lw=lw)
plt.plot(param_values, valid_scores_mean, label="Cross-validation score",
color="navy", lw=lw)
plt.fill_between(param_values, valid_scores_mean - valid_scores_std,
valid_scores_mean + valid_scores_std, alpha=0.2,
color="navy", lw=lw)
plt.legend(loc="best")
Out[118]:
Out[118]:
Out[118]:
Out[118]:
Out[118]:
Out[118]:
Out[118]:
Out[118]:
Out[118]:
There isn't much improvement beyond 200 estimators. We will use 250 and look at the parameter max_depth
In [122]:
param_values = [1,3,7]
train_scores, valid_scores = validation_curve(XGBRegressor(n_estimators=250), X_train, y_train, "max_depth", param_values,
n_jobs=-1)
In [123]:
#http://scikit-learn.org/stable/modules/learning_curve.html
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)
plt.title("Validation Curve with XGBoost")
plt.xlabel("max_depth")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.plot(param_values, train_scores_mean, label="Training score",
color="darkorange", lw=lw)
plt.fill_between(param_values, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange", lw=lw)
plt.plot(param_values, valid_scores_mean, label="Cross-validation score",
color="navy", lw=lw)
plt.fill_between(param_values, valid_scores_mean - valid_scores_std,
valid_scores_mean + valid_scores_std, alpha=0.2,
color="navy", lw=lw)
plt.legend(loc="best")
Out[123]:
Out[123]:
Out[123]:
Out[123]:
Out[123]:
Out[123]:
Out[123]:
Out[123]:
Out[123]:
Finally we look at the learning curve for XGBoost
In [124]:
train_sizes, train_scores, valid_scores = learning_curve(XGBRegressor(n_estimators=250), X_train, y_train,
n_jobs=-1)
In [125]:
#http://scikit-learn.org/stable/modules/learning_curve.html
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)
plt.title("Learning Curve with XGBoost")
plt.xlabel("Sample size")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.plot(train_sizes, train_scores_mean, label="Training score",
color="darkorange", lw=lw)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange", lw=lw)
plt.plot(train_sizes, valid_scores_mean, label="Cross-validation score",
color="navy", lw=lw)
plt.fill_between(train_sizes, valid_scores_mean - valid_scores_std,
valid_scores_mean + valid_scores_std, alpha=0.2,
color="navy", lw=lw)
plt.legend(loc="best")
Out[125]:
Out[125]:
Out[125]:
Out[125]:
Out[125]:
Out[125]:
Out[125]:
Out[125]:
Out[125]:
With the information from the validation and learning curves it appears more data and more features may help to improve the performance of our model.
In [126]:
gbr = GradientBoostingRegressor(n_estimators=250)
In [127]:
gbr.fit(X_train, y_train)
Out[127]:
In [128]:
gbr.score(X_va, y_va)
Out[128]:
In [137]:
y_pr = gbr.predict(X_va)
In [140]:
y_gbr_resids = pd.DataFrame(dict(zip(['Gen Change (MW)', 'y_pr', 'cluster'],
[y_va.values, y_pr, X.loc[X['Year'].isin([2012, 2013]),'cluster'].values])))
y_gbr_resids.loc[:,'residuals'] = y_lm_resids.loc[:,'y_pr'] - y_lm_resids.loc[:,'Gen Change (MW)']
In [141]:
g = sns.FacetGrid(y_gbr_resids, hue='cluster', col='cluster',
col_wrap=3)
g.map(plt.scatter, 'y_pr', 'residuals')
g.add_legend()
Out[141]:
Out[141]:
Finally, we fit the model with both the training and validation data and use the test hold-out set.
In [135]:
gbr.fit(np.concatenate((X_train, X_va)), np.concatenate((y_train, y_va)))
Out[135]:
In [136]:
gbr.score(X_test, y_test)
Out[136]:
This work built a gradient boosted regression tree model to predict how clusters of fossil power plants in ERCOT will change their generation in response to changes in net load. The final model captures some of the real-world behavior, but still has large errors in some predictions. Future work may include obtaining sub-hourly data for wind generation, more thoroughly exploring the clustering assignments at different levels of k, and expanding beyond ERCOT to look at a larger dataset. If the goal is to predict emissions from marginal generation, it will also be necessary to calculate the error and uncertainty in emissions from clustered power plants.
In [ ]: