Assign NERC labels to plants using 860 data and RandomForest

Instructions

Make sure the file_date parameter below is set to whatever value you would like appended to file names.

Change the most_recent_860_year parameter below to match the most up-to-date EIA-860 annual data file. As of March 2018 this is 2016.

EIA-860 (annual) excel files will need to be downloaded and unzipped to the EIA downloads folder. Make sure that all years from 2012 through the most recent data year are available. Also download the most recent EIA-860m to EIA downloads.

The most recent annual 860 file available (as of March 2018) represents 2016 data. When newer EIA-860 annual files are added the dictionary with pandas read_excel parameters will need to be updated. Note that EIA puts out an Early Release version of 860 with extra header rows and columns, so be sure to appropriately adjust the skiprows and usecols parameters if using an Early Release file.

The entire notebook can be run at once using Run All Cells



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import os
from os.path import join
import pandas as pd
from sklearn import neighbors, metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from collections import Counter
from copy import deepcopy


cwd = os.getcwd()
data_path = join(cwd, '..', 'Data storage')

Date string for filenames

This will be inserted into all filenames (reading and writing)



In [2]:

    
file_date = '2018-03-06'
most_recent_860_year = 2016

Load data

This loads facility data that has been assembled from the EIA bulk data file, and EIA-860 excel files. The EIA-860 excel files need to be downloaded manually.

Load EIA facility data

Only need to keep the plant id, year (as a check that plants don't move between years), and lat/lon



In [204]:

    
path = os.path.join(data_path, 'Derived data',
                    'Facility gen fuels and CO2 {}.csv'.format(file_date))
facility_df = pd.read_csv(path)
facility_df['state'] = facility_df['geography'].str[-2:]



In [205]:

    
plants = facility_df.loc[:, ['plant id', 'year', 'lat', 'lon', 'state']]
plants.drop_duplicates(inplace=True)

Because the 2017 facility dataframe only includes annually reporting facilities I'm going to duplicate the plant id, lat/lon, and state information from 2016.



In [206]:

    
# make a copy of 2016 (or most recent annual data year) and change the year to 
plants_2017 = plants.loc[plants['year'] == most_recent_860_year, :].copy()
plants_2017.loc[:, 'year'] += 1

plants = pd.concat([plants.loc[plants['year']<=most_recent_860_year, :], plants_2017])



In [207]:

    
(set(plants.loc[plants.year==2016, 'plant id']) - set(plants.loc[plants.year==2017, 'plant id']))









    Out[207]:





set()

Load known NERC labels from EIA-860

Current NERCS go back to 2012. Use all annual 860 files from 2012 through the most recent available. Extend the dictionary of dictionaries below with any files available after 2016. io, skiprows, and usecols are all input parameters for the Pandas read_excel function.



In [5]:

    
eia_base_path = join(data_path, 'EIA downloads')
file_860_info = {
#     2011: {'io': join(eia_base_path, 'eia8602011', 'Plant.xlsx'),
#            'skiprows': 0,
#            'parse_cols': 'B,J'},
    2012: {'io': join(eia_base_path, 'eia8602012', 'PlantY2012.xlsx'),
           'skiprows': 0,
           'usecols': 'B,J'},
    2013: {'io': join(eia_base_path, 'eia8602013', '2___Plant_Y2013.xlsx'),
           'skiprows': 0,
           'usecols': 'C,L'},
    2014: {'io': join(eia_base_path, 'eia8602014', '2___Plant_Y2014.xlsx'),
           'skiprows': 0,
           'usecols': 'C,L'},
    2015: {'io': join(eia_base_path, 'eia8602015', '2___Plant_Y2015.xlsx'),
           'skiprows': 0,
           'usecols': 'C,L'},
    2016: {'io': join(eia_base_path, 'eia8602016', '2___Plant_Y2016.xlsx'),
           'skiprows': 0,
           'usecols': 'C,L'}
}



In [6]:

    
eia_nercs = {}
for key, args in file_860_info.items():
    eia_nercs[key] = pd.read_excel(**args)
    eia_nercs[key].columns = ['plant id', 'nerc']
    eia_nercs[key]['year'] = key

I want to assign NERC regions for every year. We have data for 2012 onward from the EIA-860 files. For the purpose of this analysis I'll assume that all years from 2001-2011 are the same NERC as 2012.

Also assume that values in 2017 are the same as in 2016. I'll fill in nerc values for plants that were built in 2017 below.



In [7]:

    
for year in range(2001, 2012):
    # the pandas .copy() method is deep by default but I'm not sure in this case
    df = deepcopy(eia_nercs[2012])
    df['year'] = year
    
    eia_nercs[year] = df
    
df = deepcopy(eia_nercs[2016])
df['year'] = 2017
eia_nercs[2017] = df



In [8]:

    
eia_nercs.keys()









    Out[8]:





dict_keys([2012, 2013, 2014, 2015, 2016, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2017])



In [9]:

    
eia_nercs[2001].head()



In [10]:

    
nercs = pd.concat(eia_nercs.values())
nercs.sort_values('year', inplace=True)



In [11]:

    
nercs.head()



In [13]:

    
(set(nercs.loc[(nercs.nerc == 'MRO') &
              (nercs.year == 2016), 'plant id'])
 - set(nercs.loc[(nercs.nerc == 'MRO') &
              (nercs.year == 2017), 'plant id']))









    Out[13]:





set()



In [12]:

    
nercs.year.unique()









    Out[12]:





array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017])

Look for plants listed with different NERC labels

This may not matter anymore if I use NERC labels for each year

There are 30 plants duplicated. Five of them don't have a NERC label in one of the years. The largest move is from MRO to other regions (12), with most of those to SPP (7). After that, moves from RFC (5) to MRO (3) and SERC (2). There are also some moves from WECC and FRCC to HICC/ASCC - these might be diesel generators that get moved.

The plants that have duplicate NERC region labels represent a small fraction of national generation, but one that is growing over time. By 2016 they consist of 0.15% of national generation.



In [13]:

    
for df_ in list(eia_nercs.values()) + [nercs]:
    print('{} total records'.format(len(df_)))
    print('{} unique plants'.format(len(df_['plant id'].unique())))









    



7289 total records
7289 unique plants
8060 total records
8060 unique plants
8520 total records
8520 unique plants
8928 total records
8928 unique plants
9711 total records
9711 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
9711 total records
9711 unique plants
132398 total records
10068 unique plants



In [14]:

    
dup_plants = nercs.loc[nercs['plant id'].duplicated(keep=False), 'plant id'].unique()
dup_plants









    Out[14]:





array([56512,  1481,  1480, ..., 61001, 61003, 55160])



In [15]:

    
region_list = []
for plant in dup_plants:
    regions = nercs.loc[nercs['plant id'] == plant, 'nerc'].unique()
#     regions = regions.tolist()
    region_list.append(regions)
Counter(tuple(x) for x in region_list)









    Out[15]:





Counter({('ASCC', nan): 2,
         ('ASCC',): 130,
         ('FRCC', 'HICC'): 1,
         ('FRCC',): 205,
         ('HICC',): 52,
         ('MRO', 'RFC'): 2,
         ('MRO', 'SERC'): 1,
         ('MRO', 'SPP'): 7,
         ('MRO', 'WECC'): 3,
         ('MRO',): 1051,
         ('NPCC',): 1257,
         ('RFC', 'MRO'): 3,
         ('RFC', 'SERC'): 2,
         ('RFC',): 1626,
         ('SERC', 'SPP'): 1,
         ('SERC',): 1832,
         ('SPP', 'SERC'): 2,
         ('SPP', 'TRE'): 1,
         ('SPP',): 470,
         ('TRE',): 461,
         ('WECC', 'ASCC'): 2,
         ('WECC', 'HICC'): 1,
         ('WECC',): 2892,
         (nan, 'WECC', 'ASCC'): 3,
         (nan, 'WECC', 'HICC'): 1,
         (nan,): 35})



In [16]:

    
(facility_df.loc[facility_df['plant id'].isin(dup_plants), :]
            .groupby('year')['generation (MWh)'].sum()
 / facility_df.loc[:, :]
              .groupby('year')['generation (MWh)'].sum())









    Out[16]:





year
2001    0.993784
2002    0.994703
2003    0.993925
2004    0.996498
2005    0.997682
2006    0.998378
2007    0.998707
2008    0.998873
2009    0.999267
2010    0.999767
2011    0.999990
2012    1.000000
2013    1.000000
2014    1.000000
2015    1.000000
2016    1.000000
2017    0.999998
Name: generation (MWh), dtype: float64

Some plants in EIA-860 don't have NERC labels. Drop them now.

This is my training data. All of these plants should still be in my plants dataframe.



In [14]:

    
nan_plants = {}
all_nan = []
years = nercs.year.unique()
for year in years:
    nan_plants[year] = nercs.loc[(nercs.year == year) &
                                 (nercs.isnull().any(axis=1)), 'plant id'].tolist()
    all_nan.extend(nan_plants[year])

# number of plants that don't have a nerc in at least one year
len(all_nan)









    Out[14]:





232



In [15]:

    
# drop all the rows without a nerc value
nercs.dropna(inplace=True)



In [16]:

    
nan_plants[2017]

Load EIA-860m for some info on recent facilities

The EIA-860m (monthly) data file has an up-to-date list of all operating power plants and their associated balancing authority. It does not list the NERC region, so it can't be used to assign NERC labels for all plants. But in some cases knowing the state and balancing authority is enough to make a good guess about which NERC a plant is in.

Assigning NERC region labels has the lowest accuracy for plants in SPP and TRE. To compensate, I'm going to assume that anything in TX or OK and SWPP balancing authority is in SPP. On the flip side, if it's in TX and ERCOT I'll assign it to TRE.

Only do this for plants that come online since the most recent 860 annual data.

NOTE Because I'm looking at units that came online in 2017 some of the plant ids will already exist



In [127]:

    
path = join(data_path, 'EIA downloads', 'december_generator2017.xlsx')

# Check the excel file columns if there is a read error. They should match
# the plant id, plant state, operating year, and balancing authority code.
_m860 = pd.read_excel(path, sheet_name='Operating',skip_footer=1,
                     usecols='C,F,P,AE', skiprows=0)
_m860.columns = _m860.columns.str.lower()



In [128]:

    
# most_recent_860_year is defined at the top of this notebook
# The goal here is to only look at plants that started operating after
# the most recent annual data. So only include units starting after
# the last annual data and that don't have plant ids in the nercs
# dataframe

m860 = _m860.loc[(_m860['operating year'] > most_recent_860_year)].copy() #&
#                 (~_m860['plant id'].isin(nercs['plant id'].unique()))].copy()
m860.tail()









    Out[128]:







  
    
      
      plant id
      plant state
      operating year
      balancing authority code
    
  
  
    
      21373
      61632
      IA
      2017
      MISO
    
    
      21374
      61633
      MA
      2017
      ISNE
    
    
      21375
      61634
      MA
      2017
      ISNE
    
    
      21376
      61635
      MA
      2017
      ISNE
    
    
      21377
      61636
      MA
      2017
      ISNE



In [129]:

    
m860.loc[(m860['plant state'].isin(['TX', 'OK'])) &
         (m860['balancing authority code'] == 'SWPP'), 'nerc'] = 'SPP'

m860.loc[(m860['plant state'].isin(['TX'])) &
         (m860['balancing authority code'] == 'ERCO'), 'nerc'] = 'TRE'



In [130]:

    
# Drop all rows except the ones I've labeled as TRE or SPP
m860.dropna(inplace=True)
m860.head()









    Out[130]:







  
    
      
      plant id
      plant state
      operating year
      balancing authority code
      nerc
    
  
  
    
      343
      165
      OK
      2017
      SWPP
      SPP
    
    
      344
      165
      OK
      2017
      SWPP
      SPP
    
    
      4836
      2953
      OK
      2017
      SWPP
      SPP
    
    
      4837
      2953
      OK
      2017
      SWPP
      SPP
    
    
      4838
      2953
      OK
      2017
      SWPP
      SPP

Make lists of plant codes for SPP and TRE facilities



In [131]:

    
nercs.head()



In [132]:

    
# Create additional dataframes with 2017 SPP and TRE plants.
# Use these to fill in values for 2017 plants

m860_spp_plants = (m860.loc[m860['nerc'] == 'SPP', 'plant id']
                       .drop_duplicates()
                       .reset_index(drop=True))

additional_spp = pd.DataFrame(m860_spp_plants.copy())
# additional_spp['plant id'] = m860_spp_plants
additional_spp['nerc'] = 'SPP'
additional_spp['year'] = 2017

m860_tre_plants = (m860.loc[m860['nerc'] == 'TRE', 'plant id']
                       .drop_duplicates()
                       .reset_index(drop=True))

additional_tre = pd.DataFrame(m860_tre_plants)
# additional_tre['plant id'] = m860_tre_plants
additional_tre['nerc'] = 'TRE'
additional_tre['year'] = 2017



In [133]:

    
additional_spp



In [134]:

    
additional_tre

Append my 2017 SPP and TRE guesses to the full nerc dataframe



In [115]:

    
nercs = pd.concat([nercs, additional_spp, additional_tre])

Clean and prep data for KNN



In [116]:

    
plants.head()



In [117]:

    
nercs.tail()

Checked to make sure the type of merge doesn't matter once rows without nerc values are dropped



In [209]:

    
df = pd.merge(plants, nercs, on=['plant id', 'year'], how='left')



In [210]:

    
omitted = set(df['plant id'].unique()) - set(nercs['plant id'].unique())



In [213]:

    
df.head()



In [214]:

    
df.tail()









    Out[214]:







  
    
      
      plant id
      year
      lat
      lon
      state
      nerc
    
  
  
    
      100851
      59957
      2017
      41.226900
      -88.683600
      IL
      RFC
    
    
      100852
      59257
      2017
      37.305833
      -121.755000
      CA
      WECC
    
    
      100853
      59256
      2017
      38.393333
      -121.927778
      CA
      WECC
    
    
      100854
      10475
      2017
      41.683600
      -87.423300
      IN
      RFC
    
    
      100855
      55370
      2017
      39.851389
      -79.070556
      PA
      RFC



In [215]:

    
df.columns









    Out[215]:





Index(['plant id', 'year', 'lat', 'lon', 'state', 'nerc'], dtype='object')

Drop plants that don't have lat/lon data (using just lon to check), and then drop duplicates. If any plants have kept the same plant id but moved over time (maybe a diesel generator?) or switched NERC they will show up twice.



In [216]:

    
cols = ['plant id', 'lat', 'lon', 'nerc', 'state', 'year']
df_slim = (df.loc[:, cols].dropna(subset=['lon'])
             .drop_duplicates(subset=['plant id', 'year', 'nerc']))



In [217]:

    
df_slim.tail()









    Out[217]:







  
    
      
      plant id
      lat
      lon
      nerc
      state
      year
    
  
  
    
      100851
      59957
      41.226900
      -88.683600
      RFC
      IL
      2017
    
    
      100852
      59257
      37.305833
      -121.755000
      WECC
      CA
      2017
    
    
      100853
      59256
      38.393333
      -121.927778
      WECC
      CA
      2017
    
    
      100854
      10475
      41.683600
      -87.423300
      RFC
      IN
      2017
    
    
      100855
      55370
      39.851389
      -79.070556
      RFC
      PA
      2017

Separate out the list of plants where we don't have NERC labels from EIA-860.



In [218]:

    
unknown = df_slim.loc[df_slim.nerc.isnull()].copy()



In [219]:

    
print("{} plants/years don't have NERC labels\n".format(len(unknown)))
print(unknown.head())









    



1579 plants/years don't have NERC labels

       plant id    lat     lon nerc state  year
28205      3823  38.27 -78.035  NaN    VA  2009
28206      3823  38.27 -78.035  NaN    VA  2005
28207      3823  38.27 -78.035  NaN    VA  2004
28208      3823  38.27 -78.035  NaN    VA  2003
28209      3823  38.27 -78.035  NaN    VA  2002



In [220]:

    
unknown.tail()

Create X and y matricies

X is lat/lon and year

y is the NERC label

For both, I'm only using plants where we have all data (no NaNs). Not doing any transformation of the lat/lon at this time.



In [221]:

    
X = df_slim.loc[df_slim.notnull().all(axis=1), ['lat', 'lon', 'year']]
y = df_slim.loc[df_slim.notnull().all(axis=1), 'nerc']



In [222]:

    
len(X)









    Out[222]:





99127



In [223]:

    
# Make sure that unknown and X include all records from df_slim
len(X) + len(unknown) - len(df_slim)









    Out[223]:





0



In [224]:

    
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

GridSearch to find the best parameters in a RandomForest Classifier

I previously used k-nearest neighbors with just lat/lon as input features. The problem is that some facilities don't have lat/lon data. They do usually have a state geography label though. Categorical labels don't work well in KNN, but the combination of lat/lon and a state label will work well in a tree model. RandomForest is usually a quite effective tree model and my results are more accurate with this than they were with KNN.



In [225]:

    
from sklearn.ensemble import RandomForestClassifier



In [226]:

    
rf = RandomForestClassifier()
params = dict(
    n_estimators = [5, 10, 25, 50],
    min_samples_split = [2, 5, 10],
    min_samples_leaf = [1, 3, 5],
)

clf_rf = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)
clf_rf.fit(X_train, y_train)









    



Fitting 3 folds for each of 36 candidates, totalling 108 fits






    



[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   32.3s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:  1.3min finished






    Out[226]:





GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=False, n_jobs=-1,
       param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)



In [227]:

    
clf_rf.best_estimator_, clf_rf.best_score_









    Out[227]:





(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False), 0.9797785431782634)



In [228]:

    
clf_rf.score(X_test, y_test)









    Out[228]:





0.9822695035460993



In [229]:

    
nerc_labels = nercs.nerc.dropna().unique()

Accuracy score by region



In [230]:

    
for region in nerc_labels:
    mask = y_test == region
    
    X_masked = X_test[mask]
    y_hat_masked = clf_rf.predict(X_masked)
    y_test_masked = y_test[mask]
    
    accuracy = metrics.accuracy_score(y_test_masked, y_hat_masked)
    print('{} : {}'.format(region, accuracy))









    



RFC : 0.9707198806415517
NPCC : 0.9907823209643111
WECC : 0.9987963672174198
SERC : 0.9690892364305428
SPP : 0.9328663164039697
TRE : 0.9903057419835943
MRO : 0.9853683148335015
FRCC : 0.9768707482993197
ASCC : 0.9965986394557823
HICC : 1.0

F1 score by region



In [231]:

    
y_hat = clf_rf.predict(X_test)

for region in nerc_labels:
    f1 = metrics.f1_score(y_test, y_hat, labels=[region], average='macro')
    print('{} : {}'.format(region, f1))









    



RFC : 0.9682820202771835
NPCC : 0.9916026020106445
WECC : 0.9985778361229624
SERC : 0.9707861026633492
SPP : 0.9461219656601539
TRE : 0.986627043090639
MRO : 0.9798068481123793
FRCC : 0.9835616438356164
ASCC : 0.9982964224872232
HICC : 1.0

Plants without lat/lon

Use just the state for plants that don't have lat/lon info. Less accurate, especially where NERC regions cross state lines, but better than nothing.

Need to start with the lon column so I can filter to only unknown facilities that also don't have lon



In [232]:

    
cols = ['plant id', 'nerc', 'state', 'year', 'lon']
df_state_slim = (df.loc[:, cols].dropna(subset=['state']).copy())



In [233]:

    
df_state_slim.head()



In [234]:

    
len(df_state_slim)









    Out[234]:





100831

Encode state names as numbers for use in sklearn



In [235]:

    
le = LabelEncoder()
df_state_slim.loc[:, 'enc state'] = le.fit_transform(df_state_slim.loc[:, 'state'].tolist())



In [236]:

    
len(df_state_slim)









    Out[236]:





100831



In [237]:

    
unknown_state = df_state_slim.loc[(df_state_slim.nerc.isnull()) &
                                  (df_state_slim.lon.isnull())].copy()



In [238]:

    
len(unknown_state), len(unknown)









    Out[238]:





(116, 1579)



In [239]:

    
X_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), ['enc state', 'year']].copy()
y_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), 'nerc'].copy()



In [240]:

    
X_state_train, X_state_test, y_state_train, y_state_test = train_test_split(
    X_state, y_state, test_size=0.33, random_state=42)



In [241]:

    
rf = RandomForestClassifier()
params = dict(
    n_estimators = [5, 10, 25, 50],
    min_samples_split = [2, 5, 10],
    min_samples_leaf = [1, 3, 5],
)

clf_rf_state = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)
clf_rf_state.fit(X_state_train, y_state_train)









    



Fitting 3 folds for each of 36 candidates, totalling 108 fits






    



[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.1s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:   45.4s finished






    Out[241]:





GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=False, n_jobs=-1,
       param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)



In [242]:

    
clf_rf_state.best_estimator_, clf_rf_state.best_score_









    Out[242]:





(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=10,
             min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False), 0.9402882327958322)



In [243]:

    
clf_rf_state.score(X_state_test, y_state_test)









    Out[243]:





0.9387418230726906

Accuracy score by region



In [244]:

    
nerc_labels = nercs.nerc.dropna().unique()

for region in nerc_labels:
    mask = y_state_test == region
    
    X_state_masked = X_state_test[mask]
    y_state_hat_masked = clf_rf_state.predict(X_state_masked)
    y_state_test_masked = y_state_test[mask]
    
    accuracy = metrics.accuracy_score(y_state_test_masked, y_state_hat_masked)
    print('{} : {}'.format(region, accuracy))









    



RFC : 0.9243697478991597
NPCC : 0.9938519744620478
WECC : 0.9943912900032993
SERC : 0.8831431726168568
SPP : 0.5765453495089543
TRE : 1.0
MRO : 0.9614311088556204
FRCC : 0.9973992197659298
ASCC : 1.0
HICC : 1.0

F1 score by region



In [245]:

    
y_state_hat = clf_rf_state.predict(X_state_test)

for region in nerc_labels:
    f1 = metrics.f1_score(y_state_test, y_state_hat, labels=[region], average='macro')
    print('{} : {}'.format(region, f1))









    



RFC : 0.9088405397961994
NPCC : 0.9958535718516763
WECC : 0.9933534743202416
SERC : 0.8819259395387301
SPP : 0.7300658376005852
TRE : 0.8763183125599233
MRO : 0.9512929952297263
FRCC : 0.962962962962963
ASCC : 0.9991474850809888
HICC : 0.9975669099756691

Use best RandomForest parameters to predict NERC for unknown plants



In [246]:

    
unknown.loc[:, 'nerc'] = clf_rf.predict(unknown.loc[:, ['lat', 'lon', 'year']])
unknown_state.loc[:, 'nerc'] = clf_rf_state.predict(unknown_state.loc[:, ['enc state', 'year']])

Ensuring that no plants in Alaska or Hawaii are assigned to continental NERCs, or the other way around.



In [247]:

    
print(unknown.loc[unknown.state.isin(['AK', 'HI']), 'nerc'].unique())
print(unknown.loc[unknown.nerc.isin(['HICC', 'ASCC']), 'state'].unique())









    



['ASCC' 'HICC' 'WECC']
['AK' 'HI']



In [248]:

    
Counter(unknown['nerc'])









    Out[248]:





Counter({'ASCC': 44,
         'FRCC': 6,
         'HICC': 39,
         'MRO': 67,
         'NPCC': 264,
         'RFC': 202,
         'SERC': 349,
         'SPP': 99,
         'TRE': 145,
         'WECC': 364})



In [249]:

    
unknown.head()



In [250]:

    
unknown_state.head()

Export plants with lat/lon, state, and nerc



In [251]:

    
nercs.tail()



In [252]:

    
unknown.head()



In [253]:

    
unknown_state.tail()



In [267]:

    
len(unknown_state['plant id'].unique())









    Out[267]:





31



In [254]:

    
df_slim.head()



In [255]:

    
labeled = pd.concat([df_slim.loc[df_slim.notnull().all(axis=1)],
                     unknown,
                     unknown_state.loc[:, ['plant id', 'nerc', 'state', 'year']]])



In [256]:

    
labeled.tail()



In [257]:

    
labeled.loc[labeled.nerc.isnull()]









    Out[257]:







  
    
      
      lat
      lon
      nerc
      plant id
      state
      year

There are 7 facilities that don't show up in my labeled data.



In [258]:

    
facility_df.loc[~facility_df['plant id'].isin(labeled['plant id']), 'plant id'].unique()









    Out[258]:





array([57116, 57794, 58690, 58236, 58098, 57913, 57400, 57628, 61084,
       60539, 60540, 60991, 61079, 60383, 61172, 61020, 61221, 61021,
       61022, 60682, 60688, 61407, 60689, 61330, 61357, 60414, 60366,
       60983, 60989, 60372, 60658, 60581, 60583, 60901, 60902, 60905,
       60883, 60885, 60856, 60552, 60467, 60145, 60152, 59308, 59220,
       59309, 59315, 60237, 60306, 60303, 60340, 59665, 59666, 59684,
       59827, 59061, 60033, 59066, 60210, 60261, 61039, 61040, 61048,
       60258, 61050, 60346, 59689, 59690, 59691, 59764, 61261, 61268,
       61303, 59812, 59875, 60043, 59888, 61561, 60122, 59245, 59193,
       59004, 60655, 60987, 61512, 59206, 60436, 60217, 59712, 59940,
       60785, 61222, 61422, 55314, 55952,  7704,  6339, 55975, 55073,
       50728, 60569, 60570, 61197, 60690, 60506])



In [259]:

    
len(labeled), len(nercs)









    Out[259]:





(100822, 132244)



In [ ]:



In [260]:

    
nerc_labels









    Out[260]:





array(['RFC', 'NPCC', 'WECC', 'SERC', 'SPP', 'TRE', 'MRO', 'FRCC', 'ASCC',
       'HICC'], dtype=object)



In [261]:

    
mro_2016 = set(labeled.loc[(labeled.nerc == 'MRO') &
                                  (labeled.year == 2016), 'plant id'])
mro_2017 = set(labeled.loc[(labeled.nerc == 'MRO') &
                                  (labeled.year == 2017), 'plant id'])



In [ ]:



In [262]:

    
(set(nercs.loc[(nercs.nerc=='MRO') &
              (nercs.year==2017),'plant id'])
 - mro_2017)



In [263]:

    
for nerc in nerc_labels:
    l = len((set(labeled.loc[(labeled.nerc == nerc) &
              (labeled.year == 2016), 'plant id'])
     - set(labeled.loc[(labeled.nerc == nerc) &
              (labeled.year == 2017), 'plant id'])))
    
    print('{} plants dropped in {}'.format(l, nerc))









    



0 plants dropped in RFC
0 plants dropped in NPCC
0 plants dropped in WECC
0 plants dropped in SERC
0 plants dropped in SPP
0 plants dropped in TRE
0 plants dropped in MRO
0 plants dropped in FRCC
0 plants dropped in ASCC
0 plants dropped in HICC



In [264]:

    
(set(labeled.loc[(labeled.nerc == 'MRO') &
              (labeled.year == 2016), 'plant id'])
 - set(labeled.loc[(labeled.nerc == 'MRO') &
              (labeled.year == 2017), 'plant id']))









    Out[264]:





set()



In [265]:

    
path = join(data_path, 'Facility labels', 'Facility locations_RF.csv')
labeled.to_csv(path, index=False)



In [ ]:

	plant id	nerc	year
0	10867	SERC	2001
1	50903	RFC	2001
2	10671	SPP	2001
3	2527	NPCC	2001
4	3305	SERC	2001

	plant id	nerc	year
5465	56512	RFC	2001
4866	1481	NPCC	2001
4865	1480	NPCC	2001
4864	805	NPCC	2001
4863	54451	WECC	2001

	plant id	plant state	operating year	balancing authority code
21373	61632	IA	2017	MISO
21374	61633	MA	2017	ISNE
21375	61634	MA	2017	ISNE
21376	61635	MA	2017	ISNE
21377	61636	MA	2017	ISNE

	plant id	plant state	operating year	balancing authority code	nerc
343	165	OK	2017	SWPP	SPP
344	165	OK	2017	SWPP	SPP
4836	2953	OK	2017	SWPP	SPP
4837	2953	OK	2017	SWPP	SPP
4838	2953	OK	2017	SWPP	SPP

	plant id	nerc	year
0	56984	TRE	2017
1	59066	TRE	2017
2	59193	TRE	2017
3	59206	TRE	2017
4	59245	TRE	2017
5	59712	TRE	2017
6	59812	TRE	2017
7	60122	TRE	2017
8	60210	TRE	2017
9	60217	TRE	2017
10	60366	TRE	2017
11	60372	TRE	2017
12	60436	TRE	2017
13	60459	TRE	2017
14	60460	TRE	2017
15	60581	TRE	2017
16	60682	TRE	2017
17	60690	TRE	2017
18	60774	TRE	2017
19	60901	TRE	2017
20	60902	TRE	2017
21	60983	TRE	2017
22	60989	TRE	2017
23	61205	TRE	2017
24	61309	TRE	2017
25	61362	TRE	2017
26	61409	TRE	2017
27	61410	TRE	2017
28	61411	TRE	2017

	plant id	year	lat	lon	state
0	1001	2017	39.9242	-87.4244	IN
12	1001	2016	39.9242	-87.4244	IN
24	1001	2015	39.9242	-87.4244	IN
36	1001	2014	39.9242	-87.4244	IN
48	1001	2013	39.9242	-87.4244	IN

	plant id	year	lat	lon	state	nerc
100851	59957	2017	41.226900	-88.683600	IL	RFC
100852	59257	2017	37.305833	-121.755000	CA	WECC
100853	59256	2017	38.393333	-121.927778	CA	WECC
100854	10475	2017	41.683600	-87.423300	IN	RFC
100855	55370	2017	39.851389	-79.070556	PA	RFC

	plant id	lat	lon	nerc	state	year
100814	50712	37.734100	-121.6516	NaN	CA	2017
100817	55162	29.096900	-81.0689	NaN	FL	2017
100818	56171	34.037200	-117.5569	NaN	CA	2017
100822	7939	37.771400	-75.6342	NaN	VA	2017
100825	58916	34.248056	-116.0600	NaN	CA	2017

	plant id	lat	lon	nerc	state	year
28205	3823	38.27	-78.035	SERC	VA	2009
28206	3823	38.27	-78.035	SERC	VA	2005
28207	3823	38.27	-78.035	SERC	VA	2004
28208	3823	38.27	-78.035	SERC	VA	2003
28209	3823	38.27	-78.035	SERC	VA	2002

	plant id	nerc	state	year	lon	enc state
84867	10851	RFC	NJ	2006	NaN	31
84868	10851	RFC	NJ	2005	NaN	31
84869	10851	RFC	NJ	2004	NaN	31
84870	10851	RFC	NJ	2002	NaN	31
84871	10851	RFC	NJ	2001	NaN	31

	plant id	nerc	state	year	lon	enc state
92615	10257	WECC	CA	2004	NaN	4
92616	10257	WECC	CA	2003	NaN	4
92617	10257	WECC	CA	2002	NaN	4
92618	10257	WECC	CA	2001	NaN	4
92782	56508	WECC	CA	2007	NaN	4