Assign NERC labels to plants using 860 data and RandomForest

Instructions

Make sure the file_date parameter below is set to whatever value you would like appended to file names.

Change the most_recent_860_year parameter below to match the most up-to-date EIA-860 annual data file. As of March 2018 this is 2016.

EIA-860 (annual) excel files will need to be downloaded and unzipped to the EIA downloads folder. Make sure that all years from 2012 through the most recent data year are available. Also download the most recent EIA-860m to EIA downloads.

The most recent annual 860 file available (as of March 2018) represents 2016 data. When newer EIA-860 annual files are added the dictionary with pandas read_excel parameters will need to be updated. Note that EIA puts out an Early Release version of 860 with extra header rows and columns, so be sure to appropriately adjust the skiprows and usecols parameters if using an Early Release file.

The entire notebook can be run at once using Run All Cells


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import os
from os.path import join
import pandas as pd
from sklearn import neighbors, metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from collections import Counter
from copy import deepcopy


cwd = os.getcwd()
data_path = join(cwd, '..', 'Data storage')

Date string for filenames

This will be inserted into all filenames (reading and writing)


In [2]:
file_date = '2018-03-06'
most_recent_860_year = 2016

Load data

This loads facility data that has been assembled from the EIA bulk data file, and EIA-860 excel files. The EIA-860 excel files need to be downloaded manually.

Load EIA facility data

Only need to keep the plant id, year (as a check that plants don't move between years), and lat/lon


In [204]:
path = os.path.join(data_path, 'Derived data',
                    'Facility gen fuels and CO2 {}.csv'.format(file_date))
facility_df = pd.read_csv(path)
facility_df['state'] = facility_df['geography'].str[-2:]

In [205]:
plants = facility_df.loc[:, ['plant id', 'year', 'lat', 'lon', 'state']]
plants.drop_duplicates(inplace=True)

Because the 2017 facility dataframe only includes annually reporting facilities I'm going to duplicate the plant id, lat/lon, and state information from 2016.


In [206]:
# make a copy of 2016 (or most recent annual data year) and change the year to 
plants_2017 = plants.loc[plants['year'] == most_recent_860_year, :].copy()
plants_2017.loc[:, 'year'] += 1

plants = pd.concat([plants.loc[plants['year']<=most_recent_860_year, :], plants_2017])

In [207]:
(set(plants.loc[plants.year==2016, 'plant id']) - set(plants.loc[plants.year==2017, 'plant id']))


Out[207]:
set()

Load known NERC labels from EIA-860

Current NERCS go back to 2012. Use all annual 860 files from 2012 through the most recent available. Extend the dictionary of dictionaries below with any files available after 2016. io, skiprows, and usecols are all input parameters for the Pandas read_excel function.


In [5]:
eia_base_path = join(data_path, 'EIA downloads')
file_860_info = {
#     2011: {'io': join(eia_base_path, 'eia8602011', 'Plant.xlsx'),
#            'skiprows': 0,
#            'parse_cols': 'B,J'},
    2012: {'io': join(eia_base_path, 'eia8602012', 'PlantY2012.xlsx'),
           'skiprows': 0,
           'usecols': 'B,J'},
    2013: {'io': join(eia_base_path, 'eia8602013', '2___Plant_Y2013.xlsx'),
           'skiprows': 0,
           'usecols': 'C,L'},
    2014: {'io': join(eia_base_path, 'eia8602014', '2___Plant_Y2014.xlsx'),
           'skiprows': 0,
           'usecols': 'C,L'},
    2015: {'io': join(eia_base_path, 'eia8602015', '2___Plant_Y2015.xlsx'),
           'skiprows': 0,
           'usecols': 'C,L'},
    2016: {'io': join(eia_base_path, 'eia8602016', '2___Plant_Y2016.xlsx'),
           'skiprows': 0,
           'usecols': 'C,L'}
}

In [6]:
eia_nercs = {}
for key, args in file_860_info.items():
    eia_nercs[key] = pd.read_excel(**args)
    eia_nercs[key].columns = ['plant id', 'nerc']
    eia_nercs[key]['year'] = key

I want to assign NERC regions for every year. We have data for 2012 onward from the EIA-860 files. For the purpose of this analysis I'll assume that all years from 2001-2011 are the same NERC as 2012.

Also assume that values in 2017 are the same as in 2016. I'll fill in nerc values for plants that were built in 2017 below.


In [7]:
for year in range(2001, 2012):
    # the pandas .copy() method is deep by default but I'm not sure in this case
    df = deepcopy(eia_nercs[2012])
    df['year'] = year
    
    eia_nercs[year] = df
    
df = deepcopy(eia_nercs[2016])
df['year'] = 2017
eia_nercs[2017] = df

In [8]:
eia_nercs.keys()


Out[8]:
dict_keys([2012, 2013, 2014, 2015, 2016, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2017])

In [9]:
eia_nercs[2001].head()


Out[9]:
plant id nerc year
0 10867 SERC 2001
1 50903 RFC 2001
2 10671 SPP 2001
3 2527 NPCC 2001
4 3305 SERC 2001

In [10]:
nercs = pd.concat(eia_nercs.values())
nercs.sort_values('year', inplace=True)

In [11]:
nercs.head()


Out[11]:
plant id nerc year
5465 56512 RFC 2001
4866 1481 NPCC 2001
4865 1480 NPCC 2001
4864 805 NPCC 2001
4863 54451 WECC 2001

In [13]:
(set(nercs.loc[(nercs.nerc == 'MRO') &
              (nercs.year == 2016), 'plant id'])
 - set(nercs.loc[(nercs.nerc == 'MRO') &
              (nercs.year == 2017), 'plant id']))


Out[13]:
set()

In [12]:
nercs.year.unique()


Out[12]:
array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017])

Look for plants listed with different NERC labels

This may not matter anymore if I use NERC labels for each year

There are 30 plants duplicated. Five of them don't have a NERC label in one of the years. The largest move is from MRO to other regions (12), with most of those to SPP (7). After that, moves from RFC (5) to MRO (3) and SERC (2). There are also some moves from WECC and FRCC to HICC/ASCC - these might be diesel generators that get moved.

The plants that have duplicate NERC region labels represent a small fraction of national generation, but one that is growing over time. By 2016 they consist of 0.15% of national generation.


In [13]:
for df_ in list(eia_nercs.values()) + [nercs]:
    print('{} total records'.format(len(df_)))
    print('{} unique plants'.format(len(df_['plant id'].unique())))


7289 total records
7289 unique plants
8060 total records
8060 unique plants
8520 total records
8520 unique plants
8928 total records
8928 unique plants
9711 total records
9711 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
7289 total records
7289 unique plants
9711 total records
9711 unique plants
132398 total records
10068 unique plants

In [14]:
dup_plants = nercs.loc[nercs['plant id'].duplicated(keep=False), 'plant id'].unique()
dup_plants


Out[14]:
array([56512,  1481,  1480, ..., 61001, 61003, 55160])

In [15]:
region_list = []
for plant in dup_plants:
    regions = nercs.loc[nercs['plant id'] == plant, 'nerc'].unique()
#     regions = regions.tolist()
    region_list.append(regions)
Counter(tuple(x) for x in region_list)


Out[15]:
Counter({('ASCC', nan): 2,
         ('ASCC',): 130,
         ('FRCC', 'HICC'): 1,
         ('FRCC',): 205,
         ('HICC',): 52,
         ('MRO', 'RFC'): 2,
         ('MRO', 'SERC'): 1,
         ('MRO', 'SPP'): 7,
         ('MRO', 'WECC'): 3,
         ('MRO',): 1051,
         ('NPCC',): 1257,
         ('RFC', 'MRO'): 3,
         ('RFC', 'SERC'): 2,
         ('RFC',): 1626,
         ('SERC', 'SPP'): 1,
         ('SERC',): 1832,
         ('SPP', 'SERC'): 2,
         ('SPP', 'TRE'): 1,
         ('SPP',): 470,
         ('TRE',): 461,
         ('WECC', 'ASCC'): 2,
         ('WECC', 'HICC'): 1,
         ('WECC',): 2892,
         (nan, 'WECC', 'ASCC'): 3,
         (nan, 'WECC', 'HICC'): 1,
         (nan,): 35})

In [16]:
(facility_df.loc[facility_df['plant id'].isin(dup_plants), :]
            .groupby('year')['generation (MWh)'].sum()
 / facility_df.loc[:, :]
              .groupby('year')['generation (MWh)'].sum())


Out[16]:
year
2001    0.993784
2002    0.994703
2003    0.993925
2004    0.996498
2005    0.997682
2006    0.998378
2007    0.998707
2008    0.998873
2009    0.999267
2010    0.999767
2011    0.999990
2012    1.000000
2013    1.000000
2014    1.000000
2015    1.000000
2016    1.000000
2017    0.999998
Name: generation (MWh), dtype: float64

Some plants in EIA-860 don't have NERC labels. Drop them now.

This is my training data. All of these plants should still be in my plants dataframe.


In [14]:
nan_plants = {}
all_nan = []
years = nercs.year.unique()
for year in years:
    nan_plants[year] = nercs.loc[(nercs.year == year) &
                                 (nercs.isnull().any(axis=1)), 'plant id'].tolist()
    all_nan.extend(nan_plants[year])

# number of plants that don't have a nerc in at least one year
len(all_nan)


Out[14]:
232

In [15]:
# drop all the rows without a nerc value
nercs.dropna(inplace=True)

In [16]:
nan_plants[2017]


Out[16]:
[58651,
 58656,
 58659,
 58662,
 58639,
 58640,
 58549,
 58684,
 58277,
 58380,
 58425,
 58405,
 60563,
 60588,
 60260,
 60250,
 60243,
 60244,
 60245,
 60328,
 61166,
 61172,
 61364,
 60814,
 61099,
 61101,
 61068,
 58989,
 58982,
 58977,
 58837,
 59035,
 60125,
 60024,
 66,
 70]

Load EIA-860m for some info on recent facilities

The EIA-860m (monthly) data file has an up-to-date list of all operating power plants and their associated balancing authority. It does not list the NERC region, so it can't be used to assign NERC labels for all plants. But in some cases knowing the state and balancing authority is enough to make a good guess about which NERC a plant is in.

Assigning NERC region labels has the lowest accuracy for plants in SPP and TRE. To compensate, I'm going to assume that anything in TX or OK and SWPP balancing authority is in SPP. On the flip side, if it's in TX and ERCOT I'll assign it to TRE.

Only do this for plants that come online since the most recent 860 annual data.

NOTE Because I'm looking at units that came online in 2017 some of the plant ids will already exist


In [127]:
path = join(data_path, 'EIA downloads', 'december_generator2017.xlsx')

# Check the excel file columns if there is a read error. They should match
# the plant id, plant state, operating year, and balancing authority code.
_m860 = pd.read_excel(path, sheet_name='Operating',skip_footer=1,
                     usecols='C,F,P,AE', skiprows=0)
_m860.columns = _m860.columns.str.lower()

In [128]:
# most_recent_860_year is defined at the top of this notebook
# The goal here is to only look at plants that started operating after
# the most recent annual data. So only include units starting after
# the last annual data and that don't have plant ids in the nercs
# dataframe

m860 = _m860.loc[(_m860['operating year'] > most_recent_860_year)].copy() #&
#                 (~_m860['plant id'].isin(nercs['plant id'].unique()))].copy()
m860.tail()


Out[128]:
plant id plant state operating year balancing authority code
21373 61632 IA 2017 MISO
21374 61633 MA 2017 ISNE
21375 61634 MA 2017 ISNE
21376 61635 MA 2017 ISNE
21377 61636 MA 2017 ISNE

In [129]:
m860.loc[(m860['plant state'].isin(['TX', 'OK'])) &
         (m860['balancing authority code'] == 'SWPP'), 'nerc'] = 'SPP'

m860.loc[(m860['plant state'].isin(['TX'])) &
         (m860['balancing authority code'] == 'ERCO'), 'nerc'] = 'TRE'

In [130]:
# Drop all rows except the ones I've labeled as TRE or SPP
m860.dropna(inplace=True)
m860.head()


Out[130]:
plant id plant state operating year balancing authority code nerc
343 165 OK 2017 SWPP SPP
344 165 OK 2017 SWPP SPP
4836 2953 OK 2017 SWPP SPP
4837 2953 OK 2017 SWPP SPP
4838 2953 OK 2017 SWPP SPP

Make lists of plant codes for SPP and TRE facilities


In [131]:
nercs.head()


Out[131]:
plant id nerc year
5465 56512 RFC 2001
4866 1481 NPCC 2001
4865 1480 NPCC 2001
4864 805 NPCC 2001
4863 54451 WECC 2001

In [132]:
# Create additional dataframes with 2017 SPP and TRE plants.
# Use these to fill in values for 2017 plants

m860_spp_plants = (m860.loc[m860['nerc'] == 'SPP', 'plant id']
                       .drop_duplicates()
                       .reset_index(drop=True))

additional_spp = pd.DataFrame(m860_spp_plants.copy())
# additional_spp['plant id'] = m860_spp_plants
additional_spp['nerc'] = 'SPP'
additional_spp['year'] = 2017

m860_tre_plants = (m860.loc[m860['nerc'] == 'TRE', 'plant id']
                       .drop_duplicates()
                       .reset_index(drop=True))

additional_tre = pd.DataFrame(m860_tre_plants)
# additional_tre['plant id'] = m860_tre_plants
additional_tre['nerc'] = 'TRE'
additional_tre['year'] = 2017

In [133]:
additional_spp


Out[133]:
plant id nerc year
0 165 SPP 2017
1 2953 SPP 2017
2 60414 SPP 2017
3 61221 SPP 2017
4 61261 SPP 2017
5 61614 SPP 2017
6 61615 SPP 2017
7 61616 SPP 2017
8 61617 SPP 2017
9 61618 SPP 2017

In [134]:
additional_tre


Out[134]:
plant id nerc year
0 56984 TRE 2017
1 59066 TRE 2017
2 59193 TRE 2017
3 59206 TRE 2017
4 59245 TRE 2017
5 59712 TRE 2017
6 59812 TRE 2017
7 60122 TRE 2017
8 60210 TRE 2017
9 60217 TRE 2017
10 60366 TRE 2017
11 60372 TRE 2017
12 60436 TRE 2017
13 60459 TRE 2017
14 60460 TRE 2017
15 60581 TRE 2017
16 60682 TRE 2017
17 60690 TRE 2017
18 60774 TRE 2017
19 60901 TRE 2017
20 60902 TRE 2017
21 60983 TRE 2017
22 60989 TRE 2017
23 61205 TRE 2017
24 61309 TRE 2017
25 61362 TRE 2017
26 61409 TRE 2017
27 61410 TRE 2017
28 61411 TRE 2017

Append my 2017 SPP and TRE guesses to the full nerc dataframe


In [115]:
nercs = pd.concat([nercs, additional_spp, additional_tre])

Clean and prep data for KNN


In [116]:
plants.head()


Out[116]:
plant id year lat lon state
0 1001 2017 39.9242 -87.4244 IN
12 1001 2016 39.9242 -87.4244 IN
24 1001 2015 39.9242 -87.4244 IN
36 1001 2014 39.9242 -87.4244 IN
48 1001 2013 39.9242 -87.4244 IN

In [117]:
nercs.tail()


Out[117]:
plant id nerc year
24 61309 TRE 2017
25 61362 TRE 2017
26 61409 TRE 2017
27 61410 TRE 2017
28 61411 TRE 2017

Checked to make sure the type of merge doesn't matter once rows without nerc values are dropped


In [209]:
df = pd.merge(plants, nercs, on=['plant id', 'year'], how='left')

In [210]:
omitted = set(df['plant id'].unique()) - set(nercs['plant id'].unique())

In [213]:
df.head()


Out[213]:
plant id year lat lon state nerc
0 1001 2016 39.9242 -87.4244 IN RFC
1 1001 2015 39.9242 -87.4244 IN RFC
2 1001 2014 39.9242 -87.4244 IN RFC
3 1001 2013 39.9242 -87.4244 IN RFC
4 1001 2012 39.9242 -87.4244 IN RFC

In [214]:
df.tail()


Out[214]:
plant id year lat lon state nerc
100851 59957 2017 41.226900 -88.683600 IL RFC
100852 59257 2017 37.305833 -121.755000 CA WECC
100853 59256 2017 38.393333 -121.927778 CA WECC
100854 10475 2017 41.683600 -87.423300 IN RFC
100855 55370 2017 39.851389 -79.070556 PA RFC

In [215]:
df.columns


Out[215]:
Index(['plant id', 'year', 'lat', 'lon', 'state', 'nerc'], dtype='object')

Drop plants that don't have lat/lon data (using just lon to check), and then drop duplicates. If any plants have kept the same plant id but moved over time (maybe a diesel generator?) or switched NERC they will show up twice.


In [216]:
cols = ['plant id', 'lat', 'lon', 'nerc', 'state', 'year']
df_slim = (df.loc[:, cols].dropna(subset=['lon'])
             .drop_duplicates(subset=['plant id', 'year', 'nerc']))

In [217]:
df_slim.tail()


Out[217]:
plant id lat lon nerc state year
100851 59957 41.226900 -88.683600 RFC IL 2017
100852 59257 37.305833 -121.755000 WECC CA 2017
100853 59256 38.393333 -121.927778 WECC CA 2017
100854 10475 41.683600 -87.423300 RFC IN 2017
100855 55370 39.851389 -79.070556 RFC PA 2017

Separate out the list of plants where we don't have NERC labels from EIA-860.


In [218]:
unknown = df_slim.loc[df_slim.nerc.isnull()].copy()

In [219]:
print("{} plants/years don't have NERC labels\n".format(len(unknown)))
print(unknown.head())


1579 plants/years don't have NERC labels

       plant id    lat     lon nerc state  year
28205      3823  38.27 -78.035  NaN    VA  2009
28206      3823  38.27 -78.035  NaN    VA  2005
28207      3823  38.27 -78.035  NaN    VA  2004
28208      3823  38.27 -78.035  NaN    VA  2003
28209      3823  38.27 -78.035  NaN    VA  2002

In [220]:
unknown.tail()


Out[220]:
plant id lat lon nerc state year
100814 50712 37.734100 -121.6516 NaN CA 2017
100817 55162 29.096900 -81.0689 NaN FL 2017
100818 56171 34.037200 -117.5569 NaN CA 2017
100822 7939 37.771400 -75.6342 NaN VA 2017
100825 58916 34.248056 -116.0600 NaN CA 2017

Create X and y matricies

X is lat/lon and year

y is the NERC label

For both, I'm only using plants where we have all data (no NaNs). Not doing any transformation of the lat/lon at this time.


In [221]:
X = df_slim.loc[df_slim.notnull().all(axis=1), ['lat', 'lon', 'year']]
y = df_slim.loc[df_slim.notnull().all(axis=1), 'nerc']

In [222]:
len(X)


Out[222]:
99127

In [223]:
# Make sure that unknown and X include all records from df_slim
len(X) + len(unknown) - len(df_slim)


Out[223]:
0

In [224]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

GridSearch to find the best parameters in a RandomForest Classifier

I previously used k-nearest neighbors with just lat/lon as input features. The problem is that some facilities don't have lat/lon data. They do usually have a state geography label though. Categorical labels don't work well in KNN, but the combination of lat/lon and a state label will work well in a tree model. RandomForest is usually a quite effective tree model and my results are more accurate with this than they were with KNN.


In [225]:
from sklearn.ensemble import RandomForestClassifier

In [226]:
rf = RandomForestClassifier()
params = dict(
    n_estimators = [5, 10, 25, 50],
    min_samples_split = [2, 5, 10],
    min_samples_leaf = [1, 3, 5],
)

clf_rf = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)
clf_rf.fit(X_train, y_train)


Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   32.3s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:  1.3min finished
Out[226]:
GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=False, n_jobs=-1,
       param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [227]:
clf_rf.best_estimator_, clf_rf.best_score_


Out[227]:
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False), 0.9797785431782634)

In [228]:
clf_rf.score(X_test, y_test)


Out[228]:
0.9822695035460993

In [229]:
nerc_labels = nercs.nerc.dropna().unique()

Accuracy score by region


In [230]:
for region in nerc_labels:
    mask = y_test == region
    
    X_masked = X_test[mask]
    y_hat_masked = clf_rf.predict(X_masked)
    y_test_masked = y_test[mask]
    
    accuracy = metrics.accuracy_score(y_test_masked, y_hat_masked)
    print('{} : {}'.format(region, accuracy))


RFC : 0.9707198806415517
NPCC : 0.9907823209643111
WECC : 0.9987963672174198
SERC : 0.9690892364305428
SPP : 0.9328663164039697
TRE : 0.9903057419835943
MRO : 0.9853683148335015
FRCC : 0.9768707482993197
ASCC : 0.9965986394557823
HICC : 1.0

F1 score by region


In [231]:
y_hat = clf_rf.predict(X_test)

for region in nerc_labels:
    f1 = metrics.f1_score(y_test, y_hat, labels=[region], average='macro')
    print('{} : {}'.format(region, f1))


RFC : 0.9682820202771835
NPCC : 0.9916026020106445
WECC : 0.9985778361229624
SERC : 0.9707861026633492
SPP : 0.9461219656601539
TRE : 0.986627043090639
MRO : 0.9798068481123793
FRCC : 0.9835616438356164
ASCC : 0.9982964224872232
HICC : 1.0

Plants without lat/lon

Use just the state for plants that don't have lat/lon info. Less accurate, especially where NERC regions cross state lines, but better than nothing.

Need to start with the lon column so I can filter to only unknown facilities that also don't have lon


In [232]:
cols = ['plant id', 'nerc', 'state', 'year', 'lon']
df_state_slim = (df.loc[:, cols].dropna(subset=['state']).copy())

In [233]:
df_state_slim.head()


Out[233]:
plant id nerc state year lon
0 1001 RFC IN 2016 -87.4244
1 1001 RFC IN 2015 -87.4244
2 1001 RFC IN 2014 -87.4244
3 1001 RFC IN 2013 -87.4244
4 1001 RFC IN 2012 -87.4244

In [234]:
len(df_state_slim)


Out[234]:
100831

Encode state names as numbers for use in sklearn


In [235]:
le = LabelEncoder()
df_state_slim.loc[:, 'enc state'] = le.fit_transform(df_state_slim.loc[:, 'state'].tolist())

In [236]:
len(df_state_slim)


Out[236]:
100831

In [237]:
unknown_state = df_state_slim.loc[(df_state_slim.nerc.isnull()) &
                                  (df_state_slim.lon.isnull())].copy()

In [238]:
len(unknown_state), len(unknown)


Out[238]:
(116, 1579)

In [239]:
X_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), ['enc state', 'year']].copy()
y_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), 'nerc'].copy()

In [240]:
X_state_train, X_state_test, y_state_train, y_state_test = train_test_split(
    X_state, y_state, test_size=0.33, random_state=42)

In [241]:
rf = RandomForestClassifier()
params = dict(
    n_estimators = [5, 10, 25, 50],
    min_samples_split = [2, 5, 10],
    min_samples_leaf = [1, 3, 5],
)

clf_rf_state = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)
clf_rf_state.fit(X_state_train, y_state_train)


Fitting 3 folds for each of 36 candidates, totalling 108 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.1s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:   45.4s finished
Out[241]:
GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=False, n_jobs=-1,
       param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [242]:
clf_rf_state.best_estimator_, clf_rf_state.best_score_


Out[242]:
(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=10,
             min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False), 0.9402882327958322)

In [243]:
clf_rf_state.score(X_state_test, y_state_test)


Out[243]:
0.9387418230726906

Accuracy score by region


In [244]:
nerc_labels = nercs.nerc.dropna().unique()

for region in nerc_labels:
    mask = y_state_test == region
    
    X_state_masked = X_state_test[mask]
    y_state_hat_masked = clf_rf_state.predict(X_state_masked)
    y_state_test_masked = y_state_test[mask]
    
    accuracy = metrics.accuracy_score(y_state_test_masked, y_state_hat_masked)
    print('{} : {}'.format(region, accuracy))


RFC : 0.9243697478991597
NPCC : 0.9938519744620478
WECC : 0.9943912900032993
SERC : 0.8831431726168568
SPP : 0.5765453495089543
TRE : 1.0
MRO : 0.9614311088556204
FRCC : 0.9973992197659298
ASCC : 1.0
HICC : 1.0

F1 score by region


In [245]:
y_state_hat = clf_rf_state.predict(X_state_test)

for region in nerc_labels:
    f1 = metrics.f1_score(y_state_test, y_state_hat, labels=[region], average='macro')
    print('{} : {}'.format(region, f1))


RFC : 0.9088405397961994
NPCC : 0.9958535718516763
WECC : 0.9933534743202416
SERC : 0.8819259395387301
SPP : 0.7300658376005852
TRE : 0.8763183125599233
MRO : 0.9512929952297263
FRCC : 0.962962962962963
ASCC : 0.9991474850809888
HICC : 0.9975669099756691

Use best RandomForest parameters to predict NERC for unknown plants


In [246]:
unknown.loc[:, 'nerc'] = clf_rf.predict(unknown.loc[:, ['lat', 'lon', 'year']])
unknown_state.loc[:, 'nerc'] = clf_rf_state.predict(unknown_state.loc[:, ['enc state', 'year']])

Ensuring that no plants in Alaska or Hawaii are assigned to continental NERCs, or the other way around.


In [247]:
print(unknown.loc[unknown.state.isin(['AK', 'HI']), 'nerc'].unique())
print(unknown.loc[unknown.nerc.isin(['HICC', 'ASCC']), 'state'].unique())


['ASCC' 'HICC' 'WECC']
['AK' 'HI']

In [248]:
Counter(unknown['nerc'])


Out[248]:
Counter({'ASCC': 44,
         'FRCC': 6,
         'HICC': 39,
         'MRO': 67,
         'NPCC': 264,
         'RFC': 202,
         'SERC': 349,
         'SPP': 99,
         'TRE': 145,
         'WECC': 364})

In [249]:
unknown.head()


Out[249]:
plant id lat lon nerc state year
28205 3823 38.27 -78.035 SERC VA 2009
28206 3823 38.27 -78.035 SERC VA 2005
28207 3823 38.27 -78.035 SERC VA 2004
28208 3823 38.27 -78.035 SERC VA 2003
28209 3823 38.27 -78.035 SERC VA 2002

In [250]:
unknown_state.head()


Out[250]:
plant id nerc state year lon enc state
84867 10851 RFC NJ 2006 NaN 31
84868 10851 RFC NJ 2005 NaN 31
84869 10851 RFC NJ 2004 NaN 31
84870 10851 RFC NJ 2002 NaN 31
84871 10851 RFC NJ 2001 NaN 31

Export plants with lat/lon, state, and nerc


In [251]:
nercs.tail()


Out[251]:
plant id nerc year
24 61309 TRE 2017
25 61362 TRE 2017
26 61409 TRE 2017
27 61410 TRE 2017
28 61411 TRE 2017

In [252]:
unknown.head()


Out[252]:
plant id lat lon nerc state year
28205 3823 38.27 -78.035 SERC VA 2009
28206 3823 38.27 -78.035 SERC VA 2005
28207 3823 38.27 -78.035 SERC VA 2004
28208 3823 38.27 -78.035 SERC VA 2003
28209 3823 38.27 -78.035 SERC VA 2002

In [253]:
unknown_state.tail()


Out[253]:
plant id nerc state year lon enc state
92615 10257 WECC CA 2004 NaN 4
92616 10257 WECC CA 2003 NaN 4
92617 10257 WECC CA 2002 NaN 4
92618 10257 WECC CA 2001 NaN 4
92782 56508 WECC CA 2007 NaN 4

In [267]:
len(unknown_state['plant id'].unique())


Out[267]:
31

In [254]:
df_slim.head()


Out[254]:
plant id lat lon nerc state year
0 1001 39.9242 -87.4244 RFC IN 2016
1 1001 39.9242 -87.4244 RFC IN 2015
2 1001 39.9242 -87.4244 RFC IN 2014
3 1001 39.9242 -87.4244 RFC IN 2013
4 1001 39.9242 -87.4244 RFC IN 2012

In [255]:
labeled = pd.concat([df_slim.loc[df_slim.notnull().all(axis=1)],
                     unknown,
                     unknown_state.loc[:, ['plant id', 'nerc', 'state', 'year']]])

In [256]:
labeled.tail()


Out[256]:
lat lon nerc plant id state year
92615 NaN NaN WECC 10257 CA 2004
92616 NaN NaN WECC 10257 CA 2003
92617 NaN NaN WECC 10257 CA 2002
92618 NaN NaN WECC 10257 CA 2001
92782 NaN NaN WECC 56508 CA 2007

In [257]:
labeled.loc[labeled.nerc.isnull()]


Out[257]:
lat lon nerc plant id state year

There are 7 facilities that don't show up in my labeled data.


In [258]:
facility_df.loc[~facility_df['plant id'].isin(labeled['plant id']), 'plant id'].unique()


Out[258]:
array([57116, 57794, 58690, 58236, 58098, 57913, 57400, 57628, 61084,
       60539, 60540, 60991, 61079, 60383, 61172, 61020, 61221, 61021,
       61022, 60682, 60688, 61407, 60689, 61330, 61357, 60414, 60366,
       60983, 60989, 60372, 60658, 60581, 60583, 60901, 60902, 60905,
       60883, 60885, 60856, 60552, 60467, 60145, 60152, 59308, 59220,
       59309, 59315, 60237, 60306, 60303, 60340, 59665, 59666, 59684,
       59827, 59061, 60033, 59066, 60210, 60261, 61039, 61040, 61048,
       60258, 61050, 60346, 59689, 59690, 59691, 59764, 61261, 61268,
       61303, 59812, 59875, 60043, 59888, 61561, 60122, 59245, 59193,
       59004, 60655, 60987, 61512, 59206, 60436, 60217, 59712, 59940,
       60785, 61222, 61422, 55314, 55952,  7704,  6339, 55975, 55073,
       50728, 60569, 60570, 61197, 60690, 60506])

In [259]:
len(labeled), len(nercs)


Out[259]:
(100822, 132244)

In [ ]:


In [260]:
nerc_labels


Out[260]:
array(['RFC', 'NPCC', 'WECC', 'SERC', 'SPP', 'TRE', 'MRO', 'FRCC', 'ASCC',
       'HICC'], dtype=object)

In [261]:
mro_2016 = set(labeled.loc[(labeled.nerc == 'MRO') &
                                  (labeled.year == 2016), 'plant id'])
mro_2017 = set(labeled.loc[(labeled.nerc == 'MRO') &
                                  (labeled.year == 2017), 'plant id'])

In [ ]:


In [262]:
(set(nercs.loc[(nercs.nerc=='MRO') &
              (nercs.year==2017),'plant id'])
 - mro_2017)


Out[262]:
{1052,
 1058,
 1175,
 1189,
 1218,
 1771,
 1889,
 1914,
 1918,
 1932,
 1960,
 1995,
 2008,
 2217,
 2791,
 2821,
 2822,
 3334,
 3343,
 3347,
 3348,
 3996,
 4048,
 4054,
 4062,
 4121,
 4140,
 7376,
 7377,
 7602,
 7706,
 7882,
 7956,
 8014,
 8025,
 8057,
 10476,
 54713,
 54930,
 55315,
 55638,
 55764,
 55834,
 56072,
 56183,
 56366,
 57048,
 57116,
 57255,
 57659,
 58236,
 58434,
 58903,
 59053,
 59197,
 59223,
 59224,
 59225,
 59226,
 59227,
 59228,
 59230,
 59231,
 59307,
 59684,
 59875,
 59902,
 59903,
 60066,
 60203,
 60254,
 60503,
 60504,
 60505,
 60519,
 60520,
 60521,
 60522,
 60523,
 60524,
 60525,
 60526,
 60527,
 60528,
 60529,
 60530,
 60531,
 60532,
 60533,
 60534,
 60564,
 60595,
 60631,
 60632,
 60647,
 60674,
 60694,
 60695,
 60711,
 60712,
 60713,
 60714,
 60715,
 60716,
 60717,
 60795,
 60823,
 60830,
 60832,
 60833,
 60834,
 60835,
 60836,
 60837,
 60838,
 60873,
 60887,
 60888,
 60889,
 60890,
 60891,
 60892,
 60893,
 60894,
 60895,
 60905,
 60934,
 60935,
 60936,
 60937,
 60938,
 60939,
 60940,
 60941,
 60942,
 60943,
 60944,
 60951,
 60955,
 60957,
 60958,
 60962,
 60966,
 60971,
 60977,
 61046,
 61047,
 61056,
 61057,
 61058,
 61059,
 61060,
 61070,
 61071,
 61072,
 61077,
 61078,
 61079,
 61138,
 61139,
 61140,
 61141,
 61142,
 61144,
 61174,
 61175,
 61176,
 61177,
 61178,
 61179,
 61180,
 61181,
 61182,
 61183,
 61203,
 61328,
 61329,
 61357,
 61363,
 61379,
 61380,
 61381,
 61382,
 61383,
 61384,
 61426,
 61427,
 61428}

In [263]:
for nerc in nerc_labels:
    l = len((set(labeled.loc[(labeled.nerc == nerc) &
              (labeled.year == 2016), 'plant id'])
     - set(labeled.loc[(labeled.nerc == nerc) &
              (labeled.year == 2017), 'plant id'])))
    
    print('{} plants dropped in {}'.format(l, nerc))


0 plants dropped in RFC
0 plants dropped in NPCC
0 plants dropped in WECC
0 plants dropped in SERC
0 plants dropped in SPP
0 plants dropped in TRE
0 plants dropped in MRO
0 plants dropped in FRCC
0 plants dropped in ASCC
0 plants dropped in HICC

In [264]:
(set(labeled.loc[(labeled.nerc == 'MRO') &
              (labeled.year == 2016), 'plant id'])
 - set(labeled.loc[(labeled.nerc == 'MRO') &
              (labeled.year == 2017), 'plant id']))


Out[264]:
set()

In [265]:
path = join(data_path, 'Facility labels', 'Facility locations_RF.csv')
labeled.to_csv(path, index=False)

In [ ]: