Bioclimatic Envelope Modeling with False Negative Masking

Demonstration of an alternative method for mapping bioclimate envelopes of tree species using Python.

Authors: Matthew Perry (mperry@ecotrust.org) and David Diaz (ddiaz@ecotrust.org)


Abstract

Several research studies have shown that purely statistical approaches to mapping species distributions in response to projected climate changes are commonly observed to over-predict climatic vulnerability and species’ range losses in comparison to process-based models (e.g., Morin & Thuiller 2009; Keenan et al. 2011; Cheaib et al. 2012).

Most bioclimatic envelope models are trained using observations (plots) of the presence and absence of a given species. The resulting model effectively considers all absences as being due to explantory climatic factors. However, there are many non-climatic factors which might explain the absence of a species in a forest inventory plot such as historical or recent management activity, competition from other species, disturbances, soils, sampling bias or sampling error.

For example, although it is well-known that Douglas-fir is capable of growing in the Willamette Valley of western Oregon, there is a high occurrence of plots with Douglas-fir absent in the Willamette Valley. The absence of Douglas-fir is likely due to human land-use, competition with other species, and similar factors that are not related to the climatic suitability. In other words, they are a false negatives. When these false negatives are used to train predictive models, if/when future climatic conditions in other areas begin to resemble those of the Willamette Valley (or other places with 'false negatives'), the statistical models will under-predict suitability for the species.

Here we present a general methodology for attenuating the influence of these false negatives. In essence, we adjust our interpretation of observed absences by using auxillary information (from process-based models or expert judgement) to classify observed absences as either:

  • Absent due to climate (true negative)
  • Absent due to non-climatic factors (false negative)

False negatives are then removed from the training data and are not considered when we develop our model. This yeilds a more robust model which more closely matches the fundamental climatic niche of the species and is therefore more appropriate for predicting the impact of climate change on future species distributions.

Case Study

In this exploratory notebook, we have closely followed the Random Forest-based approach used by Crookston et al. (2010) that currently drives the calculation of climatic suitability scores that are factored into Climate-FVS. We will attempt to replicate the results using publically available Forest Inventory Analysis (FIA) plot information. It is important to note that the spatial precision for public FIA plot data has been degraded to protect the privacy of land owners. In some cases, these plots might be shifted by as much as 1 mile. As Crookston et. al. had access to the original (non-degraded) spatial data, the results are not directly comparable.

We will demostrate the workflow using python along with the following packages to perform the analysis:

  • scikit-learn
  • pyimpute
  • geopandas
  • rasterstats

We can map the FIA plots with Douglas-fir absence (GRIDCODE==0)

Loading Data

Here we load FIA plots as a GeoPandas GeoDataFrame. The GeoDataFrame is effectively a 2D tabular data structure with support for geometric data types. The GRIDCODE column is a binary indication of the presence/absence of Douglas-fir.


In [1]:
import geopandas as gpd
fia = gpd.GeoDataFrame.from_file('data/DF_PNW.shp')
fia.head(10)


Out[1]:
GRIDCODE ID ORIG_FID geometry
0 0 1460 1459 POINT (-124.0197511098481442 46.8877685025714754)
1 0 1507 1506 POINT (-123.9449234034516110 46.8793112354689470)
2 0 1524 1523 POINT (-124.0971876684833006 46.8282819277014823)
3 0 1542 1541 POINT (-124.0320287706295090 46.8364627961699611)
4 1 1543 1542 POINT (-123.9764116831542111 46.8611791251101693)
5 0 1573 1572 POINT (-124.0827832002071460 46.8034326303698549)
6 1 1574 1573 POINT (-123.8096154469948402 46.8871164965156311)
7 0 1586 1585 POINT (-123.9452067437918430 46.8312956398651679)
8 1 1607 1606 POINT (-123.7516825442385482 46.8835410983715377)
9 0 1627 1626 POINT (-123.7179816778020722 46.8734355966121541)

In [2]:
print("Coordinate reference system is: {}".format(fia.crs))
print("{} observations with {} columns".format(*fia.shape))


Coordinate reference system is: {'init': u'epsg:4326'}
5455 observations with 4 columns

We can map the FIA plots with Douglas-fir absence (GRIDCODE==0)


In [3]:
%matplotlib inline

fia[fia.GRIDCODE == 0].plot()


Out[3]:
<matplotlib.axes.AxesSubplot at 0x7ff57949b390>

And map the FIA plots with Douglas-fir present (GRIDCODE==1)


In [4]:
fia[fia.GRIDCODE == 1].plot()


Out[4]:
<matplotlib.axes.AxesSubplot at 0x7ff576680ad0>

Next, we will load up the Douglas Fir (Pseudotsuga menziesii) map from Elbert Little (citation?) which we can use as "expert judgement" to define the current suitable range.


In [5]:
df_little = gpd.GeoDataFrame.from_file("data/pseumenz.shp")
df_little.head()


Out[5]:
AREA CODE PERIMETER PSEUMENZ_ PSEUMENZ_I geometry
0 97.25040 1 224.44530 2 2 POLYGON ((-117.0207083562550139 51.61863105638...
1 0.07841 1 1.05778 3 3 POLYGON ((-122.6526140227079225 54.88258224700...
2 0.10230 1 1.21148 4 1 POLYGON ((-128.0815430397526029 54.76402478209...
3 1.22695 1 14.13027 5 4 POLYGON ((-128.5175170898437500 53.43138122558...
4 0.00255 1 0.24220 6 4 POLYGON ((-128.8721949217128326 53.66104673837...

In [6]:
df_little.plot()


Out[6]:
<matplotlib.axes.AxesSubplot at 0x7ff56f458b50>

Finally, we will load the results of a process-model approach (3PG) employed by Coops et al. (2009; 2011). This is a raster dataset which can be accessed using rasterio


In [7]:
import rasterio
import numpy as np
with rasterio.open("data/df_mask_hist.img") as ds:
    meta_coops = ds.meta
    df_coops = np.ma.masked_equal(ds.read_band(1), meta_coops['nodata'])

np.histogram(df_coops.compressed())


Out[7]:
(array([1019292,  248435,  149139,  153060,  108555,   91491,   78427,
          76872,   65123,   96097]),
 array([-0.02848894,  0.06234146,  0.15317187,  0.24400227,  0.33483268,
         0.42566308,  0.51649348,  0.60732389,  0.69815429,  0.7889847 ,
         0.8798151 ]))

In [8]:
from pylab import plt
def plotit(x, title, cmap="Blues"):
    plt.imshow(x, cmap=cmap, interpolation='nearest')
    plt.colorbar()
    plt.title(title)
    plt.show()
    
plotit(df_coops, "Current Douglas-fir Suitability, Coops et al. ")


Classifying absences

Using the Little maps as "expert judgement" we will model the current climatic envelope of Douglas-fir. To do this, we can use a spatial join to determine which plots are inside or outside of Little's mapped Douglas-fir range.

First, let's apply a buffer in order to eliminate any edge effects along the coast and then select only those polygons where df is predicted to be viable (CODE == 1) and.

Note that the buffer operation on a GeoDataFrame currently returns a GeoSeries. This is a known bug. In order to work around it, we need to copy the old dataframe and set it's geometry explicity.


In [9]:
df_little_buff = df_little.copy()
df_little_buff['geometry'] = df_little.buffer(0.01)
df_little_buff = df_little_buff[df_little_buff.CODE == 1]
df_little_buff.plot()


Out[9]:
<matplotlib.axes.AxesSubplot at 0x7ff56e595190>

Now we can perform the spatial join (sjoin) between our plots and the Little maps. This will give us a GeoDataFrame with all fia plot point geometries, the GRIDCODE column indicating presence/abscence from the fia data, and the CODE column incidating if the point is within the range as defined by Little.


In [10]:
from geopandas.tools import sjoin
fia_little = sjoin(fia, df_little_buff, how="left")
fia_little.head()


Out[10]:
GRIDCODE ID ORIG_FID geometry AREA CODE PERIMETER PSEUMENZ_ PSEUMENZ_I
0 0 1460 1459 POINT (-124.0197511098481442 46.8877685025714754) 97.2504 1 224.4453 2 2
1 0 1507 1506 POINT (-123.9449234034516110 46.8793112354689470) 97.2504 1 224.4453 2 2
2 0 1524 1523 POINT (-124.0971876684833006 46.8282819277014823) 97.2504 1 224.4453 2 2
3 0 1542 1541 POINT (-124.0320287706295090 46.8364627961699611) 97.2504 1 224.4453 2 2
4 1 1543 1542 POINT (-123.9764116831542111 46.8611791251101693) 97.2504 1 224.4453 2 2

False Negative Masking

Our false negatives will be those points where the Douglas-fir is asbsent but the range is suitable according to Little. Specifically, we will keep only the points where GRIDCODE == 1 or CODE != 1 (either Douglas-fir is observerved or it's not in the suitable range)

GRIDCODE
1 0
CODE 1 keep *remove*
NA keep keep

In [11]:
orig_shape = fia_little.shape

# Masking: remove false negatives (rmfn) 
fia_little_rmfn = fia_little[(fia_little.GRIDCODE == 1) | (fia_little.CODE != 1)]
new_shape = fia_little_rmfn.shape

print("Removed false negatives from FIA plots\nWent from {} points to {}.".format(orig_shape[0], new_shape[0]))


Removed false negatives from FIA plots
Went from 5455 points to 4310.

Mapping current suitability

In this section we will use the fia plots and climate data provided by USGS TODO to predict a map of current climatic suitability for Douglas-fir. We will use a similar approach to Crookston et. al. 2010; using a Random Forest supervised classification to predict presence/absence based on explanatory climatic variables for which we have raster data for our study area.

First we will perform the analysis using the fia dataframe (the original point dataset before we filtered out false negatives) and then compare the results with the fia_little_rmfn (after removing false negatives).


In [12]:
import os

explanatory_fields = [
    'd100',
    'dd0',
    'dd5',
    'fday',
    'ffp',
    'gsdd5',
    'gsp',
    'map',
    'mat_tenths',
    'mmax_tenths',
    'mmindd0',
    'mmin_tenths',
    'mtcm_tenths',
    'mtwm_tenths',
    'sday']

explanatory_rasters = [os.path.join('data', 'raster', "current_" + r + ".img") for r in explanatory_fields]

We will pyimpute to prep the data and generate the raster maps of suitability.


In [13]:
from pyimpute import load_training_vector

train_xs, train_y = load_training_vector(fia, explanatory_rasters, response_field='GRIDCODE')
train_xs.shape, train_y.shape


Out[13]:
((5455, 15), (5455,))

And we will use scikit-learn to implement the Random Forest model itself.


In [14]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, n_jobs=1)
clf.fit(train_xs, train_y)


Out[14]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

Let's evaluate the accuracy of the classification


In [15]:
from pyimpute import evaluate_clf
evaluate_clf(clf, train_xs, train_y)


Accuracy Score: 0.821848

Classification report
             precision    recall  f1-score   support

          0       0.82      0.81      0.81      1318
          1       0.82      0.83      0.83      1410

avg / total       0.82      0.82      0.82      2728


Confussion matrix
[[1068  250]
 [ 236 1174]]

Feature importances
                   0: 3.2
                   1: 3.5
                   2: 4.7
                   3: 7.6
                   4: 10.9
                   5: 6.4
                   6: 6.3
                   7: 15.1
                   8: 3.2
                   9: 3.1
                  10: 7.6
                  11: 15.2
                  12: 3.5
                  13: 3.1
                  14: 6.5

Cross validation
[ 0.80242547  0.86338798  0.79300828  0.72529644]
4-fold Cross Validation Accuracy: 79.60 (+/- 9.79)

Finally, let's use the classifier to impute the current range of Douglas-fir (i.e. generate raster maps). Though the Random Forest model attempts to classify each pixel as suitable/unsuitable, it also provides a certainty estimate which is interpreted as a relative suitability score with 0 being completely unsuitable and 1 being completely suitable.


In [16]:
from pyimpute import impute, load_targets

target_xs, raster_info = load_targets(explanatory_rasters)

impute(target_xs, clf, raster_info, 
       outdir="_df_output_current", 
       class_prob=True, certainty=True)

The output consists of a responses.img raster which is the predicted class (0 or 1) and probability_1.img which is our current climatic suitability index.


In [17]:
with rasterio.open("_df_output_current/probability_1.img") as ds:
    current_df = ds.read_band(1)
    
plotit(current_df, "Douglas-fir Suitability (unmasked)", cmap="Greens")


Let's zoom in to the Willamette Valley and check the predicted suitability. We see that there is very low values as many of the FIA plots did not contain Douglas-fir due to management, agriculture, etc. By incorporating these false negatives we have effectively trained the model to consider the current climate of the Willamette valley as unsuitable for Douglas-fir when it is most likely that non-climatic factors are the likely cause of absence.


In [18]:
plotit(current_df[550:850, 600:800], "Douglas-fir Suitability, Willamette (unmasked)", cmap="Greens")


Now let's repeat the analysis but this time use the FIA plots that have been filtered to exclude false negatives based on expert judgement in the form of Elbert Little's maps. (fia_little_rmfn)


In [19]:
train_xs_rmfn, train_y_rmfn = load_training_vector(fia_little_rmfn, explanatory_rasters, response_field='GRIDCODE')
train_xs_rmfn.shape, train_y_rmfn.shape


Out[19]:
((4310, 15), (4310,))

In [20]:
clf_rmfn = RandomForestClassifier(n_estimators=10, n_jobs=3)
clf_rmfn.fit(train_xs_rmfn, train_y_rmfn)
evaluate_clf(clf, train_xs_rmfn, train_y_rmfn)


Accuracy Score: 0.945708

Classification report
             precision    recall  f1-score   support

          0       0.92      0.93      0.92       734
          1       0.96      0.96      0.96      1421

avg / total       0.95      0.95      0.95      2155


Confussion matrix
[[ 679   55]
 [  62 1359]]

Feature importances
                   0: 2.2
                   1: 1.5
                   2: 2.2
                   3: 7.2
                   4: 6.2
                   5: 3.5
                   6: 15.7
                   7: 29.0
                   8: 1.8
                   9: 3.5
                  10: 8.6
                  11: 12.0
                  12: 2.6
                  13: 2.3
                  14: 1.6

Cross validation
[ 0.97710667  0.94101124  0.92970123  0.82995595]
4-fold Cross Validation Accuracy: 91.94 (+/- 10.91)

In [21]:
target_xs_rmfn, raster_info_rmfn = load_targets(explanatory_rasters)

impute(target_xs, clf_rmfn, raster_info_rmfn, 
       outdir="_df_output_current_rmfn", 
       class_prob=True, certainty=True)

In [22]:
with rasterio.open("_df_output_current_rmfn/probability_1.img") as ds:
    current_df_rmfn = ds.read_band(1)

plotit(current_df_rmfn, "Douglas-fir Suitability (masked)", cmap="Greens")
plotit(current_df_rmfn[550:850, 600:800], "Douglas-fir Suitability, Willamette (masked)", cmap="Greens")


This masked model with false negatives removed is not as adept at showing variation in the current spatial distribution. But it is more appropriate for showing where the climate could support Douglas-fir.

Predicting suitability under climate change scenarios

The preceding plots demonstrate a subtle difference in interpretation and purpose. The first set (with all absences considered) is trying to map the realized niche, and is answering the question

Where does this species exist?

For mapping current species distributions, this approach may be valid. However, it is problematic to assume that this non-masked relationship between species and climate will hold true in different locations when projected onto future climate scenarios. In other words, projecting the non-masked model onto future climate scenarios assumes that the non-climatic factors (which partly explain the distribution) will migrate in step with the climate.

It is clear that we need to modify the approach to model the fundamental climatic niche of the species. By masking false negatives, we are answering the question

Where could this species exist considering only climatic factors?

When we project this climatic niche onto future climate scenarios, the resulting map is not one of species distribution (i.e. it does not define where Douglas-fir will exist) but of climatic suitability (i.e. where the climatic niche of douglas fir will exist).

If the soils, human managment, lack of seed sources or any other non-climatic factors prevent Douglas-fir from growing, any change in climatic suitability will hardly matter. Conversely if the non-climatic factors are just right for Douglas-fir but the area is shifting out of the climatic niche due to climate change, it may be vulnerable to increased mortality.

In this section we'll demonstrate the application of the models to map future climatic suitability and compare the results of non-masked and masked models.

Load future climate data

First, load the target raster data for the Ensemble of GCMs under the 6.0 representitive concentration pathway (moderate emmissions) for the year 2090:


In [23]:
# Swap out for future climate rasters
future_explanatory_rasters = [os.path.join('data', 'raster', "Ensemble_rcp60_y2090_{}.img".format(r)) 
                              for r in explanatory_fields]

target_xs, raster_info = load_targets(future_explanatory_rasters)

impute(target_xs, clf, raster_info, outdir="_df_output_Ensemble_rcp60_y2090",
       linechunk=400, class_prob=True, certainty=True)

In [24]:
with rasterio.open("_df_output_Ensemble_rcp60_y2090/probability_1.img") as ds:
    df_Ensemble_rcp60_y2090 = ds.read_band(1)
plotit(df_Ensemble_rcp60_y2090, "2090 Douglas-fir Suitability (unmasked)", cmap="Greens")


We can take the difference between current and future suitability to get a visual sense of the predicted change


In [25]:
loss = current_df - df_Ensemble_rcp60_y2090
loss[loss < 0] = 0
plotit(loss, "Loss of Douglas-fir suitability (unmasked)", cmap="Blues")


According to this interpretation, in the southern edge of it's current range, Douglas-fir would appear to become entirely unsuitable. In the western coastal ranges, suitability would decrease significantly.

Contrast this future projection of Douglas-fir suitability with the masked version.


In [26]:
impute(target_xs, clf_rmfn, raster_info, outdir="_df_output_Ensemble_rcp60_y2090_rmfn",
       linechunk=400, class_prob=True, certainty=True)

In [27]:
with rasterio.open("_df_output_Ensemble_rcp60_y2090_rmfn/probability_1.img") as ds:
    df_Ensemble_rcp60_y2090_rmfn = ds.read_band(1)
plotit(df_Ensemble_rcp60_y2090_rmfn, "2090 Douglas-fir Suitability (masked)", cmap="Greens")



In [28]:
loss_rmfn = current_df_rmfn - df_Ensemble_rcp60_y2090_rmfn
loss_rmfn[loss_rmfn < 0] = 0
plotit(loss_rmfn, "Loss of Douglas-fir suitability (masked)", cmap="Blues")


If we compare the non-masked to masked results for 2090, we can see where they over and under predict relative to one another. Of note is the fact that the masked plots predict sustaintially higher future suitability in Douglas-fir's current/historical range.


In [29]:
diff = df_Ensemble_rcp60_y2090_rmfn - df_Ensemble_rcp60_y2090

# Smoothing
from scipy import ndimage
diff_smooth = ndimage.filters.gaussian_filter(diff, 1.5, mode='nearest')

plotit(diff_smooth, "Diff in Douglas-fir Suitability (masked - unmasked)", cmap="jet")


Citations

Cheaib, A., Badeau, V., Boe, J., Chuine, I., Delire, C., Dufrêne, E., François, C., Gritti, E.S., Legay, M., Pagé, C., Thuiller, W., Viovy, N. & Leadley, P. (2012)

Climate change impacts on tree ranges: model intercomparison facilitates understanding and quantification of uncertainty: Understanding and quantification of uncertainties of climate change impacts on tree range. Ecology Letters, 15, 533–544.

Coops, N.C., Waring, R.H., Beier, C., Roy-Jauvin, R. & Wang, T. (2011) Modeling the occurrence of 15 coniferous tree species throughout the Pacific Northwest of North America using a hybrid approach of a generic process -based growth model and decision tree analysis: Modeling the occurrence of fifteen coniferous tree species. Applied Vegetation Science, 14, 402–414.

Coops, N.C., Waring, R.H. & Schroeder, T.A. (2009) Combining a generic process-based productivity model and a statistical classification method to predict the presence and absence of tree species in the Pacific Northwest, U.S.A. Ecological Modelling, 220, 1787–1796.

Crookston, N.L., Rehfeldt, G.E., Dixon, G.E. & Weiskittel, A.R. (2010) Addressing climate change in the forest vegetation simulator to assess impacts on landscape forest dynamics. Forest Ecology and Management, 260, 1198–1211.

Keenan, T., Maria Serra, J., Lloret, F., Ninyerola, M. & Sabate, S. (2011) Predicting the future of forests in the Mediterranean under climate change, with niche- and process-based models: CO2 matters! Global Change Biology, 17, 565–579.

Morin, X. & Thuiller, W. (2009) Comparing niche-and process-based models to reduce prediction uncertainty in species range shifts under climate change. Ecology, 90, 1301–1313.