Notebook for the deforestprob Python module

This notebook will show you how to use the deforestprob Python module. The module can be used to estimate the spatial probability of deforestation in the tropics depending on various spatial environmental variables. We use Madagascar as a case study.

We first import the deforestprob module and the other necessary modules to run the script.


In [1]:
# Import modules
%load_ext autoreload
%autoreload 2
import os
import numpy as np
import pandas as pd
from patsy import dmatrices
import deforestprob as dfp
import matplotlib.pyplot as plt

We also create a directory to hold the outputs with the help of the function .make_dir() from the deforestprob module.


In [3]:
# Make output directory
dfp.make_dir("output")

Downloading and preparing the data

To model the spatial probability of deforestation, we need a map of the past deforestation and maps of potential explicative environmental variables. Environmental variables can be derived from topography (altitude, slope, and aspect), accessibility (distance to roads, towns, and forest edge), deforestation history (distance to previous deforestation) or landscape policy (location inside a protected area) for example. In our example, we use the following variables :

Product Source Variable derived Unit Resolution (m)
Deforestation maps (1990-2000-2010) Vieilledent et al. 2018 distance to forest edge m 30
distance to past deforestation m 30
Digital Elevation Model SRTM v4.1 CSI-CGIAR altitude m 90
slope ° 90
Highways OSM - Geofabrik distance to roads m 150
Places distance to towns m 150
Waterways distance to river m 150
Protected areas Rebioma presence of protected area - 30
  1. http://bioscenemada.cirad.fr, deforestation maps derived from Harper et al. 2007 and Hansen et al. 2013
  2. http://srtm.csi.cgiar.org, SRTM 90m Digital Elevation Database v4.1
  3. http://www.geofabrik.de, data extracts from the OpenStreetMap project for Madagascar,
  4. http://rebioma.net, SAPM ("Système des Aires Protégées à Madagascar"), 20/12/2010 version

To obtain the data, we can either (i) execute a bash script which downloads the raw data from the sources and computes the variables, or (ii) directly download the derived variables from a repository on Zenodo (http://doi.org/10.5281/zenodo.259582). The bash script is available here.


In [6]:
# =====================================
# First option: execute the bash script
# =====================================

# os.system("wget https://github.com/ghislainv/deforestprob/blob/master/notebook/scripts/dataMada.sh")
# os.system("sh data_Mada.sh")

# =====================================
# Second option: download the data
# =====================================

# Make data directory
dfp.make_dir("data")

# List of files to download
files = ["fordefor2010.tif", "dist_edge.tif", "dist_defor.tif",
         "altitude.tif", "slope.tif", "aspect.tif",
         "dist_road.tif", "dist_town.tif", "dist_river.tif",
         "sapm.tif", "roads.kml", "towns.kml", "rivers.kml", "sapm.kml"]
for i in files:
    cmd = "wget -P data https://zenodo.org/record/259582/files/" + i
    #os.system(cmd)

In our example, fordefor2010.tif is a forest raster at 30m for the year 2010 considering the deforestation on the period 2000-2010 in Madagascar. We can plot this raster and zoom on a region with function .plot.forest() in the deforestprob package to see the deforestation data. The remaining forest appears in green and the deforested areas appear in red.


In [7]:
# ========================================================
# Plot forest
# ========================================================
dfp.plot.forest(input_forest_raster="data/fordefor2010.tif",
                output_file="output/forest.png",
                col=(255, 0, 0, 255),  # rgba color for deforestation
                figsize=(4,4),
                dpi=200,
                zoom=(340000,412000,7420000,7500000))


Build overview
Out[7]:

We can also plot the environmental raster files using function .plot.var() in the deforestprob package.


In [8]:
# ========================================================
# Plot environmental variables
# ========================================================
var_fig = dfp.plot.var(var_dir="data",
                       output_file="output/var.pdf",
                       gridsize=(2,2),
                       figsize=(5,6),
                       dpi=125)
var_fig[0]


Build overview
Build overview
Build overview
Build overview
Build overview
Build overview
Build overview
Build overview
Build overview
Out[8]:

Sampling the observations

We use the function .sample() from the deforestprob module to sample 10,000 points (pixel centers) in the deforested areas and in the remaining forest (20,000 points in total). The input_forest_raster argument defines the path to the forest raster including the deforested pixels (with value=0) and the remaining forest pixels (value=1) after a given period of time. The random seed can be set with argument Seed to reproduce the data of the random sampling.

The .sample() function also extracts information from environmental variables for each sampled point. Sampling is done by block to allow the computation on large study areas (e.g. country or continental scale) with a fine spatial resolution (e.g. 30m). The var_dir argument indicates the directory including all the environmental raster files (they must be GeoTIFF files with extension .tif) that we want to test.

The .sample() function identifies the spatial cell for each sample point (sample point are grouped within a spatial cell). Spatial cells and grouped observations are used to estimate the spatial autocorrelation of the deforestation process. The csize argument define the width (in km) of the square spatial cells. Each spatial cell will be attributed a parameter. To avoid estimating too many parameters, width of the square spatial cells must be sufficiently large. Both points sampling and extraction of environmental data are done by block to avoid memory problems for big datasets.


In [9]:
# ========================================================
# Sample
# ========================================================

# Sample points
dataset = dfp.sample(nsamp=10000, Seed=1234, csize=10,
                     var_dir="data",
                     input_forest_raster="fordefor2010.tif",
                     output_file="output/sample.txt",
                     blk_rows=0)


Sample 2x10000 pixels (deforested vs. forest)
Divide region in 20895 blocks
Compute number of deforested and forest pixels per block
100%
Draw blocks at random
Draw pixels at random in blocks
100%
Compute center of pixel coordinates
Compute number of 10 x 10 km spatial cells
... 12393 cells (153 x 81)
Identify cell number from XY coordinates
Make virtual raster with variables as raster bands
Extract raster values for selected pixels
100%
Export results to file output/sample.txt

A pandas DataFrame is returned by the function (each row being an observation) and the sampled observations are also written to a text file defined by the argument output_file.


In [10]:
# To import data as pandas DataFrame if necessary
dataset = pd.read_table("output/sample.txt",
                         delimiter=",")

# Print the first five rows
print(dataset.head(5))


   altitude  aspect  dist_defor  dist_edge  dist_river  dist_road  dist_town  \
0      13.0   129.0     16658.0       30.0       150.0     6737.0     3231.0   
1      56.0   135.0      1950.0       30.0      6750.0      750.0     1142.0   
2      81.0   248.0       175.0       95.0      7500.0     2758.0     3221.0   
3     123.0   252.0       120.0       90.0      1082.0     7107.0     7107.0   
4     277.0    17.0        90.0       60.0       912.0     5743.0     6696.0   

   fordefor2010  sapm  slope         X          Y   cell  
0           0.0   0.0    2.0  969105.0  8657115.0  229.0  
1           0.0   0.0    0.0  974115.0  8643255.0  310.0  
2           0.0   1.0    3.0  975975.0  8641245.0  391.0  
3           0.0   0.0    3.0  943275.0  8630595.0  469.0  
4           0.0   1.0   12.0  975315.0  8628855.0  472.0  

Sampled observations can be plotted using function .plot.obs() from the deforestprob module. Dark red dots indicate deforestation observations and dark green dots indicate forest observations.


In [11]:
# Plot sample points
dfp.plot.obs(sample=dataset,
             name_forest_var="fordefor2010",
             input_forest_raster="data/fordefor2010.tif",
             output_file="output/obs.png",
             zoom=(340000,412000,7420000,7500000),
             s=5,dpi=150)


Out[11]:

Descriptive statistics

Before modelling the deforestation, it is important to look at the relationship between environmental variables and deforestation. Using formulas from the patsy Python module, we can specify the relationships that we want to look at. In the example below, we plot the relationships between some continuous environmental variables and the probability of deforestation using function .plot.correlation() from the deforestprob package. Note that -1 must be set at the end of the formula.


In [12]:
# ========================================================
# Descriptive statistics
# ========================================================

# To import data as pandas DataFrame if necessary
# dataset = pd.read_table("output/sample.txt",
#                         delimiter=",")

# Model formulas
formula_1 = "fordefor2010 ~ dist_road + dist_town + dist_defor + \
dist_river + dist_edge + altitude + slope + aspect - 1"
# Standardized variables (mean=0, std=1)
formula_2 = "fordefor2010 ~ scale(dist_road) + scale(dist_town) + \
scale(dist_defor) + scale(dist_river) + scale(dist_edge) + \
scale(altitude) + scale(slope) + scale(aspect) - 1"
formulas = (formula_1, formula_2)

# List to store figures
corr_fig = []

# Remove NA from data-set (otherwise scale() doesn't work)
dataset = dataset.dropna(axis=0)

# Loop on formulas
for f in range(len(formulas)):
    # Output file
    of = "output/correlation_" + str(f) + ".pdf"
    # Data
    y, data = dmatrices(formulas[f], data=dataset,
                        return_type="dataframe")
    
    # Plots
    corr_fig.append(dfp.plot.correlation(y=y, data=data, 
                                         plots_per_page=3,
                                         figsize=(7,8),
                                         dpi=100,
                                         output_file=of))

The function .correlation() returns a serie of graphics that can be analyzed to choose the right relationship for each continuous variable (linear or polynomial for example).


In [13]:
# Correlation plots
corr_fig[1][0]


Out[13]:

In this example, we can see that a linear model should be sufficient to represent the relationship between the probability of deforestation and the standardized distance to the nearest road or town. On the contrary, it could be interesting to fit a polynomial model for the the standardized distance to previous deforestation (dist_defor variable) for which the relationship seems non-linear. Several models can be fitted and compared to see if a second-order or third-order polynomial relationship is justified.

Spatial deforestation model

We propose to use the Binomial iCAR model (Vieilledent et al. 2014) to estimate the deforestation probability of a pixel given a set of environmental variables. The Binomial iCAR model is a linear Binomial logistic regression model including an intrinsic Conditional Autoregressive (iCAR) process to account for the spatial autocorrelation of the observations. Parameter inference is done in a hierarchical Bayesian framework. The .model_binomial_iCAR() function from the deforestprob module includes a Metropolis-within-Gibbs algorithm written in pure C code to reduce computation time.

Figure caption: Parameter inference is done in a hierarchical Bayesian framework, which relies on the Bayes' theorem named after Reverend Thomas Bayes. Each parameter has a prior and an approximated posterior probability distribution from which we can compute the mean, standard deviation, credible intervals at 95%, etc.

For the deforestation process it is very important to take into account the spatial autocorrelation of the process with spatial random effects. Indeed, the selected fixed environmental variables are not able to fully explain the spatial variability of the deforestation process, especially when working at large geographical scales, such as the national or continental scale. Spatial random effects allow estimating a higher/lower probability of deforestation in a particular region (associated to unmeasurable or unknow factors) that is different from the mean probability of deforestation derived from the environmental factors included in the model. The Binomial iCAR model can be discribed as follow:

Ecological process

\begin{equation} y_i \sim \mathcal{B}inomial(\theta_i,t_i) \\ \text{logit}(\theta_i) = X_i \beta + \rho_{j(i)} \end{equation}

$y_i$: random variable for the deforestation process (0 if no deforestation, 1 if deforestation)

$\theta_i$: probability of deforestation

$t_i$: number of trials (always 1 in our example)

$X_i$: vector of values for environmental explicative variables

$\beta$: vector of fixed effect parameters

$\rho_j$: spatial random effect

$j(i)$: index of the spatial entity for observation $i$.

Spatial autocorrelation

The spatial autocorrelation is modelled with an intrinsic conditional autoregressive (iCAR) process:

\begin{equation} \rho_j \sim \mathcal{N}ormal(\mu_j,V_{\rho} / n_j) \end{equation}

$\mu_j$: mean of $\rho_{j'}$ in the neighborhood of $j$.

$V_{\rho}$: variance of the spatial random effects.

$n_j$: number of neighbors for spatial entity $j$.

Figure caption: Representation of the neighborhood for the intrinsic conditional autoregressive (iCAR) process. Target spatial cell $j$ has 8 neighbors in this case. Several observations (black points, equivalent to pixel centers in our case) can be located in each spatial cell. Deforestation probability in one spatial cell $j$ depends on deforestation probability in neighboring cells.

Before running the model, we add a column indicating the number of trials for each observation (1 in our case as we are considering a Bernoulli process). We then remove any observation with non-available data (NA) from the data-set. We also compute the number of neighbors (nneigh) and the neighbor identifiers (adj) for each spatial cell using function .cellneigh from the deforestprob module.

A model formula must also be defined to specify the explicative variables we want to include in the model. The formula allows specifying some variable transformations (such as standardization in our case). See the patsy module for more information. In our model, we included the following variables: location inside a protected area, altitude, distance to past deforestation (with a degree two polynomial), distance to forest edge, distance to nearest road and distance to nearest town. The formula must end with the name of the variable indicating the spatial cell for each observation (cell in our case).


In [14]:
# ========================================================
# hSDM model
# ========================================================

# Set number of trials to one
dataset["trial"] = 1

# Remove observations with NA
dataset = dataset.dropna(axis=0)

# Spatial cells for spatial-autocorrelation
nneigh, adj = dfp.cellneigh(raster="data/fordefor2010.tif", csize=10, rank=1)

# Formula
formula = "I(1-fordefor2010) + trial ~ C(sapm) + scale(altitude) +  scale(slope) + \
scale(dist_defor) + np.power(scale(dist_defor),2) + \
scale(dist_edge) + \
scale(dist_road) + scale(dist_town) + cell"


Compute number of 10 x 10 km spatial cells
... 12393 cells (153 x 81)
Identify adjacent cells and compute number of neighbors

In [15]:
# Model
mod_binomial_iCAR = dfp.model_binomial_iCAR(
    # Observations
    suitability_formula=formula, data=dataset,
    # Spatial structure
    n_neighbors=nneigh, neighbors=adj,
    # Chains
    burnin=1000, mcmc=1000, thin=1,
    # Starting values
    beta_start=-99)


Using estimates from classic logistic regression as starting values for betas

Results of the model

Once the model has been fitted, we can print a summary of the model showing the parameter estimates. The 95% credible intervals obtained from the posterior distribution of each parameter, except distance to nearest town (dist_town), do not include zero, indicating that parameters are significantly different from zero. The variance of the spatial random effects (Vrho) is given together with the deviance value, which can be used to compare different statistical models (lower deviance is better). Looking at the parameter estimates, we can see that the deforestation probability is much lower inside protected areas and that deforestation probability decreases with altitude, slope, distance to past deforestation, forest edge, roads and towns. Parameter values are then coherent regarding the deforestation process and easy to interpret.


In [16]:
# Summary
print(mod_binomial_iCAR)


Binomial logistic regression with iCAR process
  Model: I(1 - fordefor2010) + trial ~ 1 + C(sapm) + scale(altitude) + scale(slope) + scale(dist_defor) + np.power(scale(dist_defor), 2) + scale(dist_edge) + scale(dist_road) + scale(dist_town) + cell
  Posteriors:
                                     Mean        Std     CI_low    CI_high
                     Intercept     -0.309     0.0788      -0.45      -0.17
                C(sapm)[T.1.0]     -0.632     0.0603     -0.751     -0.511
               scale(altitude)     -0.483     0.0685     -0.605     -0.374
                  scale(slope)    -0.0673     0.0301     -0.127    -0.0125
             scale(dist_defor)      -0.75     0.0491     -0.828      -0.67
np.power(scale(dist_defor), 2)     0.0561    0.00546      0.046     0.0655
              scale(dist_edge)     -0.524     0.0472     -0.607     -0.424
              scale(dist_road)     -0.113     0.0559     -0.215     -0.013
              scale(dist_town)    -0.0928     0.0459     -0.177    0.00632
                          Vrho       7.23       0.33       6.68        7.9
                      Deviance   1.96e+04       60.8   1.95e+04   1.97e+04

To check for the convergence of the Markov chain Monte Carlo (MCMC), we can plot the traces and the posterior distributions of the estimated parameters using method .plot() associated to the hSDM_binomial_iCAR class defined in the deforestprob module. This method returns the figures showing the traces and posterior distributions.


In [17]:
# Plot
traces_fig = mod_binomial_iCAR.plot(output_file="output/mcmc.pdf",
                                    plots_per_page=3,
                                    figsize=(9,6),
                                    dpi=100)
traces_fig[0]


Traces and posteriors will be plotted in output/mcmc.pdf
Out[17]:

Predicting the spatial probability of deforestation

We use the model to predict the spatial probability of deforestation at the national scale for Madagascar. Before, doing so, we smooth the spatial random effects which have been estimated at a coarse resolution (10km in our example). To do so, we use the function .resample_rho() from the deforestprob module to resample the results at a finer resolution using a bilinear interpolation. The function writes a raster file to the disk with a resolution of the raster specified in the argument input_raster of the function (1km in our case). The function .resample_rho() returns a figure of the spatial random effects that can be plotted.


In [44]:
# ========================================================
# Resampling spatial random effects
# ========================================================

# Spatial random effects
rho = mod_binomial_iCAR.rho

# Resample
dfp.resample_rho(rho=rho, input_raster="data/fordefor2010.tif",
                 output_file="output/rho.tif",
                 csize_orig=10, csize_new=1,
                 figsize=(5,5),
                 dpi=150)


Write spatial random effect data to disk
Compute statistics
Build overview
Resampling spatial random effects to file output/rho.tif
Make figure
Out[44]:

The .predict_hSDM() function of the deforestprob module can be used to predict the spatial probability of deforestation from an hSDM_binomial_iCAR model (i.e. an object of class hSDM_binomial_iCAR). The function writes a raster of predictions to the disk and returns a figure with the predictions that can be plotted. The prediction is done by block to avoid memory problems for big datasets. Functions will return NA for pixels with no forest or for pixels with missing environmental variables.


In [45]:
# ========================================================
# Predicting spatial probability of deforestation
# ========================================================

# Compute predictions
fig_pred = dfp.predict(mod_binomial_iCAR, var_dir="data",
            input_cell_raster="output/rho.tif",
            input_forest_raster="data/fordefor2010.tif",
            output_file="output/pred_binomial_iCAR.tif",
            blk_rows=128,
            dpi=200)


Make virtual raster with variables as raster bands
Divide region in 398 blocks
Create a raster file on disk for projections
Predict deforestation probability by block
100%
Compute statistics
Build overview
Make figure

We can then plot the spatial probability of deforestation.


In [46]:
# Plot the spatial probability of deforestation.
fig_pred.set_size_inches(10,12)
fig_pred


Out[46]:

Predicting the future forest cover

Given the spatial probability of deforestation and the number of hectares that should be deforested, we can predict the future forest cover using function .deforest() from the deforestprob package. The number of hectares are converted into number of pixels to be deforested. Pixels with the highest probability of deforestation are deforested first. The function computes a probability threshold above which pixels are deforested.

In our example, we consider an annual deforestation of roughly 100,000 ha for Madagascar. Considering the period 2010-2050, this would correspond to 4 Mha of deforestation. This number can of course be more precise and refined considering various deforestation scenarios (demographic growth, economic development, etc.).


In [2]:
# ========================================================
# Predicting forest cover
# ========================================================

forest_cover = dfp.deforest(input_raster="output/pred_binomial_iCAR.tif",
                            hectares=4000000,
                            output_file="output/forest_cover_2050.tif",
                            blk_rows=128)


Divide region in 398 blocks
Compute the total number of forest pixels
100%
Compute the histogram of values
100%
Identify threshold
Minimize error on deforested hectares
Create a raster file on disk for forest cover
Write raster of forest cover
100%
Compute statistics
Build overview
Make figure

Because deforestation probability is scaled on the interval [1, 65535] including 65535 values, there might be a difference between the deforested hectares set as input and the true deforested surfaces. The magnitude of the difference depends on the distribution of the deforestation probabilities. Usually, the distribution of the deforestation probabilities should be asymmetric with low frequencies for higher probabilities of deforestation.


In [3]:
# Distribution of the deforestation probabilities
frequences = forest_cover["statistics"][0]
threshold = forest_cover["statistics"][1]
freq = plt.figure()
plt.subplot(111)
plt.plot(frequences, "bo")
plt.title("Frequencies of deforestation probabilities")
plt.axvline(x=threshold, color='k', linestyle='--')
freq


Out[3]:

As a consequence, the error should be negligible compared to the number of hectares set as input.


In [4]:
# Error regarding deforested hectares
future_forest_error = forest_cover["statistics"][2]
future_forest_hectares = forest_cover["statistics"][3]
error_perc = 100*future_forest_error/future_forest_hectares
print("Error (ha): %.2f, %.3f%%" % (future_forest_error, error_perc))


Error (ha): 58.10, 0.001%

We can plot the predicted future forest cover in 2050. The red areas represent the deforestation on the period 2010-2050. The green areas represent the remaining forest in 2050. Most of the remaining forest in 2050 are inside the protected areas or located in remote areas, at high altitudes and far from roads and big cities (for example in the Tsaratanana mountain region and around the Masoala peninsula, north-east Madagascar).


In [18]:
# Plot future forest cover
dfp.plot.forest(input_forest_raster="output/forest_cover_2050.tif",
                output_file="output/forest_cover_2050.png",
                col=(128, 128, 128, 255),  # rgba color for deforestation
                figsize=(4,4),
                dpi=200,
                zoom=(340000,412000,7420000,7500000))


Out[18]:

Model validation

Deviance explained

We can compute the percentage of deviance explained by the model with or without accounting for spatial autocorrelation. To do so, we need to compute the deviance of the full (or saturated) and null model. The full model assumes a perfect fit of the data and has a deviance equals to zero. The null model is a simple mean-only model ($\text{logit}(\theta_i)=\mu$). The deviance of the null model and of the model without spatial autocorrelation can be computed using a classical Binomial regression, without hierarchical structure. We will use the statsmodels Python module to do so.


In [18]:
# Additional module
import statsmodels.api as sm
# Full model
deviance_full = 0
# Null model
formula_null = "I(1-fordefor2010) ~ 1"
Y, X = dmatrices(formula_null, data=dataset, NA_action="drop", return_type="dataframe")
mod_null = sm.GLM(Y, X, family = sm.families.Binomial())
fit_null = mod_null.fit()
deviance_null = fit_null.deviance
# Model with no spatial random effects
formula_nsre = "I(1-fordefor2010) ~ C(sapm) + scale(altitude) +  scale(slope) + \
scale(dist_defor) + np.power(scale(dist_defor),2) + scale(dist_edge) + scale(dist_road) + scale(dist_town)"
Y, X = dmatrices(formula_nsre, data=dataset, NA_action="drop", return_type="dataframe")
mod_nsre = sm.GLM(Y, X, family = sm.families.Binomial())
fit_nsre = mod_nsre.fit()
deviance_nsre = fit_nsre.deviance
# Model with iCAR process
deviance_icar = mod_binomial_iCAR.Deviance
# Dataframe
mod = np.array(["null", "nsre", "icar", "full"])
dev = np.array([deviance_null, deviance_nsre, deviance_icar, deviance_full])
perc = 100*(1-dev/deviance_null)
print(pd.DataFrame.from_items([("model", mod),
                               ("deviance", np.rint(dev).astype(int)),
                               ("percentage", np.rint(perc).astype(int))]))


  model  deviance  percentage
0  null     27637           0
1  nsre     25537           8
2  icar     19581          29
3  full         0         100

We see that the model without spatial random effects only explains 8% of the deviance while the model including spatial random effects explains 29% of the deviance. These percentage might appear rather low, but one should remember that we are trying to predict the deforestation at the pixel scale with a precision of 30 meters.

Predictions vs. observations

Another way of assessing model performance is to compute some accuracy indices based on a confusion matrix. We select the following accuracy indices: the Overall Accuracy (OA), the Specificity (Spe), the Sensitivity (Sen), Cohen's Kappa (K), the True Skill Statistics (TSS) and the Figure Of Merit (FOM). These indices are classical indices used to estimate the accuracy of deforestation models (see for example Dezécache et al. 2017, Vieilledent et al. 2013 and Pontius et al. 2011).

Internal validation

We first compute these indices on the sampling data-set. Here we are interested in seeing if our model is predicting well the location of the deforestation (relative deforestation probability), not the intensity of deforestation (absolute deforestation probability). We then transform the predicted probability of deforestation into binary values using the deforestation probability "observed" in the sampling data-set. This last value should be closed to 0.5 as we sampled as many forest and non-forest pixels, at the exception of the pixels with no-data value for environmental variables. The same proportion of deforested pixels will be applied to the predictions selecting the pixels with the highest predicted deforestation probabilities.


In [79]:
# Predictions
dataset["pred"] = 0
dataset["theta_pred"] = mod_binomial_iCAR.theta_pred
# Proportion of deforested pixels
nobs = dataset["fordefor2010"].size  # inferior to 20000 as there were NaN
ndefor = sum(dataset["fordefor2010"] == 0)  # 0 for deforestation in fordefor2010
proba_defor = float(ndefor)/nobs  # not exactly 0.5
# Probability threshold to transform probability into binary values
proba_thresh = np.percentile(dataset["theta_pred"], 100*(1-proba_defor))  # ! must be (1-proba_defor)
dataset.loc[dataset["theta_pred"] >= proba_thresh, "pred"] = 1
# We check that the proportion of deforested pixel is the same for observations/predictions
ndefor_pred = sum(dataset["pred"] == 1)
proba_defor_pred = float(ndefor_pred)/nobs
proba_defor_pred == proba_defor


Out[79]:
True

We use function accuracy_indices() from the deforestprob package to compute the accuracy indices.


In [80]:
# Accuracy indices
pred = dataset["pred"].tolist()
obs = (1-dataset["fordefor2010"]).tolist()
internal_validation = dfp.accuracy_indices(pred,obs)
print(internal_validation)


{'K': 0.58, 'OA': 0.79, 'Spe': 0.79, 'Sen': 0.79, 'FOM': 0.65, 'TSS': 0.58}

The accuracy indices show that the model has good predictive abilities. Their value is relatively high (close to 1). Once we know the number of pixels to be deforested, the model correctly identifies the pixels that should be deforested first.

External validation

We can also compute these indices using independant observations. To do so, we will compare the predicted deforestation on the period 2010-2014 with the true observed deforestation on the same period. The forest cover for the year 2014 for Madagascar can be downloaded from the BioSceneMada website.


In [47]:
# Download forest cover for 2014
cmd = "wget -P data http://bioscenemada.cirad.fr/FileTransfer/for2014.tif"
os.system(cmd)


Out[47]:
0

We compute the observed forest-cover change between 2010 and 2014 using gdal_calc.py.


In [70]:
# Forest-cover change between 2010 and 2014
os.system("gdal_translate -a_nodata 99 -co 'COMPRESS=LZW' -co 'PREDICTOR=2' \
                          data/fordefor2010.tif output/fordefor2010_.tif") # Set nodata different from 255
os.system("gdal_translate -a_nodata 99 -co 'COMPRESS=LZW' -co 'PREDICTOR=2' \
                          data/for2014.tif output/for2014_.tif")
os.system("gdal_calc.py --overwrite -A output/fordefor2010_.tif -B output/for2014_.tif \
                --outfile=output/fcc_2010_2014_obs.tif --type=Byte \
                --calc='255-254*(A==1)*(B==1)-255*(A==1)*(B==255)' --co 'COMPRESS=LZW' --co 'PREDICTOR=2' \
                --NoDataValue=255")


Out[70]:
0

We then need to compute the number of hectares that have been deforested between 2010 and 2014. We use gdalinfo to do so.


In [81]:
# Compute deforestation between 2010 and 2014
out_2010 = os.popen("gdalinfo -stats -hist -noct -norat -nomd data/for2014.tif").read()
out_2014 = os.popen("gdalinfo -stats -hist -noct -norat -nomd data/fordefor2010.tif").read()
#print(out_2010)
#print(out_2014)
d0014 = (103445946-98994821)*30*30/10000 # deforestation in ha
print(d0014)


400601

400601 hectares have been deforested between 2010 and 2014.


In [61]:
# Predicting forest cover in 2014
fcc_2010_2014_prd = dfp.deforest(input_raster="output/pred_binomial_iCAR.tif",
                                 hectares=d0014,
                                 output_file="output/fcc_2010_2014_pred.tif",
                                 blk_rows=128)


Divide region in 398 blocks
Compute the total number of forest pixels
100%
Compute the histogram of values
100%
Identify threshold
Minimize error on deforested hectares
Create a raster file on disk for forest cover
Write raster of forest cover
100%
Compute statistics
Build overview
Make figure

Finally, we compute the accuracy indices comparing the predicted and observed forest-cover change map between 2010 and 2014 using function validation() from the deforestprob package.


In [86]:
# Computing accuracy indices
external_validation = dfp.validation(pred="output/fcc_2010_2014_pred.tif",
                                     obs="output/fcc_2010_2014_obs.tif",
                                     blk_rows=128)


Divide region in 398 blocks
Compute the confusion matrix
100%
             obs0       obs1
pred0  94581745.0  3971324.0
pred1   3984113.0   466892.0
Compute accuracy indices

In [87]:
print(external_validation)


{'K': 0.06, 'OA': 0.92, 'Spe': 0.96, 'Sen': 0.11, 'FOM': 0.06, 'TSS': 0.06}

References

Dezécache, C.; Salles, J.-M.; Vieilledent, G. & Hérault, B. 2017. Moving forward socio-economically focused models of deforestation. Global Change Biology. in press. doi: 10.1111/gcb.13611

Hansen, M. C.; Potapov, P. V.; Moore, R.; Hancher, M.; Turubanova, S. A.; Tyukavina, A.; Thau, D.; Stehman, S. V.; Goetz, S. J.; Loveland, T. R.; Kommareddy, A.; Egorov, A.; Chini, L.; Justice, C. O. & Townshend, J. R. G. 2013. High-Resolution Global Maps of 21st-Century Forest Cover Change. Science. 342:850-853. doi: 10.1126/science.1244693

Harper, G. J.; Steininger, M. K.; Tucker, C. J.; Juhn, D. & Hawkins, F. 2007. Fifty years of deforestation and forest fragmentation in Madagascar. Environmental Conservation. 34:325-333. doi: 10.1017/S0376892907004262

Pontius R. G. Jr., J. & Millones, M. 2011. Death to Kappa: birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing. 32:4407-4429. doi: 10.1080/01431161.2011.552923

Vieilledent, G.; Merow, C.; Guélat, J.; Latimer, A. M.; Kéry, M.; Gelfand, A. E.; Wilson, A. M.; Mortier, F. & Silander Jr., J. A. 2014. hSDM CRAN release v1.4 for hierarchical Bayesian species distribution models. Zenodo. doi: 10.5281/zenodo.48470

Vieilledent, G.; Grinand, C. & Vaudry, R. 2013. Forecasting deforestation and carbon emissions in tropical developing countries facing demographic expansion: a case study in Madagascar. Ecology and Evolution. 3:1702-1716. doi: 10.1002/ece3.550