Based on the discussion with Bastian and various people.
The spatial analysis was done outside of this notebook. In a nutshell, the spatial component dealt with the question of how much of cumulative marine pressure there is in each unit (see below for such a hypothetical biogeographic classification). The analysis was carried out in such a way that the aggregation happens in the later stage and if thresholds are to be changed (very likely due to the explorative nature of such exercise), it requires minimum efforts without having to re-run any spatial analysis, which are time-consuming and prone to error.
Nodata in the result (when converting rasters to numpy) is also removed thus saving the efforts of having to manually remove them here.
In [2]:
# load default libraries
import os, sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# make sure gdal is correctly installed
from osgeo import gdal
import gc
%matplotlib inline
It is necessary to load the original raster in order to calculateits quantiles. They are used to define thresholds to explore the extent of marine wilderness areas.
In [3]:
def raster2array(rasterfn):
raster = gdal.Open(rasterfn)
band = raster.GetRasterBand(1)
return band.ReadAsArray()
In [4]:
g_array = raster2array('global_cumul_impact_2013_all_layers.tif')
In [5]:
g_array_f = g_array.flatten()
In [6]:
(g_array_f == 0).sum()
Out[6]:
In [7]:
print('The total number of non-zero values in the raw raster dataset:', g_array_f.size - (g_array_f==0).sum())
## in fact the following should be used for testing equality of float dtypes. Because the result remains\
## the same thus the simpler option is used.
## (np.isclose(g_array_f, 0.0)).sum()
The number of non-zero values is notably different from esri's calculation, which stands at 414,347,791, less than what's calculated here and is 300,000 fewer zeros. This suggests esri may be using a bigger tolerence value, i.e. what is considered small enough to be regarded as zero .
Now, get the quantiles... this threshold is subject to change. For the time being, arbitary values of 1%, 3%, 5% and 10% are used.
In [8]:
## the percentile function applied to the sliced array, i.e., those with values greater than 0
quantiles = [np.percentile(g_array_f[~(g_array_f == 0)], quantile) for quantile in [1,3,5,10]]
In [9]:
quantiles
Out[9]:
In [10]:
print('\n'.join(['Threshold cut-off value: '+ str(threshold) for threshold in quantiles]))
The hypothetical biogeographical classification of the marine environmental within EEZ is described as a combination of MEOW (Marine Ecoregional of the World), its visual representation (called hereafter MEOW visual) up to 200 nautical miles and the World's pelagic provinces. The spatial data was prepared in a way such that from the coastline outwards disjoint polygons represents: MEOW (up to 200 meter depth, inner/red), MEOW visual overlaps with pelagic provinces (middle/green), pelagic provinces that do not overlap with MEOW visual (outer/blue). This is purely a spatial aggregation based on the above data and the World Vector Shoreline EEZ. See below for example.
Load the input_data
table, which describes the intersection between the marine pressure layer and the marine ecoregion/pelagic provinces classification. The input_attr
table contains information on the relationship between OBJECTID
and each raster pixel value.
OBJECTID
(one) - pixel value (many)OBJECTID
(many) - attr: Province, Ecoregion, and Realm, categories (one)
Each pixel is of height and width: 934.478 meter, making each pixel in area 0.873 $km^2$
In [11]:
# calculate cell-size in sqkm2
cell_size = 934.478*934.478/1000000
print(cell_size)
In [12]:
# the OBJECTID - ras_val table. This is a very big table and will take a long time.
input_data = pd.read_csv('result.csv')
# print fields
input_data.columns
Out[12]:
In [13]:
input_data.ras_val.min()
Out[13]:
In [14]:
# the attribute table containing information about province etcb
input_attr = pd.read_csv('attr.csv')
# print fileds
input_attr.columns
Out[14]:
In [15]:
# total count of pixels per OBJECTID, i.e. base
result_count = input_data.groupby('OBJECTID').count().reset_index()
Here I created four result tables containing only pixels that meet the criteria as specified by different thresholds
In [16]:
# filter result only in the top 1, 3, 5, 10 percentile (of least impacted marine areas)
result_1, result_3, result_5, result_10 = \
[input_data[input_data.ras_val <= threshold].groupby('OBJECTID').count().reset_index() for threshold in quantiles]
The next step will be to join the input_attr
table with filtered pixel values. Replace result10
result table if other threshold is used.
In [17]:
# join base to the attribute
attr_merge = pd.merge(input_attr, result_count, on = 'OBJECTID')
# join result to the above table
attr_merge_10 = pd.merge(attr_merge, result_10, how = 'left', on ='OBJECTID', suffixes = ('_base', '_result'))
# fill ras_val_result's NaN with 0, province and realms with None. This should happen earlier
attr_merge_10['ras_val_result'].fillna(0, inplace=True)
attr_merge_10['PROVINCE'].fillna('None', inplace=True)
attr_merge_10['PROVINCE_P'].fillna('None', inplace=True)
# apply an aggregate function to each sub dataframe, as a result of grouping
def apply_func(group):
overlap = group['ras_val_result'].sum()*cell_size # in sqkm
base = group['ras_val_base'].sum()*cell_size
per = overlap/base
# can have multiple columns as a result, if returened as pd.series
return pd.Series([overlap, per, base], index=['less_than_threshold', 'per_ltt', 'base'])
# code reuse: threshold
def calculate_wilderness_marine(threshold, groups):
"""<threshold to consider wilderness value>, <a python list such as ['PROVINCE', 'PROVINCE_P', attr fields]>"""
# filtered input data according to threshold merge
input_data_filtered = input_data[input_data.ras_val <= threshold].groupby('OBJECTID').count().reset_index()
# base merge
base_merge = pd.merge(input_attr, result_count, on = 'OBJECTID')
# merge the two above
result = pd.merge(base_merge, input_data_filtered, how='left', on='OBJECTID', suffixes=('_base', '_result'))
# solve no data issue
result['ras_val_result'].fillna(0, inplace=True)
result['PROVINCE'].fillna('None', inplace=True)
result['PROVINCE_P'].fillna('None', inplace=True)
return result.groupby(groups).apply(apply_func).reset_index()
One all tables are joined - full attributes with pixel values, attributes can be used to specify groupings
In [18]:
# use 10% as threshold
calculate_wilderness_marine(quantiles[-1], ['PROVINCE', 'PROVINCE_P', 'category']).head(20)
Out[18]:
Further aggregation could be applied here, if needed.
The World Heritage Convention currently operates only within areas under national jurisdiction, and thus high seas/ABNJ is not to be considered. It is sensible to reduce the scope of area of interest to the extent of EEZ, and accordingly adjust wilderness threshold values.
By excluding Antartica, where significant area of wilderness exist, it should raise the bar lower for those to be considered as wilderness areas, i.e. having a higher cumulative marine pressure threshold and more areas would be 'eligible' as wilderness.
In [19]:
# check data integrity
input_data.OBJECTID.unique().size
Out[19]:
In [20]:
# no zeros in the result data
input_data.ras_val.size
Out[20]:
In [21]:
# it should not have 0, which indicates nodata in the raster data as it has been removed during the spatial analysis
input_data.ras_val.min()
Out[21]:
In [22]:
# percentage of EEZ water in relation to the entire ocean
input_data.ras_val.size/g_array_f[~(g_array_f==0)].size
Out[22]:
In [23]:
# all input_data are non-zero (zero indicates land and nodata)
input_data[~(input_data.ras_val == 0)].ras_val.count() == input_data.ras_val.count()
Out[23]:
In [24]:
# get threshold for 10%
new_threshold = np.percentile(input_data.ras_val, 10)
old_threshold = np.percentile(g_array_f[~(g_array_f == 0)], 10)
Use the new threshold (based on EEZ) and the function defined in the previous section to output lists of:
It is possible for other combinations or different threshold if required.
In [25]:
# export wilderness distribution by province or other groupings
calculate_wilderness_marine(new_threshold, ['PROVINCE']).to_csv('export_meow_province.csv')
calculate_wilderness_marine(new_threshold, ['PROVINCE', 'PROVINCE_P', 'category']).to_csv('export_province_full.csv')
The distribution map of wilderness within EEZ using new threshold
Distribution of percentage of wilderness (less than threshold, ltt) by groups
In [ ]:
import seaborn as sns
In [ ]:
# small multiples: distribution of percentage of less than threshold (ltt)
g = sns.FacetGrid(calculate_wilderness_marine(new_threshold, ['PROVINCE', 'PROVINCE_P', 'category']), col="category")
g.map(plt.hist, 'per_ltt', bins=50, log=True)
In [ ]:
# MEOW province (200m and 200 nautical combined)
sns.distplot(calculate_wilderness_marine(new_threshold, ['PROVINCE']).per_ltt)
In [ ]:
# pelagic province
sns.distplot(calculate_wilderness_marine(new_threshold, ['PROVINCE_P']).per_ltt)
From the graphs, it is obvious that most provinces/pelagic provinces have very low percentage of marine wilderness area inside them.
In [27]:
# load data
wh47 = pd.read_csv('wh47.csv')
wh_attr = pd.read_csv('wh_attr.csv')
print(wh47.columns, wh_attr.columns)
In [28]:
# check thresholds, use new threshold
print('Old threshold: {0}\nNew threshold: {1}'.format(old_threshold, new_threshold))
In [29]:
# get WH statics
wh_n_base = (wh47.groupby('wdpaid').ras_val.count()*cell_size).reset_index() # all marine area
wh_n = (wh47[wh47.ras_val<new_threshold].groupby('wdpaid').ras_val.count()*cell_size).reset_index() # marine wild
# merge in order to calculate percentage (% of marine wilderness in marine area of WH sites)
a = pd.merge(wh_n_base, wh_n, on='wdpaid', suffixes=('_all', '_wild'))
a = pd.merge(wh_attr, a, how='inner', on='wdpaid')
a['per'] = a.ras_val_wild/a.ras_val_all
# export save
a.to_csv('export_wh_wilderness.csv')
In [ ]:
# distribution of WH wilderness percentage
sns.distplot(a.per)
In [ ]:
sns.distplot(a.ras_val_wild)
del a
In [35]:
input_attr.columns, wh_attr.columns
Out[35]:
In [36]:
int_wh = pd.read_csv('wh_base_intersect.csv')
int_wh_attr = pd.read_csv('wh_base_intersect_attr.csv')
In [37]:
int_wh.columns, int_wh_attr.columns
Out[37]:
In [41]:
int_wh_attr[['wdpaid', 'en_name', 'gis_area', 'PROVINCE_P', 'PROVINCE', 'category']].to_csv('wh_biogeo_intersect.csv')
In [34]:
# filter pixels that meet the new threshold (from EEZ)
int_wh_filter = int_wh[int_wh.ras_val < new_threshold]
# group value based on OBJECTID
int_wh_filter_group = int_wh_filter.groupby('OBJECTID_12').count().reset_index()
# attr join
int_result = pd.merge(int_wh_attr, int_wh_filter_group, on='OBJECTID_12')
# % wilderness area inside each PA within EEZ
int_result.groupby(['wdpaid', 'en_name']).ras_val.sum()*cell_size
Out[34]:
As contrary to common sense, wilderness in WH sites calculated from the intersection is slightly different from that of directly using WH boundary to cut out the marine cumulative impact data. This is due to boundary mismatches. The intersection of WH and EEZ (with biogeography attrs) removed all land area, where the marine pressure layer may have mapped pixels (See below highlighted pixels, in Galapogas)
Vice versa, due to the nature of intersection (clipping in strict sense), adjacent geometries having a long/shared boundary might pick up the same pixel from the base raster twice. This should not be a problem due to very low occurence (upon manual checking) but it is possible to count the same pixel twice. This should not present a problem in most cases, although it could possibly be one if such a shared boundary is very long and complicated.
In order to address this issue in the future, one could revert back to the old way: using an aggregated boundary for the result, however every change will mean a complete re-run. I would still prefer the fine granular approach, which far outweighs the shortcomings - do spatial once at the finest scale and the rest would be non-spatial. Subpixel level calculation is perhaps needed to determine whether or not an overlap should be counted or left out.
Futhermore, the mismatching issue is further plagued by spatial data quality. See below Natural System of Wrangel Island, where the blue part is the overlap between WH and biogeography
The only logical/sensible way to deal with this is to use WH calulation for its self (i.e. how much of wilderness is in WH system), while the intersection WH result for relations with biogeography.
Below is an in-depth investigation but it's not part of the gap analysis
In [42]:
# calculate total WH marine area, no filter applied
# group value based on OBJECTID
int_wh_group = int_wh.groupby('OBJECTID_12').count().reset_index()
# base
G_base = (pd.merge(int_wh_attr, int_wh_group, on='OBJECTID_12').groupby(['wdpaid', 'en_name']).ras_val.sum()*cell_size).reset_index()
G_wh = (int_result.groupby(['wdpaid', 'en_name']).ras_val.sum()*cell_size).reset_index()
In [43]:
G_base.columns, G_wh.columns
Out[43]:
In [ ]:
G_result = pd.merge(G_base, G_wh, how='left', on=('wdpaid', 'en_name'))
G_result.fillna(0, inplace=True)
G_result.columns = ['wdpaid', 'en_name', 'marine_area', 'marine_wild_area']
G_result['per'] = G_result.marine_wild_area/G_result.marine_area
# G_result.to_csv('export_wh_per_.csv')
G_result
In [ ]:
wh47.columns, int_wh.columns
In [ ]:
wh47_int = pd.merge(int_wh, int_wh_attr, on='OBJECTID_12')
wh47_int.columns
In [ ]:
# compare differences from the two methods
a = wh47.groupby('wdpaid').ras_val.count().reset_index()
b = wh47_int.groupby('wdpaid').ras_val.count().reset_index()
c = pd.merge(a, b, on='wdpaid', suffixes=('_wh', '_int'))
c['per'] = abs(c.ras_val_wh - c.ras_val_int)/c.ras_val_wh
# c
del a, b, c
There are considerable differences in percentage between the two methods to calculate marine areas within WH sites at first glance, however at the site scale, apart from wrangel Island, the differences are quite negliable.
In [ ]:
# the data to be used
## wh intersection
# filter pixels that meet the new threshold (from EEZ)
int_wh_filter = int_wh[int_wh.ras_val < new_threshold]
# group value based on OBJECTID
int_wh_filter_group = int_wh_filter.groupby('OBJECTID_12').count().reset_index()
# attr join
int_result = pd.merge(int_wh_attr, int_wh_filter_group, on='OBJECTID_12')
int_result.columns
In [ ]:
# get unique WDPAIDs for each province
int_result.groupby('PROVINCE').wdpaid.unique()
In [ ]:
# get province MEOW (200m + 200nm)
province = calculate_wilderness_marine(new_threshold, ['PROVINCE'])
# provinces with WH sites, nunique() return unique number of WDPAIDs
province_wh_number = pd.merge(province, int_result.groupby('PROVINCE').wdpaid.nunique().reset_index(), on='PROVINCE', how='left')
The above does not say anything about wilderness, although by linking it with the provinces of wilderness values it could potentially identify prioirity provinces, however the above does not address the questions of how much of widerness is covered by WH sites. It could be a well 'represented' province may have little of its vast wilderness enjoying WH status, thus it may still presents a gap, from the point of view of marine wilderness.
In [ ]:
# WH area that are wilderness area within provinces
province_wh_wilderness = (int_result.groupby('PROVINCE').ras_val.sum() * cell_size).reset_index()
# get province attributes and joi
a = pd.merge(province, province_wh_wilderness, on='PROVINCE', how = 'left')
# fill all NAs with 0
a.fillna(0,inplace=True)
# calculate percentage of province wilderness covered by WH
a['per_wilderness_covered_by_WH'] = a.ras_val/a.less_than_threshold
a.columns = ['PROVINCE', 'wilderness_area', 'per_wilderness_area', 'total_area', 'wh_wilderness_area', a.columns[-1]]
# ======== now get number of WH sites per Province into one single dataframe ==========
## num of WH sites
b = int_result.groupby('PROVINCE').wdpaid.nunique().reset_index()
b.columns = ['PROVINCE', 'num_wh']
## merge
a = pd.merge(a, b, how='left', on='PROVINCE')
a.fillna(0, inplace=True)
# a.sort_values('num_wh')
a.to_csv('export_gap_meow_province.csv')
# clear temp variable in case of polluting the global name space
del a