Based on the discussion with Bastian.
The spatial analysis was done outside of this notebook. In a nutshell, the spatial component dealt with the question of how much of cumulative marine pressure there is in each unit (see below for such a hypothetical biogeographic classification). The analysis was carried out in such a way that the aggregation happens in the later stage and if thresholds are to be changed (very likely due to the explorative nature of such exercise), it requires minimum efforts without having to re-run any spatial analysis, which are time-consuming and prone to error.
In [1]:
# load default libraries
import os, sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# make sure gdal is correctly installed
from osgeo import gdal
import gc
%matplotlib inline
It is necessary to load the original raster in order to calculateits quantiles.
In [2]:
def raster2array(rasterfn):
raster = gdal.Open(rasterfn)
band = raster.GetRasterBand(1)
return band.ReadAsArray()
In [3]:
g_array = raster2array('global_cumul_impact_2013_all_layers.tif')
In [4]:
g_array_f = g_array.flatten()
In [5]:
print('The total number of non-zero values in the raw raster dataset:', g_array_f.size - (g_array_f==0).sum())
## in fact the following should be used for testing equality of float dtypes. Because the result remains\
## the same thus the simpler option is used.
## (np.isclose(g_array_f, 0.0)).sum()
The number of non-zero values is notably different from esri's calculation, which stands at 414,347,791, less than what's calculated here and is 300,000 fewer zeros. This suggests esri may be using a bigger tolerence value, i.e. what is considered small enough to be regarded as zero .
Now, get the quantiles... this threshold is subject to change. For the time being, arbitary values of 1%, 3%, 5% and 10% are used.
In [6]:
## the percentile function applied to the sliced array, i.e., those with values greater than 0
quantiles = [np.percentile(g_array_f[~(g_array_f == 0)], quantile) for quantile in [1,3,5,10]]
In [7]:
quantiles
Out[7]:
The hypothetical biogeographical classification of the marine environmental within EEZ is described as a combination of MEOW (Marine Ecoregional of the World), its visual representation (called hereafter MEOW visual) up to 200 nautical miles and the World's pelagic provinces. The spatial data was prepared in a way such that from the coastline outwards disjoint polygons represents: MEOW (up to 200 meter depth, inner/red), MEOW visual overlaps with pelagic provinces (middle/green), pelagic provinces that do not overlap with MEOW visual (outer/blue). This is purely a spatial aggregation based on the above data and the World Vector Shoreline EEZ. See below for example.
Load the input_data
table, which describes the intersection between the marine pressure layer and the marine ecoregion/pelagic provinces classification. The input_attr
table contains information on the relationship between OBJECTID
and each raster pixel value.
OBJECTID
(one) - pixel value (many)OBJECTID
(many) - attr: Province, Ecoregion, and Realm, categories (one)
Each pixel is of height and width: 934.478 meter, making each pixel in area 0.873 $km^2$
In [8]:
# calculate cell-size in sqkm2
cell_size = 934.478*934.478/1000000
print(cell_size)
In [9]:
# the OBJECTID - ras_val table
input_data = pd.read_csv('result.csv')
In [10]:
# the attribute table containing information about province etc
input_attr = pd.read_csv('attr.csv')
In [11]:
print('\n'.join(['Threshold cut-off value: '+ str(threshold) for threshold in quantiles]))
In [12]:
# total count of pixels per OBJECTID
result_count = input_data.groupby('OBJECTID').count().reset_index()
Here I created four result tables containing only pixels that meet the criteria as specified by different thresholds
In [13]:
# filter result only in the top 1, 3, 5, 10 percentile (of least impacted marine areas)
result_1, result_3, result_5, result_10 = \
[input_data[input_data.ras_val <= threshold].groupby('OBJECTID').count().reset_index() for threshold in quantiles]
The next step will be to apply the categorisations in the input_attr
table. Replace result10
result table if other threshold is used
In [14]:
# join base to the attribute
attr_merge = pd.merge(input_attr, result_count, on = 'OBJECTID')
In [15]:
# join result to the above table
attr_merge_10 = pd.merge(attr_merge, result_10, how = 'left', on ='OBJECTID', suffixes = ('_base', '_result'))
In [16]:
# fill ras_val_result's NaN with 0, province and realms with None. This should happen earlier
attr_merge_10['ras_val_result'].fillna(0, inplace=True)
attr_merge_10['PROVINCE'].fillna('None', inplace=True)
attr_merge_10['PROVINCE_P'].fillna('None', inplace=True)
In [17]:
# apply an aggregate function to each sub dataframe
def apply_func(group):
overlap = group['ras_val_result'].sum()*cell_size # in sqkm
base = group['ras_val_base'].sum()*cell_size
per = overlap/base
# can have multiple columns as a result, if returened as pd.series
return pd.Series([overlap, per, base], index=['less_than_threshold', 'per_ltt', 'base'])
In [18]:
# final dataframe
result_agg_10 = attr_merge_10.groupby(['PROVINCE', 'PROVINCE_P', 'category']).apply(apply_func).reset_index()
print(result_agg_10.head(20))
Further aggregation could be applied here.
In [19]:
import seaborn as sns
In [20]:
g = sns.FacetGrid(result_agg_10, col="category")
g.map(plt.hist, 'per_ltt', bins=50, log=True)
Out[20]:
In [21]:
# MEOW province (200m and 200 nautical combined)
result_agg_10_province = attr_merge_10.groupby(['PROVINCE']).apply(apply_func).reset_index()
sns.distplot(result_agg_10_province.per_ltt)
Out[21]:
In [22]:
result_agg_10_province.sort_values('per_ltt', ascending=False).head(20)
Out[22]:
In [23]:
# pelagic province
result_agg_10_pelagic = attr_merge_10.groupby(['PROVINCE_P']).apply(apply_func).reset_index()
sns.distplot(result_agg_10_pelagic.per_ltt)
Out[23]:
In [24]:
result_agg_10_pelagic.sort_values('per_ltt', ascending=False).head(20)
Out[24]:
In [25]:
# MEOW province (200m only)
result_agg_10_m200 = attr_merge_10[attr_merge_10.category == 'meow200m'].groupby(['PROVINCE']).apply(apply_func).reset_index()
sns.distplot(result_agg_10_m200.per_ltt)
Out[25]:
In [26]:
result_agg_10_m200.sort_values('per_ltt', ascending = False).head(20)
Out[26]:
In [52]:
input_data.OBJECTID.unique().size
Out[52]:
In [42]:
# no zeros in the result data
input_data.ras_val.size
Out[42]:
In [43]:
input_data.ras_val.min()
Out[43]:
In [49]:
# percentage of EEZ water in relation to the entire ocean
input_data.ras_val.size/g_array_f[~(g_array_f==0)].size
Out[49]:
In [38]:
# all input_data are non-zero (zero indicates land and no_data)
input_data[~(input_data.ras_val == 0)].ras_val.count() == input_data.ras_val.count()
Out[38]:
In [32]:
# get 10% threshold value
np.percentile(input_data.ras_val, 10)
Out[32]:
In [50]:
sns.distplot(g_array_f[~(g_array_f==0)])
Out[50]:
In [51]:
sns.distplot(input_data.ras_val)
Out[51]:
It appears both the high seas and EEZ waters share similar characteristics in terms of marine cumulative pressure.