In [1]:
# load libraries
import os, sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
Now import the data in various format for analysis. First check the folder for data. Commands starting with !
indicates they are system commmands.
In [2]:
!dir
In [3]:
# CCV analysis results
amp = pd.read_excel('amp.xlsx')
bird = pd.read_excel('bird.xlsx')
coral = pd.read_excel('coral.xlsx')
In [4]:
# check data for different structure
# the intersection result processed in arcmap, essentially: wdpaid, id_no, per_overlap, + sp attr and wh attr
# Essentially, POS invalid rows removed, duplicated dissolved and percentage overlap calculated
# for more information, consult Evernote notes
result = pd.read_csv('result_final.csv', encoding='latin1')
sis = pd.read_csv('sis.csv', sep='|')
rl = pd.read_csv('rl_attr.csv')
In [5]:
# no duplicates for bird and amp
(amp.Fullname.size == amp.index.size) and (bird.Fullname.size == bird.index.size)
Out[5]:
In [6]:
# coral data has duplicates
print('unique coral names:', coral.Fullname.unique().size)
print('total rows:', coral.index.size)
In [7]:
# get only the half
coral_unique = coral.groupby('Fullname').first().reset_index()
coral_unique.index.size
Out[7]:
In [8]:
result.per.hist(bins=50)
Out[8]:
Note: areakm2_x
is the actual overlap between pairs of PA and species, while areakm2_y
refers to the size of WH site.
In [9]:
result.head(5)
Out[9]:
In [10]:
result.dtypes
Out[10]:
The climate vulnerability analysis was done back using an older version of Red List 2009 (methodology) and the current species range distribution data is the latest 2015-4 version. Therefore it is likely the taxanomies may change. In some cases, changes may be simple for example renaming genus name or other higher taxanomy, while others may be extremely convoluted, involving spliting of one previously recognised species into multiple ones or vice versa merging previously different species. It is very difficult to reconcile such differences.
Here, I explore such mismatches between birds, amphibians and warm water reef-building corals.
In [11]:
print('Total birds in SIS,', (sis['class'] == 'AVES').sum())
# birds SIS
sis_bird = sis[sis['class']=='AVES']
In [12]:
print('unique birds in CCV analysis:', bird.Fullname.unique().size)
print('unique birds in SIS:', sis_bird.friendly_name.unique().size)
Differences between bird numbers, and similarly other taxa groups
In [13]:
sis_amp = sis[sis['class']=='AMPHIBIA']
print('Birds in CCV but not in SIS:', np.setdiff1d(bird.Fullname.unique(), sis_bird.friendly_name.unique()).size)
print('Total birds in CCV:', bird.index.size)
print('Amphibians in CCV but not in SIS:', np.setdiff1d(amp.Fullname.unique(), sis_amp.friendly_name.unique()).size)
print('Total amphibians in CCV:', amp.index.size)
print('Corals in CCV but not in SIS:', np.setdiff1d(coral_unique.Fullname.unique(), sis.friendly_name.unique()).size)
print('Total corals in CCV:', coral_unique.index.size)
Given the relatively low number and proportion of such missing species due to taxonomic changes, it is unlikely the result may change significantly. For the purpose of deriving a globally consistent picture, we simply exclude these species.
Note the datatype in the result table is somehow different (int vs float) and it will impact comparisons. This needs to be explicited addressed, by rounding.
In [14]:
result.ix[176646].id_no
Out[14]:
result.id_no.astype('int64')
will not work as it simply gets rid of its fraction. Instead round(0)
method should be used
In [15]:
result['id_no_int'] = result.id_no.round(0)
The effect can be easily illustrated. The difference has reduced from 1581 to 36.
In [16]:
(result.id_no.isin(sis.taxonid)).sum()
Out[16]:
In [17]:
dif_spatial = np.setdiff1d(result.id_no, sis.taxonid)
dif_spatial.size
Out[17]:
In [18]:
dif_spatial = np.setdiff1d(result.id_no_int, sis.taxonid)
dif_spatial.size
Out[18]:
Let's take a look at the differece, i.e., IDS in the intersection result but not in the SIS.
In [19]:
dif_sis = result[result.id_no.isin(dif_spatial)][['id_no', 'class_name', 'binomial']]
In [20]:
dif_sis['class_name'].unique()
Out[20]:
Now it is clear that the odd missing species should not impact our analysis, which only looks at birds, amphibians and corals. In other words, the IDs in the result table having corresponding SIS entries. Given that the result table contains already all SIS information, there is no need to join the SIS table. What needs to be address now is to compare the result table directly with the climate vulnerability analysis.
In [21]:
dif_amp_result = np.setdiff1d(amp.Fullname, result.binomial)
dif_amp_sis = np.setdiff1d(amp.Fullname, sis.friendly_name)
dif_amp_result.size, dif_amp_sis.size
Out[21]:
In [22]:
# get amp_result using CCV, SIS and partialoverlap_result
amp_result = pd.merge(amp, sis_amp, left_on = 'Fullname', right_on='friendly_name')
amp_result = pd.merge(amp_result, result, left_on='taxonid', right_on='id_no_int')
In [23]:
amp_result.index
Out[23]:
In [24]:
amp.index
Out[24]:
In [25]:
# what happens if a join is made directly without going through SIS and ID_NO
test_amp = pd.merge(amp, result, left_on='Fullname', right_on='binomial')
In [26]:
test_amp.index
Out[26]:
In [27]:
# the difference
test_dif = np.setdiff1d(test_amp.id_no_int.unique(), amp_result.id_no_int.unique())
test_dif.size
Out[27]:
There might be something peculiar happening - as in theory the number should match. The fact there are 11 mismatching names suggest there is a possibility that the name in SIS still mismatches. See below the missing species using the CCV-SIS join.
In [28]:
test_amp[test_amp.id_no_int.isin(test_dif)]
Out[28]:
As a result, use direct join between RL intersection and CCV.
In [29]:
## remove test variables
del amp_result, test_amp
There are several tasks outstanding:
wdpaid, id_no
look up table is then filtered, and it can be further refined by variables such as IUCN Red List category, CCV traits etc.def f(input_result_table, cut-off-value):
new_table= input_result_table[input_result_table.per >= cut-off-value]
# function to produces outputs based on the result table
result = process_result(new_table)
return result
The EOO (extent of occurence) nature of the Red List species distribution polygon, an overlap between a species and a WH site does not necessarily suggest that a species is present. This is made worse if only part of the distribution polygon intersects with the WH site - this could be a digitisation error or simply inaccuracy due to a scale mismatch. Such 'species within WH sites' as a result must be removed. Quantiles of can be a useful way to look at.
In [30]:
# what happens if a join is made directly without going through SIS and ID_NO
result_amp = pd.merge(amp, result, left_on='Fullname', right_on='binomial')
result_bird = pd.merge(bird, result, left_on='Fullname', right_on='binomial')
result_coral = pd.merge(coral_unique, result, left_on='Fullname', right_on='binomial')
In [31]:
all_per_abc = pd.concat([result_amp.per, result_bird.per, result_coral.per])
all_per_abc.hist(bins=50)
Out[31]:
The large majority of overlaps are fairely big, suggesting sites WH are generally entirely covered by species. This is not surprising as most species distributions polygons are a few magnitudes bigger than WH sites. Yet, it is possible that range-restricted and/or threatened species may have much smaller range.
In [32]:
all_per_abc.quantile(0.5)
Out[32]:
For illustration purposes, note the difference of choosing a cut-off value between 10% and 15% is not signicant (~2%, or ~2000 pa-species pair less). This should have minimal effects on the result.
In [33]:
(all_per_abc>0.15).sum(), (all_per_abc>0.1).sum(), (all_per_abc>0.05).sum()
Out[33]:
In [34]:
((all_per_abc>0.1).sum() - (all_per_abc>0.15).sum())/all_per_abc.index.size
Out[34]:
However, if we look at amphibians, birds and mammals separately, it is easily noticeable that corals have significant number of low percentage overlaps. This is because marine World Heritage sites tend to be rather extensive and huge compared to coral distributions which are coastal. Thus it is important to examine the absolute overlap as well for corals.
In [35]:
result_amp.per.hist(bins=50)
Out[35]:
In [36]:
result_bird.per.hist(bins=50)
Out[36]:
In [37]:
result_coral.per.hist(bins=50)
Out[37]:
In [38]:
result_amp.areakm2_x.hist(bins =50, log=True)
Out[38]:
In [39]:
result_bird.areakm2_x.hist(bins =50, log=True)
Out[39]:
In [40]:
result_coral.areakm2_x.hist(bins =50, log=True)
Out[40]:
The histograms show all three taxa have a bimodal distribution, with significant number of values near two poles (very large overlaps or very small ones)
In [41]:
result_coral.sort_values('per', ascending=True)[['en_name', 'binomial', 'areakm2_x', 'per']].head(20)
Out[41]:
In [42]:
result_coral[['en_name', 'binomial', 'areakm2_x', 'per']][result_coral.en_name.isin(['Henderson Island'])].head()
Out[42]:
The threshold value can be reasonably justified by using both the perentage overlap and actual overlap. It is obvious if the entire WH site is covered by a species, then it should be counted. If the percentage overlap is too small, it could be either due to an inaccurate boundary, in which case, the species should not be counted; or a genunie overlap if the WH is considerably larger. Therefore, by adding an additional test to the absolute overlap value in $km^2$, ommissions due to small percentage overlap can be reduced.
Let's test the effect of using various emperical values of per
and abskm2
to remove artifical overlaps caused by the inaccuracy of boundaries. This represents an optimistic estimate of the number of species in WH (reinforced by the nature of Red List EOO)
In [43]:
def test_params(per=None, abkm2=None):
if per is None and abkm2 is None:
result = None
if per and abkm2: # both conditions are applied
result = [(result_taxa[(result_taxa.per>per)|(result_taxa.areakm2_x>abkm2)].index.size, result_taxa.index.size) for \
result_taxa in [result_amp, result_bird, result_coral]]
elif per and abkm2 is None: # only per
result = [(result_taxa[(result_taxa.per>per)].index.size, result_taxa.index.size) for \
result_taxa in [result_amp, result_bird, result_coral]]
elif per is None and abkm2:
result = [(result_taxa[(result_taxa.areakm2_x>abkm2)].index.size, result_taxa.index.size) for \
result_taxa in [result_amp, result_bird, result_coral]]
else:
return None
print(per, abkm2, result)
# # check the number of rows after applying criteria
# for each in zip(['after', 'before'], *test_params(0.15, 1)):
# print(each)
# per, abkm2, (after and before) for amp, bird and coral
print('---only per---')
test_params(0.05)
test_params(0.15)
test_params(0.25)
test_params(0.50)
print('---only abs km2---')
test_params(abkm2=1)
test_params(abkm2=5)
test_params(abkm2=10)
print('--- per or abs km2---')
test_params(0.15, 1)
test_params(0.15, 2)
test_params(0.25, 1)
test_params(0.25, 2)
test_params(0.25, 5)
The result corroborates the distributions that percentage overlaps are sensitive and their change has a considerable impact on the number of filtered rows, while s small increase in absolute area in $km^2$ seems to have little effect. It is observed that an increase of per from 15%
to 25%
poses little change.
The threshold values chosen here may present a rather insigificant factor compared to the assumption made by the EOO intersection, i.e., even though a species overlaps 100% with a given WH, it may still be absent. However this approach remains a popular method in estimating global species conservation and offer a consistent view for comprehensively assessed species, in the absence of true area of occupancy (AOO) data.
I use 15%
and 1km^2
as cut-off values
To be policy relavant at a site level, it's imperative that climate vulnerability information on individual World Heritage sites be obtained. This analysis illustrates how many amphibians, birds and coral species are affected by climate change in each World Heritage site, by using both the total number of climate vulnerable species and their proportion. Further aggretations can be used to reveal each component within sensitivity, low adaptability and exposure.
In [44]:
# get filtered result
result_amp_f, result_bird_f, result_coral_f = [result_taxa[(result_taxa.per>.15)|(result_taxa.areakm2_x>1)] for \
result_taxa in [result_amp, result_bird, result_coral]]
# get species numbers
print('Filtered', 'Total-intersect', 'Total')
print(result_amp_f.Fullname.unique().size, result_amp.Fullname.unique().size, amp.Fullname.unique().size)
print(result_bird_f.Fullname.unique().size, result_bird.Fullname.unique().size, bird.Fullname.unique().size)
print(result_coral_f.Fullname.unique().size, result_coral.Fullname.unique().size, coral.Fullname.unique().size)
Using a lambda function as an argument to get
In [45]:
## agg_dict passed to do multiple aggregation at the same time
## x is a group of FINAL_SCOREs
agg_dict = {'FINAL_SCORE':\
{'H_vul_per': lambda x: (x=='H').sum()/x.size, \
'total_H_vul': lambda x: (x=='H').sum(),\
'total_vul':len}}
## amp
amp_v = result_amp_f.groupby(['wdpaid', 'en_name']).agg(agg_dict).reset_index()
amp_v.columns = amp_v.columns.droplevel()
# birds
bird_v = result_bird_f.groupby(['wdpaid', 'en_name']).agg(agg_dict).reset_index()
bird_v.columns = bird_v.columns.droplevel()
# corals
coral_v = result_coral_f.groupby(['wdpaid', 'en_name']).agg(agg_dict).reset_index()
coral_v.columns = coral_v.columns.droplevel()
In [46]:
bird_v.head()
Out[46]:
It is necessary to add the missing labels for the two columns
In [47]:
# rename column names. One cannot do: amp_v.columns[1] = newcolumn_name
new_columns = amp_v.columns.values
new_columns[:2] = ['wdpaid', 'en_name']
amp_v.columns = new_columns
bird_v.columns = new_columns
coral_v.columns = new_columns
Graphs to show both the percentage (the number of species with FINAL SCORE =='H' against all species), and the total number of species that are highly vulnerable.
In [48]:
# percentage of H distribution
amp_v.H_vul_per.hist(bins=50)
Out[48]:
In [49]:
# total number of H distribution
amp_v.total_H_vul.hist(bins=50)
Out[49]:
In [50]:
# percentage of H distribution
bird_v.H_vul_per.hist(bins=50)
Out[50]:
In [51]:
# total number of H distribution
bird_v.total_H_vul.hist(bins=50)
Out[51]:
In [52]:
# percentage of H distribution
coral_v.H_vul_per.hist(bins=50)
Out[52]:
In [53]:
# total number of H distribution
coral_v.total_H_vul.hist(bins=50)
Out[53]:
In [54]:
print('Top WHs with high number of CV birds\n', bird_v.sort_values('total_H_vul', ascending=False).head(10))
print('==================')
print('Top WHs with high percentage of CV birds\n', bird_v.sort_values('H_vul_per', ascending=False).head(10))
In [55]:
print('Top WHs with high number of CV amphibians\n', amp_v.sort_values('total_H_vul', ascending=False).head(10))
print('==================')
print('Top WHs with high percentage of CV amphibian\n', amp_v.sort_values('H_vul_per', ascending=False).head(10))
In [56]:
amp_v.quantile(0.9)
Out[56]:
In [57]:
amp_v.quantile(0.1)
Out[57]:
In [58]:
bird_v.quantile(0.9)
Out[58]:
In [59]:
bird_v.quantile(0.1)
Out[59]:
For communications purposes, it may be convenient to have a single indicator/index to indicate whether a given WH site is considered climate vulnerable for some taxon (in this case amphibians). The questions then becomes how such a cut-off value can be reasonable calculated and can be meaningfully applied. One way is to, again, use quantile cut-off values for total number of highly vulnerable species and percentage overlap. However, any ranking created using quantile only compares within WH sites. Ideally, this should be derived using a global metric. Perhaps this is an important caveat. One needs to know about other taxa before making a decision as they should be consistent.
Questions:
Top10, top 50, top200
WH sites with high proportion seem to have a low total number of CV amphibians. Thus using proportion alone may not paint a full picture. It is conceivable that a combination of both number and propertion is needed to classify/rank WH sites.
While the above methods work well, similar stats for other columns need considerable manual coding. In this study, depending on the taxon of interest, there may be as high as 20 replications (a total of ~60 if all three taxa are conisidered). Therefore it is necessary to find a way through which such aggregations could be calculated simply and more scalable.
In [60]:
# get all fields and pick those we need for the stats
print(result_amp_f.columns)
print(result_bird_f.columns)
print(result_coral_f.columns)
In [61]:
# Selected columns, based on the df.columns for each taxon
## amphibian
a_columns = ['SUSC_A_Habitats', 'SUSC_A_aquatic larvae',
'SUSC_B_Temperature Range', 'SUSC_B_Precipitation Range',
'SUSC_C_explosive breeder', 'SUSC_D_disease', 'SENSITIVITY',
'ADAPT_A_barriers', 'ADAPT_A_dispersal_distance',
'ADAPT_C_Slow_Gen_Turnpover', 'LOW_ADAPTABILITY',
'EXP_Sea Level',
'EXP_MeanTemperature', 'EXP_MeanRainfall', 'EXP_AADTemperature',
'EXP_AADRainfall', 'EXPOSURE', 'FINAL_SCORE']
## birds
b_columns = ['FINAL_SCORE', '__hab_specialisation',
'__microhabitat', '__ForestDependence', '__TemperatureRange',
'__PrecipRange', '__Species dependence', '__PopnSize',
'__EffectivePopnSize', 'SENSITIVITY', '__Dispersal distance limited',
'__Dispersal barriers', '__Low genetic diversity', '__clutch size',
'__Gen_Length', 'LOW_ADAPTABILITY', '__EXP_Sea Inundation',
'__EXP_MeanTemperature', '__EXP_AADTemperature', '__EXP_MeanPrecip',
'__EXP_AADPrecip', 'EXPOSURE']
## corals
c_columns = ['FINAL_SCORE', '_Few habitats', '_narrow depth rng',
'_Larval vulnerabilityH', '_ShallowH', '_PastMortalityH', '_zooxH',
'_Rare', 'SENSITIVITY', '_Short dispersal distance',
'_dispersal barriers', '_ColonyMaxAge', '_SlowGrowthRate',
'LOW_ADAPTABILITY', '_ProportionArag<3', '_MeanDHM', 'EXPOSURE']
In [62]:
# apply summation of HLU to each column, count how many occurences of each H, L, U and percentage
def get_hlu(x):
return pd.Series({'H': (x=='H').sum(), 'L': (x=='L').sum(), 'U': (x=='U').sum(), 'per_H': (x=='H').sum()/x.size})
# apply get_hlu to each df
def apply_func(df, select_columns):
# for each column of the df apply a function
return df[select_columns].apply(get_hlu).T
# ======= GENERIC VERSION ======
# get unique value of the series and count how many occurences of each unique value
def generic_count(x):
# x is a pd.Series object
unique_values = x.unique()
return pd.Series({value: (x==value).sum() for value in unique_values})
def apply_func_mk2(df):
return df[summary_columnsy_columns].apply(generic_count).T
# ======= about 'apply' =======
## dfgroupby.apply: takes a df as argument(for each sub df defined by groupby criteria)
## dataframe.apply: takes a series as argument (for each of the columns in the df)
## series.apply: takes a value as argument (for each value of the series)
In [63]:
## split into df for each unique wdpaid-en_name combinations
## for each df apply the logic
amp_vv = result_amp_f.groupby(['wdpaid', 'en_name']).apply(apply_func, a_columns).reset_index()
bird_vv = result_bird_f.groupby(['wdpaid', 'en_name']).apply(apply_func, b_columns).reset_index()
coral_vv = result_coral_f.groupby(['wdpaid', 'en_name']).apply(apply_func, c_columns).reset_index()
# amp_vvv = groups.apply(apply_func_mk2).reset_index()
In [64]:
## spot checks
bird_vv[bird_vv.wdpaid == 191]
Out[64]:
In [65]:
bird_vv[bird_vv.wdpaid == 2008]
Out[65]:
In [66]:
result_amp_f[result_amp_f.wdpaid == 191]
Out[66]:
In [67]:
amp_vv[amp_vv.wdpaid == 191]
coral_vv[coral_vv.wdpaid == 191]
Out[67]:
Having more CCV species, having more CCV species in proportion indicates WH sites, whose species are more susceptible to adverse impact due to climate change.
Questions and statements that can be made?
In [68]:
# dump for database
amp_vv.to_csv('agg_amp.csv')
bird_vv.to_csv('agg_bird.csv')
coral_vv.to_csv('agg_coral.csv')
result_amp_f.to_csv('wh_amp.csv')
result_bird_f.to_csv('wh_bird.csv')
result_coral_f.to_csv('wh_coral.csv')
This section looks at individual taxon within the entire network. A comparison is then made to examine whether or not the WH network house species that are more vulnerable to climate change. Because the result contains many duplicate species rows (due to different sites having the same species), the first step is to get a unique list of species within all WH sites.
In [69]:
# combine species in WH result and all results, by taxa
## get a unique set of species within each taxon
amp_selected = result_amp_f.groupby('id_no_int').first().reset_index()
bird_selected = result_bird_f.groupby('id_no_int').first().reset_index()
coral_selected = result_coral_f.groupby('id_no_int').first().reset_index()
## function to aggregate result and apply hlu function to all columns
def get_agg_result(unique_taxon_result, taxon_columns):
return unique_taxon_result[taxon_columns].apply(get_hlu).T
amp_total = pd.merge(get_agg_result(amp_selected, a_columns), \
get_agg_result(amp, a_columns),
left_index = True, right_index=True, suffixes=('_wh', '_all'))
bird_total = pd.merge(get_agg_result(bird_selected, b_columns), \
get_agg_result(bird, b_columns),
left_index = True, right_index=True, suffixes=('_wh', '_all'))
coral_total = pd.merge(get_agg_result(coral_selected, c_columns), \
get_agg_result(coral, c_columns),
left_index = True, right_index=True, suffixes=('_wh', '_all'))
In [70]:
amp_total
Out[70]:
In [71]:
a = amp_total[['H_wh', 'L_wh', 'U_wh']].sum(1)/amp_total[['H_all', 'L_all', 'U_all']].sum(1)
b = pd.concat([amp_total.H_wh/amp_total.H_all, a], 1)
b.columns = ['H_per', 'total_per']
b
# (amp_total.H_wh + amp_total.L_wh + amp_total.U_wh)/(amp_total.H_all + amp_total.L_all + amp_total.U_all)
Out[71]:
It appears that also fewer proportion of amphibians in WH sites are climate vulnerable
In [72]:
bird_total
Out[72]:
In [73]:
coral_total
Out[73]:
Amphibians and birds in WH sites have consistently lower number of CCV species in proportion, compared to all ampbibians (14.9% to 22%) and birds (19.3% and 23.5%). Lower proportions are also observed in sensitivity, low adaptability and exposure.
Question: what does it mean? what statements can be realistically made?
Corals show a consistent trend in proportion in terms of vulnerability, and SAE.
Test statistics for significance
per region (IUCN/UNESCO regions?) / biodiversity sites
In [74]:
amp_total
Out[74]:
Is it possible that the statistical differences observed simply reflect the fact that WH network happens to draw a sample from the pool of all species and that the difference is only due to random sampling and nothing else. In other words, the difference is not because of WH capturing species of specific traits but random sampling.
In [75]:
print('amp:', amp.index.size, 'in WH', amp_selected.index.size)
print('bird:', bird.index.size, 'in WH', bird_selected.index.size)
print('coral:', coral_unique.index.size, 'in WH', coral_selected.index.size)
In [76]:
amp_selected.head()
Out[76]:
If I get repeated random samples from the pool, calculate the H_per value and then plot them...
In [151]:
# get a random sample of 2013 repeatedly and calculate H_per
# ===== explanation
# ## get a random sample without replacement
# a = amp[amp.SVP_ID.isin(np.random.choice(amp.SVP_ID, 2013, replace=False))]
# ## apply HLU and get the final score
# a[a_columns].apply(get_hlu).T.ix['FINAL_SCORE'].per_H
test_a = [amp[amp.SVP_ID.isin(np.random.choice(amp.SVP_ID, 2013, replace=True))][a_columns].apply(get_hlu).T.ix['FINAL_SCORE'].per_H\
for i in range(1000)]
In [152]:
sns.distplot(test_a)
Out[152]:
In [153]:
amp_total.ix['FINAL_SCORE'].per_H_wh
Out[153]:
It is quite clear the WH result is an outlier - a signficant difference from the empirical distribution.
Try birds and corals
In [154]:
test_b = [bird[bird.SVP_ID.isin(np.random.choice(bird.SVP_ID, 6914, replace=True))][b_columns].apply(get_hlu).T.ix['FINAL_SCORE'].per_H\
for i in range(1000)]
In [155]:
sns.distplot(test_b)
Out[155]:
In [156]:
bird_total.ix['FINAL_SCORE'].per_H_wh
Out[156]:
Significant result for birds too
In [157]:
test_c = [coral_unique[coral_unique.Fullname.isin(np.random.choice(coral_unique.Fullname, 724, replace=True))][c_columns].apply(get_hlu).T.ix['FINAL_SCORE'].per_H\
for i in range(1000)]
In [158]:
sns.distplot(test_c)
Out[158]:
In [159]:
coral_total.ix['FINAL_SCORE'].per_H_wh
Out[159]:
In [160]:
sns.distplot(test_c)
Out[160]:
In [235]:
# a function to plot all columns using stacked horizontal barchart
def plot_comparison(wh_species, all_species, columns):
my_colors = ['#b2182b', '#ffffff', '#e0e0e0']
col = len(columns)
fig, axes = plt.subplots(col, figsize=(15, 2*col)) # len and heihgt
fig.subplots_adjust(hspace = 0.5)
# get agg count of HLUso
a = get_agg_result(wh_species, columns)
b = get_agg_result(all_species, columns)
for i, column in enumerate(columns):
data = pd.concat([a[['H', 'L', 'U']].loc[column],
b[['H', 'L', 'U']].loc[column]], 1).T
data.index = ['Species in WH sites', 'All species']
c = data.plot.barh(ax=axes[i], stacked=True, color=my_colors)
axes[i].set_xlim([0, all_species.index.size*1.2])
axes[i].set_title(column)
# annotation
for bar, y_axis in zip([a,b], range(2)):
c.annotate('{0:.0f}'.format(bar.loc[column].H) + ', ' + '{0:.0f}'.format(bar.loc[column].H + bar.loc[column].L +bar.loc[column].U) + \
', ' + '{0:.1%}'.format(bar.loc[column].per_H),
(bar.loc[column].sum() + 10, y_axis))
return fig
# a function to plot all columns using stacked horizontal barchart
def plot_comparison_sle(wh_species, all_species, columns = ['FINAL_SCORE', 'SENSITIVITY', 'LOW_ADAPTABILITY', 'EXPOSURE']):
my_colors = ['#b2182b', '#ffffff', '#e0e0e0']
col = len(columns)
fig, axes = plt.subplots(col, figsize=(15, 2*col)) # len and heihgt
fig.subplots_adjust(hspace = 0.5)
# get agg count of HLUso
a = get_agg_result(wh_species, columns)
b = get_agg_result(all_species, columns)
for i, column in enumerate(columns):
data = pd.concat([a[['H', 'L', 'U']].loc[column],
b[['H', 'L', 'U']].loc[column]], 1).T
data.index = ['Species in WH sites', 'All species']
c = data.plot.barh(ax=axes[i], stacked=True, color=my_colors)
axes[i].set_xlim([0, all_species.index.size*1.2])
axes[i].set_title(column)
# annotation
for bar, y_axis in zip([a,b], range(2)):
c.annotate('{0:.0f}'.format(bar.loc[column].H) + ', ' + '{0:.0f}'.format(bar.loc[column].H + bar.loc[column].L +bar.loc[column].U) + \
', ' + '{0:.1%}'.format(bar.loc[column].per_H),
(bar.loc[column].sum() + 10, y_axis))
return fig
Detailed comparisons across traits and exposures for amphibians
In [260]:
bird[['FINAL_SCORE', 'SENSITIVITY', 'LOW_ADAPTABILITY', 'EXPOSURE']]
Out[260]:
In [253]:
# amp
plot_comparison(amp_selected, amp, a_columns);
In [268]:
a = plot_comparison_sle(amp_selected, amp);
a.savefig('wh_network_amp.png', dpi=100, bbox_inches='tight')
Detailed comparisons across traits and exposures for birds
In [238]:
# bird
plot_comparison(bird_selected, bird, b_columns);
In [267]:
a = plot_comparison_sle(bird_selected, bird);
a.savefig('wh_network_bird.png', dpi=100, bbox_inches='tight')
[!Insignificant findings] Detailed comparisons across traits and exposures for amphibians
In [240]:
# coral
plot_comparison(coral_selected, coral_unique, c_columns);
In [91]:
type(amp_selected.crit)
Out[91]:
In [168]:
# try testing regular expression and `pd.series.str.match`
pattern = r'.*x.*' # crit having character `x` indicates a biodiversity WH site
test_matching = pd.concat([amp_selected.crit, amp_selected.crit.str.match(pattern)], 1)
# get a filter for biodi
amp_wh_bio_filter = amp_selected.crit.str.match(pattern)
bird_wh_bio_filter = bird_selected.crit.str.match(pattern)
coral_wh_bio_filter = coral_selected.crit.str.match(pattern)
# reduction in the number of WH sites
print('all WH', 'only bWH', 'non bWH')
print(amp_selected.wdpaid.unique().size, amp_selected[amp_wh_bio_filter].wdpaid.unique().size, amp_selected[~amp_wh_bio_filter].wdpaid.unique().size,)
print(bird_selected.wdpaid.unique().size, bird_selected[bird_wh_bio_filter].wdpaid.unique().size, bird_selected[~bird_wh_bio_filter].wdpaid.unique().size)
print(coral_selected.wdpaid.unique().size, coral_selected[coral_wh_bio_filter].wdpaid.unique().size)
In [163]:
# do comparison between subset of
amp_wh_bio = pd.merge(amp_selected[amp_wh_bio_filter][a_columns].apply(get_hlu).T, \
amp_selected[~amp_wh_bio_filter][a_columns].apply(get_hlu).T, \
left_index = True, right_index=True, suffixes=('_bio', '_wh'))
bird_wh_bio = pd.merge(bird_selected[bird_wh_bio_filter][b_columns].apply(get_hlu).T, \
bird_selected[~bird_wh_bio_filter][b_columns].apply(get_hlu).T, \
left_index = True, right_index=True, suffixes=('_bio', '_wh'))
coral_wh_bio = pd.merge(coral_selected[coral_wh_bio_filter][c_columns].apply(get_hlu).T, \
coral_selected[c_columns].apply(get_hlu).T, \
left_index = True, right_index=True, suffixes=('_bio', '_wh'))
In [241]:
def plot_comparison_sle_bio(wh_species, all_species, columns = ['FINAL_SCORE', 'SENSITIVITY', 'LOW_ADAPTABILITY', 'EXPOSURE']):
"""<non-bio>, <bio WH sites>"""
my_colors = ['#b2182b', '#ffffff', '#e0e0e0']
col = len(columns)
fig, axes = plt.subplots(col, figsize=(15, 2*col)) # len and heihgt
fig.subplots_adjust(hspace = 0.5)
# get agg count of HLUso
a = get_agg_result(wh_species, columns)
b = get_agg_result(all_species, columns)
for i, column in enumerate(columns):
data = pd.concat([a[['H', 'L', 'U']].loc[column],
b[['H', 'L', 'U']].loc[column]], 1).T
data.index = ['Other natural WH sites', 'Biodiversity WH sites']
c = data.plot.barh(ax=axes[i], stacked=True, color=my_colors)
axes[i].set_xlim([0, all_species.index.size*1.2])
axes[i].set_title(column)
# annotation
for bar, y_axis in zip([a,b], range(2)):
c.annotate('{0:.0f}'.format(bar.loc[column].H) + ', ' + '{0:.0f}'.format(bar.loc[column].H + bar.loc[column].L +bar.loc[column].U) + \
', ' + '{0:.1%}'.format(bar.loc[column].per_H),
(bar.loc[column].sum() + 10, y_axis))
return fig
In [266]:
a = plot_comparison_sle_bio(amp_selected[~amp_wh_bio_filter], amp_selected[amp_wh_bio_filter]);
a.savefig('network_bio_amp.png', dpi=100, bbox_inches='tight')
In [243]:
amp_wh_bio
Out[243]:
In [264]:
a = plot_comparison_sle_bio(bird_selected[~bird_wh_bio_filter], bird_selected[bird_wh_bio_filter]);
a.savefig('network_bio_bird.png', dpi=100, bbox_inches='tight')
In [245]:
bird_wh_bio
Out[245]:
It appears that subdivision by separating biodiversity sites from all natural sites does not indicate a different picture. The percentage difference is minute.
In [97]:
# load regions
region = pd.read_csv('region.csv')
In [98]:
region.unesco_reg.unique()
Out[98]:
In [99]:
# convenient variables
regions_list = ['_africa', '_asia', '_euna', '_lac', '_arab']
regions_name_list = ['Africa', 'Asia and the Pacific', 'Europe and North America', 'Latin America and the Caribbean', 'Arab States']
In [100]:
# africa_ids = region[region.unesco_reg == 'Africa'].wdpaid
# asia_ids = region[region.unesco_reg == 'Asia and the Pacific'].wdpaid
# euna_ids = region[region.unesco_reg == 'Europe and North America'].wdpaid
# lac_ids = region[region.unesco_reg == 'Latin America and the Caribbean'].wdpaid
# arab_ids = region[region.unesco_reg == 'Arab States'].wdpaid
# more concise version using list comprehension
africa_ids, asia_ids, euna_ids, lac_ids, arab_ids = [region[region.unesco_reg == region_name].wdpaid for region_name in \
regions_name_list]
In [101]:
print([len(x) for x in [africa_ids, asia_ids, euna_ids, lac_ids, arab_ids]])
In [102]:
# get a function that returns the result of aggregates
def get_agg_result_by_wdpaidlist(taxon_result, taxon_columns, filtered_wdpaids):
# get a unique list of species for sites, based on the filtered wdpaids
unique_taxon_result = taxon_result[taxon_result.wdpaid.isin(filtered_wdpaids)].groupby('id_no_int').first().reset_index()
# apply the stangdard aggregation methods
return unique_taxon_result[taxon_columns].apply(get_hlu).T
In [103]:
# amp in wh sites by region
amp_africa, amp_asia, amp_euna, amp_lac, amp_arab = \
[get_agg_result_by_wdpaidlist(result_amp_f, a_columns, a_region) \
for a_region in [africa_ids, asia_ids, euna_ids, lac_ids, arab_ids]]
amp_regions = [amp_africa, amp_asia, amp_euna, amp_lac, amp_arab]
# bird in wh sites by region
bird_africa, bird_asia, bird_euna, bird_lac, bird_arab = \
[get_agg_result_by_wdpaidlist(result_bird_f, b_columns, a_region) \
for a_region in [africa_ids, asia_ids, euna_ids, lac_ids, arab_ids]]
bird_regions = [bird_africa, bird_asia, bird_euna, bird_lac, bird_arab]
# coral in wh sites by region
coral_africa, coral_asia, coral_euna, coral_lac, coral_arab = \
[get_agg_result_by_wdpaidlist(result_coral_f, c_columns, a_region) \
for a_region in [africa_ids, asia_ids, euna_ids, lac_ids, arab_ids]]
coral_regions = [coral_africa, coral_asia, coral_euna, coral_lac, coral_arab]
In [104]:
# join all the regional tables together
from functools import reduce
# use reduce function to merge any given number of dfs; here i used a tuple in order to get suffix information
# the first element is the df, the second a dummy value used in the reduce process
amp_region_final = reduce(lambda left, right: (pd.merge(left[0], right[0], left_index=True, right_index=True, \
suffixes = ('', right[1])), 0), \
zip(amp_regions, regions_list),
(amp_selected[a_columns].apply(get_hlu).T, 0))[0]
bird_region_final = reduce(lambda left, right: (pd.merge(left[0], right[0], left_index=True, right_index=True, \
suffixes = ('', right[1])), 0), \
zip(bird_regions, regions_list),
(bird_selected[b_columns].apply(get_hlu).T, 0))[0]
coral_region_final = reduce(lambda left, right: (pd.merge(left[0], right[0], left_index=True, right_index=True, \
suffixes = ('', right[1])), 0), \
zip(coral_regions, regions_list),
(coral_selected[c_columns].apply(get_hlu).T, 0))[0]
In [105]:
amp_region_final[['per_H' + each for each in ['', '_africa', '_asia', '_euna', '_lac', '_arab']]]
Out[105]:
In [106]:
amp_region_final.loc['FINAL_SCORE']
Out[106]:
In [107]:
bird_region_final.loc['FINAL_SCORE']
Out[107]:
In [108]:
coral_region_final.loc['FINAL_SCORE']
Out[108]:
It is very difficult to find patterns in a table full of numbers like the above. Graphs may present a much better view in terms of the underlying differences.
In [109]:
# amp
# a = pd.concat([each[['H', 'L', 'U']][-1:] for each in amp_regions])
## a better way to slice columns and rows, loc is a label based indxing
## .loc returns a series, concatenate on axis 1, transpose and then replace index, so that they'll show up as labels
# get df for all regions, with HLU
a = pd.concat([each[['H', 'L', 'U']].loc['FINAL_SCORE'] for each in amp_regions], 1).T
a.index = regions_name_list
b = a.plot.barh(stacked=True, figsize=(15,5))
# for each bar, annotate num of 'H' and percentage of 'H'
for region_name, region_data, y_axis in zip(regions_name_list, amp_regions, range(5)):
b.annotate('{0:.0f}'.format(a.loc[region_name].H) + ', ' + '{0:.0f}'.format(a.loc[region_name].sum()) + ', ' + '{0:.2%}'.format(region_data.loc['FINAL_SCORE'].per_H),
(a.loc[region_name].sum() + 10, y_axis)) # add a buf-space for x
In [110]:
## find amphibians species in arab states WH sites
pd.set_option('display.max_columns', 60)
example = result_amp_f[result_amp_f.wdpaid.isin(arab_ids)]
example[example.FINAL_SCORE=='H']
Out[110]:
Latin America and the Caribbean region contains the highest number of amphibians that are climate vulnerable (178, out of a total of 723, 24.62%), compared to about 10% for WH sites in other regions. Notably on the other end of the spectrum, WH sites in the Arab States region host very few ampbibans and only one of them is climate vulnerable (Discoglossus pictus in Ichkeul National Park, based on the IUCN Red List data).
In [111]:
# bird
a = pd.concat([each[['H', 'L', 'U']].loc['FINAL_SCORE'] for each in bird_regions], 1).T
a.index = regions_name_list
b = a.plot.barh(stacked=True, figsize=(15,5))
# # testing annotation, note y axis is from bottom up. unit based on data
# for each bar, annotate num of 'H' and percentage of 'H'
for region_name, region_data, y_axis in zip(regions_name_list, bird_regions, range(5)):
b.annotate('{0:.0f}'.format(a.loc[region_name].H) + ', ' + '{0:.2%}'.format(region_data.loc['FINAL_SCORE'].per_H),
(a.loc[region_name].sum() + 10, y_axis)) # add a buf-space for x
Similarly from the point of view of climate vulnerable birds in global WH network, Latin America and the Caribbean region has the largest number of birds, and a quarter of them are climate vulnerable, compared to other regions.
In [112]:
# coral
a = pd.concat([each[['H', 'L', 'U']].loc['FINAL_SCORE'] for each in coral_regions], 1).T
a.index = regions_name_list
b = a.plot.barh(stacked=True, figsize=(15,5))
# for each bar, annotate num of 'H' and percentage of 'H'
for region_name, region_data, y_axis in zip(regions_name_list, coral_regions, range(5)):
b.annotate('{0:.0f}'.format(a.loc[region_name].H) + ', ' + '{0:.2%}'.format(region_data.loc['FINAL_SCORE'].per_H),
(a.loc[region_name].sum() + 10, y_axis)) # add a buf-space for x
The above regional comparisons across amphibians, birds and corals indicate a large variation in the number and proportion of cimate vulnerable species in the World Heritage network.
In [113]:
# enumerate to get index number, which is needed in the plot
for i, column in enumerate(a_columns):
print(i, column)
In [246]:
# a function to plot all columns using stacked horizontal barchart
def plot_all_attributes(taxon_regions, region_namelist, columns):
my_colors = ['#b2182b', '#ffffff', '#e0e0e0']
# create a figure with correct number of axes
col = len(columns)
fig, axes = plt.subplots(nrows = col, figsize=(15, 5*col)) # width and heihgt
fig.subplots_adjust(hspace = 0.3)
# for each exes, plot a column result for all regions, for comparisons
for i, column in enumerate(columns):
# a df of all regions HLUs
a = pd.concat([each[['H', 'L', 'U']].loc[column] for each in taxon_regions], 1).T
a.index = region_namelist
# so it also works for col=1. If col=1, axes is not an array and thus does not support indexing
if col>1:
b = a.plot.barh(ax=axes[i], stacked=True, color=my_colors)
axes[i].set_title(column)
else:
b = a.plot.barh(ax=axes, stacked=True, color=my_colors)
axes.set_title(column)
# add num of H, total num and percentage as annotation
for region_name, region_data, y_axis in zip(region_namelist, taxon_regions, range(len(region_namelist))):
b.annotate('{0:.0f}'.format(a.loc[region_name].H) + ', ' + '{0:.0f}'.format(a.loc[region_name].H + a.loc[region_name].L + a.loc[region_name].U) + \
', ' + '{0:.2%}'.format(region_data.loc[column].per_H),
(a.loc[region_name].sum() + 10, y_axis))
return fig
In [263]:
# need to add `;` to suppress the automatic behaviour that plots the same graph twice
a = plot_all_attributes(amp_regions, regions_name_list, ['FINAL_SCORE', 'SENSITIVITY', 'LOW_ADAPTABILITY', 'EXPOSURE']);
a.savefig('region_amp.png', dpi=100, bbox_inches='tight')
Detailed regional comparisons across traits and exposures for amphibians
In [248]:
# need to add `;` to suppress the automatic behaviour that plots the same graph twice
plot_all_attributes(amp_regions, regions_name_list, a_columns);
In [249]:
# test for single column and it also works now
plot_all_attributes(amp_regions, regions_name_list, ['FINAL_SCORE']);
In [269]:
a = plot_all_attributes(bird_regions, regions_name_list, ['FINAL_SCORE', 'SENSITIVITY', 'LOW_ADAPTABILITY', 'EXPOSURE']);
a.savefig('region_bird.png', dpi=100, bbox_inches='tight')
Detailed regional comparisons across traits and exposures for birds
In [251]:
plot_all_attributes(bird_regions, regions_name_list, b_columns);
In [252]:
plot_all_attributes(coral_regions, regions_name_list, c_columns);
Sensitivity seems rather weird - all species across all regions have sensitivity high?
In [119]:
(coral_unique.SENSITIVITY == 'H').sum(), coral_unique.index.size
Out[119]:
In [120]:
# theshold value according to the number of H species
thres_total = [np.percentile(each[each.level_2 == 'FINAL_SCORE'].H, 75) for each in [amp_vv, bird_vv, coral_vv]]
In [121]:
# treshold value according to the percentage of H in relation to total HLU
thres_per = [np.percentile(each[each.level_2 == 'FINAL_SCORE'].per_H, 75) for each in [amp_vv, bird_vv, coral_vv]]
In [122]:
thres_per, thres_total
Out[122]:
In [123]:
# the number of sites as 'highly vulnerable' for amp
((amp_vv[amp_vv.level_2=='FINAL_SCORE'].H > thres_total[0]) & (amp_vv[amp_vv.level_2=='FINAL_SCORE'].per_H > thres_per[0])).sum()
Out[123]:
In [124]:
(amp_vv[amp_vv.level_2=='FINAL_SCORE'].H > thres_total[0]).sum(), (amp_vv[amp_vv.level_2=='FINAL_SCORE'].per_H > thres_per[0]).sum()
Out[124]:
In [125]:
# amp, bird, coral number of cv sites
[((each[each.level_2=='FINAL_SCORE'].H > thres_total[i]) & (each[each.level_2=='FINAL_SCORE'].per_H > thres_per[i])).sum() for i, each in enumerate([amp_vv, bird_vv, coral_vv])]
Out[125]:
In [ ]: