This notebook presents further evaluations of Bangladesh's multimodal freight transport network criticality results. This evaluation observes (dis)similarities between the metrics. That is, which pairs of criticality metrics are overlapping (highlighting the same set of links in the network as critical) and which of them are complementary (highlighting different sets of links in the network as critical). Kolmogorov-smirnov distance and Correlation coefficients are used for this purpose.
There are five analysis in this notebook:
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.pylab import *
import matplotlib.colors as colors
import seaborn as sns
from __future__ import division
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
sys.path.append(module_path)
#Modules developed for this project
from transport_network_modeling import network_visualization as net_v
from transport_network_modeling import criticality as crit
In [2]:
#import criticality results
crit_df_loc = r'./criticality_results/result_interdiction_1107noz2_v03.csv'
crit_df = pd.read_csv(crit_df_loc)
In [3]:
#remove one strange value in m2_02
crit_df['m2_02'] = crit_df.m2_02.apply(lambda val: 0 if val < 1.39e-10 else val)
In [4]:
#record the metric names
metric_names = {'m01_01' : ['Change in unweighted daily accessibility',
'Topological', 'Accessibility', 'Network-wide'],
'm01_02' : ['Change in number of nodes accessible within daily reach',
'Topological', 'Accessibility', 'Network-wide'],
'm02_01' : ['Change in unweighted total travel cost',
'Topological', 'Total travel cost', 'Network-wide'],
'm02_02' : ['Change in network average efficiency',
'Topological', 'Total travel cost', 'Network-wide'],
'm03_01' : ['Unweighted link betweenness centrality',
'Topological', 'Total travel cost', 'Localized'],
'm03_02' : ['Change in region-based unweighted total travel cost',
'Topological', 'Total travel cost', 'Localized'],
'm04_01' : ['Minimum link cut centrality',
'Topological', 'Connectivity', 'Network-wide'],
'm04_02' : ['OD k-connectivity',
'Topological', 'Connectivity', 'Network-wide'],
'm05_01' : ['Nearby alternative links (simplified)',
'Topological', 'Connectivity', 'Localized'],
'm06_01' : ['Change in weighted accessibility',
'System-based', 'Accessibility', 'Network-wide'],
'm07_01' : ['Change in weighted total travel cost',
'System-based', 'Total travel cost', 'Network-wide'],
'm07_02' : ['Change in expected user exposure',
'System-based', 'Total travel cost', 'Network-wide'],
'm07_03' : ['Change in worst-case user exposure',
'System-based', 'Total travel cost', 'Network-wide'],
'm08_01' : ['Traffic flow data',
'System-based', 'Total travel cost', 'Localized'],
'm08_02' : ['Weighted link betweenness centrality',
'System-based', 'Total travel cost', 'Localized'],
'm08_03' : ['Volume over capacity',
'System-based', 'Total travel cost', 'Localized'],
'm09_01' : ['Unsatisfied demand',
'System-based', 'Connectivity', 'Network-wide'],
'm10' : ['Exposure to disaster',
'System-based', 'Connectivity', 'Localized']}
In [5]:
# Show the metrics and their associated code
metric_names_df = pd.DataFrame()
metric_names_df['Code'] = metric_names.keys()
metric_names_df['Description'] = [x[0] for x in metric_names.values()]
metric_names_df['Layer I (Paradigm)'] = [x[1] for x in metric_names.values()]
metric_names_df['Layer II (Functionality)'] = [x[2] for x in metric_names.values()]
metric_names_df['Layer III (Aggregation)'] = [x[3] for x in metric_names.values()]
metric_names_df.sort_values(by='Code', inplace=True)
metric_names_df.index = range(len(metric_names_df))
print('Metrics Code and Description')
metric_names_df
Out[5]:
Before analysing the (dis)similarities between the metrics, it is interesting to observe the distribution pattern of the criticality scores from each metric. In this way, we can understand if there are few links with high criticality scores, or if the criticality scores are normally distributed between all links in the network.
This section explores the distribution pattern of the criticality scores from the top 100 links in each criticality metric.
In [6]:
#subset the result dataframe to only the relevant columns
crit_df2 = crit_df[['osmid','m1_01', 'm1_02', 'm2_01', 'm2_02', 'm3_01', 'm3_02', 'm4_01', 'm4_02', 'm5_01', 'm6_01',
'm7_01', 'm7_02', 'm7_03', 'm8_01', 'm8_02', 'm8_03', 'm9_01', 'm10']]
#rename the metrics, adding '0' before the number
crit_df2.columns = ['osmid','m01_01', 'm01_02', 'm02_01', 'm02_02', 'm03_01', 'm03_02', 'm04_01', 'm04_02', 'm05_01', 'm06_01',
'm07_01', 'm07_02', 'm07_03', 'm08_01', 'm08_02', 'm08_03', 'm09_01', 'm10']
#record the name of each metric
all_metric = sorted(metric_names.keys())
In [7]:
#alter the m5_01 criticality scores so that they are consistent to the formula described in the report
crit_df2['m05_01'] = crit_df2.m05_01.apply(lambda x: 1/x if x > 0 else 2)
In [8]:
# Visualize the scores distribution
print("Individual metric scores' distribution")
fig1 = plt.figure(figsize=(24,60))
c=0
n=100
for num, metric in enumerate(all_metric):
new_df = crit_df2[[metric, 'osmid']]
new_df = new_df.loc[new_df[metric]!=0]
topn_list = []
#take only top 100 links
try:
topn_list.extend(list(new_df.sort_values(metric, ascending=False).osmid[:n]))
except:
topn_list.extend(list(new_df.sort_values(metric).osmid))
new_df = new_df.loc[new_df['osmid'].isin(topn_list)]
sns.set_style('white')
exec("ax{} = fig1.add_subplot(12, 5, c+1)".format(num))
exec("b = sns.distplot(new_df[metric], kde=False, rug=False, ax=ax{})".format(num))
b.set_xlabel(metric, fontsize=24)
c+=1
fig1.tight_layout()
plt.show()
In order to select the appropriate correlation coefficient technique, the distribution between each metric pair should be observed. If they follow a normal distribution, Pearson correlation coefficient can be used. Else, Spearman-rank correlation coefficient should be used.
This section visualizes the comparisons of criticality scores distributions between all metrics.
First, create a DataFrame of union of top 100 critical links from all metrics
In [9]:
n=100
topn_list = []
for metric in all_metric:
new_data = crit_df2.loc[crit_df2[metric]!=0]
try:
topn_list.extend(list(new_data.sort_values(metric, ascending=False).osmid[:n]))
except:
topn_list.extend(list(new_data.sort_values(metric).osmid))
topn_list = list(set(topn_list))
data2 = crit_df2.ix[:, crit_df2.columns != 'osmid']
crit_df2 = crit_df2.loc[crit_df2['osmid'].isin(topn_list)]
Next, visualize the data
In [10]:
net_v.overlap_distribution(crit_df=crit_df2, all_metric=all_metric)
As seen above, most of the metric pairs are not normally distributed. Therefore, Spearman-rank correlation coefficient should be used.
First, create a DataFrame of union of top 100 critical links from all metrics
In [11]:
n=100
topn_list = []
for metric in all_metric:
new_data = crit_df2.loc[crit_df2[metric]!=0]
try:
topn_list.extend(list(new_data.sort_values(metric, ascending=False).osmid[:n]))
except:
topn_list.extend(list(new_data.sort_values(metric).osmid))
topn_list = list(set(topn_list))
data2 = crit_df2.ix[:, crit_df2.columns != 'osmid']
crit_df2 = crit_df2.loc[crit_df2['osmid'].isin(topn_list)]
Then calculate the K-S distance
In [12]:
#normalize between 0 and 1
#because the ks_2samp function assumes similar scale of values between the two datasets
for metric in all_metric:
minval = crit_df2[metric].min()
maxval = crit_df2[metric].max()
rang = maxval - minval
crit_df2[metric] = crit_df2[metric].apply(lambda val: (val-minval)/rang)
In [13]:
ks_df = pd.DataFrame(np.nan, index=data2.columns, columns=data2.columns)
for index, rows1 in ks_df.iterrows():
for value, rows2 in rows1.iteritems():
D, p = crit.correlate_metrics_ks(df=crit_df2, m_a=index, m_b=value)
ks_df.set_value(index, value, D)
Finally, visualize it in a heatmap
In [14]:
net_v.correlation_plot(ks_df, title='K-S Distance between metrics')
In order to observe the (dis)similarities between metrics, Spearman-rank correlation coefficients are used. If two metrics have high correlation coefficients, they can be considered as overlapping as they highlight the same transport segments as critical. Thus, one of them is subject to be eliminated.
First, create a DataFrame of union of top 100 critical links from all metrics
In [15]:
n=100
topn_list = []
for metric in all_metric:
new_data = crit_df2.loc[crit_df2[metric]!=0]
try:
topn_list.extend(list(new_data.sort_values(metric, ascending=False).osmid[:n]))
except:
topn_list.extend(list(new_data.sort_values(metric).osmid))
topn_list = list(set(topn_list))
data2 = crit_df2.ix[:, crit_df2.columns != 'osmid']
crit_df2 = crit_df2.loc[crit_df2['osmid'].isin(topn_list)]
Then calculate the Spearman rank correlation coefficients between all metrics
In [16]:
spearmanr_df = pd.DataFrame(np.nan, index=data2.columns, columns=data2.columns)
for index, rows1 in spearmanr_df.iterrows():
for value, rows2 in rows1.iteritems():
r, p, n = crit.correlate_metrics_spearman(df=crit_df2, m_a=index, m_b=value)
spearmanr_df.set_value(index, value, r)
Lastly, visualize it in a heatmap
In [17]:
net_v.correlation_plot(spearmanr_df, title='Spearman Rank Correlation')
From the graph above we can observe the following metrics as having high correlation coefficients with the other metrics in general, as well as their corresponding counterpart:
- m1_01 - m1_02
- m1_02 - m1_01
- m2_01 - m2_02
- m2_02 - m2_01
- m3_01 - m4_02
- m4_02 - m3_01
- m6_01 - m7_01
- m7_01 - m6_01 and m2_01
- m7_02 - m6_01 and m2_01
- m7_03 - m6_01
- m8_02 - m3_01
Therefore we reduce the highly correlated metrics set to only:
- m1_01
- m2_01
- m4_02
- m6_01
- m8_02
Which means we leave out:
- m1_02
- m2_02
- m3_01
- m7_01
- m7_02
- m7_03
In [18]:
n=100
topn_list = []
for metric in all_metric:
new_data = crit_df2.loc[crit_df2[metric]!=0]
try:
topn_list.extend(list(new_data.sort_values(metric, ascending=False).osmid[:n]))
except:
topn_list.extend(list(new_data.sort_values(metric).osmid))
topn_list = list(set(topn_list))
data2 = crit_df2.ix[:, crit_df2.columns != 'osmid']
data2 = data2[['m01_01', 'm02_01', 'm03_02', 'm04_01', 'm04_02', 'm05_01', 'm06_01',
'm08_01', 'm08_02', 'm08_03', 'm09_01', 'm10']]
crit_df2 = crit_df2.loc[crit_df2['osmid'].isin(topn_list)]
spearmanr_df = pd.DataFrame(np.nan, index=data2.columns, columns=data2.columns)
pearsonr_df = pd.DataFrame(np.nan, index=data2.columns, columns=data2.columns)
for index, rows1 in spearmanr_df.iterrows():
for value, rows2 in rows1.iteritems():
r, p, n = crit.correlate_metrics_spearman(df=crit_df2, m_a=index, m_b=value)
spearmanr_df.set_value(index, value, r)
In [19]:
net_v.correlation_plot(spearmanr_df, title='Spearman Rank Correlation')