Assessing the quality of crowdsourced data in CollabMap from their provenance
The CollabMap dataset is provided in the collabmap/depgraphs.csv
file, each row corresponds to a building, route, or route sets created in the application:
id
: the identifier of the data entity (i.e. building/route/route set).trust_value
: the beta trust value calculated from the votes for the data entity.
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv("collabmap/depgraphs.csv", index_col='id')
df.head()
Out[2]:
In [3]:
df.describe()
Out[3]:
In [4]:
trust_threshold = 0.75
df['label'] = df.apply(lambda row: 'Trusted' if row.trust_value >= trust_threshold else 'Uncertain', axis=1)
df.head() # The new label column is the last column below
Out[4]:
Having used the trust valuue to label all the data entities, we remove the trust_value
column from the data frame.
In [5]:
# We will not use trust value from now on
df.drop('trust_value', axis=1, inplace=True)
df.shape # the dataframe now have 23 columns (22 metrics + label)
Out[5]:
In [6]:
df_buildings = df.filter(like="Building", axis=0)
df_routes = df.filter(regex="^Route\d", axis=0)
df_routesets = df.filter(like="RouteSet", axis=0)
df_buildings.shape, df_routes.shape, df_routesets.shape # The number of data points in each dataset
Out[6]:
This section explore the balance of each of the three datasets and balance them using the SMOTE Oversampling Method.
In [7]:
from analytics import balance_smote
In [8]:
df_buildings.label.value_counts()
Out[8]:
Balancing the building dataset:
In [9]:
df_buildings = balance_smote(df_buildings)
In [10]:
df_routes.label.value_counts()
Out[10]:
Balancing the route dataset:
In [11]:
df_routes = balance_smote(df_routes)
In [12]:
df_routesets.label.value_counts()
Out[12]:
Balancing the route set dataset:
In [13]:
df_routesets = balance_smote(df_routesets)
We now run the cross validation tests on the three balanaced datasets (df_buildings
, df_routes
, and df_routesets
) using all the features (combined
), only the generic network metrics (generic
), and only the provenance-specific network metrics (provenance
). Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.
In [14]:
from analytics import test_classification
We test the classification of buildings, collect individual accuracy scores results
and the importance of every feature in each test in importances
(both are Pandas Dataframes). These two tables will also be used to collect data from testing the classification of routes and route sets later.
In [15]:
# Cross validation test on building classification
res, imps = test_classification(df_buildings)
# adding the Data Type column
res['Data Type'] = 'Building'
imps['Data Type'] = 'Building'
# storing the results and importance of features
results = res
importances = imps
# showing a few newest rows
results.tail()
Out[15]:
In [16]:
# Cross validation test on route classification
res, imps = test_classification(df_routes)
# adding the Data Type column
res['Data Type'] = 'Route'
imps['Data Type'] = 'Route'
# storing the results and importance of features
results = results.append(res, ignore_index=True)
importances = importances.append(imps, ignore_index=True)
# showing a few newest rows
results.tail()
Out[16]:
In [17]:
# Cross validation test on route classification
res, imps = test_classification(df_routesets)
# adding the Data Type column
res['Data Type'] = 'Route Set'
imps['Data Type'] = 'Route Set'
# storing the results and importance of features
results = results.append(res, ignore_index=True)
importances = importances.append(imps, ignore_index=True)
# showing a few newest rows
results.tail()
Out[17]:
In [18]:
results.to_pickle("collabmap/results.pkl")
importances.to_pickle("collabmap/importances.pkl")
Next time, we can reload the results as follows:
In [19]:
import pandas as pd
results = pd.read_pickle("collabmap/results.pkl")
importances = pd.read_pickle("collabmap/importances.pkl")
results.shape, importances.shape # showing the shape of the data (for checking)
Out[19]:
In [20]:
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("paper", font_scale=1.4)
Converting the accuracy score from [0, 1] to percentage, i.e [0, 100]:
In [21]:
results.Accuracy = results.Accuracy * 100
results.head()
Out[21]:
In [22]:
from matplotlib.font_manager import FontProperties
fontP = FontProperties()
fontP.set_size(12)
In [23]:
pal = sns.light_palette("seagreen", n_colors=3, reverse=True)
plot = sns.barplot(x="Data Type", y="Accuracy", hue='Metrics', palette=pal, errwidth=1, capsize=0.02, data=results)
plot.set_ylim(80, 100)
plot.legend(loc='upper center', bbox_to_anchor=(0.5, 1.0), ncol=3)
plot.set_ylabel('Accuracy (%)')
Out[23]:
Saving the chart above to Fig4.eps
to be included in the paper:
In [24]:
plot.figure.savefig("figures/Fig4.eps")
In this section, we explore the relevance of each features in classifying the data quality of CollabMap buildings, routes, and route sets. To do so, we analyse the feature importance values provided by the decision tree training done above - the importances
data frame.
In [25]:
import numpy as np
In [26]:
# Rename the columns with Math notation for consistency with the metrics symbols in the paper
feature_name_maths_mapping = {
"entities": "$n_e$", "agents": "$n_{ag}$", "activities": "$n_a$", "nodes": "$n$", "edges": "$e$",
"diameter": "$d$", "assortativity": "$r$", "acc": "$\\mathsf{ACC}$",
"acc_e": "$\\mathsf{ACC}_e$", "acc_a": "$\\mathsf{ACC}_a$", "acc_ag": "$\\mathsf{ACC}_{ag}$",
"mfd_e_e": "$\\mathrm{mfd}_{e \\rightarrow e}$", "mfd_e_a": "$\\mathrm{mfd}_{e \\rightarrow a}$",
"mfd_e_ag": "$\\mathrm{mfd}_{e \\rightarrow ag}$", "mfd_a_e": "$\\mathrm{mfd}_{a \\rightarrow e}$",
"mfd_a_a": "$\\mathrm{mfd}_{a \\rightarrow a}$", "mfd_a_ag": "$\\mathrm{mfd}_{a \\rightarrow ag}$",
"mfd_ag_e": "$\\mathrm{mfd}_{ag \\rightarrow e}$", "mfd_ag_a": "$\\mathrm{mfd}_{ag \\rightarrow a}$",
"mfd_ag_ag": "$\\mathrm{mfd}_{ag \\rightarrow ag}$", "mfd_der": "$\\mathrm{mfd}_\\mathit{der}$", "powerlaw_alpha": "$\\alpha$"
}
importances.rename(columns=feature_name_maths_mapping, inplace=True)
In [27]:
grouped =importances.groupby("Data Type") # Grouping the importance values by data type
In [28]:
sns.set_context("talk")
grouped.boxplot(figsize=(16, 5), layout=(1, 3), rot=90)
Out[28]:
The charts above show us the relevance of each feature in classifying the quality of CollabMap buildings, routes, and route sets. Next, we find the three most relevant features for each data type to report in the paper.
In [29]:
# Calculate the mean importance of each feature for each data type
imp_means = grouped.mean()
In [30]:
pd.DataFrame(
{row_name: row.sort_values(ascending=False)[:3].index.get_values()
for row_name, row in imp_means.iterrows()
}
)
Out[30]:
The table above shows the most important metrics as reported by the decision tree classifiers during their training for each dataset.
In [31]:
from analytics import cv_test
In [32]:
res, imps = cv_test(df_buildings[['assortativity', 'acc', 'activities']], df_buildings.label, test_id="Building")
In [33]:
res, imps = cv_test(df_routes[['acc', 'diameter', 'mfd_der']], df_routes.label, test_id="Route")
In [34]:
res, imps = cv_test(df_routesets[['assortativity', 'acc_e', 'entities']], df_routesets.label, test_id="Routeset")
As shown above, using only three metrics from each dataset, we can still achieve a high level of accuracy from the classifiers: 90%, 95%, and 95% for buildings, routes, and route sets, respectively. The performance is very close to that when all the metrics were used (90%, 97%, 96%, respectively). This shows that, in certain applications, we can select a smaller set of metrics for classification based on the relevancy analysis above to reduce the computational cost without significantly affecting the classification performance.