Identifying owners of provenance documents from their provenance network metrics
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv("provstore/data.csv")
df.head()
Out[2]:
In [3]:
df.describe()
Out[3]:
In [4]:
# The number of each label in the dataset
df.label.value_counts()
Out[4]:
In [5]:
from analytics import balance_smote, test_classification
Balancing the data
With an unbalanced like the above, the resulted trained classifier will typically be skewed towards the majority labels. In order to mitigate this, we balance the dataset using the SMOTE Oversampling Method.
In [6]:
df = balance_smote(df)
Cross Validation tests: We now run the cross validation tests on the dataset (df
) using all the features (combined
), only the generic network metrics (generic
), and only the provenance-specific network metrics (provenance
). Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.
In [7]:
results, importances = test_classification(df)
Result: The outputs above is the accuracy of the classifier in identifying the owner of a provenance document from ProvStore using all provenance network metrics (i.e. combined
), only generic network metrics, and only provenance-specific network metrics.
The individual accuracy scores are stored in results
and the importance of every feature in each test in imps
(both are pandas Dataframe objects).
In [8]:
results.to_pickle("provstore/results.pkl")
importances.to_pickle("provstore/importances.pkl")
Next time, we can reload the results as follows:
In [9]:
import pandas as pd
results = pd.read_pickle("provstore/results.pkl")
importances = pd.read_pickle("provstore/importances.pkl")
results.shape, importances.shape # showing the shape of the data (for checking)
Out[9]:
In [10]:
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("talk")
In [11]:
# Rename the columns with Math notation for consistency with the metrics symbols in the paper
feature_name_maths_mapping = {
"entities": "$n_e$", "agents": "$n_{ag}$", "activities": "$n_a$", "nodes": "$n$", "edges": "$e$",
"diameter": "$d$", "assortativity": "$r$", "acc": "$\\mathsf{ACC}$",
"acc_e": "$\\mathsf{ACC}_e$", "acc_a": "$\\mathsf{ACC}_a$", "acc_ag": "$\\mathsf{ACC}_{ag}$",
"mfd_e_e": "$\\mathrm{mfd}_{e \\rightarrow e}$", "mfd_e_a": "$\\mathrm{mfd}_{e \\rightarrow a}$",
"mfd_e_ag": "$\\mathrm{mfd}_{e \\rightarrow ag}$", "mfd_a_e": "$\\mathrm{mfd}_{a \\rightarrow e}$",
"mfd_a_a": "$\\mathrm{mfd}_{a \\rightarrow a}$", "mfd_a_ag": "$\\mathrm{mfd}_{a \\rightarrow ag}$",
"mfd_ag_e": "$\\mathrm{mfd}_{ag \\rightarrow e}$", "mfd_ag_a": "$\\mathrm{mfd}_{ag \\rightarrow a}$",
"mfd_ag_ag": "$\\mathrm{mfd}_{ag \\rightarrow ag}$", "mfd_der": "$\\mathrm{mfd}_\\mathit{der}$", "powerlaw_alpha": "$\\alpha$"
}
importances.rename(columns=feature_name_maths_mapping, inplace=True)
In [12]:
plot = sns.barplot(data=importances)
for i in plot.get_xticklabels():
i.set_rotation(90)
From the above chart, the three most important features for this application are: $n_e$, $\mathrm{mfd}_{e \rightarrow ag}$, and $\mathsf{ACC}$.