We used the same cross validation test procedure for the three applications described in the paper. This document provides explanations for the code in analytics.py used in those tests.
See the tests carried out in each application:
In our experiments, we first test our trained classifiers using all 22 provenance network metrics as defined in the paper. We then repeat the test using only the generic network metrics (6) and only the provenance-specific network metrics (16). Comparing the performance from all three tests will help verify whether the provenance-specific network metrics bring added benefits to the classification application being discussed.
The lists of metrics combined
, generic
, and provenance
are defined below.
In [1]:
# The 'combined' list has all the 22 metrics
feature_names_combined = (
'entities', 'agents', 'activities', # PROV types (for nodes)
'nodes', 'edges', 'diameter', 'assortativity', # standard metrics
'acc', 'acc_e', 'acc_a', 'acc_ag', # average clustering coefficients
'mfd_e_e', 'mfd_e_a', 'mfd_e_ag', # MFDs
'mfd_a_e', 'mfd_a_a', 'mfd_a_ag',
'mfd_ag_e', 'mfd_ag_a', 'mfd_ag_ag',
'mfd_der', # MFD derivations
'powerlaw_alpha' # Power Law
)
# The 'generic' list has 6 generic network metrics (that do not take provenance information into account)
feature_names_generic = (
'nodes', 'edges', 'diameter', 'assortativity', # standard metrics
'acc',
'powerlaw_alpha' # Power Law
)
# The 'provenance' list has 16 provenance-specific network metrics
feature_names_provenance = (
'entities', 'agents', 'activities', # PROV types (for nodes)
'acc_e', 'acc_a', 'acc_ag', # average clustering coefficients
'mfd_e_e', 'mfd_e_a', 'mfd_e_ag', # MFDs
'mfd_a_e', 'mfd_a_a', 'mfd_a_ag',
'mfd_ag_e', 'mfd_ag_a', 'mfd_ag_ag',
'mfd_der', # MFD derivations
)
# The utitility of above threes set of metrics will be assessed in our experiements to
# understand whether provenance type information help us improve data classification performance
feature_name_lists = (
('combined', feature_names_combined),
('generic', feature_names_generic),
('provenance', feature_names_provenance)
)
This section defines the data balancing function by over-sampling using the SMOTE algorithm (see SMOTE: Synthetic Minority Over-sampling Technique).
It takes a dataframe where each row contains the label (in column label
) and the feature vector corresponding to that label. It returns a new dataframe of the same format, but with added rows resulted from the SMOTE oversampling process.
In [2]:
from imblearn.over_sampling import SMOTE
from collections import Counter
def balance_smote(df):
X = df.drop('label', axis=1)
Y = df.label
print('Original data shapes:', X.shape, Y.shape)
smoX, smoY = X, Y
c = Counter(smoY)
while (min(c.values()) < max(c.values())): # check if all classes are balanced, if not balance the first minority class
smote = SMOTE(ratio="auto", kind='regular')
smoX, smoY = smote.fit_sample(smoX, smoY)
c = Counter(smoY)
print('Balanced data shapes:', smoX.shape, smoY.shape)
df_balanced = pd.DataFrame(smoX, columns=X.columns)
df_balanced['label'] = smoY
return df_balanced
The t_confidence_interval
method below calculate the 95% confidence interval for a given list of values.
In [3]:
def t_confidence_interval(an_array, alpha=0.95):
s = np.std(an_array)
n = len(an_array)
return stats.t.interval(alpha=alpha, df=(n - 1), scale=(s / np.sqrt(n)))
The following cv_test
function carries out the cross validation test over n_iterations
times and returns the accuracy scores and importance scores (for each feature). The cross validation steps are as follow:
clf
using the training setclf
on the test set
In [4]:
def cv_test(X, Y, n_iterations=1000, test_id=""):
accuracies = []
importances = []
while len(accuracies) < n_iterations:
skf = model_selection.StratifiedKFold(n_splits=10, shuffle=True)
for train, test in skf.split(X, Y):
clf = tree.DecisionTreeClassifier()
clf.fit(X.iloc[train], Y.iloc[train])
accuracies.append(clf.score(X.iloc[test], Y.iloc[test]))
importances.append(clf.feature_importances_)
print("Accuracy: %.2f%% ±%.4f <-- %s" % (np.mean(accuracies) * 100, t_confidence_interval(accuracies)[1] * 100, test_id))
return accuracies, importances
Experiments: Having defined the cross validation method above, we now run it on the dataset (df
) using all the features (combined
), only the generic network metrics (generic
), and only the provenance-specific network metrics (provenance
).
In [5]:
def test_classification(df, n_iterations=1000):
results = pd.DataFrame()
imps = pd.DataFrame()
Y = df.label
for feature_list_name, feature_names in feature_name_lists:
X = df[list(feature_names)]
accuracies, importances = cv_test(X, Y, n_iterations, test_id=feature_list_name)
rs = pd.DataFrame(
{
'Metrics': feature_list_name,
'Accuracy': accuracies}
)
results = results.append(rs, ignore_index=True)
if feature_list_name == "combined": # we are interested in the relevance of all features (i.e. 'combined')
imps = pd.DataFrame(importances, columns=feature_names)
return results, imps
In summary, the test_classification()
function above takes a DataFrame with a special label
column holding the labels for the intended classification. It runs the cross validation test three times:
The accuracy measures from those tests (1,000 values from each) are collated in the returned results
DataFrame. The the importance measures of all the 22 metrics calculated in test (1) are also collated and returned in the imps
DataFrame.