Identifying instructions from chat messages in the Radiation Response Game
The datasets from this application are provided in the folder rrg
. Each CSV file, depgraphs-
$k$ .csv
with $k = 1 \ldots 18$, is a table whose rows correspond to individual chat messages in RRG:
label
: the manual classification of the message (e.g., instruction, information, requests, etc.)
In [1]:
import pandas as pd
In [2]:
filepath = lambda k: "rrg/depgraphs-%d.csv" % k
In [3]:
# An example of reading the data file
df = pd.read_csv(filepath(5), index_col=0)
df.head()
Out[3]:
In [4]:
label = lambda l: 'other' if l != 'instruction' else l
In [5]:
df.label = df.label.apply(label).astype('category')
df.head()
Out[5]:
In [6]:
# Examine the balance of the dataset
df.label.value_counts()
Out[6]:
Since both labels have roughly the same number of data points, we decide not to balance the RRG datasets.
We now run the cross validation tests on the 18 datasets ($k = 1 \ldots 18$) using all the features (combined
), only the generic network metrics (generic
), and only the provenance-specific network metrics (provenance
). The folowing steps are applied to each dataset:
results
and the feature importance into importances
Please refer to Cross Validation Code.ipynb for the detailed description of the cross validation code.
In [7]:
from analytics import test_classification
In [8]:
results = pd.DataFrame()
importances = pd.DataFrame()
for k in range(1, 19):
df = pd.read_csv(filepath(k), index_col=0)
df.label = df.label.apply(label).astype('category')
res, imps = test_classification(df, n_iterations=1000, test_id=str(k))
res['$k$'] = k
imps['$k$'] = k
# storing the results and importance of features
results = results.append(res, ignore_index=True)
importances = importances.append(imps, ignore_index=True)
Optionally, we can save the test results to save time the next time we want to re-explore them:
In [9]:
results.to_pickle("rrg/results.pkl")
importances.to_pickle("rrg/importances.pkl")
Next time, we can reload the results as follows:
In [10]:
import pandas as pd
results = pd.read_pickle("rrg/results.pkl")
importances = pd.read_pickle("rrg/importances.pkl")
results.shape, importances.shape
Out[10]:
In [11]:
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("paper", font_scale=1.4)
For this application, with the many configuration to chart, it is difficult to determine which configuration yields the best accuracy from a figure. Instead, we determine this from the data. We group the performance of all classifiers by the set of metrics they used and the $k$ value; then, we calculate the mean accuracy of those groups.
In [12]:
results['Accuracy'] = results['Accuracy'] * 100 # converting accuracy values to percent
In [13]:
# define a function to calculate the mean and its confidence interval from a group of values
import scipy.stats as st
def calc_means_ci(group):
mean = group.mean()
ci_low, ci_high = st.t.interval(0.95, group.size - 1, loc=mean, scale=st.sem(group))
return pd.Series({
'mean': mean,
'ci_low': ci_low,
'ci_high': ci_high
})
In [14]:
accuracy_by_metrics_k = results.groupby(["Metrics", "$k$"]) # grouping results by metrics sets and k
In [15]:
# Calculate the means and the confidence intervals over the grouped data (using the calc_means_ci function above)
results_means_ci = accuracy_by_metrics_k.Accuracy.apply(calc_means_ci).unstack()
results_means_ci = results_means_ci[['mean', 'ci_low', 'ci_high']] # reorder the column
results_means_ci
Out[15]:
Next, we sort the mean accuracy values in each metrics sets and find $k$ value that yields the highest accuracy for each set of metrics (i.e. combined
, generic
, and provenance
).
In [16]:
# Looking at only the means in each set of metrics
results_means_ci['mean'].unstack()
Out[16]:
In [17]:
# Finding the highest accuracy value in each row (i.e. each set of metrics)
highest_accuracy_configurations = [
(row_name, row.sort_values(ascending=False)[:1].index.get_values()[0]) # the index (i.e. k value) of the highest accuracy (i.e. first one)
for row_name, row in results_means_ci['mean'].unstack().iterrows()
]
highest_accuracy_configurations
Out[17]:
In [18]:
results_means_ci.loc[highest_accuracy_configurations, :]
Out[18]:
The results above shows that $k = 13$ - generic
yields the highest accuracy level: 85.24%. Using all the metrics or only the provenance-specific metrics yield comparable levels of accuracy (in the confidence interval of the highest accuracy) with $k = 11$.
For a visual comparison of all the configurations tested, we chart their accuracy next.
In [19]:
pal = sns.light_palette("seagreen", n_colors=3, reverse=True)
plot = sns.barplot(x="$k$", y="Accuracy", hue='Metrics', palette=pal, errwidth=1, capsize=0.04, data=results)
plot.figure.set_size_inches((10, 4))
plot.legend(loc='upper center', bbox_to_anchor=(0.5, 1.02), ncol=3)
plot.set_ylabel('Accuracy (%)')
plot.set_ylim(50, 95)
# drawing a line at the highest accuracy for visual comparison between configurations
highest_accuracy = results_means_ci['mean'].max()
plot.axes.plot([0, 17], [highest_accuracy, highest_accuracy], 'g')
highest_accuracy
Out[19]:
The chart shows that the configurations yield the highest accuracy are: $k = 11$ - combined
/generic
/provenance
, $k = 13$ - generic
, and $k = 15$ - combined
. The accuracy level seems to decrease with $k > 15$.
Saving the chart above to Fig6.eps
to be included in the paper:
In [20]:
plot.figure.savefig("figures/Fig6.eps")
In [21]:
# Rename the columns with Math notation for consistency with the metrics symbols in the paper
feature_name_maths_mapping = {
"entities": "$n_e$", "agents": "$n_{ag}$", "activities": "$n_a$", "nodes": "$n$", "edges": "$e$",
"diameter": "$d$", "assortativity": "$r$", "acc": "$\\mathsf{ACC}$",
"acc_e": "$\\mathsf{ACC}_e$", "acc_a": "$\\mathsf{ACC}_a$", "acc_ag": "$\\mathsf{ACC}_{ag}$",
"mfd_e_e": "$\\mathrm{mfd}_{e \\rightarrow e}$", "mfd_e_a": "$\\mathrm{mfd}_{e \\rightarrow a}$",
"mfd_e_ag": "$\\mathrm{mfd}_{e \\rightarrow ag}$", "mfd_a_e": "$\\mathrm{mfd}_{a \\rightarrow e}$",
"mfd_a_a": "$\\mathrm{mfd}_{a \\rightarrow a}$", "mfd_a_ag": "$\\mathrm{mfd}_{a \\rightarrow ag}$",
"mfd_ag_e": "$\\mathrm{mfd}_{ag \\rightarrow e}$", "mfd_ag_a": "$\\mathrm{mfd}_{ag \\rightarrow a}$",
"mfd_ag_ag": "$\\mathrm{mfd}_{ag \\rightarrow ag}$", "mfd_der": "$\\mathrm{mfd}_\\mathit{der}$", "powerlaw_alpha": "$\\alpha$"
}
importances.rename(columns=feature_name_maths_mapping, inplace=True)
In [22]:
grouped =importances.groupby("$k$") # Grouping the importance values by k
In [23]:
# Calculate the mean importance of each feature for each data type
imp_means = grouped.mean()
In [24]:
three_most_relevant_metrics = pd.DataFrame(
{row_name: row.sort_values(ascending=False)[:3].index.get_values() # three highest importance values in each row
for row_name, row in imp_means.iterrows()
}
)
three_most_relevant_metrics
Out[24]:
The table above shows the most important metrics as reported by the decision tree classifiers during their training for each value of $k$.
Apart from $k = 1$, whose performance is no better than the random baseline, we count the occurences of the most relevant metrics in cases where $k \geq 2$ to find the most common metrics in the table above.
In [25]:
metrics_occurrences = three_most_relevant_metrics.loc[:,2:].apply(pd.value_counts, axis=1).fillna(0) # excluding k = 1
metrics_occurrences
Out[25]:
In [26]:
# sorting the sum of the metrics occurences
pd.DataFrame(metrics_occurrences.sum().sort_values(ascending=False), columns=['occurences'])
Out[26]:
As shown above, the number of edges $e$, the number of entities $n_e$, and the maximum finite distance between entities and activities $\mathrm{mfd}_{e \rightarrow a}$ are the most common metrics in the table of the most relevant metrics.
In this extra experiement, we tested run the same experiment as above but on the full dependency graphs of messages (similar to the experiements in Application 2), i.e. without restricting a dependency graph to $k$ edges away from a message entity. The provenance network metrics of those dependency graphs are provided in rrg/depgraphs.csv, which has the same format as the other CSV files provided in this application.
In [27]:
# Reading the data
df = pd.read_csv("rrg/depgraphs.csv", index_col=0)
# Generate the label for classification
df.label = df.label.apply(label).astype('category')
In [28]:
res, imps = test_classification(df, n_iterations=1000)
Results: The above accuracy levels are very low (compared with the 50% baseline accuracy of random selection between two labels), indicating that the provenance network metrics of full dependency graphs of RRG messages do not correlate well with the nature of the messages.
The reason for this is that a RRG provenance graph captures all the activities in a game which are all connected. As a RRG provenance graph evolves linearly along the lifeline of a RRG game, the size of a dependency graph varies greatly depending on when in a game the message was sent; messages sent at the beginning of a game have significantly more (potential) dependants than those sent later in the game. This is shown in the histograms of the number of nodes and edges below. As a result, their network metrics also similarly vary (in another word, noisy) and are not a good predictor of the message type.
In [29]:
df.nodes.hist()
Out[29]:
In [30]:
df.edges.hist()
Out[30]: