The purpose of this notebook is to provide time analysis on the various versions of CAPEC. The difference report which is provided in the release download page under the difference report tab provides a brief list of the various Attack Patterns, Categories and Views that have changed from the previous version of CAPEC to the new version. The goal of the time analysis is to provide the information as to what has changed in the new CAPEC version. This is however, not provided in the difference report.
The difference report between CAPEC v2.9 and CAPEC v2.10 can be found here
In [1]:
from IPython.display import Image
Image(filename='../Difference_Report/img/difference_report_sample.PNG')
Out[1]:
In the CAPEC index website, where the Release Downloads and XML images are provided, the inherent organisation data structure is a table. It can therefore be parsed to obtain the fifth element which is the Difference Report for each row and we would obtain the hyperlinks of various Difference Reports to parse.
Similarly inside every Difference Report, the various sub headings are represented using HTML Tables. Therefore, we can easily parse each row to obtain each Attack Pattern or Category or View that has been changed, modified or added.
For the time analysis, however, we will be primarily be needing the ones that have changed.
Our goal to build the time analysis is to write a script that parses each difference report and exports a CSV file that contains a list of various CAPEC Attack Patterns, Categories and Views that have changed during those several versions.
The script must first scrape through the CAPEC index HTML file, obtain the hyperlinks of various difference reports. It must then create a CSV file which leaves the first column empty and name the other columns from the latest CAPEC version to the first one.
The script must then parse every individual report, and provide three different sub-headings - Views, Categories and Attack Patterns.
Under Views the script must then parse the "Existing Views Modified with Enhanced Material" table and obtain the various entries under it. Depending on the difference report it is parsing, the script must identify which two versions of XMLs it needs to provide the time analysis for. The script must then parse those two XMLs and compare each corresponding field with the other and display the two versions of the field(s) that has (or have) changed.
While parsing the same table under other Difference Reports, the script must then check if the View has already been added to the CSV, and if added already, append to the end of it under the corresponding two CAPEC versions.
The same must be done for Categories and Attack Patterns as well.
In [2]:
from IPython.display import Image
Image(filename='../Difference_Report/img/difference_report_samplecsv.PNG')
Out[2]:
While a complex CSV documenting the various changes would provide a good starting point, it would be difficult to understand let alone draw conclusions from.
In order to visualize the differences, we could use the foamtree javascript file that we used for capec visualization. This visualization would contain the various Attack Patterns that were modified and inside each Attack Pattern there would be the various CAPEC versions where this particular CAPEC ID was modified along with the details of the changes made.
It would look like the following images at Level 1 and Level 2
In [3]:
from IPython.display import Image
Image(filename='../Difference_Report/img/difference_report_samplevisualization1.PNG')
Out[3]:
In [4]:
from IPython.display import Image
Image(filename='../Difference_Report/img/difference_report_samplevisualization2.PNG')
Out[4]:
The Difference Reports were initially not in the same format as they are now. They have undergone a lot of transformations and have been optimized. While the initial Different Reports are having a different schema, it is not impossible to parse them.
From the following image it is clear that the Difference Report does not have separate HTML tables as the new ones but all the entries are under a single table tag. They are, however, inserted as a paragraph tag separately which allows us to scrape those tags.
In [5]:
from IPython.display import Image
Image(filename='../Difference_Report/img/difference_report_schema_old.PNG')
Out[5]:
The new Difference Reports have separate tables for each of the sub-headings in the report itself as displayed here.
In [6]:
from IPython.display import Image
Image(filename='../Difference_Report/img/difference_report_schema_new.PNG')
Out[6]:
Time is an important factor in the updation of the corpus. It allows us to learn how the various categories have evolved over time and helps us in anticipating or predicting future changes. As such, it also allows us to analyse these changes into meaningful information - categories that have been removed or updated with existing knowledge.
The difference reports document these changes and are crucial in understanding the evolution of the CAPEC Attack Patterns. By analysing these changes, we can then understand how the representations and hierarchies around categories have been affected over time. By looking at the multiple versions of the same category, we can then understand which information has stayed fairly constant throughout the changes and the information that has been left out. With this information we can predict to a certain extent what parameters about a Category or a Attack Pattern is not likely to change.
The notion of finding out the unchanged vocabulary helps us in building the corpus with terms that identify a category or an attack pattern over a longer period of time. This also helps in making sure that any exploit that is run against the CAPEC corpus, is run against the most consistent of CAPEC vocabulary, thereby increasing the efficiency and accuracy of the similarity quotient.