In [19]:

    
#output_notebook function must be imported and called in order to output Bokeh figures within the notebook.  
from bokeh.io import output_notebook
output_notebook()









    





    
        
        Loading BokehJS ...

Introduction

The purpose of this notebook is to document research and analysis done on the Common Attack Pattern Enumeration and Classification (CAPEC) for the ultimate goal of creating a corpus for PERCEIVE.

CAPEC displays its data in two formats: the CAPEC Website and the CAPEC XML file. The CAPEC 2.9 XML file used and its accompanying XML Schema Documentation are both available for download on the website CAPEC website under "Release Downloads." The representation of the data on the website is easier to navigate and easier to make sense of than the XML file. The website's interface for the Views allows you to easily explore through the developed hierarchical relationships due to the (+) and (-) buttons that allow you to expand the relationships. For this reason, we use the website to gather general information, but rely on the XML file, which contains the bulk of the infomation in a convenient machine-readable form to gather the important information.

After initial examination of the file, we found that the XML contained four root nodes: Views, Categories, Attack Patterns, and Environments. Each of these root nodes contained subnodes, which we refer to as individual Entries. These Entries have identification numbers and contain numerous subnodes of their own, which we call Fields. These Fields contain the organized information regarding the Entry and are the main focus of our investigation.

The following image attempts to provide an overview of the XML representation of CAPEC. It visualizes the Root Nodes and the observed Fields used within the entries of those Nodes.

Note that Environments is linked to the Attack Execution Flow Field used by Attack Pattern Entries. This is because this Field uses information noted in the Environments Node. The information in Environments does not appear to be used for anything else. [This should be investigated further] After further analysis of the XML file, we found that the four root nodes had a noticeable hierarchical relationship, which is visualized in diagram linked below.

Motivation

As mentioned previously, the CAPEC Website is significantly easier to navigate than the XML file. However, the website does not document the hierarchical rules explicitly. As a result, we observed the hierarchical relationships and created the following diagram to provide such documentation to aid us in solidifying our understanding of those rules.

CAPEC entries may have relationships among themselves based on Views, which comprise the highest hierarchical level, as well as relationships to other entries in other levels.

The two Views are Mechanisms of Attack Domains of Attack. Category Entries have MemberOf relationships to and are separated based on these views depending on whether they pertain to mechanisms employed in exploiting a vulnerability or the domains on which the attacks are perpetrated.

Below Category Entries are the Attack Pattern Entries; It is important to note that there are three types of Attack Pattern Entries: Meta, Standard, and Detailed. These three terms refer to the level of abstraction in the particular Attack Pattern Entries.

Meta Attack Pattern Entries are directly below Category Entries in the hierarchy and have the MemberOf relationship to these Categories. As the Categories are ways of sorting Attack Patterns, a given Meta Attack Pattern Entry will be a MemberOf two categories, one for each View. Meta Attack Pattern Entries have Child nodes that can be either Standard or Detailed Attack Patterns. These two abstraction types of Attack Patterns do not have a relationship to the Categories. Standard Attack Pattern Entries may also have their own Child, which will always be a Detailed Attack Pattern.

Given that Views and Categories are primarily methods of organizing Attack Patterns, we are specifically interested in the Attack Patterns and the Fields that they contain. To prepare for extracting information from the text within the Attack Pattern Fields, we must first determine which Fields appear the most, if the most frequent Fields even contain the most important/relevant pieces of information, and the means by which to extract the information from the XML needed to create a corpus.

Parsing the XML File

As noted in the introduction, we must determine which fields are the most frequently used among Attack Pattern Entries. The following Python script uses a list of Fields used by the XML which was created through examining the XML's schema documentation and counts the Fields mentioned to return their frequencies in a dictionary.

We encountered a special case within the XML representation where the fields Summary and Attack Execution Flow were under a container Description and would not be counted, despite appearing as unique fields on the HTML representation. Although there was a Summary field in every Description, this was not the case for Attack Execution Flow and provided an inaccurate representation of the data. As such, the script takes the Description tag as a special case and extracts its children instead.

This appears to be a singular case, but in case MITRE repeats this format in the future, keep in mind that these special cases will have to be manually added to the script.



In [20]:

    
#Function to count all nodes that are direct children of Attack Pattern entries.  In the future, if there are other cases
#in which the important fields are children of a direct child, a special case should be added like one has been for the
#"Description" field.
def extract_label(node):
    if node.tag == 'Description':  
        for label in node:
            tag = label.tag
            if tag in frequencies:      # if the tag is already in the dictionary
                frequencies[tag] +=1    # add 1 to the count 
            else:
                frequencies[tag] = 1    # else, create an entry in the dictionary, starting the count at 1
    else:    #for all non-special cases
        if node.tag in frequencies:    # if the tag is already in the dictionary
            frequencies[node.tag] += 1 # add 1 to the count 
        else:
            frequencies[node.tag] = 1  # else, create an entry in the dictionary, starting the count at 1

            
import lxml.etree # LXML etree is used in place on Python's standard ElementTree.
tree = lxml.etree.parse('capec2.9.xml') # Outputs the results of the parsing into 'tree'
root = tree.getroot() # Grabs the root of the ElementTree element and places into 'root' [Both tree and root can be renamed]

# Remove namespaces from XML.  
for elem in root.getiterator(): 
    if not hasattr(elem.tag, 'find'): continue  # (1)
    i = elem.tag.find('}') # Counts the number of characters up to the '}' at the end of the XML namespace within the XML tag
    if i >= 0: 
        elem.tag = elem.tag[i+1:] # Starts the tag a character after the '}'

# Count fields using the previously defined extract_label function
frequencies = {}
for AttackPatternEntry in root[2]: # For each Attack Pattern Entry in the Attack Patterns Table (root[2])
    for Field in AttackPatternEntry: # For each Field in the current Attack Pattern entry
        extract_label(Field)

Plotting the Frequencies

To better visualize the counts returned by parsing the XML file, the following script uses the data stored in the dictionary created previously to plot a histogram.

Histogram of Field Frequencies



In [21]:

    
import numpy as np
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import Range1d

data = {}  # Initialize dictionary 
data['Entries'] = frequencies # Place frequency count dictionary into the data dictionary.  More dictionaries may be added later
                              # to create stacked histograms with multiple sets of data.

df_data = pd.DataFrame(data).sort_values(by='Entries', ascending='true')  # Create dataframe from the data dictionary
series = df_data.loc[:,'Entries'] # Generate data series from the dataframe 

p = figure(width=800, y_range=series.index.tolist(), title="Attack Pattern Histogram") # Create figure 

#Set label details for the two axes
p.xaxis.axis_label = 'Frequency'
p.xaxis.axis_label_text_font_size = '10pt'
p.xaxis.major_label_text_font_size = '8pt'

p.yaxis.axis_label = 'Field'
p.yaxis.axis_label_text_font_size = '10pt'
p.yaxis.major_label_text_font_size = '8pt'


# Output horizontal histogram.  Bokeh function for this is currently being developed, and will be simpler to use in future.
j = 1 # Initializes count
for k,v in series.iteritems(): # k = key, v = value; these can be renamed more descriptively if needed.
  p.rect(x=v/2, y=j, width=abs(v), height=0.4, 
    width_units="data", height_units="data")
  j += 1
show(p)

Examination of Frequent Fields

Now that we know the frequencies of fields in the Attack Pattern table, our next step for extracting the information from the fields is to determine each field's structure and subsequently, our intended method of text extraction. For this purpose, we set the line of demarkation at 50 instances and investigated the layout of fields that occurred at least 50 times. There were 24 fields that fit this criterium and 7 fields that did not. The 7 fields that occur less than 50 times will likely have to be included in our scope of inquiry at a later time since rarity could be an indicator of greater importance rather than lesser, but for now, we have targeted the fields that are voluminous. Our findings are reported in the table below. Type names are subject to change and some minimal differences that have little effect on varying extraction methods have been omitted for initial grouping purposes.

Type	Description	Fields	Example
General Description	Contains one to a few sentences	Injection Vector, Payload, Payload Activation Impact, Examples-Instances, Probing Techniques	Ability to communicate synchronously or asynchronously with server. Optionally, ability to capture output directly through synchronous communication or other method such as FTP.
Table	Contains rows and columns. Columns are qualities, rows are individual items	Content History, Attack Motivation Consequence, CIA Impact, Content History, Related Attack Pattern, Related Weaknesses, Technical Context
Single Word or Two Word Descriptor	Low, Medium, High, or Very High	Typical Severity	Medium
Labeled Descriptor	Phrase label followed by a level descriptor	Attack Skills or Knowledge Required	Skill or Knowledge Level: Low
Labeled Descriptor with Potential Explanation	Single word label sometimes followed by a explanation (27 instances of the explanation tag in the XML)	Typical Likelihood of Exploit	Likelihood: Low The nature of these type of attacks involve a coordinated effort between well-funded multiple attackers, and sometimes require physical access to successfully complete an attack. As a result these types of attacks are not launched on a large scale against any potential victim, but are typically highly targeted against victims who are often targeted and may have rather sophisticated cyber defenses already in place.
Unordered List	List using bullets	Attack Pre-requisites, Methods of Attack, Purposes, Related Security Principles	• Injection • Protocol Manipulation
Citation	Citation format	References	[R.13.2] [REF-3] "Common Weakness Enumeration (CWE)". CWE-20 - Input Validation. Draft. The MITRE Corporation. 2007. http://cwe.mitre.org/data/definitions/20.html.
Unbulleted List with Qualified Entries	List with no bullets, frequently has entries that start with a type.	Solutions and Mitigations	To mitigate this type of an attack, an organization can monitor incoming packets and look for patterns in the TCP traffic to determine if the network is under an attack. The potential target may implement a rate limit on TCP SYN messages which would provide limited capabilities while under attack. OR Design: Limit program privileges, so if metacharacters or other methods circumvent program input validation routines and shell access is attained then it is not running under a privileged account. chroot jails create a sandbox for the application to execute in, making it more difficult for an attacker to elevate privilege even in the case that a compromise has occurred. Implementation: Implement an audit log that is written to a separate host, in the event of a compromise the audit log may be able to provide evidence and details of the compromise.
Numbered List and Tables	Numbers and contains a table for each Attack Step	Attack Execution Flow
Unbulleted List with Single Table	List items are qualifiers. Last tag contains a table	Target Attack Surface

Investigating the Content Matter of Fields

Since our ultimate goal is creating a corpus using the content contained in the fields of the Attack Pattern entries, it is therefore important that we explore what type of information is provided by each field type. Below is a table that lists the fields that occur more than 50 times as well as a description of what information CAPEC provides inside those fields. Our descriptions are based off of the schema documentation. Most entries in this table are taken directly from the documented description, with a few minor changes to increase understandability, while others have been reworded for effectiveness.

Field	Description	CAPEC Example
Content History	Identifies the contributor and contributor's comments. Provides a means of contacting the authors and modifiers for clarification, merging contributions, etc.	CAPEC-1
Summary	Provides a summary description of the attack that includes the attack target and sequence of steps	CAPEC-1
Related Attack Patterns	Contains attack patterns that are dependent on or applied in conjunction with the current attack pattern	CAPEC-1
Typical Severity	Reflects the typical severity of an attack on a scale. Used to capture an overall typical average value for the type of attack, understanding that it will not be completely accurate for all attacks.	CAPEC-1
Attack Prerequisites	Describes the conditions that must exist or functionality and characteristics that the target software must have, or behavior it must exhibit for the type of attack to succeed	CAPEC-1
References	Contains one or more references, each of which represents a documentary resource used to develop the definition of the attack pattern. These can provide further reading and insight into the attack pattern	CAPEC-334
Resources Required	Describes the resources (CPU cycles, IP addresses, tools, etc.) needed by an attacker to effectively execute this attack type	CAPEC-1
Solutions and Mitigations	Describes actions or approaches to prevent or mitigate the risk of the attack by improving resilience of the target, reducing the attack surface, or reducing the impact of a successful attack	CAPEC-1
Related Weaknesses	Software weaknesses potentially targeted for exploit by the attack pattern. Specific weaknesses reference CWE.	CAPEC-1
Attack-Motivation Consequence	The specific desired technical results that the attacker is hoping to achieve, which could be leveraged to achieve their end objective	CAPEC-1
Attacker Skills or Knowledge Required	Level of skills or specific knowledge required by an attacker to execute the attack type	CAPEC-1
Injection Vector	The mechanism and format of an input-driven attack of the pattern's type. Considers the attack's grammar, the system's accepted syntax, position of fields, and acceptable ranges of data	CAPEC-10
Payload	Describes code, configuration, or other data to be executed or activated as part of this type of injection-based attack.	CAPEC-10
Typical Likelihood of Exploit	Estimated likelihood of at successful attack, sometimes accompanied by an explanation of the estimate.	CAPEC-1/CAPEC-101
Payload Activation Impact	Describes the impact that the activation of the attack payload for an injection-based attack of this type would typically have on confidentiality, integrity, or availability of the target software	CAPEC-10
Examples-Instances	An example instance details an explanatory example or demonstrative exploit instance of the attack. Used to help the reader understand the nature, context and variabiltiy of the attack in practical/concrete terms	CAPEC-1
Technical Context	The technical context (architectural paradigms, frameworks, platforms, and languages) for which the pattern is applicable	CAPEC-1
Methods of Attack	The defined vectors identifying the mechanisms used in the attack. Can help define applicable attack surface for the attack	CAPEC-1
Purposes	Intended purpose behind the attack pattern relative to a list of attack objectives. Used to capture pattern composability and assist with normalization and classification in the catalog	CAPEC-1
CIA Impact	Typical relative impact of the pattern on Confidentiality, Integrity, and Availability of the targeted software	CAPEC-1
Attack Execution Flow	Comprised of Attack Phases. Phases segment the attack steps: "Explore," "Experiment," and "Exploit."	CAPEC-1
Related Security Principles	Security rules or practices that impede the attack pattern. Defined as rules and standards for good behavior	CAPEC-1
Probing Techniques	Describes methods used to probe and reconnoiter potential vulnerabilities and/or prepare for attack	CAPEC-1
Target Attack Surface	The locations where the attacker interacts with the target system	CAPEC-285

Field Groupings

Chronological-based grouping

Undecided - documentation/context-related?

Content History
References
Examples-Instances
Summary
Related Weaknesses
Related Attack Patterns

Prior to Attack

Ungrouped

Related Security Principles

Intent - Fields deal with what the attacker hopes to gain by executing the attack, that is why they would use the particular attack.

Attack Motivation-Consequences
Purposes

Requirements or Preparatory Steps - Fields deal with the necessary preparatory measures taken to prepare for the attack. Includes necessary skills/resources as well as prep-work. Relevant to matter before the attack.

Resources Required
Attacker Skills or Knowledge Required
Probing Techniques
Typical Likelihood of Exploit

During Attack

Attack Execution Mechanisms and Location -- Fields deal with steps taken during the attack; how and where the attack is executed
Attack Execution Flow
Payload
Methods of Attack
Injection Vector
Target Attack Surface
Technical Context

After Attack

Impact of Successful Attack -- Fields deal with the aftermath/effects of the attack

Payload Activation Impact
Typical Severity
CIA Impact

Mitigation -- Fields deal with steps that can be taken mitigate the aftermath

Solutions and Mitigations

These first groupings were constructed based on the timing of the fields occurrences within the length of an attack using the pattern. Each group gives details on a different section of the story. However, it may be necessary to consider the actual text used in the different fields and adjust the groupings as necessary. The "groups" below are field pairings based on similar text content (same words used) between different fields and note specific CAPEC entries used as a basis for the pairings.

Similar text content

Ungrouped Technical Context
Typical Severity
Content History
Related Attack Patterns
References
Summary (Will co-occur with everything)

Attack Execution Flow
& Methods of Attack -1
& Solutions and Mitigations - 71
& Attack Prequisites - 16

Attack Prerequisites
& Solutions and Mitigations
& Methods of Attack - 3

Attacker Skills or Knowledge Required
& Solutions and Mitigations - 51
& Payload Activation Impact - 51
& Attack Motivation Consequences - 52

CIA Impact
& Payload Activation Impact - 3
& Attack Motivation Consequences- 3

Examples-Instances
& Methods of Attack -53
& Resources Required - 4+5

Methods of Attack
& Solutions and Mitigations - 51
& Injection Vector - 51
& Payload - 3

Payload
& Activation Zone - 3
& Attack Motivation Consequences - 3

Co-occurrence between Fields



In [22]:

    
#Counts co-occurence between field pairs - this code counts co-occurrence between all pairs of fields.  It currently also counts
#co-occurence between each field and itself, which can be ignored for now, but should be adjusted in the future.

co_occur = {}

for keys in frequencies:        #for each field in the frequencies dict
    co_occur[keys] = {}      #create a new dict inside the co-occurrence dict
    for key in frequencies:  #create a key in each individual field dict 
        co_occur[keys][key] = 0  #and set the values to 0

for fields in co_occur:           #for each dict in co_occur
    for test in co_occur[fields]:  #search each individual field in that dict
        for _ in root[2]:  #for each individual entry in the attack patterns table in the XML
            for column in _: #search each field in that entry
                if column.tag == fields: #if there is a field tag matching the current dict
                    for column in _: #for each field in the same entry
                        if column.tag == test: #if there is also a field tag matching the current field query
                            co_occur[fields][test] +=1 #add 1 to the count
#print(co_occur)