In [8]:
import lxml.etree
import csv
import os
import pandas as pd
The purpose of this notebook is to build a field parser and extract the content of various fields in the CAPEC v2.11 XML file so that the field content can be directly analyzed and stored into database. The raw XML file can be downloaded at http://capec.mitre.org/data/archive/capec_v2.11.zip. Guided by CAPEC Introduction notebook, this notebook will focus on the detail structure under attack patterns table and how parser functions work in order to extract various formats of field.
In [4]:
tree = lxml.etree.parse('capec_v2.11.xml')
root = tree.getroot()
# Remove namespaces from XML.
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}') # Counts the number of characters up to the '}' at the end of the XML namespace within the XML tag
if i >= 0:
elem.tag = elem.tag[i+1:] # Starts the tag a character after the '}'
Although the fields have been categorized based on the format in CAPEC Introduction Notebook, the Introduction Notebook is more focus on how field information is shown in the website, not in the XML raw file, thus making it useless when designing parser.
The following table represents how fields are grouped when designing the parser function, from the perspective of whether there are field content stored as XML element attribute and whether the field contents can be concatenated and written as one row. Since it is difficult to name these three groups, we simply use A, B, and C to represent. In the table, all fields in the CAPEC Field Example can be parsed through the parser functions in this notebook, except Technical_Context and Target_Attack_Surface.
Format | Number of Levels | Whether content can be concatenated | Whether has information as attribute | CAPEC Field Example | Avaiable Parser Function |
---|---|---|---|---|---|
A | 1-4 | Yes | No | Typical_Severity, Typical_Likelihood_of_Exploit, Methods_of_Attack, Resources_Required, Purposes, CIA_Impact, Payload, Activation_Zone, Summary, Attack_Prerequisites, Relevant_Security_Requirements, Related_Security_Principles, Related_Guidelines, Solutions_and_Mitigations,Probing_Techniques, Indicators-Warnings_of_Attack, Payload_Activation_Impact, Technical_Context* | field_parser_with_concatenation |
B | 3-5 | No | No | Attacker_Skills_or_Knowledge_Required, Attack_Motivation-Consequences, Examples-Instances | field_parser_without_concatenation |
C | 3-4 | No | Yes | References, Content_History, Related_Weakness, Related_Attack_Pattern, Target_Attack_Surface* | field_parser_without_concatenation |
We will discuss the field structure and the table details below.
1.1 Format A
All fields in Format A can be generalized to the similar structure. The difference is the number of levels, which means the depth of the parsed content. Here is the generalized structure for fields that have 4 levels. For example, Probing_Techniques is the field that has 4 levels in Format A. The content we are trying to parse is under Entry_Element_Child, so that we have to parse the XML element four times in order to get Entry_Element_Child element. On the other hand, Typical_Severity is the field that only has one level, making the content directly under the Target_Field element.
The idea is the same for the fields that have 2 and 3 levels. Specifically, the 2-levels fields will have Target_Field and Field_Entry, while the content will under Field_Entry element. The 3-levels fields will have Target_Field, Field_Entry, and Entry_Element, while the content will under Entry_Element element.
<Target_Field>
<Field_Entry1>
<Entry_Element>
<Entry_Element_Child>the content function will parse</Entry_Element_Child>
</Entry_Element>
</Field_Entry1>
<Field_Entry2>
<Entry_Element>
<Entry_Element_Child>the content function will parse</Entry_Element_Child>
</Entry_Element>
</Field_Entry2>
...
</Target_Field>
<capec:Probing_Techniques>
<capec:Probing_Technique>
<capec:Description>
<capec:Text>While interacting with a system an attacker would typically investigate for environment variables that can be overwritten. The more a user knows about a system the more likely she will find a vulnerable environment variable.</capec:Text>
</capec:Description>
</capec:Probing_Technique>
<capec:Probing_Technique>
<capec:Description>
<capec:Text>On a web environment, the attacker can read the client side code and search for environment variables that can be overwritten.</capec:Text>
</capec:Description>
</capec:Probing_Technique>
</capec:Probing_Techniques>
<capec:Typical_Severity>High</capec:Typical_Severity>
After explaining the number of levels, here we will discuss the difference between Format A and B, and why the content will be concatenated for fields in Format A. In the example for 4-levels fields shown above, the sentence starting with 'While interacting with a system' and the sentence starting with 'On a web environment' are stored separately under two different Probing_Technique elements, but they come from the same paragraph under Probing Techniques section (capec_10). Therefore, it makes sense to concatenate two sentences and output as a whole.
In summary, when parsing the fields in Format A through the parser function, the content under different Field_Entry elements, no matter the number of levels, will be concatenated and written as one output. In addition, since the content will be merged, capec_id will be unique in the output CSV file.
Before introducing the parser function, we need a function that can write the dictionary that stores the field content to a CSV file. Function write_dict_to_csv will append the given dictionary to the end of CSV file. If the file does not exist, the function will create a CSV file and take the csv_header as the header of this CSV file.
In [5]:
def write_dict_to_csv(output_file,csv_header,dict_data):
'''
Create a CSV file with headers and write a dictionary;
If the file already existes, only append a dictionary.
Args:
output_file -- name of the output csv file
csv_header -- the header of the output csv file.
dict_data -- the dictionary that will be writen into the CSV file. The number of
element in the dictionary should be equal to or lower than the number of
headers of the CSV file.
Outcome:
a new csv file with headers and one row that includes the information from the dictionary;
or an existing CSV file with a new row that includes the information from the dictionary
'''
# create a file if the file does not exist; if exsits, open the file
with open(output_file, 'a',encoding='UTF-8') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=csv_header,lineterminator='\n')
# check whether the csv file is empty
if csv_file.tell()==0:
# if empty, write header and the dictionary
writer.writeheader()
writer.writerow(dict_data)
else:
# if not empty, only write the dictionary
writer.writerow(dict_data)
Given the target field, function field_parser_with_concatenation will extract the content within the target field element and write capec_id and content into a CSV file named by the target field. Each row in the output CSV file will contain the following information:
There are two parts within function field_parser_with_concatenation. The first part will generate all possible tags as the headers of the output CSV file by traversing all child element tags under each field entry. It is very important for the first part, because once the function writes the headers, it is computationally expensive to edit the first row later. The second part will extract the content from the nesting target field and then write to a CSV file by using function write_dict_to_csv .
The following fields have been tested successfully: Typical_Severity, Typical_Likelihood_of_Exploit, Methods_of_Attack, Resources_Required, Purposes, CIA_Impact, Payload, Activation_Zone, Summary, Attack_Prerequisites, Relevant_Security_Requirements, Related_Security_Principles, Related_Guidelines, Solutions_and_Mitigations,Probing_Techniques, Indicators-Warnings_of_Attack, Payload_Activation_Impact
In [149]:
def field_parser_with_concatenation(target_field, root):
'''
Parse the field from capec_v2.11.xml file and output the information to a csv file.
Args:
target_field -- the target field that will be parsed through this function. The format of this arg should be string.
root -- the root element of the whole parsed tree.
Outcome:
a csv file named by the field name. Each row will include the following information:
- capec_id: The CAPEC identifier
- field: The name of the target field
- file content: The text information stored under the target field.
'''
# define the path of target field. Here we select all element nodes that the tag is the target field
# if the target field is Summary field, please use the following path:
#target_field_path='Attack_Pattern/Description/'+target_field
target_field_path='Attack_Pattern/./'+target_field
# extract attack pattern table in the XML
attack_pattern_table = root[2]
# define the headers
output_header=['capec_id','field']
# define path of the output file
output_path=target_field+'.csv'
### 1.Generate all possible tags(column header in csv file) under the target field tree
# for each target field node
for field in attack_pattern_table.findall(target_field_path):
# check whether there is no content under target field; if yes, then go to next capec-id:
if type(field.text)==type(None):
continue
# extract the content under field element
field_content=field.text
# if there is field_entry element under target_field // will move to level 2
if field_content.isspace():
# for each field entry node under the target field node
for field_entry in list(field):
# extract the tag and content information for field entry
field_entry_tag=field_entry.tag
field_entry_content=field_entry.text
# in case there is an empty element without any content
if type(field_entry_content)==type(None):
continue
# if there is no child element under field_entry // stop to level 2
elif not field_entry_content.isspace():
# if the tag is 'Text', we will replace the tag by its field naming
if field_entry_tag.lower()=='text':
field_entry_tag=target_field
# append the tag to the output_header list if it does not exist in the list
if field_entry_tag.lower() not in output_header:
output_header.append(field_entry_tag.lower())
# if there is element under field_entry // will move to level 3
elif field_entry_content.isspace():
# traverse all entry_element nodes under each field entry
for entry_element in list(field_entry):
# generate tag and content of each entry_element
entry_element_tag=entry_element.tag
entry_element_content=entry_element.text
# build the distinguishable tag for content for furture usage
field_element_header=field_entry_tag+'_'+entry_element_tag
# append the tag to the output_header list if it does not exist in the list
if field_element_header.lower() not in output_header:
output_header.append(field_element_header.lower())
# if there is not content, then move to next entry element
if type(entry_element_content)==type(None):
continue
### stop to level 1
# if there is no field_entry element under target_field
else:
if target_field.lower() not in output_header:
output_header.append(target_field.lower())
### 2.Extract the content from the nesting target field
# for each target field node
for field in attack_pattern_table.findall(target_field_path):
# check whether there is no content under target field; if yes, then go to next capec-id:
if type(field.text)==type(None):
continue
# extract capec_id from the attribute of its parent node
# if the target field is Summary field, please use the following code to extract capec_id:
#capec_id=field.getparent().getparent().attrib.get('ID')
capec_id=field.getparent().attrib.get('ID')
# the dictionary that will be written to a CSV file
field_dict=dict()
field_dict['capec_id']=capec_id
field_dict['field']=target_field
field_content=field.text
# if there is field_entry element under target_field // will move to level 2
if field_content.isspace():
# for each field entry node under the target field node
for field_entry in list(field):
# extract the tag and content information for field entry
field_entry_tag=field_entry.tag
field_entry_content=field_entry.text
# in case there is an empty element without any content
if type(field_entry_content)==type(None):
continue
# if there is no node under field_entry // will stop to level 2
elif not field_entry_content.isspace():
if field_entry_tag.lower()=='text':
field_entry_tag=target_field
#if there are multiple field entries using a same tag, all content will be concatenated
if field_entry_tag.lower() in field_dict:
# add the concatenated content into the dictionary
field_dict[field_entry_tag.lower()]=field_dict[field_entry_tag.lower()]+ ';'+field_entry_content.strip()
# if not, directly add the field_entry content into the dictionary
else:
field_dict[field_entry_tag.lower()]=field_entry_content.strip()
# if there is element under field_entry // will move to level 3
elif field_entry_content.isspace():
# traverse all entry_element nodes under each field entry
for entry_element in list(field_entry):
# generate tag and content of each entry_element
entry_element_tag=entry_element.tag
entry_element_content=entry_element.text
# build the distinguishable tag for content for furture usage
field_element_header=field_entry_tag+'_'+entry_element_tag
# if there is not content, then move to next entry element
if type(entry_element_content)==type(None):
continue
# if there is no element under entry_element // will stop level 3
if not entry_element_content.isspace():
# concatenate all entry element content
field_content=field_content.strip()+' '+entry_element_content.strip()
# if there is element under entry_element // will move to level 4
else:
# traverse all elements under entry_element
for entry_element_child in list(entry_element):
# extract the content for each element under entry_element
entry_element_child_content=entry_element_child.text
# concatenate all element content under each entry_element
field_content=field_content.strip()+' '+entry_element_child_content.strip()
# add the tag and content pairs to the output dictionary
field_dict[field_element_header.lower()]=field_content.strip()
# if there is no field_entry element under target_field // will stop to level 1
else:
field_dict[target_field.lower()]=field_content.strip()
# write the dictionary with headers to a CSV file
write_dict_to_csv(output_path,output_header,field_dict)
In [153]:
#### the target field that can be parsed by this function
## 1 level
severity='Typical_Severity'
## 2 levels:
method_attack='Methods_of_Attack'
resource='Resources_Required'
purpose='Purposes'
impact='CIA_Impact'
payload='Payload'
likelihood_exploit='Typical_Likelihood_of_Exploit'
activation_zone='Activation_Zone'
# Since the Summary field is under Description field,
# please change the target_field path and the way to get capec_id
summary='Summary'
## 3 levels
prerequisite='Attack_Prerequisites'
security_requirement='Relevant_Security_Requirements'
security_principle='Related_Security_Principles'
guideline='Related_Guidelines'
mitigation='Solutions_and_Mitigations'
## 4 levels:
probing='Probing_Techniques'
indicator='Indicators-Warnings_of_Attack'
payload_impact='Payload_Activation_Impact'
# parse the target field
field_parser_with_concatenation(probing,root)
In [154]:
# the output CSV file
field_with_concatenation=pd.read_csv('Probing_Techniques.csv')
field_with_concatenation.head(5)
Out[154]:
Although the fields in Format B and C have the similar nested structure as the fields in Format A, the difference is that there are multiple bottom elements storing different aspects of information, thus making content parsed meaningless to be concatenated.
Here is the example for Attack_Skill_or_Knowledge_Required field under capec-10. In the example, each field_entry element has two entry_element elements that store the level and type of the skill or knowledge for capec-10. Therefore, it makes no sense to represent two different attacker skills or knowledge required in one row. From the capec_10, we can have the same conclusion.
As a result, when parsing the following example, the parser function will separate these two skills and output two rows that have a same capec_id and different content.
Example for Format B
<capec:Attacker_Skills_or_Knowledge_Required>
<capec:Attacker_Skill_or_Knowledge_Required>
<capec:Skill_or_Knowledge_Level>Low</capec:Skill_or_Knowledge_Level>
<capec:Skill_or_Knowledge_Type>
<capec:Text>An attacker can simply overflow a buffer by inserting a long string into an attacker-modifiable injection vector. The result can be a DoS.</capec:Text>
</capec:Skill_or_Knowledge_Type>
</capec:Attacker_Skill_or_Knowledge_Required>
<capec:Attacker_Skill_or_Knowledge_Required>
<capec:Skill_or_Knowledge_Level>High</capec:Skill_or_Knowledge_Level>
<capec:Skill_or_Knowledge_Type>
<capec:Text>Exploiting a buffer overflow to inject malicious code into the stack of a software system or even the heap can require a higher skill level.</capec:Text>
</capec:Skill_or_Knowledge_Type>
</capec:Attacker_Skill_or_Knowledge_Required>
</capec:Attacker_Skills_or_Knowledge_Required>
Parsing output for the above example
Given the target field, function field_parser_without_concatenation will extract the content within the target field element and write capec_id and content into a CSV file named by the target field. Each row in the output CSV file will contain the following information:
There are two parts within function field_parser_without_concatenation. The first part will generate all possible tags as the headers of the output CSV file by traversing all child element tags under each field entry. It is very important for the first part, because once the function writes the headers, it is computationally expensive to edit the first row later. The second part will extract the content from the nesting target field and then write to a CSV file by using function write_dict_to_csv .
The following fields have been tested successfully: Attacker_Skills_or_Knowledge_Required, Attack_Motivation-Consequences, Examples-Instances.
In [82]:
def field_parser_without_concatenation(target_field, root):
'''
Parse the field from capec_v2.11.xml file and output the information to a csv file.
Args:
target_field -- the target field that will be parsed through this function. The format of this arg should be string.
root -- the root element of the whole parsed tree.
Outcome:
a csv file named by the field name. Each row will include the following information:
- capec_id: The CAPEC identifier
- field: The name of the target field
- file content: The text information stored under the target field.
'''
# define the path of target field. Here we select all element nodes that the tag is the target field
target_field_path='Attack_Pattern/./'+target_field
# extract attack pattern table in the XML
attack_pattern_table = root[2]
# define the headers
output_header=['capec_id','field']
# define path of the output file
output_path=target_field+'.csv'
##### 1. Generate all possible tags(column header in csv file) under the target field tree
# for each target field node
for field in attack_pattern_table.findall(target_field_path):
# check whether there is no content under target field; if yes, then go to next capec-id:
if type(field.text)==type(None):
continue
capec_id=field.getparent().attrib.get('ID')
# for each field entry, in case there are multiple field entries under the target field node // will move to level 2
for field_entry in list(field):
# traverse all entry_element nodes under each field entry // will move to level 3
for entry_element in list(field_entry):
# generate tag and content of each entry_element
entry_element_tag=entry_element.tag
entry_element_content=entry_element.text
# if there is one more element under entry_element // will move to level 4
if entry_element_content.isspace():
# traverse all elements under entry_element
for entry_element_child in list(entry_element):
# generate tag and content of each element under entry_element
entry_element_child_tag=entry_element_child.tag
entry_element_child_content=entry_element_child.text
# if there is one more element under entry_element_child // will move to level 5
if entry_element_child_content.isspace():
# traverse all elements under entry_element_child
for entry_element_child_embed in list(entry_element_child):
# build the distinguishable tag for content for furture usage
entry_element_child_embed_tag=entry_element_child_embed.tag
field_entry_header=entry_element_child_tag+'_'+entry_element_child_embed_tag
# append the tag to the output_header list if it does not exist in the list
if field_entry_header.lower() not in output_header:
output_header.append(field_entry_header.lower())
# if there no element under entry_element_child // will stop to level 4
else:
# build the distinguishable tag for content for furture usage
field_entry_header=entry_element_tag+'_'+entry_element_child_tag
# append the tag to the output_header list if it does not exist in the list
if field_entry_header.lower() not in output_header:
output_header.append(field_entry_header.lower())
# if there is no entry_element under field_entry // will stop to level 3
else:
# append the tag to the output_header list if it does not exist in the list
if entry_element_tag.lower() not in output_header:
output_header.append(entry_element_tag.lower())
#### 2. Extract the content from the target field
# for each target field node
for field in attack_pattern_table.findall(target_field_path):
# check whether there is no content under target field; if yes, then go to next capec-id:
if type(field.text)==type(None):
continue
# extract capec_id from the attribute of its parent node
capec_id=field.getparent().attrib.get('ID')
# for each field entry, in case there are multiple field entries under the target field node // will move to level 2
for field_entry in list(field):
# the dictionary that will be written to a CSV file
field_entry_dict=dict()
field_entry_dict['capec_id']=capec_id
field_entry_dict['field']=target_field
# traverse all entry_element nodes under each field entry // will move to level 3
for entry_element in list(field_entry):
# generate tag and content of each entry_element
entry_element_tag=entry_element.tag
entry_element_content=entry_element.text
# if there is one more element under entry_element // will move to level 4
if entry_element_content.isspace():
# traverse all elements under each entry_element
for entry_element_child in list(entry_element):
# generate tag and content of each element under entry_element
entry_element_child_tag=entry_element_child.tag
entry_element_child_content=entry_element_child.text
# build the distinguishable tag for content for furture usage
field_entry_header=entry_element_tag+'_'+entry_element_child_tag
# if there is one more element under entry_element_child // will move to level 5
if entry_element_child_content.isspace():
# traverse all elements under each entry_element_child
for entry_element_child_embed in list(entry_element_child):
# generate tag and content of each element under entry_element_child
entry_element_child_embed_tag=entry_element_child_embed.tag
entry_element_child_embed_content=entry_element_child_embed.text
# if there is no content, then move to next element
if type(entry_element_child_embed_content)==type(None):
continue
# build the distinguishable tag for content for furture usage
field_entry_header=entry_element_child_tag+'_'+entry_element_child_embed_tag
# if there is multiple elements that share a same tag
if field_entry_header.lower() in field_entry_dict:
# add the concatenated content into the dictionary
field_entry_dict[field_entry_header.lower()]=field_entry_dict[field_entry_header.lower()]+ ';'+ entry_element_child_embed_content
# if not, directly add the content into the dictionary
else:
field_entry_dict[field_entry_header.lower()]= entry_element_child_embed_content
# if there no element under entry_element_child // will stop to level 4
else:
# build the distinguishable tag for content for furture usage
field_entry_header=entry_element_tag+'_'+entry_element_child_tag
# if there is multiple elements that share a same tag
if field_entry_header.lower() in field_entry_dict:
# add the concatenated content into the dictionary
field_entry_dict[field_entry_header.lower()]=field_entry_dict[field_entry_header.lower()]+ ';'+ entry_element_child_content
# if not, directly add the entry_element_child content into the dictionary
else:
field_entry_dict[field_entry_header.lower()]= entry_element_child_content
# if there is no element under entry_element // will stop to 3
else:
# if there is multiple elements that share a same tag
if entry_element_tag.lower() in field_entry_dict:
# add the concatenated content into the dictionary
field_entry_dict[entry_element_tag.lower()]=field_entry_dict[entry_element_tag.lower()]+ ';'+entry_element_content
# if not, directly add the entry_element content into the dictionary
else:
field_entry_dict[entry_element_tag.lower()]=entry_element_content
# write the dictionary with headers to a CSV file
write_dict_to_csv(output_path,output_header,field_entry_dict)
In [95]:
# the target field that can be parsed by this function
attacker_skill='Attacker_Skills_or_Knowledge_Required'
motivation_outcome='Attack_Motivation-Consequences'
example='Examples-Instances'
# parse the target field
field_parser_without_concatenation(attacker_skill,root)
In [96]:
# the output CSV file
field_without_concatenation=pd.read_csv('Attacker_Skills_or_Knowledge_Required.csv')
field_without_concatenation.head(5)
Out[96]:
The fields in Format C have the very similar structure as the fields in Format B and also face the same problem that the content cannot be concatenated. The only difference is that fields in Format C have the information stored as element attribute.
Here is the example for the content history field under capec-10. The content, Submission_Source="Internal_CAPEC_Team" and Modification_Source="Internal", are stored as the attribute of Submission element and Modification element, thus making the function field_parser_without_concatenation not applicable for the fields in Format C. However, if the content in the attribute can be ignored, the function field_parser_without_concatenation has the ability to parse the following fields in Format C: References, Content_History, Related_Weakness, Related_Attack_Pattern
<capec:Content_History>
<capec:Submissions>
<capec:Submission Submission_Source="Internal_CAPEC_Team">
<capec:Submitter>CAPEC Content Team</capec:Submitter>
<capec:Submitter_Organization>The MITRE Corporation</capec:Submitter_Organization>
<capec:Submission_Date>2014-06-23</capec:Submission_Date>
</capec:Submission>
</capec:Submissions>
<capec:Modifications>
<capec:Modification Modification_Source="Internal">
<capec:Modifier>CAPEC Content Team</capec:Modifier>
<capec:Modifier_Organization>The MITRE Corporation</capec:Modifier_Organization>
<capec:Modification_Date>2017-01-09</capec:Modification_Date>
<capec:Modification_Comment>Updated Related_Attack_Patterns</capec:Modification_Comment>
</capec:Modification>
</capec:Modifications>
</capec:Content_History>
In [106]:
# the target field that can be parsed by this function, if ignoring the attribute content
reference='References'
content_history='Content_History'
weakness='Related_Weaknesses'
attack_pattern='Related_Attack_Patterns'
# parse the target field
field_parser_without_concatenation(content_history,root)
In [108]:
# the output CSV file
field_without_concatenation=pd.read_csv('Content_History.csv')
field_without_concatenation.head(5)
Out[108]: