In [13]:
import lxml.etree
import csv
import os
import pandas as pd
The purpose of this notebook is to build a field parser and extract the contents of various fields in the CWE 3.0 XML file so that the field content can be directly analyzed and stored into database. The raw XML file can be downloaded at http://cwe.mitre.org/data/xml/cwec_v3.0.xml.zip. Guided by CWE Introduction notebook, this notebook will focus on the detail structure under Weakness table and how parser functions work in order to extract two formats of field: fields with no nesting element and fields with nesting structure.
Although the overall structure of CWE XML file has been documented in CWE Introduction notebook, the Introduction notebook is built on version 2.9. Therefore, the following differences about weakness table between version 2.9 and 3.0 can be observed:
In [3]:
tree = lxml.etree.parse('cwec_v3.0.xml')
root = tree.getroot()
# Remove namespaces from XML.
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}') # Counts the number of characters up to the '}' at the end of the XML namespace within the XML tag
if i >= 0:
elem.tag = elem.tag[i+1:] # Starts the tag a character after the '}'
In [4]:
for table in root:
print (table.tag)
Although there are various kinds of field, in general, there are only three ways to store the field information in the CWE XML file: 1) fields with no nesting element, 2) fields with nesting element, 3) fields with attribute information.
Format | CWE Field Example |
---|---|
Fields with no nesting element | Description, Extended_Description, Likelihood_Of_Exploit, Background_Details |
Fields with nesting element | Potential_Mitigations, Weakness_Ordinalities, Common_Consequences, Alternate_Terms, Modes_Of_Introduction, Affected_Resources, Observed_Examples, Functional_Areas, Content_History, Detection_Methods |
Fields with attribute information | Demonstrative_Exampls, Taxonomy_Mappings, Applicable_Platforms, References,Related Attack Pattern |
We will discuss the detail structure and how to parse the first two types of field below.
Typically, the fields in this format will keep of the information directly under the field element, without any nesting structure and attribute. For example, Description and Extended_Description are the fields in this format. There is no further nesting structure under the field element and thus cannot be extended (no plus sign on the left)
However, when parsing Extended_Description in cwe-1007, there are nesting html elements under Extended_Description element. In this case, we will remove the html tag and concatenate the contents under separate html elements
General case:
HTML elements under Extended_Description:
Before introducing the parser function, we need a function that can write the dictionary that stores the field content to a CSV file. Function write_dict_to_csv will append the given dictionary to the end of the CSV file. If the file does not exist, the function will create a CSV file and take the csv_header as the header of this CSV file.
In [5]:
def write_dict_to_csv(output_file,csv_header,dict_data):
'''
Create a CSV file with headers and write a dictionary;
If the file already existes, only append a dictionary.
Args:
output_file -- name of the output csv file
csv_header -- the header of the output csv file.
dict_data -- the dictionary that will be writen into the CSV file. The number of
element in the dictionary should be equal to or lower than the number of
headers of the CSV file.
Outcome:
a new csv file with headers and one row that includes the information from the dictionary;
or an existing CSV file with a new row that includes the information from the dictionary
'''
# create a file if the file does not exist; if exsits, open the file
with open(output_file, 'a') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=csv_header,lineterminator='\n')
# check whether the csv file is empty
if csv_file.tell()==0:
# if empty, write header and the dictionary
writer.writeheader()
writer.writerow(dict_data)
else:
# if not empty, only write the dictionary
writer.writerow(dict_data)
Given the target field, function no_nesting_field_parser will extract the contents within the target field element and write cwe_id and content into a CSV file named by the target field. Each row in the output CSV file will include the following information:
The following fields have been tested successfully: Description, Extended_Description, Likelihood_Of_Exploit, Background_Details.
In [10]:
def no_nesting_field_parser(target_field, root):
'''
Parse the field with no nesting element from cwec_v3.0.xml file and output the information to a csv file.
Args:
target_field -- the target field that will be parsed through this function. The format of this arg should be string.
root -- the root element of the whole parsed tree.
Outcome:
a csv file named by the field name. Each row will include the following information:
- cwe_id: The CWE identifier
- field: The name of the target field
- (field name)_content: The text information stored under the target field. The header varies depending on field.
For example, the header will be 'description_content' if parsing 'Description' field
'''
# define the path of target field. Here we select all element nodes that the tag is the target field
target_field_path='Weakness/./'+target_field
# extract weakness table in the XML
weakness_table = root[0]
#define the headers
field_header=target_field.lower()+'_content'
output_header=['cwe_id','field',field_header]
#define path of the output file
output_path=target_field+'.csv'
# for each target field node
for field in weakness_table.findall(target_field_path):
# extract cwe_id from the parent node of the target field node
cwe_id=field.getparent().attrib.get('ID')
# extract the content under the target field
field_entry_content=field.text
# in case there are nested html tags under the field
if field_entry_content.isspace()==True:
for field_entry in field:
# extract the content under html tags and concatenate
field_entry_content=field_entry.text
field_entry_content=field_entry_content+field_entry
# build the dictionary that is used to write
field_entry_dict=dict()
field_entry_dict['cwe_id']=cwe_id
field_entry_dict['field']=target_field
field_entry_dict[field_header.lower()]= field_entry_content.strip()
# write the dictionary with headers to a CSV file
write_dict_to_csv(output_path,output_header, field_entry_dict)
In [12]:
des='Description'
extended_des='Extended_Description'
likelihood='Likelihood_Of_Exploit'
background='Background_Details'
no_nesting_field_parser(des,root)
After running the above codes, the file named by 'Description.csv' should be created under the same directory as this notebook. For parsing other fields, need to change the name of the target field.
In [16]:
no_nesting_field=pd.read_csv('Description.csv')
no_nesting_field.head(5)
Out[16]:
Typically, the fields in this format will have a nested structured under the target field element. To understand the nesting structure, here we use the Common_Consequences field in cwe-1004 as the example. Under Common_Consequences element, there are two field entries named by 'Consequence', which represent two different individual consequences associated with the weakness. Under each consequence element, there are three entry elements (scope, impact, and note), which have the contents that our parser is intended to to extract.
General Case :
To understand the structure and the variable naming in the coding part, I generalized the structure of the fields in this format. Here is the general format:
<Target_Field>
<Field_Entry1>
<Entry_Element1> the content function will parse</Entry_Element1>
<Entry_Element2> the content function will parse</Entry_Element2>
<Entry_Element3> the content function will parse</Entry_Element3>
<Entry_Element4> the content function will parse</Entry_Element4>
...
</Field_Entry1>
<Field_Entry2>
<Entry_Element1> the content function will parse</Entry_Element1>
<Entry_Element2> the content function will parse</Entry_Element2>
<Entry_Element3> the content function will parse</Entry_Element3>
<Entry_Element4> the content function will parse</Entry_Element4>
...
</Field_Entry2>
...
</Target_Field>
Here are two special cases when parsing the nesting fields.
For example, a consequence of a weakness may have only one impact and note but multiple scopes. Therefore, in this case, the parser will extract and concatenate the contents that share a same tag under an individual field entry element.
For some unknown reason, the content we aim to extract will be stored in html elements, such as li, div, ul,and o. Therefore, in this case, the parser will extract and concatenate the content that have html tag under a same entry_element. After extracting the content, the parser will also parse the tag information from their parent elements.
Given the target field, function nesting_field_parser will extract the content within the target field element and write cwe_id and content into a CSV file named by the target field. Each row in the output CSV file will include the following information:
There are two parts within function nesting_field_parser . The first part will generate all possible tags as the headers of the output CSV file by traversing all child element tags under each field entry. It is very important for the first part, because once the function writes the headers, it is computationally expensive to edit the first row later - we have to read all content of the original file and re-write to a new file. The function will exclude all HTML tags, such as li, div, ul, and p, because these html tags are meaningless and repetitive. The second part will extract the content from the nesting target field and then write to a CSV file by using function write_dict_to_csv .
The following fields have been tested successfully: Potential_Mitigations, Weakness_Ordinalities Common_Consequences, Alternate_Terms Modes_Of_Introduction, Affected_Resources Observed_Examples, Functional_Areas Content_History, etection_Methods
In [26]:
def nesting_field_parser(target_field, root):
'''
Parser the field with nested elements from cwec_v3.0.xml file and output the information to a csv file.
The following fields have been tested successfully:
-Potential_Mitigations, Weakness_Ordinalities
-Common_Consequences, Alternate_Terms
-Modes_Of_Introduction, Affected_Resources
-Observed_Examples, Functional_Areas
-Content_History, Detection_Methods
Args:
target_field -- the target field that will be parsed through this function. The format of this arg should be string.
root -- the root element of the parsed tree.
Outcome:
a csv file named by the field name. Each row will include the following headers:
- cwe_id: The CWE identifier
- field: The name of the target field
- tags under the field node, but exclude all html tags, including li, div, ul,and p.
'''
# define the path of target field. Here we select all element nodes that the tag is the target field
target_field_path='Weakness/./'+target_field
# extract weakness table in the XML
weakness_table = root[0]
# define the headers
output_header=['cwe_id','field']
# define path of the output file
output_path=target_field+'.csv'
### 1.Generate all possible tags(column header in csv file) under the target field tree
# for each target field node
for field in weakness_table.findall(target_field_path):
# for each field entry, in case there are multiple field entries under the target field node
for field_entry in list(field):
# traverse all entry_element nodes under each field entry
for entry_element in field_entry.iter():
# generate tag and content of each entry_element
entry_element_tag=entry_element.tag
entry_element_content=entry_element.text
# exclude the tag of field entry node, since .iter() will return field entry node and its entry_element nodes
if entry_element_content.isspace():
continue
# exclude all html tags, such as li,div,ul,p
if entry_element_tag=='li' or entry_element_tag=='div' or entry_element_tag=='p' or entry_element_tag=='ul':
continue
# append the tag to the output_header list if it does not exist in the list
if entry_element_tag.lower() not in output_header:
output_header.append(entry_element_tag.lower())
### 2.Extract the content from the nesting target field
# for each target field node
for field in weakness_table.findall(target_field_path):
# extract cwe_id from the attribute of its parent node
cwe_id=field.getparent().attrib.get('ID')
# for each field entry node under the target field node
for field_entry in list(field):
# the dictionary that will be written to a CSV file
entry_element_dict=dict()
entry_element_dict['cwe_id']=cwe_id
entry_element_dict['field']=target_field
# traverse all entry_element nodes under each field entry
for entry_element in field_entry.iter():
# generate tag and content of each entry_element
entry_element_tag=entry_element.tag
entry_element_content=entry_element.text
# skip the first field entry node
if entry_element_content.isspace():
continue
#if the tag is html tag, such as li, div, p, and ul, the tag will be replaced by its parent tag
while(entry_element_tag.lower() not in output_header):
entry_element_tag=entry_element.getparent().tag.lower()
entry_element=entry_element.getparent()
#if there are multiple entry_element entries using a same tag, all content will be concatenated
if entry_element_tag.lower() in entry_element_dict:
# add the concatenated content into the dictionary
entry_element_dict[entry_element_tag.lower()]=entry_element_dict[entry_element_tag.lower()]+ ';'+entry_element_content
# if not, directly add the entry_element content into the dictionary
else:
entry_element_dict[entry_element_tag.lower()]=entry_element_content
# write the dictionary with headers to a CSV file
write_dict_to_csv(output_path,output_header,entry_element_dict)
In [19]:
mitigation="Potential_Mitigations"
consequence='Common_Consequences'
mode='Modes_Of_Introduction'
example='Observed_Examples'
content='Content_History'
weakness='Weakness_Ordinalities'
detection='Detection_Methods'
term='Alternate_Terms'
resources='Affected_Resources'
function_area='Functional_Areas'
nesting_field_parser(consequence, root)
After running the above codes, the file named by 'Common_Consequences.csv' should be created under the same directory as this notebook. For parsing other fields, need to change the name of the target field.
In [22]:
nesting_field=pd.read_csv('Common_Consequences.csv')
nesting_field.head(5)
Out[22]:
Typically, the fields in this format will store the information not only in the element but also as the attribute. For example, from the screenshot below, the attribute of Example_Code under Demonstrative_Examples field stores the information about the nature and the language of the example code. If the information stored in the attribute can be ignored, function nesting_field_parser can also work for the fields in this format.