In [8]:

    
import lxml.etree
import csv
import os
import pandas as pd

Introduction

The purpose of this notebook is to build a field parser and extract the content of various fields in the CAPEC v2.11 XML file so that the field content can be directly analyzed and stored into database. The raw XML file can be downloaded at http://capec.mitre.org/data/archive/capec_v2.11.zip. Guided by CAPEC Introduction notebook, this notebook will focus on the detail structure under attack patterns table and how parser functions work in order to extract various formats of field.



In [4]:

    
tree = lxml.etree.parse('capec_v2.11.xml')
root = tree.getroot()

# Remove namespaces from XML.  
for elem in root.getiterator(): 
    if not hasattr(elem.tag, 'find'): continue  # (1)
    i = elem.tag.find('}') # Counts the number of characters up to the '}' at the end of the XML namespace within the XML tag
    if i >= 0: 
        elem.tag = elem.tag[i+1:] # Starts the tag a character after the '}'

Format and Field Parser

Although the fields have been categorized based on the format in CAPEC Introduction Notebook, the Introduction Notebook is more focus on how field information is shown in the website, not in the XML raw file, thus making it useless when designing parser.

The following table represents how fields are grouped when designing the parser function, from the perspective of whether there are field content stored as XML element attribute and whether the field contents can be concatenated and written as one row. Since it is difficult to name these three groups, we simply use A, B, and C to represent. In the table, all fields in the CAPEC Field Example can be parsed through the parser functions in this notebook, except Technical_Context and Target_Attack_Surface.

Format	Number of Levels	Whether content can be concatenated	Whether has information as attribute	CAPEC Field Example	Avaiable Parser Function
A	1-4	Yes	No	Typical_Severity, Typical_Likelihood_of_Exploit, Methods_of_Attack, Resources_Required, Purposes, CIA_Impact, Payload, Activation_Zone, Summary, Attack_Prerequisites, Relevant_Security_Requirements, Related_Security_Principles, Related_Guidelines, Solutions_and_Mitigations,Probing_Techniques, Indicators-Warnings_of_Attack, Payload_Activation_Impact, Technical_Context*	field_parser_with_concatenation
B	3-5	No	No	Attacker_Skills_or_Knowledge_Required, Attack_Motivation-Consequences, Examples-Instances	field_parser_without_concatenation
C	3-4	No	Yes	References, Content_History, Related_Weakness, Related_Attack_Pattern, Target_Attack_Surface*	field_parser_without_concatenation

We will discuss the field structure and the table details below.

1.1 Format A

1.1.1 The number of levels

All fields in Format A can be generalized to the similar structure. The difference is the number of levels, which means the depth of the parsed content. Here is the generalized structure for fields that have 4 levels. For example, Probing_Techniques is the field that has 4 levels in Format A. The content we are trying to parse is under Entry_Element_Child, so that we have to parse the XML element four times in order to get Entry_Element_Child element. On the other hand, Typical_Severity is the field that only has one level, making the content directly under the Target_Field element.

The idea is the same for the fields that have 2 and 3 levels. Specifically, the 2-levels fields will have Target_Field and Field_Entry, while the content will under Field_Entry element. The 3-levels fields will have Target_Field, Field_Entry, and Entry_Element, while the content will under Entry_Element element.

Genarlized Structure for 4-levels fields:

  <Target_Field>
      <Field_Entry1>
          <Entry_Element>
              <Entry_Element_Child>the content function will parse</Entry_Element_Child>
          </Entry_Element>
      </Field_Entry1>
      <Field_Entry2>
          <Entry_Element> 
              <Entry_Element_Child>the content function will parse</Entry_Element_Child>
          </Entry_Element>
      </Field_Entry2>
      ...
  </Target_Field>

Example for 4-levels fields:

<capec:Probing_Techniques>
   <capec:Probing_Technique>
       <capec:Description>
           <capec:Text>While interacting with a system an attacker would typically investigate for environment variables that can be overwritten. The more a user knows about a system the more likely she will find a vulnerable environment variable.</capec:Text>
       </capec:Description>
   </capec:Probing_Technique>
   <capec:Probing_Technique>
       <capec:Description>
           <capec:Text>On a web environment, the attacker can read the client side code and search for environment variables that can be overwritten.</capec:Text>
       </capec:Description>
   </capec:Probing_Technique>
</capec:Probing_Techniques>

Example for 1-level field

<capec:Typical_Severity>High</capec:Typical_Severity>

1.1.2 Concatenated content:

After explaining the number of levels, here we will discuss the difference between Format A and B, and why the content will be concatenated for fields in Format A. In the example for 4-levels fields shown above, the sentence starting with 'While interacting with a system' and the sentence starting with 'On a web environment' are stored separately under two different Probing_Technique elements, but they come from the same paragraph under Probing Techniques section (capec_10). Therefore, it makes sense to concatenate two sentences and output as a whole.

In summary, when parsing the fields in Format A through the parser function, the content under different Field_Entry elements, no matter the number of levels, will be concatenated and written as one output. In addition, since the content will be merged, capec_id will be unique in the output CSV file.

1.2 Parser Function for Format A fields

Before introducing the parser function, we need a function that can write the dictionary that stores the field content to a CSV file. Function write_dict_to_csv will append the given dictionary to the end of CSV file. If the file does not exist, the function will create a CSV file and take the csv_header as the header of this CSV file.



In [5]:

    
def write_dict_to_csv(output_file,csv_header,dict_data):
    '''
    Create a CSV file with headers and write a dictionary;
    If the file already existes, only append a dictionary.
    
    Args:
        output_file -- name of the output csv file
        csv_header -- the header of the output csv file. 
        dict_data -- the dictionary that will be writen into the CSV file. The number of 
                     element in the dictionary should be equal to or lower than the number of
                     headers of the CSV file. 
    
    Outcome:
        a new csv file with headers and one row that includes the information from the dictionary;
        or an existing CSV file with a new row that includes the information from the dictionary
    '''
    # create a file if the file does not exist; if exsits, open the file
    with open(output_file, 'a',encoding='UTF-8') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=csv_header,lineterminator='\n')
        
        # check whether the csv file is empty
        if csv_file.tell()==0:
            # if empty, write header and the dictionary
            writer.writeheader()         
            writer.writerow(dict_data)
        else:
            # if not empty, only write the dictionary
            writer.writerow(dict_data)

Given the target field, function field_parser_with_concatenation will extract the content within the target field element and write capec_id and content into a CSV file named by the target field. Each row in the output CSV file will contain the following information:

capec_id: The CAPEC identifier. Since the content will be concatenated, capec_id will be a unique identifier in the output file.
field: The name of the target field
field content: The text information stored under the target field. The header name will vary depending on the field the function is parsing.

There are two parts within function field_parser_with_concatenation. The first part will generate all possible tags as the headers of the output CSV file by traversing all child element tags under each field entry. It is very important for the first part, because once the function writes the headers, it is computationally expensive to edit the first row later. The second part will extract the content from the nesting target field and then write to a CSV file by using function write_dict_to_csv .

The following fields have been tested successfully: Typical_Severity, Typical_Likelihood_of_Exploit, Methods_of_Attack, Resources_Required, Purposes, CIA_Impact, Payload, Activation_Zone, Summary, Attack_Prerequisites, Relevant_Security_Requirements, Related_Security_Principles, Related_Guidelines, Solutions_and_Mitigations,Probing_Techniques, Indicators-Warnings_of_Attack, Payload_Activation_Impact



In [149]:

    
def field_parser_with_concatenation(target_field, root):
    '''
    Parse the field from capec_v2.11.xml file and output the information to a csv file.
    
    Args:
        target_field -- the target field that will be parsed through this function. The format of this arg should be string.
        root -- the root element of the whole parsed tree. 
    Outcome:
        a csv file named by the field name. Each row will include the following information:
            - capec_id: The CAPEC identifier
            - field: The name of the target field
            - file content: The text information stored under the target field.
    '''
    
    # define the path of target field. Here we select all element nodes that the tag is the target field
    # if the target field is Summary field, please use the following path: 
    #target_field_path='Attack_Pattern/Description/'+target_field
    target_field_path='Attack_Pattern/./'+target_field
   
    # extract attack pattern table in the XML
    attack_pattern_table = root[2]
    # define the headers
    output_header=['capec_id','field']
    # define path of the output file
    output_path=target_field+'.csv'
    
    ### 1.Generate all possible tags(column header in csv file) under the target field tree
    # for each target field node
    for field in attack_pattern_table.findall(target_field_path):
        # check whether there is no content under target field; if yes, then go to next capec-id:
        if  type(field.text)==type(None):
            continue
        # extract the content under field element
        field_content=field.text
        
        # if there is field_entry element under target_field  // will move to level 2
        if field_content.isspace():
            # for each field entry node under the target field node
            for field_entry in list(field):
                # extract the tag and content information for field entry
                field_entry_tag=field_entry.tag
                field_entry_content=field_entry.text

                # in case there is an empty element without any content
                if type(field_entry_content)==type(None):
                    continue
                    
                # if there is no child element under field_entry // stop to level 2
                elif not field_entry_content.isspace():
                    # if the tag is 'Text', we will replace the tag by its field naming
                    if field_entry_tag.lower()=='text':
                        field_entry_tag=target_field

                    # append the tag to the output_header list if it does not exist in the list
                    if field_entry_tag.lower() not in output_header:
                        output_header.append(field_entry_tag.lower())
                        
                # if there is element under field_entry // will move to level 3
                elif field_entry_content.isspace():

                    # traverse all entry_element nodes under each field entry
                    for entry_element in list(field_entry):
                        # generate tag and content of each entry_element
                        entry_element_tag=entry_element.tag
                        entry_element_content=entry_element.text
                        # build the distinguishable tag for content for furture usage
                        field_element_header=field_entry_tag+'_'+entry_element_tag
                        # append the tag to the output_header list if it does not exist in the list
                        if field_element_header.lower() not in output_header:
                            output_header.append(field_element_header.lower())
                        # if there is not content, then move to next entry element
                        if type(entry_element_content)==type(None):
                                continue
        ### stop to level 1
        # if there is no field_entry element under target_field           
        else:
            if target_field.lower() not in output_header:
                output_header.append(target_field.lower())
                
    ### 2.Extract the content from the nesting target field
    # for each target field node
    for field in attack_pattern_table.findall(target_field_path):
        # check whether there is no content under target field; if yes, then go to next capec-id:
        if  type(field.text)==type(None):
            continue
            
        # extract capec_id from the attribute of its parent node
        # if the target field is Summary field, please use the following code to extract capec_id: 
        #capec_id=field.getparent().getparent().attrib.get('ID')
        capec_id=field.getparent().attrib.get('ID')
        
        # the dictionary that will be written to a CSV file
        field_dict=dict()
        field_dict['capec_id']=capec_id
        field_dict['field']=target_field
        field_content=field.text
        
        # if there is field_entry element under target_field // will move to level 2
        if field_content.isspace():
            # for each field entry node under the target field node
            for field_entry in list(field):
                # extract the tag and content information for field entry
                field_entry_tag=field_entry.tag
                field_entry_content=field_entry.text

                # in case there is an empty element without any content
                if type(field_entry_content)==type(None):
                    continue

                # if there is no node under field_entry // will stop to level 2
                elif not field_entry_content.isspace():
                    if field_entry_tag.lower()=='text':
                        field_entry_tag=target_field

                    #if there are multiple field entries using a same tag, all content will be concatenated
                    if field_entry_tag.lower() in field_dict:
                        # add the concatenated content into the dictionary 
                        field_dict[field_entry_tag.lower()]=field_dict[field_entry_tag.lower()]+ ';'+field_entry_content.strip()

                    # if not, directly add the field_entry content into the dictionary
                    else:
                        field_dict[field_entry_tag.lower()]=field_entry_content.strip()

                # if there is element under field_entry // will move to level 3
                elif field_entry_content.isspace():

                    # traverse all entry_element nodes under each field entry
                    for entry_element in list(field_entry):
                        # generate tag and content of each entry_element
                        entry_element_tag=entry_element.tag
                        entry_element_content=entry_element.text
                        # build the distinguishable tag for content for furture usage
                        field_element_header=field_entry_tag+'_'+entry_element_tag

                        # if there is not content, then move to next entry element
                        if type(entry_element_content)==type(None):
                                continue

                        # if there is no element under entry_element // will stop level 3
                        if not entry_element_content.isspace():
                            # concatenate all entry element content
                            field_content=field_content.strip()+' '+entry_element_content.strip()
                        # if there is element under entry_element // will move to level 4
                        else:
                            # traverse all elements under entry_element
                            for entry_element_child in list(entry_element):
                                # extract the content for each element under entry_element
                                entry_element_child_content=entry_element_child.text
                                # concatenate all element content under each entry_element
                                field_content=field_content.strip()+' '+entry_element_child_content.strip()
                    # add the tag and content pairs to the output dictionary
                    field_dict[field_element_header.lower()]=field_content.strip()
                    
        # if there is no field_entry element under target_field // will stop to level 1
        else:
            field_dict[target_field.lower()]=field_content.strip()
        # write the dictionary with headers to a CSV file 
        write_dict_to_csv(output_path,output_header,field_dict)



In [153]:

    
#### the target field that can be parsed by this function
## 1 level
severity='Typical_Severity'

## 2 levels:
method_attack='Methods_of_Attack' 
resource='Resources_Required'  
purpose='Purposes' 
impact='CIA_Impact' 
payload='Payload' 
likelihood_exploit='Typical_Likelihood_of_Exploit'
activation_zone='Activation_Zone' 
# Since the Summary field is under Description field,
# please change the target_field path and the way to get capec_id 
summary='Summary' 

## 3 levels
prerequisite='Attack_Prerequisites'   
security_requirement='Relevant_Security_Requirements'
security_principle='Related_Security_Principles'
guideline='Related_Guidelines' 
mitigation='Solutions_and_Mitigations'

## 4 levels:
probing='Probing_Techniques' 
indicator='Indicators-Warnings_of_Attack'
payload_impact='Payload_Activation_Impact' 

# parse the target field
field_parser_with_concatenation(probing,root)



In [154]:

    
# the output CSV file
field_with_concatenation=pd.read_csv('Probing_Techniques.csv')
field_with_concatenation.head(5)









    Out[154]:







  
    
      
      capec_id
      field
      probing_technique_description
    
  
  
    
      0
      1
      Probing_Techniques
      In the case of web applications, use of a spid...
    
    
      1
      10
      Probing_Techniques
      While interacting with a system an attacker wo...
    
    
      2
      100
      Probing_Techniques
      The adversary sends an overly long input in va...
    
    
      3
      101
      Probing_Techniques
      The attacker can probe for enabled SSI by inje...
    
    
      4
      102
      Probing_Techniques
      Use available tools to snoop on communications...

2. Format B

Although the fields in Format B and C have the similar nested structure as the fields in Format A, the difference is that there are multiple bottom elements storing different aspects of information, thus making content parsed meaningless to be concatenated.

Here is the example for Attack_Skill_or_Knowledge_Required field under capec-10. In the example, each field_entry element has two entry_element elements that store the level and type of the skill or knowledge for capec-10. Therefore, it makes no sense to represent two different attacker skills or knowledge required in one row. From the capec_10, we can have the same conclusion.

As a result, when parsing the following example, the parser function will separate these two skills and output two rows that have a same capec_id and different content.

Example for Format B

<capec:Attacker_Skills_or_Knowledge_Required>

   <capec:Attacker_Skill_or_Knowledge_Required>
      <capec:Skill_or_Knowledge_Level>Low</capec:Skill_or_Knowledge_Level>
        <capec:Skill_or_Knowledge_Type>
           <capec:Text>An attacker can simply overflow a buffer by inserting a long string into an attacker-modifiable injection vector. The result can be a DoS.</capec:Text>
        </capec:Skill_or_Knowledge_Type>
       </capec:Attacker_Skill_or_Knowledge_Required>
   <capec:Attacker_Skill_or_Knowledge_Required>

      <capec:Skill_or_Knowledge_Level>High</capec:Skill_or_Knowledge_Level>
        <capec:Skill_or_Knowledge_Type>
           <capec:Text>Exploiting a buffer overflow to inject malicious code into the stack of a software system or even the heap can require a higher skill level.</capec:Text>
        </capec:Skill_or_Knowledge_Type>
   </capec:Attacker_Skill_or_Knowledge_Required>

</capec:Attacker_Skills_or_Knowledge_Required>

Parsing output for the above example

Given the target field, function field_parser_without_concatenation will extract the content within the target field element and write capec_id and content into a CSV file named by the target field. Each row in the output CSV file will contain the following information:

capec_id: The CAPEC identifier. Since the content will be not concatenated, capec_id will not be unique in the output file.
field: The name of the target field
field content: The text information stored under the target field. The header name will vary based on the field the function is parsing.

There are two parts within function field_parser_without_concatenation. The first part will generate all possible tags as the headers of the output CSV file by traversing all child element tags under each field entry. It is very important for the first part, because once the function writes the headers, it is computationally expensive to edit the first row later. The second part will extract the content from the nesting target field and then write to a CSV file by using function write_dict_to_csv .

The following fields have been tested successfully: Attacker_Skills_or_Knowledge_Required, Attack_Motivation-Consequences, Examples-Instances.



In [82]:

    
def field_parser_without_concatenation(target_field, root):
    '''
    Parse the field from capec_v2.11.xml file and output the information to a csv file.
    
    Args:
        target_field -- the target field that will be parsed through this function. The format of this arg should be string.
        root -- the root element of the whole parsed tree. 
    Outcome:
        a csv file named by the field name. Each row will include the following information:
            - capec_id: The CAPEC identifier
            - field: The name of the target field
            - file content: The text information stored under the target field.
    '''
        
    # define the path of target field. Here we select all element nodes that the tag is the target field
    target_field_path='Attack_Pattern/./'+target_field
    # extract attack pattern table in the XML
    attack_pattern_table = root[2]
    # define the headers
    output_header=['capec_id','field']
    # define path of the output file
    output_path=target_field+'.csv'
    
    ##### 1. Generate all possible tags(column header in csv file) under the target field tree
    # for each target field node
    for field in attack_pattern_table.findall(target_field_path):
        
        # check whether there is no content under target field; if yes, then go to next capec-id:
        if  type(field.text)==type(None):
            continue
        capec_id=field.getparent().attrib.get('ID')
        # for each field entry, in case there are multiple field entries under the target field node // will move to level 2
        for field_entry in list(field):
            # traverse all entry_element nodes under each field entry // will move to level 3
            for entry_element in list(field_entry):
                # generate tag and content of each entry_element
                entry_element_tag=entry_element.tag
                entry_element_content=entry_element.text
                
                # if there is one more element under entry_element // will move to level 4
                if entry_element_content.isspace():
                    # traverse all elements under entry_element 
                    for entry_element_child in list(entry_element):
                        # generate tag and content of each element under entry_element
                        entry_element_child_tag=entry_element_child.tag
                        entry_element_child_content=entry_element_child.text
                        
                        # if there is one more element under entry_element_child // will move to level 5
                        if entry_element_child_content.isspace():
                            # traverse all elements under entry_element_child
                            for entry_element_child_embed in list(entry_element_child):
                                # build the distinguishable tag for content for furture usage
                                entry_element_child_embed_tag=entry_element_child_embed.tag
                                field_entry_header=entry_element_child_tag+'_'+entry_element_child_embed_tag
                                # append the tag to the output_header list if it does not exist in the list
                                if field_entry_header.lower() not in output_header:
                                    output_header.append(field_entry_header.lower())
                        
                        # if there no element under entry_element_child // will stop to level 4
                        else:
                            # build the distinguishable tag for content for furture usage
                            field_entry_header=entry_element_tag+'_'+entry_element_child_tag
                            # append the tag to the output_header list if it does not exist in the list
                            if field_entry_header.lower() not in output_header:
                                output_header.append(field_entry_header.lower())
                                
                # if there is no entry_element under field_entry // will stop to level 3    
                else:
                    # append the tag to the output_header list if it does not exist in the list
                    if entry_element_tag.lower() not in output_header:
                        output_header.append(entry_element_tag.lower())

    
    #### 2. Extract the content from the target field
    # for each target field node
    for field in attack_pattern_table.findall(target_field_path):
        
         # check whether there is no content under target field; if yes, then go to next capec-id:
        if  type(field.text)==type(None):
            continue
            
        # extract capec_id from the attribute of its parent node
        capec_id=field.getparent().attrib.get('ID')
        
        # for each field entry, in case there are multiple field entries under the target field node // will move to level 2
        for field_entry in list(field):
            
            # the dictionary that will be written to a CSV file
            field_entry_dict=dict()
            field_entry_dict['capec_id']=capec_id
            field_entry_dict['field']=target_field
            
            # traverse all entry_element nodes under each field entry // will move to level 3
            for entry_element in list(field_entry):
                
                # generate tag and content of each entry_element
                entry_element_tag=entry_element.tag
                entry_element_content=entry_element.text
                
                # if there is one more element under entry_element // will move to level 4
                if entry_element_content.isspace():
                    # traverse all elements under each entry_element
                    for entry_element_child in list(entry_element):
                        # generate tag and content of each element under entry_element
                        entry_element_child_tag=entry_element_child.tag
                        entry_element_child_content=entry_element_child.text
                        # build the distinguishable tag for content for furture usage
                        field_entry_header=entry_element_tag+'_'+entry_element_child_tag
                    
                        # if there is one more element under entry_element_child // will move to level 5
                        if entry_element_child_content.isspace():
                            # traverse all elements under each entry_element_child
                            for entry_element_child_embed in list(entry_element_child):
                                # generate tag and content of each element under entry_element_child
                                entry_element_child_embed_tag=entry_element_child_embed.tag
                                entry_element_child_embed_content=entry_element_child_embed.text
                                
                                # if there is no content, then move to next element
                                if type(entry_element_child_embed_content)==type(None):
                                    continue
                                    
                                # build the distinguishable tag for content for furture usage 
                                field_entry_header=entry_element_child_tag+'_'+entry_element_child_embed_tag
                                
                                # if there is multiple elements that share a same tag
                                if field_entry_header.lower() in field_entry_dict:
                                # add the concatenated content into the dictionary 
                                    field_entry_dict[field_entry_header.lower()]=field_entry_dict[field_entry_header.lower()]+ ';'+ entry_element_child_embed_content
                                # if not, directly add the content into the dictionary
                                else:
                                    field_entry_dict[field_entry_header.lower()]= entry_element_child_embed_content
                        
                        # if there no element under entry_element_child // will stop to level 4
                        else:
                            # build the distinguishable tag for content for furture usage 
                            field_entry_header=entry_element_tag+'_'+entry_element_child_tag                            
                            # if there is multiple elements that share a same tag
                            if field_entry_header.lower() in field_entry_dict:
                            # add the concatenated content into the dictionary 
                                field_entry_dict[field_entry_header.lower()]=field_entry_dict[field_entry_header.lower()]+ ';'+ entry_element_child_content
                            # if not, directly add the entry_element_child content into the dictionary
                            else:
                                field_entry_dict[field_entry_header.lower()]= entry_element_child_content
                # if there is no element under entry_element // will stop to 3
                else:
                    # if there is multiple elements that share a same tag
                    if entry_element_tag.lower() in field_entry_dict:
                    # add the concatenated content into the dictionary 
                        field_entry_dict[entry_element_tag.lower()]=field_entry_dict[entry_element_tag.lower()]+ ';'+entry_element_content
                    # if not, directly add the entry_element content into the dictionary
                    else:
                        field_entry_dict[entry_element_tag.lower()]=entry_element_content

            # write the dictionary with headers to a CSV file    
            write_dict_to_csv(output_path,output_header,field_entry_dict)



In [95]:

    
# the target field that can be parsed by this function
attacker_skill='Attacker_Skills_or_Knowledge_Required' 
motivation_outcome='Attack_Motivation-Consequences'
example='Examples-Instances'
# parse the target field
field_parser_without_concatenation(attacker_skill,root)



In [96]:

    
# the output CSV file
field_without_concatenation=pd.read_csv('Attacker_Skills_or_Knowledge_Required.csv')
field_without_concatenation.head(5)









    Out[96]:







  
    
      
      capec_id
      field
      skill_or_knowledge_level
      skill_or_knowledge_type_text
    
  
  
    
      0
      1
      Attacker_Skills_or_Knowledge_Required
      Low
      In order to discover unrestricted resources, t...
    
    
      1
      10
      Attacker_Skills_or_Knowledge_Required
      Low
      An attacker can simply overflow a buffer by in...
    
    
      2
      10
      Attacker_Skills_or_Knowledge_Required
      High
      Exploiting a buffer overflow to inject malicio...
    
    
      3
      100
      Attacker_Skills_or_Knowledge_Required
      Low
      In most cases, overflowing a buffer does not r...
    
    
      4
      100
      Attacker_Skills_or_Knowledge_Required
      High
      In cases of directed overflows, where the moti...

3. Format C

The fields in Format C have the very similar structure as the fields in Format B and also face the same problem that the content cannot be concatenated. The only difference is that fields in Format C have the information stored as element attribute.

Here is the example for the content history field under capec-10. The content, Submission_Source="Internal_CAPEC_Team" and Modification_Source="Internal", are stored as the attribute of Submission element and Modification element, thus making the function field_parser_without_concatenation not applicable for the fields in Format C. However, if the content in the attribute can be ignored, the function field_parser_without_concatenation has the ability to parse the following fields in Format C: References, Content_History, Related_Weakness, Related_Attack_Pattern

<capec:Content_History>
  <capec:Submissions>
     <capec:Submission Submission_Source="Internal_CAPEC_Team">
       <capec:Submitter>CAPEC Content Team</capec:Submitter>
          <capec:Submitter_Organization>The MITRE Corporation</capec:Submitter_Organization>
             <capec:Submission_Date>2014-06-23</capec:Submission_Date>
     </capec:Submission>
  </capec:Submissions>

  <capec:Modifications>
     <capec:Modification Modification_Source="Internal">
       <capec:Modifier>CAPEC Content Team</capec:Modifier>
       <capec:Modifier_Organization>The MITRE Corporation</capec:Modifier_Organization>
       <capec:Modification_Date>2017-01-09</capec:Modification_Date>
       <capec:Modification_Comment>Updated Related_Attack_Patterns</capec:Modification_Comment>
     </capec:Modification>
  </capec:Modifications>
</capec:Content_History>



In [106]:

    
# the target field that can be parsed by this function, if ignoring the attribute content
reference='References' 
content_history='Content_History'
weakness='Related_Weaknesses'
attack_pattern='Related_Attack_Patterns'
# parse the target field
field_parser_without_concatenation(content_history,root)



In [108]:

    
# the output CSV file
field_without_concatenation=pd.read_csv('Content_History.csv')
field_without_concatenation.head(5)









    Out[108]:







  
    
      
      capec_id
      field
      submission_submitter
      submission_submitter_organization
      submission_submission_date
      modification_modifier
      modification_modifier_organization
      modification_modification_date
      modification_modification_comment
      previous_entry_name
    
  
  
    
      0
      1
      Content_History
      CAPEC Content Team
      The MITRE Corporation
      2014-06-23
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1
      Content_History
      NaN
      NaN
      NaN
      CAPEC Content Team;CAPEC Content Team
      The MITRE Corporation;The MITRE Corporation
      2017-05-01;2017-08-04
      Updated Attack_Pattern, References;Updated Att...
      NaN
    
    
      2
      10
      Content_History
      CAPEC Content Team
      The MITRE Corporation
      2014-06-23
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      10
      Content_History
      NaN
      NaN
      NaN
      CAPEC Content Team
      The MITRE Corporation
      2017-01-09
      Updated Related_Attack_Patterns
      NaN
    
    
      4
      100
      Content_History
      CAPEC Content Team
      The MITRE Corporation
      2014-06-23
      NaN
      NaN
      NaN
      NaN
      NaN

	capec_id	field	probing_technique_description
0	1	Probing_Techniques	In the case of web applications, use of a spid...
1	10	Probing_Techniques	While interacting with a system an attacker wo...
2	100	Probing_Techniques	The adversary sends an overly long input in va...
3	101	Probing_Techniques	The attacker can probe for enabled SSI by inje...
4	102	Probing_Techniques	Use available tools to snoop on communications...

	capec_id	field	skill_or_knowledge_level	skill_or_knowledge_type_text
0	1	Attacker_Skills_or_Knowledge_Required	Low	In order to discover unrestricted resources, t...
1	10	Attacker_Skills_or_Knowledge_Required	Low	An attacker can simply overflow a buffer by in...
2	10	Attacker_Skills_or_Knowledge_Required	High	Exploiting a buffer overflow to inject malicio...
3	100	Attacker_Skills_or_Knowledge_Required	Low	In most cases, overflowing a buffer does not r...
4	100	Attacker_Skills_or_Knowledge_Required	High	In cases of directed overflows, where the moti...

	capec_id	field	submission_submitter	submission_submitter_organization	submission_submission_date	modification_modifier	modification_modifier_organization	modification_modification_date	modification_modification_comment	previous_entry_name
0	1	Content_History	CAPEC Content Team	The MITRE Corporation	2014-06-23	NaN	NaN	NaN	NaN	NaN
1	1	Content_History	NaN	NaN	NaN	CAPEC Content Team;CAPEC Content Team	The MITRE Corporation;The MITRE Corporation	2017-05-01;2017-08-04	Updated Attack_Pattern, References;Updated Att...	NaN
2	10	Content_History	CAPEC Content Team	The MITRE Corporation	2014-06-23	NaN	NaN	NaN	NaN	NaN
3	10	Content_History	NaN	NaN	NaN	CAPEC Content Team	The MITRE Corporation	2017-01-09	Updated Related_Attack_Patterns	NaN
4	100	Content_History	CAPEC Content Team	The MITRE Corporation	2014-06-23	NaN	NaN	NaN	NaN	NaN