1. Introduction

The number and severity of cyber-attacks has been increasing at an alarming rate. Currently software security reacts to attacks, trying to minimize damage after it has already occurred. The inherit delay in a reactive approach will inevitably result in larger-than-necessary losses. To prevent or minimize such losses, a proactive approach to security is needed. This proactive approach is the basis of PERCEIVE. By being able to identify potential attacks before they can cause harm, we will be able to finally turn the tide and stop being on the defensive.

Many weaknesses in software are known, and are publicly indexed, but this information is not reaching decision makers. MITRE’s databases—CVE for vulnerabilities, CWE for weaknesses, and CAPEC for attack patterns—hold these past concepts.

While the data in these databases was rich, it was loosely structured and high in volume. My portion of this project was to understand the structure and contents of the CWE database so that we could create a more easily understood and analyzable corpus. A corpus is a collection of writing that through machine learning, will be used to compare known concepts (CWEs) to emerging concepts in hacker discussion groups. To aid in determining similarity between our indexed concepts and emerging concepts, we need to understand the specificity, subject, time and purpose of each concept that we index. By analyzing emerging concepts in this manner, we can direct the attention of software developers, managers, and decision-makers so that they can proactively fix the security weaknesses in an economic and time efficient way.

In order to create this corpus, we must understand the ways in which the aformentioned indexes are organized. This purpose of this notebook is to record and document the way that the CWE database functions. The version of CWE used for the data analytics in this notebook in 2.9.

2. Structure

The data in CWE is organized in two seperate ways. The first part of this section will deal with the structure and formating of the XML section. XML is language designed to carry information and XML uses fields to hold the information. Manually reading XML data and trying to comprehend how it is strucutured is definitely made easier by using an external program such as XML explorer. The objective of this notebook was to determine which fields would be useful for the corpus. The raw XML data can be downloaded at https://cwe.mitre.org/data/. The version used for this notebook is 2.9.



In [1]:

    
import lxml.etree
tree = lxml.etree.parse('cwec_v2.9.xml')
root = tree.getroot()
for table in root: 
    print (table.tag)









    



Views
Categories
Weaknesses
Compound_Elements

There are four main tables in the XML. They are Views, Categories, Weaknesses, and Coumpound_Elements. In the XML the contents of these main categories is messy, however, on the website these main tables follow a strict hierarchy. The four main fields contain the entries related to that category. The meaning of the categories will be explained in a later section. For example, the Weakness main table contains all weakness while the Views main table contains all of the views. The individual entries contain the ID number of that entry, the name, and the status. The status refers to whether the entry is a draft, incomplete, etc. Each entry contains the various fields that are contained in said entry. A field is a container that contains a specific type of information.

Below is a graphical representation of the XML structure. The important thing to note is that there are hundreds of entries under the main table. The graphic only shows one in order to maximize clarity. The number of entries per main table varies depending on the main table. The Categories and Weaknesses main table contain the most entries while Views and Compound_Elements contain the fewest.

Visualizing the XML data is a headache, however the data for the website is significantly more structured. The four main tables from the XML return, however this time they are organized in a strict hierarchy. There are two Views and they are always at the top of the hierarchy. Every Category, Weakness, and Compound_Element is a "MemberOf" one or both of these two views. Weaknesses are further broken down into three levels. Weakness Class is the top level. Weakness Classes contain descriptions of weaknesses that are very abstract and is the most general of the three Weaknesses. Weakness Bases are the middle ground and contain details on detection and prevention. Weakness Variants are the most specific and are typically limited to a specific language or technology. Categories simply contain entries that share common characteristics. Compound Elements can either be composites or chains as of CWE 2.9. However it is stated that this can change as necessary.

The hierarcy functions as follows. The two Views are the apex of the hierarchy. Every Weakness or Category directly contained within one or both Views is refered as a "MemberOf". This is unique. All Weaknesses Classes and Compound Elements are contained within these views, however not all of them are directly contained within them. For those contained in other Weaknesses they are considered to be the "ChildOf" while their holder is considered the "ParentOf". Any member of the hierachy can be a "ParentOf" any member that is below it or of the same rank. Conversely any member of the hierarcy can be a "ChildOf" any member that is above it or the same rank. The exception to this rule is that a Category can be the "ParentOf" a Weakness Class yet is the "ChildOf" a Weakness Base at least once. See CWE ID 60 .

3. Counting the Weakness Fields

Each weakness contains multiple fields. These fields in turn contain information regarding that weakness. The type of information contained within these fields depends on the type of field. Each type of field has a predetermined format and a specificed topic and type of content. We need to determine which fields will be most useful for the program and direct the programs attention to those fields. The fields that will be useful not only have to contain useful information, but also occur frequently enough to be used regularly. The goal of the scripts below is to determine which fields occur frequently enough to be used.

The scripts below record all of the fields in the weaknesses.



In [2]:

    
weakness_table = root[2]
for row in weakness_table[0]: 
    print (row.tag)









    



Description
Relationships
Weakness_Ordinalities
Applicable_Platforms
Time_of_Introduction
Common_Consequences
Potential_Mitigations
Causal_Nature
Demonstrative_Examples
Taxonomy_Mappings
Content_History



In [3]:

    
for row in weakness_table[20]: 
    print (row.tag)









    



Description
Relationships
Relationship_Notes
Weakness_Ordinalities
Applicable_Platforms
Alternate_Terms
Terminology_Notes
Time_of_Introduction
Likelihood_of_Exploit
Common_Consequences
Detection_Methods
Potential_Mitigations
Causal_Nature
Demonstrative_Examples
Observed_Examples
Functional_Areas
Affected_Resources
References
Taxonomy_Mappings
White_Box_Definitions
Related_Attack_Patterns
Content_History

Now that we have a list of all of the fields, we can count to see how many times they are used.



In [4]:

    
histogram = {}
for row in weakness_table: 
    for column in row: 
        if column.tag not in histogram: 
            histogram[column.tag] = 0
        else:
            histogram[column.tag] += 1
print (histogram)









    



{'Description': 718, 'Relationships': 705, 'Weakness_Ordinalities': 130, 'Applicable_Platforms': 556, 'Time_of_Introduction': 664, 'Common_Consequences': 701, 'Potential_Mitigations': 523, 'Causal_Nature': 74, 'Demonstrative_Examples': 385, 'Taxonomy_Mappings': 597, 'Content_History': 718, 'Relationship_Notes': 122, 'Maintenance_Notes': 86, 'Background_Details': 41, 'Modes_of_Introduction': 32, 'Other_Notes': 23, 'References': 281, 'Related_Attack_Patterns': 206, 'Observed_Examples': 357, 'Theoretical_Notes': 26, 'Affected_Resources': 50, 'Research_Gaps': 74, 'Alternate_Terms': 65, 'Terminology_Notes': 26, 'Likelihood_of_Exploit': 184, 'Detection_Methods': 76, 'Functional_Areas': 27, 'White_Box_Definitions': 29, 'Enabling_Factors_for_Exploitation': 22, 'Relevant_Properties': 15}

Finally in order to make it easier to read, we create a histogram.



In [5]:

    
import numpy as np
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import Range1d
from bokeh.io import output_notebook
output_notebook() #So plots

data = {}
data['Entries'] = histogram

df_data = pd.DataFrame(data).sort_values(by='Entries', ascending=True)
series = df_data.loc[:,'Entries']

p = figure(width=800, y_range=series.index.tolist(), title="Weaknesses Histogram")

p.xaxis.axis_label = 'Frequency'
p.xaxis.axis_label_text_font_size = '10pt'
p.xaxis.major_label_text_font_size = '8pt'

p.yaxis.axis_label = 'Field'
p.yaxis.axis_label_text_font_size = '10pt'
p.yaxis.major_label_text_font_size = '8pt'

j = 1
for k,v in series.iteritems():
  
  #Print fields, values, orders
  #print (k,v,j) 
  p.rect(x=v/2, y=j, width=abs(v), height=0.4,
    width_units="data", height_units="data")
  j += 1









    





    
        
        Loading BokehJS ...



In [6]:

    
show(p)

4. Examination of Frequent Fields

Now that we know the frequencies of fields in the weakness table, our next step for extracting the information from the fields is to determine each field's structure and subsequently, our intended method of text extraction. Our findings are reported in the table below. Type names are subject to change and some minimal differences that have little effect on varying extraction methods have been omitted for initial grouping purposes.

Type	Description	Fields	Example
General Description	Contains one to a few sentences	Description, Extended Description, Background Details	The software uses a cookie to store sensitive information, but the cookie is not marked with the HttpOnly flag.
General Description with one to a few tables	Contains one to a few sentences and a table to list details	Common Consequences, Relationships, Mode Of Introduction
Subtitles with Qualified Entries	List with subtitles and one to a few sentences	Potential Mitigations, Notes, Applicable Platforms, Alternate Terms, Detection Methods
Description with one to a few code blocks	Contains one to a few sentences and a few code blocks	Demonstrative Examples
Table	Contains rows and columns. Columns are qualities, rows are individual items	Content History， Observed Examples, Weakness Ordinalities, Related Attack Patterns
Citation	Citation format	References	[REF-2] OWASP. "HttpOnly". https://www.owasp.org/index.php/HttpOnly.
Single Word or Two Word Descriptor	Low, Medium, High, or Very High	Likelihood Of Exploit	Medium

5. Investigating the Content Matter of Fields

Below is a table that lists the fields as well as a description of what information CWE provides inside those fields. Our descriptions are based off of the schema documentation. Most entries in this table are taken directly from the documented description, with a few minor changes to increase understandability, while others have been reworded for effectiveness.

(Here, we only include the fields that will be extracted through CWE Field parser)

Field	Description	CWE Example
Description	A brief description of the Weakness. Typically one to two sentances. Can also contain an Extended Description which goes into further detail. It should be noted that sometimes the extended description is shorter than the regular description but this is rarely the case	CWE-11
Potential Mitigations	Describes how to prevent exploit at various steps in the development cycle. Typically a single sentence per phase and there can be a variable amount of phases	CWE-11
Common_Consequences	A single sentence. There is also the scope which is a one or two word term as will as the technical impact which varies in format.	CWE-11
Demonstrative_Examples	Contains code pertaining to the weakness, but more importantly it can contain extremely detailed descriptions. The length is highly variable but the term density is high. The code is contained within an HTML tag however the notes by whoever wrote the article can be strucutred in a multitude of ways.	CWE-11
Relationships	Lists other fields fields and relationship to said fields.	CWE-11
Related_Attack_Patterns	Lists CAPEC entries that relate to topic.	CWE-1007
Observed_Examples	References CVE entry. See CWE ID number 141	CWE-1007
Taxonomy_Mappings	Alternative way to organize and understand the data.	CWE-11
Content_History	Shows original sumission date along with dates of modifications as well as other relevent data.	CWE-11
Application_Platforms	Gives the languages that the weakness effects.	CWE-11
References	References source(s) used.	CWE-11
Likelyhood_of_Exploit	Gives the chance that the weakness would be taken advantage of using. Typically rated from low to high..	CWE-1007
Weakness_Ordinalities	Ask Carlos about this one.	CWE-1007

6. Analyzing the Fields

Now that we have the frequencies of the fields we can determine which fields will be most useful in creating our corpus. For this purpose the arbitrary number of 100 was chosen. If a field occurs more than 100 times it will be detailed below with an example. The observations are based on limited research and analysis so they are not perfect, but they reasonably describe the rules and regulations of each field.

Currently Useful Fields

These fields have been determined to be useful for developing the corpus. Each field contains relevent content that is structured and plentiful. The field that is most likely to cause difficulties is the "Demonstrative_Examples" field due to the fact that is has the least structured field of the four currently useful fields.

Currently Useful Fields: Description, Common_Consequences, Potential_Mitigations, Demonstrative_Examples

Fields That Might Be Useful Later

These fields have been determined to be useful at a later date. The following fields mostly deal with the entries relationships with other fields. Being able to incorperate relationships into the corpus will definitely improve the accuracy, but that functionality will be added at a later date.

Here are the fields that might be userful later: Relationships, Time_of_Introduction, Related_Attack_Patterns, Observed_Examples, Taxonomy_Mappings

Currently Discarded Fields

These fields have been determined to not be useful at the current time. Most of these fields do not contain enough information for example "Likelyhood_of_Exploit" or contain information that would not be useful to the corpus like "References."

Here are the currently discarded fields: Content_History ,Application_Platforms, References, Likelyhood_of_Exploit, Weakness_Ordinalities, Relationship_Notes

7. Future Goals

Now that we know which fields will be useful, we need to compare them in different ways. Four methods of comparison have been determined so far. They are specificity, subject, time, and purpose.

The first way that we can compare indexes is by looking at their specificity. Entries can range from focusing on the specifics of a given weakness to broad concepts. If we were to compare two documents, we could use their specificity to see how similar the two documents are to one another. In MITRE’s CAPEC especially, specificity is strictly organized. These predetermined tier of specificity are a good litmus test to see if the program is accurately detecting specificity based on content. Being able to accurately measure specificity will ultimately make sure the similarity scores given to documents is more accurate.

The second way that we can compare indexes is by looking at their subject. Conveniently for the indexes, the subject can be easily found by simply looking at the title. Comparing subjects is a quick way to determine how similar the indexes are to one another. By knowing that two subjects are related, the program will be able to more accurately categorize threats. Subject is also an important category for when the program moves to the email phase. Emails conveniently have a line for subject and using this line to its full advantage is key in developing this program.

Time is the third way we can compare indexes. Languages are constantly evolving. As such looking at the way that people talk about a certain subject over time is an important dimension in developing a will rounded corpus for the program. For the indexes, looking at the way an article is changed version to version lets us get a more accurate mapping of how weaknesses are explored and understood. In emails this takes a more organic nature. While the indexes simply record the changes, the emails are the catalyst that drive these changes. By mapping out the way that topics evolve over time, we can create a more complete corpus and as such create a more accurate similarity score.

The final way that we can compare indexes is by comparing similar fields. In MITRE fields have a specific rule set about the way that they are written and their contents. By comparing the same field in multiple indexes with one another we are able to slice the data in another way giving increasing the depth of the corpus and hopefully increasing the accuracy of the program.

8. Conclusion

PERCEIVE is still in its infancy. As such, it is still a work in progress. However a lot was learned in the time that I worked on it. Having to reverse engineer the way that the indexes were formatted and organized was time consuming. However, with our new found understanding we will be better able to create the corpus in the future. Also our knowledge of how to slice up the indexes and emails has grown. For a program, reading the text is easy, however being able to understand what is written is a different matter. Being able to divide up the writings like we have done might seem natural, but being able to implement it into a learning program will prove to be difficult.