NVD Introduction

Introduction

Vulnerabilities data are available in three different sources: CVE Mitre, NVD and CVE Details, being created and annotated through the data sources in this respective order.

Launched in 1999 when most information security tools used their own databases with their own names for security vulnerabilities, the Common Vulnerabilities and Exposures (CVE) by Mitre documents known vulnerabilities manually for public usage.

Each vulnerability contains a description, is uniquely identified by a CVE ID, and may also include fields specifying the vulnerable software, version and vendors affected by it. If a set of vulnerabilities are similar, but occur for different software, they can have different CVE-IDs, and contain the same weakness ID (CWE ID).

When created by CVE Mitre, each vulnerability may or not be annotated with a weakness ID(CWE ID),but when available they can serve to group similar vulnerabilities conceptually,and observe how they have been ‘instantiated’ in different software, version or vendor.

CVE Mitre’s vulnerabilities are then annotated with severity scores, fix information, and impact ratings in the National Vulnerability Database(NVD),and made available for download as XML feeds.

CVE Details was created to provide a user-friendly interface to NVD’s XML feeds. For instance, using vulnerabilities’ CWE IDs and keyword matching, it defines 13 vulnerability types to facilitate browsing vulnerabilities. Since CVE Details warns about inconsistencies in NVD XML Feeds (e.g.same vendor’s software having different names), and irrelevant entries to our purposes (i.e. reserved, duplicates and removed entries), we downloaded all software vulnerabilities to date from the three sources to define our vulnerability dataset and ensure consistency.

Motivation

NVD data has about 60,000 vulnerability entries (CVE Mitre's RESERVED entries not included), but it is more important to understand the weakness they are associated with (CWE IDs). By evaluating the CWE ID coverage over given CVE IDs, across the timeline, we can gain insights about active months or years (in terms of number of entries) and it's correlation to CWE coverage (if any)!

Method

Parsing XML files to CVS

The data from the NVD website is available for each year from 2002 up to 2017 in XML format. To perform further analysis, we require few fields (CVE-ID, CWE-ID,Timestamp) to be extracted, thus we will parse only those tags from XML to CSV. The code below has demostrated conversion and extraction for the years 2002 to 2017 (Feb), and the files after conversion will be found in the same folder as the notebook exists identifyable with the name of the year. The range of years or the destination folder to be copied can be changed by althering the path and the numbers mentioned in the code. The below method uses ElementTree for XML parsing. If you choose to skip this step and access the CSV files directly, you can find them on our Google Drive.


In [3]:
#Parsing using ElelmentTree
import xml.etree.ElementTree as ET
import csv

In [4]:
#getting all XML files in a python list
import glob
x = glob.glob("raw_data/*.xml")

In [6]:
#extarcting fields and writing them into a CSV file
yearRange =2002;
for y in x:
    CVE_tree = ET.parse(y)
    CVE_root= CVE_tree.getroot()
    f = open(str(yearRange)+".csv", 'w');
    yearRange= yearRange+1;
    CVE_count = 0;
    CVE_listOfId = [];
    for entry in CVE_root:
        cve_id = "";
        cwe_id = "";
        modified_date = "";
        cvss = "";
        for child in entry:
        #Print Child.tag will help you code further to identify child nodes
        #print (child.tag) 
            root = '{http://scap.nist.gov/schema/vulnerability/0.4}';
            childList = ['cve_id','published-datetime'];
            
            if (child.tag == '{http://scap.nist.gov/schema/vulnerability/0.4}cve-id'):
                cve_id = child.text;
            if (child.tag == '{http://scap.nist.gov/schema/vulnerability/0.4}cwe'):
                cwe_id = child.attrib['id'];        
            if (child.tag == '{http://scap.nist.gov/schema/vulnerability/0.4}summary'):
                modified_date = child.text;
            
    #Dont write header if you will be using the merged database
    #Head = "CVE ID,CWE ID,Timestamp\n";
    #f.write(Head);
        vuln = '{o1},{o2},{o3}\n'.format(o1=cve_id,o2=cwe_id,o3=modified_date);
        f.write(vuln);
        CVE_count = CVE_count +1;
    #print (CVE_count)
    f.close();

In [7]:
#merge all years into one
#only perform if data is required for all years
fout=open("Merged_2002-17.csv","a")
# first file:
Head = "CVE ID,CWE ID,Summary\n";
fout.write(Head);
for line in open("2002.csv"):
    fout.write(line)
# now the rest:    
for num in range(2003,2017):
    f = open(str(num)+".csv")
    #f.next() # skip the header
    for line in f:
         fout.write(line)
    f.close() # not really needed
fout.close()

Histograms by Month


In [8]:
#using panda 
import pandas as pd
import csv

In [9]:
#We first load all file paths
import glob
nvd_filepaths = glob.glob("data/*.csv")
#Then we prepare a list, that will contain all the tables that exist in these file paths
nvd_dataframes = []
for nvd_filepath in nvd_filepaths:
    #the csvs do not contain headers, so are added here. TO-DO: Add headers to CSV when they are generated.
    nvd_dataframes.append(pd.read_csv(nvd_filepath,names=['cve_id', 'cwe_id','timestamp']))

For the sake of clarity, let's consider just one of the dataframes in the list of dataframes to show how the percent is done.


In [10]:
#Choose the first dataframe at position 0
nvd_df = nvd_dataframes[0]
#Parse the timestamp field turning it into a datetimeindex object, and then access the month attribute
nvd_df['month'] = pd.DatetimeIndex(nvd_df['timestamp']).month

#Now that we have a month column, we can 'group by' the table by the month column. 
nvd_df = nvd_df.groupby(by=['month'])['cve_id','cwe_id'].count()
nvd_df


Out[10]:
cve_id cwe_id
month
1 674 182
2 458 110
3 608 145
4 637 98
5 572 90
6 566 94
7 571 98
8 529 152
9 446 401
10 573 478
11 426 370
12 450 392

In [11]:
#All that is left is divide row-wise the number of cwe_ids, by the numter of cve_ids. 
#Since the cwe_ids are never null, then they effectively represent the number of rows for the given month. 
#cwe_id, instead, that can be null, will only counted when it occurs. 
#Dividing one by the other, gives us the cwe_coverage we desire for the timeseries.
nvd_df['cwe_coverage'] = nvd_df['cwe_id']/nvd_df['cve_id']
nvd_df


Out[11]:
cve_id cwe_id cwe_coverage
month
1 674 182 0.270030
2 458 110 0.240175
3 608 145 0.238487
4 637 98 0.153846
5 572 90 0.157343
6 566 94 0.166078
7 571 98 0.171629
8 529 152 0.287335
9 446 401 0.899103
10 573 478 0.834206
11 426 370 0.868545
12 450 392 0.871111

Let's define a function with the code above, so that given a dataframe, it generates the table above, with cve_id,_cwe_id, and cwe_coverage.


In [12]:
def calculate_cwe_coverage(nvd_df):
    #Parse the timestamp field turning it into a datetimeindex object, and then access the month attribute
    nvd_df['month'] = pd.DatetimeIndex(nvd_df['timestamp']).month
    #Now that we have a month column, we can 'group by' the table by the month column. 
    nvd_df = nvd_df.groupby(by=['month'])['cve_id','cwe_id'].count()
    nvd_df['cwe_coverage'] = nvd_df['cwe_id']/nvd_df['cve_id']
    return nvd_df

Now we generate the cwe coverage table for all our dataframes in our list.


In [13]:
cwe_coverage_dfs = []
for nvd_df in nvd_dataframes: 
    cwe_coverage_dfs.append(calculate_cwe_coverage(nvd_df))
#cwe coverage for the 3rd dataset.
print(cwe_coverage_dfs)


[       cve_id  cwe_id  cwe_coverage
month                              
1         674     182      0.270030
2         458     110      0.240175
3         608     145      0.238487
4         637      98      0.153846
5         572      90      0.157343
6         566      94      0.166078
7         571      98      0.171629
8         529     152      0.287335
9         446     401      0.899103
10        573     478      0.834206
11        426     370      0.868545
12        450     392      0.871111,        cve_id  cwe_id  cwe_coverage
month                              
1         640     547      0.854688
2         844     800      0.947867
3         745     673      0.903356
4         663     582      0.877828
5         423     393      0.929078
6         458     434      0.947598
7         561     501      0.893048
8         622     580      0.932476
9         566     497      0.878092
10        527     458      0.869070
11        450     424      0.942222
12        535     499      0.932710,        cve_id  cwe_id  cwe_coverage
month                              
1         358     312      0.871508
2         388     355      0.914948
3         449     391      0.870824
4         436     339      0.777523
5         376     339      0.901596
6         473     427      0.902748
7         457     386      0.844639
8         336     276      0.821429
9         483     394      0.815735
10        366     276      0.754098
11        312     258      0.826923
12        454     390      0.859031,        cve_id  cwe_id  cwe_coverage
month                              
1         422     288      0.682464
2         347     271      0.780980
3         437     385      0.881007
4         445     329      0.739326
5         398     368      0.924623
6         463     413      0.892009
7         302     223      0.738411
8         446     350      0.784753
9         398     268      0.673367
10        547     353      0.645338
11        380     334      0.878947
12        368     316      0.858696,        cve_id  cwe_id  cwe_coverage
month                              
1         355     294      0.828169
2         432     375      0.868056
3         386     342      0.886010
4         324     224      0.691358
5         380     330      0.868421
6         375     315      0.840000
7         365     235      0.643836
8         362     324      0.895028
9         417     379      0.908873
10        445     298      0.669663
11        238     203      0.852941
12        362     312      0.861878,        cve_id  cwe_id  cwe_coverage
month                              
1         401     279      0.695761
2         369     308      0.834688
3         432     338      0.782407
4         279     248      0.888889
5         384     264      0.687500
6         383     309      0.806789
7         487     345      0.708419
8         613     537      0.876020
9         568     426      0.750000
10        567     375      0.661376
11        428     388      0.906542
12        289     229      0.792388,        cve_id  cwe_id  cwe_coverage
month                              
1         545     354      0.649541
2         438     345      0.787671
3         530     456      0.860377
4         544     307      0.564338
5         484     447      0.923554
6         400     314      0.785000
7         478     324      0.677824
8         371     330      0.889488
9         465     417      0.896774
10        563     385      0.683837
11        415     371      0.893976
12        484     433      0.894628,        cve_id  cwe_id  cwe_coverage
month                              
1         701     492      0.701854
2         386     347      0.898964
3         407     352      0.864865
4         523     372      0.711281
5         420     370      0.880952
6         382     334      0.874346
7         679     509      0.749632
8         437     372      0.851259
9        1107    1055      0.953026
10       1393    1201      0.862168
11        478     438      0.916318
12        572     501      0.875874,        cve_id  cwe_id  cwe_coverage
month                              
1         550     416      0.756364
2         488     403      0.825820
3         414     352      0.850242
4         622     447      0.718650
5         448     378      0.843750
6         497     449      0.903421
7         641     423      0.659906
8         591     513      0.868020
9         514     461      0.896887
10        749     551      0.735648
11        371     327      0.881402
12        582     457      0.785223,        cve_id  cwe_id  cwe_coverage
month                              
1        1093     756      0.691674
2         534     375      0.702247
3         299     246      0.822742
4         534     379      0.709738
5         525     411      0.782857
6         472     354      0.750000
7         671     436      0.649776
8         350     332      0.948571
9         570     550      0.964912
10        652     573      0.878834
11        378     370      0.978836
12        505     491      0.972277]

We now have all the information needed to plot our timeseries.

CWE Coverage Timeseries


In [14]:
#imports for histogram
import numpy as np
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import Range1d
from bokeh.io import output_notebook
from bokeh.charts import Bar
import matplotlib.pyplot as plot
from datetime import datetime
output_notebook()


Loading BokehJS ...

In [15]:
color_map = {
'2007': 'red',
'2008': 'green',
'2009': 'yellow',
'2010': 'violet',
'2011': 'indigo',
'2012': 'brown',
'2013': 'black',
'2014': 'blue',
'2015': 'orange',
'2016': 'olive',
'2017': 'navy',
}

def create_multi_line(vul):
    vul_plot = plot.subplot(111)
    map={}
    year = 2006;
    for frame in cwe_coverage_dfs:
        year+=1;
        map[year]= frame['cwe_coverage']
        vul_plot.plot(map[year], label = year )
    vul_plot.set_ylabel('cwe coverage')
    vul_plot.set_xlabel('month')
    vul_plot.legend(bbox_to_anchor=(1.05, 1),loc=2, borderaxespad=0.)
    vul_plot.set_title("Count of Vulnerability type per month")
    vul_plot.set_autoscaley_on(False)
    vul_plot.set_ylim([0,1])
    vul_plot.set_autoscalex_on(False)
    vul_plot.set_xlim([1,12])
    vul_plot.xaxis.set_ticks([1,2,3,4,5,6,7,8,9,10,11,12])   
    plot.show()

create_multi_line(cwe_coverage_dfs)