Dataset Comparision


In [1]:
#import packages
import pandas as pd
import glob
import csv
from xml.etree.ElementTree import ElementTree
import re

Motivation

The functions below are reused from the nvd_introduction notebook and cve_mitre_introdcution notebook. These functions define fileloading functionality of csv and xml files. There are seperate function defenitions for both as they differ in method. While running this notebook please create a seperate folder (called data) to hold specifically data files. You may add or delete files from this folder depending on what is required for the results at that point. To count the number of vulnerabilities(in this case CVE ID), the count functions will tackle respective files.


In [2]:
#function to load a csv file
#accepts folderpath and headerlist as parameter to load the data files
def file_csv(folderpath,addheader,headerlist):
    #this reads all files under that folder
    filepaths = glob.glob(folderpath+"data/*.csv")
    #we prepare a list, that will contain all the tables that exist in these file paths
    dataframe = []   
    for filepath in filepaths:
        #if header is required to be added: as in NVD
        if addheader is True:
            dataframe.append(pd.read_csv(filepath,names=headerlist))
        else:
            dataframe.append(pd.read_csv(filepath))    
    return pd.concat(dataframe);

In [3]:
#function to load a xml file
#accepts folderpath as parameter to load the data files
def file_xml(folderpath):
    filepaths = glob.glob(folderpath+"data/*.xml")
    count = 0;
    #uses ElementTree to parse the tree structure
    for filepath in filepaths:
        CVE_tree = ElementTree()
        CVE_tree.parse(filepath)
        CVE_root= CVE_tree.getroot()       
        count = count+ countrow_xml(CVE_root,'{http://www.icasi.org/CVRF/schema/vuln/1.1}')
    return count

The blocks below contain functions to parse and count all entries in each loaded database. Since XML files have a table structure we will have seperate functions to count through each.


In [4]:
#counts number of rows under cve_id header
def countrows_csv(dataframe):
    count = 0
    entries = []
#consider only unique CVE entries    
    for element in dataframe['cve_id']:
        if element not in entries:
            entries.append(element)
            count = count + 1
            
    return count

In [5]:
#counts number of rows in CVE tag
def countrow_xml(CVE_root,root_string):
    cve_id =[] ;
    description=[];
    cell=0
    entries=[]
    for entry in CVE_root:       
        for child in entry:
            if (child.tag == root_string+'CVE'):
                if child.tag not in entries:
                    cve_id.append(child.text);                
           
        cell+=1
    return len(cve_id)

In [8]:
#all entries added to a dictionary
data={}
#calling functions
nvd_dataframe=file_csv('NVD/',True,['cve_id', 'cwe_id','timestamp'])
details_dataframe=file_csv('CVE_Details/',False,None)
data["nvd"] = countrows_csv(nvd_dataframe)
data["cve_details"]= countrows_csv(details_dataframe)
data['cve_mitre']=file_xml('CVE_Mitre/')

In [9]:
#visualization of the count
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
output_notebook()
plot_data={}
plot_data['Entries'] = data
#saving in dictionary for sorting and visualising
df_data = pd.DataFrame(plot_data).sort_values(by='Entries', ascending=True)
series = df_data.loc[:,'Entries']

p = figure(width=800, y_range=series.index.tolist(), title="Number of Vulnerabilities in each dataset")

p.xaxis.axis_label = 'Number of vulnerabilities/rows'
p.xaxis.axis_label_text_font_size = '10pt'
p.xaxis.major_label_text_font_size = '10pt'

p.yaxis.axis_label = 'Name of the dataset'
p.yaxis.axis_label_text_font_size = '14pt'
p.yaxis.major_label_text_font_size = '12pt'

j = 1
for k,v in series.iteritems():
  
  #Print fields, values, orders
  #print (k,v,j) 
  p.rect(x=v/2, y=j, width=abs(v), height=0.4,
    width_units="data", height_units="data")
  j += 1
show(p)


Loading BokehJS ...