Vulnerabilities data are available in three different sources: CVE Mitre, NVD and CVE Details, being created and annotated through the data sources in this respective order. Launched in 1999 when most information security tools used their own databases with their own names for security vulnerabilities, the Common Vulnerabilities and Exposures (CVE) by Mitre documents known vulnerabilities manually for public usage.
Each vulnerability contains a description, is uniquely identified by a CVE ID, and may also include fields specifying the vulnerable software, version and vendors affected by it. If a set of vulnerabilities are similar, but occur for different software, they can have different CVE-IDs, and contain the same weakness ID (CWE ID). When created by CVE Mitre, each vulnerability may or not be annotated with a weakness ID(CWE ID),but when available they can serve to group similar vulnerabilities conceptually,and observe how they have been ‘instantiated’ in different software, version or vendor.
CVE Mitre’s vulnerabilities are then annotated with severity scores, fix information, and impact ratings in the National Vulnerability Database(NVD),and made available for download as XML feeds. CVE Details was created to provide a user-friendly interface to NVD’s XML feeds. For instance, using vulnerabilities’ CWE IDs and keyword matching, it defines 13 vulnerability types to facilitate browsing vulnerabilities. Since CVE Details warns about inconsistencies in NVD XML Feeds (e.g.same vendor’s software having different names), and irrelevant entries to our purposes (i.e. reserved, duplicates and removed entries), we downloaded all software vulnerabilities to date from the three sources to define our vulnerability dataset and ensure consistency.
The CVE Mitre database has information about the reference (or the source) of the vulnerability. There are various sources, and the database provides information about the Url it is reported from the description of the attack (with an ID associated). It is important to identify the right sources of vulnerabilities and this notebook aims to help choose the sources and filter the chosen ones into a new file.
The files provided by the CVE Mitre website are in CVRF(XML)format and can be found http://cve.mitre.org/data/downloads/index.html . The XML schema is built such that it encapsulates tables within a table. We will parse through the tree to reach the required child node and perform pattern matching using regular expressions. This will enable us to extract the right fields and write it onto a file(file1). The other unfiltered sources are writen into another file(file2), from where they can be fetched if felt they are to be considered.
In [1]:
#import Element tree for parsing xml
from xml.etree.ElementTree import ElementTree
import csv
import re
import glob
#parsing the tree and fetching root node
table_root = "{http://www.icasi.org/CVRF/schema/vuln/1.1}"
The data is very complex and filled with different types of references. On initial evaluation, the data was found to not be completely described by one type of regular expression. The data has over 60 different types of references and each of them vary in its composition. The regex has been formed in a way that it identifies only the IDs of reference types and does not confuse the occurance of that word in other sentences(or descriptions of references). This process of filtering and verification for accuracy is semi automated.
In [2]:
#creating variables for regex search
BID_regex = "BID:(\d+)"
SECTRACK_regex = "SECTRACK:(\d+)"
MS_regex = "MS:[A-Z]*[0-9]*-[0-9]*"
REDHAT_regex = "REDHAT:RHSA-[0-9]*:[0-9]*"
GENTOO_regex = "GENTOO:GLSA-[0-9]*-[0-9]*"
DEBIAN_regex = "DEBIAN:DSA-(\d+)"
SECUNIA_regex = "SECUNIA:(\d+)"
TURBO_regex = "TURBO:TLSA-[0-9]*-[0-9]*"
AIXAPAR_regex = "AIXAPAR:[A-Z]*[0-9]*"
ALLAIRE_regex = "ALLAIRE:[A-Z]*[0-9]*-[0-9]*"
AUSCERT_regex = "AUSCERT:[A-Z]*-[0-9]*.[0-9]*"
BEA_regex = "BEA:BEA[0-9]*-[0-9]*.[0-9]*"
CIAC_regex = "CIAC:[A-Z]*-[0-9]*"
CONECTIVA_regex = "^CONECTIVA"
OSVDB_regex = "OSVDB:(\d+)"
CERT_regex = "^CERT:"
CERT_VN_regex = "^CERT-VN"
APPLE_regex = "^APPLE"
CALDERA_regex = "^CALDERA"
EXPLOIT_regex = "^EXPLOIT"
BUGTRAQ_regex = "^BUGTRAQ"
CISCO_regex = "^CISCO"
FEDORA_regex = "^FEDORA"
FULLDISC_regex = "^FULLDISC"
MISC_regex = "^MISC"
OVAL_regex = "^OVAL"
IBM_regex = "^IBM"
OPENBSD_regex = "^OPENBSD"
VIM_regex = "^VIM"
SUN_regex = "^SUN"
NAI_regex = "^NAI"
CONFIRM_regex = "^CONFIRM"
#creating headers for file
header_file1= ["CVE ID","BID Description","BID Url","SECTRACK Description",
"SECTRACK Url","MS Description","MS Url","REDHAT Description",
"REDHAT Url","DEBIAN Description","DEBIAN Url","GENTOO Description",
"GENTOO Url","SECUNIA Description","SECUNIA Url","TURBO Description",
"TURBO Url","AIXAPAR Description","AIXAPAR Url","ALLAIRE Description",
"ALLAIRE Url","APPLE Description","APPLE Url","ATSTAKE Description",
"ATSTAKE Url","AUSCERT Description","AUSCERT Url","BEA Description",
"BEA Url","CALDERA Description","CALDERA Url","CERT Description","CERT Url",
"CIAC Description","CIAC Url","CONECTIVA Description",
"CONECTIVA Url","CONFIRM Description","CONFIRM Url","OSVDB Description",
"OSVDB Url","BUGTRAQ Description", "BUGTRAQ Url","CISCO Description",
"CISCO Url","BINDVIEW Description", "BINDVIEW Url","EXPLOIT Description",
"EXPLOIT Url","FEDORA Description", "FEDORA Url","FULLDISC Description",
"FULLDISC Url","MILLWORM Description", "MILLWORM Url","MISC Description",
"MISC Url","MLIST Description", "MLIST Url","SUSE Description",
"SUSE Url","XF Description", "XF Url","UBUNTU Description", "UBUNTU Url",
"VUPEN Description", "VUPEN Url","SREASON Description", "SREASON Url",
"OVAL Description", "OVAL Url","SGI Description", "SGI Url","CHECKPOINT Description",
"CHECKPOINT Url","MANDRAKE Description", "MANDRAKE Url","MANDRIVA Description",
"MANDRIVA Url","COMPAQ Description", "COMPAQ Url","FREEBSD Description", "FREEBSD Url",
"HP Description", "HP Url","IBM Description", "IBM Url","IDEFENSE Description",
"IDEFENSE Url","IMMUNIX Description", "IMMUNIX Url","ISS Description", "ISS Url",
"JVN Description", "JVN Url","L0PHT Description","L0PHT Url","OPENBSD Description",
"OPENBSD Url","SUNALERT Description","SUNALERT Url","TRUSTIX Description",
"TRUSTIX Url","SLACKWARE Description","SLACKWARE Url","NETBSD Description","NETBSD Url",
"VIM Description","VIM Url","VULNWATCH Description","VULNWATCH Url","CERT_VN Description","CERT_VN Url",
"MSKB Description","MSKB Url","NAI Description","NAI Url","SUN Description","SUN Url"]
header_file2= ["CVE ID","Reference Description","Reference Url"]
On evaluation, the data was found to have plenty of Unicode errors. This module performs the read write operation into files are filteration. The unicode errors can be either printed or found in a file called "UnicodeErrors.txt"
In [3]:
#write into file currently holding references
def write_file(filename,data,header):
with open(filename , 'w') as file:
writer = csv.DictWriter(file, fieldnames = header)
writer.writeheader()
for value in data:
#print(value);
try:
writer.writerow(value)
except UnicodeEncodeError:
writer.writerow({k:v.encode('utf8') for k,v in value.items()})
File1 holds all accurately filtered entries and File2 contain the other entries. The function counter helps know the frequency of occurances of various reference types. The function coverage_counter helps us identify the coverage of a type over CVE IDs.
In [4]:
def counter(reference,references_counter):
global total
if not reference in count_reference_all:
count_reference_all[reference]=1
else:
count_reference_all[reference]+=1
if not reference in references_counter:
total +=1
references_counter[reference]=1
if not reference in count_reference:
count_reference[reference]=1
else:
count_reference[reference]+=1
The module reference_sort performs filtering to decide what entry goes to what file. Python does not allow for switch cases, so we use if and elif. Here we have performed regex matching to seperate out the different types.
The function matches each of the references obtained for a CVE-ID one by one in the if and elif loop and keeps a track of the number of references for each of the reference types. If the reference type does not match with any of the reference types listed below, it will be classified as other reference.
In [5]:
def reference_sort(data):
#data[1] holds references
if data[1] is not None:
references_counter={}
#print (data[1])
#print (count_reference)
for child in data[1].findall(table_root+"Reference"):
file1 = {}
file2 = {}
#re.search(regex,text)
if re.search(BID_regex,child.find(table_root + "Description").text):
counter("BID",references_counter)
file1["CVE ID"] = data[0].text
file1["BID Url"] = child.find(table_root + "URL").text
file1["BID Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(SECTRACK_regex,child.find(table_root + "Description").text):
counter("SECTRACK",references_counter)
file1["CVE ID"] = data[0].text
file1["SECTRACK Url"] = child.find(table_root + "URL").text
file1["SECTRACK Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(MS_regex,child.find(table_root + "Description").text):
counter("MS",references_counter)
file1["CVE ID"] = data[0].text
file1["MS Url"] = child.find(table_root + "URL").text
file1["MS Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "DEBIAN:" in child.find(table_root + "Description").text:
counter("DEBIAN",references_counter)
file1["CVE ID"] = data[0].text
file1["DEBIAN Url"] = child.find(table_root + "URL").text
file1["DEBIAN Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(REDHAT_regex,child.find(table_root + "Description").text):
counter("REDHAT",references_counter)
file1["CVE ID"] = data[0].text
file1["REDHAT Url"] = child.find(table_root + "URL").text
file1["REDHAT Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(AIXAPAR_regex,child.find(table_root + "Description").text):
counter("AIXAPAR",references_counter)
file1["CVE ID"] = data[0].text
file1["AIXAPAR Url"] = child.find(table_root + "URL").text
file1["AIXAPAR Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "GENTOO:" in child.find(table_root + "Description").text:
counter("GENTOO",references_counter)
file1["CVE ID"] = data[0].text
file1["GENTOO Url"] = child.find(table_root + "URL").text
file1["GENTOO Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(SECUNIA_regex,child.find(table_root + "Description").text):
counter("SECUNIA",references_counter)
file1["CVE ID"] = data[0].text
file1["SECUNIA Url"] = child.find(table_root + "URL").text
file1["SECUNIA Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "TURBO:" in child.find(table_root + "Description").text:
counter("TURBO",references_counter)
file1["CVE ID"] = data[0].text
file1["TURBO Url"] = child.find(table_root + "URL").text
file1["TURBO Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(ALLAIRE_regex,child.find(table_root + "Description").text):
counter("ALLAIRE",references_counter)
file1["CVE ID"] = data[0].text
file1["ALLAIRE Url"] = child.find(table_root + "URL").text
file1["ALLAIRE Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(APPLE_regex,child.find(table_root + "Description").text):
counter("APPLE",references_counter)
file1["CVE ID"] = data[0].text
file1["APPLE Url"] = child.find(table_root + "URL").text
file1["APPLE Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "ATSTAKE:" in child.find(table_root + "Description").text:
counter("ATSTAKE",references_counter)
file1["CVE ID"] = data[0].text
file1["ATSTAKE Url"] = child.find(table_root + "URL").text
file1["ATSTAKE Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(AUSCERT_regex,child.find(table_root + "Description").text):
counter("AUSCERT",references_counter)
file1["CVE ID"] = data[0].text
file1["AUSCERT Url"] = child.find(table_root + "URL").text
file1["AUSCERT Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(CALDERA_regex,child.find(table_root + "Description").text):
counter("CALDERA",references_counter)
file1["CVE ID"] = data[0].text
file1["CALDERA Url"] = child.find(table_root + "URL").text
file1["CALDERA Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(BEA_regex,child.find(table_root + "Description").text):
counter("BEA",references_counter)
file1["CVE ID"] = data[0].text
file1["BEA Url"] = child.find(table_root + "URL").text
file1["BEA Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(CIAC_regex,child.find(table_root + "Description").text):
counter("CIAC",references_counter)
file1["CVE ID"] = data[0].text
file1["CIAC Url"] = child.find(table_root + "URL").text
file1["CIAC Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(CONECTIVA_regex,child.find(table_root + "Description").text):
counter("CONECTIVA",references_counter)
file1["CVE ID"] = data[0].text
file1["CONECTIVA Url"] = child.find(table_root + "URL").text
file1["CONECTIVA Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(CERT_VN_regex,child.find(table_root + "Description").text):
counter("CERT-VN",references_counter)
file1["CVE ID"] = data[0].text
file1["CERT_VN Url"] = child.find(table_root + "URL").text
file1["CERT_VN Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(CERT_regex,child.find(table_root + "Description").text):
counter("CERT",references_counter)
file1["CVE ID"] = data[0].text
file1["CERT Url"] = child.find(table_root + "URL").text
file1["CERT Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "BINDVIEW:" in child.find(table_root + "Description").text:
counter("BINDVIEW",references_counter)
file1["CVE ID"] = data[0].text
file1["BINDVIEW Url"] = child.find(table_root + "URL").text
file1["BINDVIEW Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "CHECKPOINT:" in child.find(table_root + "Description").text:
counter("CHECKPOINT",references_counter)
file1["CVE ID"] = data[0].text
file1["CHECKPOINT Url"] = child.find(table_root + "URL").text
file1["CHECKPOINT Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(CONFIRM_regex,child.find(table_root + "Description").text):
counter("CONFIRM",references_counter)
file1["CVE ID"] = data[0].text
file1["CONFIRM Url"] = child.find(table_root + "URL").text
file1["CONFIRM Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(OSVDB_regex,child.find(table_root + "Description").text):
counter("OSVDB",references_counter)
file1["CVE ID"] = data[0].text
file1["OSVDB Url"] = child.find(table_root + "URL").text
file1["OSVDB Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(BUGTRAQ_regex,child.find(table_root + "Description").text):
counter("BUGTRAQ",references_counter)
file1["CVE ID"] = data[0].text
file1["BUGTRAQ Url"] = child.find(table_root + "URL").text
file1["BUGTRAQ Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(CISCO_regex,child.find(table_root + "Description").text):
counter("CISCO",references_counter)
file1["CVE ID"] = data[0].text
file1["CISCO Url"] = child.find(table_root + "URL").text
file1["CISCO Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(EXPLOIT_regex,child.find(table_root + "Description").text):
counter("EXPLOIT",references_counter)
file1["CVE ID"] = data[0].text
file1["EXPLOIT Url"] = child.find(table_root + "URL").text
file1["EXPLOIT Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(FEDORA_regex,child.find(table_root + "Description").text):
counter("FEDORA",references_counter)
file1["CVE ID"] = data[0].text
file1["FEDORA Url"] = child.find(table_root + "URL").text
file1["FEDORA Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(FULLDISC_regex,child.find(table_root + "Description").text):
counter("FULLDISC",references_counter)
file1["CVE ID"] = data[0].text
file1["FULLDISC Url"] = child.find(table_root + "URL").text
file1["FULLDISC Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "MILW0RM:" in child.find(table_root + "Description").text:
counter("MILLWORM",references_counter)
file1["CVE ID"] = data[0].text
file1["MILLWORM Url"] = child.find(table_root + "URL").text
file1["MILLWORM Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(MISC_regex,child.find(table_root + "Description").text):
counter("MISC",references_counter)
file1["CVE ID"] = data[0].text
file1["MISC Url"] = child.find(table_root + "URL").text
file1["MISC Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "MLIST:" in child.find(table_root + "Description").text:
counter("MLIST",references_counter)
file1["CVE ID"] = data[0].text
file1["MLIST Url"] = child.find(table_root + "URL").text
file1["MLIST Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(OVAL_regex,child.find(table_root + "Description").text):
counter("OVAL",references_counter)
file1["CVE ID"] = data[0].text
file1["OVAL Url"] = child.find(table_root + "URL").text
file1["OVAL Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "SGI:" in child.find(table_root + "Description").text:
counter("SGI",references_counter)
file1["CVE ID"] = data[0].text
file1["SGI Url"] = child.find(table_root + "URL").text
file1["SGI Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "SREASON:" in child.find(table_root + "Description").text:
counter("SREASON",references_counter)
file1["CVE ID"] = data[0].text
file1["SREASON Url"] = child.find(table_root + "URL").text
file1["SREASON Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "SUSE:" in child.find(table_root + "Description").text:
counter("SUSE",references_counter)
file1["CVE ID"] = data[0].text
file1["SUSE Url"] = child.find(table_root + "URL").text
file1["SUSE Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "UBUNTU:" in child.find(table_root + "Description").text:
counter("UBUNTU",references_counter)
file1["CVE ID"] = data[0].text
file1["UBUNTU Url"] = child.find(table_root + "URL").text
file1["UBUNTU Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "VUPEN:" in child.find(table_root + "Description").text:
counter("VUPEN",references_counter)
file1["CVE ID"] = data[0].text
file1["VUPEN Url"] = child.find(table_root + "URL").text
file1["VUPEN Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "XF:" in child.find(table_root + "Description").text:
counter("XF",references_counter)
file1["CVE ID"] = data[0].text
file1["XF Url"] = child.find(table_root + "URL").text
file1["XF Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "COMPAQ" in child.find(table_root + "Description").text:
counter("COMPAQ",references_counter)
file1["CVE ID"] = data[0].text
file1["COMPAQ Url"] = child.find(table_root + "URL").text
file1["COMPAQ Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "FREEBSD" in child.find(table_root + "Description").text:
counter("FREEBSD",references_counter)
file1["CVE ID"] = data[0].text
file1["FREEBSD Url"] = child.find(table_root + "URL").text
file1["FREEBSD Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "HP:" in child.find(table_root + "Description").text:
counter("HP",references_counter)
file1["CVE ID"] = data[0].text
file1["HP Url"] = child.find(table_root + "URL").text
file1["HP Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(IBM_regex,child.find(table_root + "Description").text):
counter("IBM",references_counter)
file1["CVE ID"] = data[0].text
file1["IBM Url"] = child.find(table_root + "URL").text
file1["IBM Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "IDEFENSE" in child.find(table_root + "Description").text:
counter("IDEFENSE",references_counter)
file1["CVE ID"] = data[0].text
file1["IDEFENSE Url"] = child.find(table_root + "URL").text
file1["IDEFENSE Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "IMMUNIX" in child.find(table_root + "Description").text:
counter("IMMUNIX",references_counter)
file1["CVE ID"] = data[0].text
file1["IMMUNIX Url"] = child.find(table_root + "URL").text
file1["IMMUNIX Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "ISS" in child.find(table_root + "Description").text:
counter("ISS",references_counter)
file1["CVE ID"] = data[0].text
file1["ISS Url"] = child.find(table_root + "URL").text
file1["ISS Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "JVN" in child.find(table_root + "Description").text:
counter("JVN",references_counter)
file1["CVE ID"] = data[0].text
file1["JVN Url"] = child.find(table_root + "URL").text
file1["JVN Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "L0PHT" in child.find(table_root + "Description").text:
counter("L0PHT",references_counter)
file1["CVE ID"] = data[0].text
file1["L0PHT Url"] = child.find(table_root + "URL").text
file1["L0PHT Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "MANDRAKE" in child.find(table_root + "Description").text:
counter("MANDRAKE",references_counter)
file1["CVE ID"] = data[0].text
file1["MANDRAKE Url"] = child.find(table_root + "URL").text
file1["MANDRAKE Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "MANDRIVA" in child.find(table_root + "Description").text:
counter("MANDRIVA",references_counter)
file1["CVE ID"] = data[0].text
file1["MANDRIVA Url"] = child.find(table_root + "URL").text
file1["MANDRIVA Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(OPENBSD_regex,child.find(table_root + "Description").text):
counter("OPENBSD",references_counter)
file1["CVE ID"] = data[0].text
file1["OPENBSD Url"] = child.find(table_root + "URL").text
file1["OPENBSD Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "SUNALERT" in child.find(table_root + "Description").text:
counter("SUNALERT",references_counter)
file1["CVE ID"] = data[0].text
file1["SUNALERT Url"] = child.find(table_root + "URL").text
file1["SUNALERT Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "SLACKWARE" in child.find(table_root + "Description").text:
counter("SLACKWARE",references_counter)
file1["CVE ID"] = data[0].text
file1["SLACKWARE Url"] = child.find(table_root + "URL").text
file1["SLACKWARE Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "VULNWATCH" in child.find(table_root + "Description").text:
counter("VULNWATCH",references_counter)
file1["CVE ID"] = data[0].text
file1["VULNWATCH Url"] = child.find(table_root + "URL").text
file1["VULNWATCH Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(VIM_regex,child.find(table_root + "Description").text):
counter("VIM",references_counter)
file1["CVE ID"] = data[0].text
file1["VIM Url"] = child.find(table_root + "URL").text
file1["VIM Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "NETBSD" in child.find(table_root + "Description").text:
counter("NETBSD",references_counter)
file1["CVE ID"] = data[0].text
file1["NETBSD Url"] = child.find(table_root + "URL").text
file1["NETBSD Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "MSKB" in child.find(table_root + "Description").text:
counter("MSKB",references_counter)
file1["CVE ID"] = data[0].text
file1["MSKB Url"] = child.find(table_root + "URL").text
file1["MSKB Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif "TRUSTIX" in child.find(table_root + "Description").text:
counter("TRUSTIX",references_counter)
file1["CVE ID"] = data[0].text
file1["TRUSTIX Url"] = child.find(table_root + "URL").text
file1["TRUSTIX Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(NAI_regex,child.find(table_root + "Description").text):
counter("NAI",references_counter)
file1["CVE ID"] = data[0].text
file1["NAI Url"] = child.find(table_root + "URL").text
file1["NAI Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
elif re.search(SUN_regex,child.find(table_root + "Description").text):
counter("SUN",references_counter)
file1["CVE ID"] = data[0].text
file1["SUN Url"] = child.find(table_root + "URL").text
file1["SUN Description"] = child.find(table_root + "Description").text
file1_data.append(file1)
else:
file2["CVE ID"] = data[0].text
file2["Reference Description"] = child.find(table_root + "Description").text
file2["Reference Url"] = child.find(table_root + "URL").text
counter("OTHER",references_counter)
file2_data.append(file2)
In [6]:
def plot_bar_dict(ref_dict, plot_tit, plot_xlab, plot_ylab):
data = {}
data['Entries'] = ref_dict
#saving in dictionary for sorting and visualising
df_data = pd.DataFrame(data).sort_values(by='Entries', ascending=False)
df_data = df_data[:25]
df_data= df_data.sort_values('Entries')
series = df_data.loc[:,'Entries']
#print series
p = figure(width=800, y_range=series.index.tolist(), title=plot_tit)
p.xaxis.axis_label = plot_xlab
p.xaxis.axis_label_text_font_size = '10pt'
p.xaxis.major_label_text_font_size = '10pt'
p.yaxis.axis_label = plot_ylab
p.yaxis.axis_label_text_font_size = '10pt'
p.yaxis.major_label_text_font_size = '6pt'
j = 1
for k,v in series.iteritems():
#Print fields, values, or references_counter_cve100={}
#print (k,v,j)
p.rect(x=v/2.0, y=j, width=abs(v), height=0.2, width_units="data", height_units="data")
j += 1
show(p)
In [7]:
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show
import datetime as dt
def plot_fd_ts_year(year_ip):
df = pd.DataFrame.from_dict(references_counter_fd_ts, orient="index")
df = df.fillna(0)
df.to_csv("check.csv", sep=',', encoding='utf-8')
df.index = pd.to_datetime(df.index, format='%Y%m%d')
df.index.name = 'Date'
df.columns = ['FullDiscCount']
df['Date_col']=df.index
df = df[df['Date_col'].dt.year == year_ip]
df.sort_index(inplace=True)
source = ColumnDataSource(df)
p = figure(x_axis_type="datetime", plot_width=800, plot_height=400, title="Time Series Plot for Full Disclosure as reference")
p.line('Date_col', 'FullDiscCount', source=source)
p.circle('Date_col', 'FullDiscCount', source=source, fill_color="white", size=6)
p.xaxis.axis_label = "Date"
p.xaxis.axis_label_text_font_size = '10pt'
p.xaxis.major_label_text_font_size = '10pt'
p.yaxis.axis_label = 'Number of FULLDISC references'
p.yaxis.axis_label_text_font_size = '10pt'
p.yaxis.major_label_text_font_size = '6pt'
show(p)
def counter_fd_ts(reference):
if not reference in references_counter_fd_ts:
references_counter_fd_ts[reference]=1
else:
references_counter_fd_ts[reference]+=1
def reference_sort_fd_ts(data):
if data[1] is not None:
for child in data[1].findall(table_root+"Reference"):
file6 = {}
if re.search(FULLDISC_regex,child.find(table_root + "Description").text):
counter_fd_ts(str(child.find(table_root + "Description").text[9:18]))
file6["CVE ID"] = data[0].text
file6["FULLDISC Description"] = child.find(table_root + "Description").text
file6["FULLDISC date"] = child.find(table_root + "Description").text[9:18]
file6["FULLDISC Url"] = child.find(table_root + "URL").text
file6_data.append(file6)
In [8]:
def counter_cve100(reference):
if not reference in references_counter_cve100:
references_counter_cve100[reference]=1
else:
references_counter_cve100[reference]+=1
def reference_sort_cve100(data):
#data[1] holds references
if data[1] is not None:
for child in data[1].findall(table_root+"Reference"):
file5 = {}
if re.search(FULLDISC_regex,child.find(table_root + "Description").text):
counter_cve100(data[0].text)
file5["CVE ID"] = data[0].text
file5["FULLDISC Url"] = child.find(table_root + "URL").text
file5["FULLDISC Description"] = child.find(table_root + "Description").text
In [9]:
import collections
references_counter_cve_uqmonths = collections.defaultdict(dict)
def counter_cve_uqmonths(reference, month):
if not month in references_counter_cve_uqmonths[reference]:
references_counter_cve_uqmonths[reference][month]=1
else:
references_counter_cve_uqmonths[reference][month]+=1
def reference_sort_cve_uqmonths(data):
#data[1] holds references
if data[1] is not None:
for child in data[1].findall(table_root+"Reference"):
file5 = {}
if re.search(FULLDISC_regex,child.find(table_root + "Description").text):
counter_cve_uqmonths(data[0].text, str(child.find(table_root + "Description").text[13:15]))
file5["CVE ID"] = data[0].text
file5["FULLDISC Url"] = child.find(table_root + "URL").text
file5["FULLDISC Description"] = child.find(table_root + "Description").text
The function below prints the number of vulnerabilities present in a CVE tree that is obtained from the CVE_XML file parsed in the next module.
In [10]:
#verify write operation into file and perform reference sort
def module_runner(cve_Tree):
print("Vulnerability data count:" + str(len(CVE_tree.findall(table_root+"Vulnerability"))));
v_counter = 0
for vul in CVE_tree.findall(table_root+"Vulnerability"):
#print ("Vulnerability index: " + str(v_counter));
v_counter +=1
reference_sort((vul.find(table_root + "CVE"),vul.find(table_root + "References")))
reference_sort_fd_ts((vul.find(table_root + "CVE"),vul.find(table_root + "References")))
reference_sort_cve100((vul.find(table_root + "CVE"),vul.find(table_root + "References")))
reference_sort_cve_uqmonths((vul.find(table_root + "CVE"),vul.find(table_root + "References")))
The module below performs parsing of all files under a particular folder called data. This allows for flexibility to choose what files we want to filter(either particular years or cumulatively). This module also records values and counts of vulnerability entries, filtered and unfiltered data.
The XML files to be parsed for the same are available for download on the official website for CVE.
In [11]:
#creating list to hold data for file.write into file
file1_data = []
file2_data = []
file6_data = []
count_reference ={}
count_reference_all={}
references_counter_fd_ts={}
references_counter_cve100={}
total = 0
#call module runner to perform parsing
for filename in glob.glob("data/*.xml"):
CVE_tree = ElementTree()
CVE_tree.parse(filename)
module_runner(CVE_tree)
print ("Filtered count: "+ str(len(file1_data)));
print ("Unfiltered count: "+ str(len(file2_data)));
percentage = ((len(file1_data)*1.0/(len(file1_data)+len(file2_data)))*100)
print ("percentage filtered:"+ str(percentage));
write_file('file1.csv', file1_data, header_file1)
write_file('file2.csv', file2_data, header_file2)
In [12]:
#calculating total to formulate percentages to calculate coverage
for key,value in count_reference.items():
value = (value*1.0/total)*100
value = float("{0:.2f}".format(value))
count_reference[key]=value
The plots below shows the number of occurrences of particular type of reference in the all the XML files parsed from the data folder:
In the first plot, the Y-axis in the plot is the type of the reference while the X-axis gives us the number of occurrences of that particular type.
In the second plot, the Y-axis in the plot is the type of the reference while the X-axis gives us the percentage coverage for that particular type.
In [13]:
#creating a histogram for cwe ID types by creating a dictionary
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
output_notebook()
data = {}
data['Entries'] = count_reference_all
#saving in dictionary for sorting and visualising
df_data = pd.DataFrame(data).sort_values(by='Entries', ascending=True)
series = df_data.loc[:,'Entries']
p = figure(width=800, y_range=series.index.tolist(), title="Occurances of reference types")
p.xaxis.axis_label = 'Number of references of a given type in CVE Mitre CVRF'
p.xaxis.axis_label_text_font_size = '10pt'
p.xaxis.major_label_text_font_size = '10pt'
p.yaxis.axis_label = 'Name of the reference type'
p.yaxis.axis_label_text_font_size = '10pt'
p.yaxis.major_label_text_font_size = '6pt'
j = 1
for k,v in series.iteritems():
#Print fields, values, orders
#print (k,v,j)
p.rect(x=v/2, y=j, width=abs(v), height=0.4,
width_units="data", height_units="data")
j += 1
show(p)
In [14]:
#creating a histogram to illustrate coverage in percentage of a particular reference type
data = {}
data['Entries'] = count_reference
#saving in dictionary for sorting and visualising
df_data = pd.DataFrame(data).sort_values(by='Entries', ascending=True)
series = df_data.loc[:,'Entries']
p = figure(width=800, y_range=series.index.tolist(), title="CVE-ID's Reference Coverage Percentage")
p.xaxis.axis_label = 'Percentage coverage of references of a given type in CVE Mitre CVRF'
p.xaxis.axis_label_text_font_size = '10pt'
p.xaxis.major_label_text_font_size = '10pt'
p.yaxis.axis_label = 'Name of the reference type'
p.yaxis.axis_label_text_font_size = '10pt'
p.yaxis.major_label_text_font_size = '6pt'
j = 1
for k,v in series.iteritems():
#Print fields, values, orders
#print (k,v,j)
p.rect(x=v/2, y=j, width=abs(v), height=0.4,
width_units="data", height_units="data")
j += 1
show(p)
In the next section, we focus on the CVE file for the year '2013'.
The modules below focus only on the CVE-IDs with reference type as 'FULL DISCLOSURE'. The reference type is listed in the reference section for a CVE-ID as 'FULLDISC:YYYYMMDD', where YYYYMMDD is the date of the thread in the full disclosure mailing list.
The plot shown below is a time series plot that shows the number of times a particular Full Disclosure conversation was listed as a reference for all the CVE-IDs in the XML file.
In [15]:
input_year=2013
plot_fd_ts_year(input_year)
The plot below helps in narrowing down the analysis by providing a smaller set of CVE-IDs that needs to be focused on. The CVE-IDs with more number of FULLDISC references would be of more interest compared to the ones with a single FULLDISC reference.
The plot displays the top 25 CVE-IDs that have the most number of full disclosure references. As shown in the plot, CVE-2013-1808 has 5 number of full disclosure references.
In [16]:
plot_bar_dict(references_counter_cve100,"CVE-ID's with Full Disclosure as reference","Number of FULLDISC references","CVE-ID")
The plot shown below displays the number of unique months of FULLDISC references for a CVE-ID.
This gives us an idea about the time span during which the full disclosure mailing lists had discussions related to a particular CVE-ID. As all the FULLDISC references are related to the CVE-ID, the content of the mailing list replies can be analyzed to check how the discussions have varied over a period of time.
In [17]:
cve_uqmonths_dict={}
for k, v in references_counter_cve_uqmonths.iteritems():
cve_uqmonths_dict[k]=len(v)
plot_bar_dict(cve_uqmonths_dict,"Histogram of unique number of months in FULLDISC references per CVE-ID","Number of Unique months in FULLDISC references","CVE-ID")
The two plots below gives us insights regarding all the CVE-IDs which were discussed more number of times in a particular month on the full disclosure mailing list.
The first plot displays the average number of FULLDISC references that belong to the same month for a particular CVE-ID whereas the second plot displays the maximum number of FULLDISC references in a month for a particular CVE-ID
In [18]:
cve_months_avg_dict={}
cve_months_max_dict={}
for key, val in references_counter_cve_uqmonths.iteritems():
num_months=len(val)
tot_fulldiscref=0
average_fulldiscref=0
max_fulldiscref=0
for key2, val2 in val.iteritems():
tot_fulldiscref = tot_fulldiscref + val2
if val2>max_fulldiscref:
max_fulldiscref = val2
cve_months_avg_dict[key] = (tot_fulldiscref*1.0/num_months)
cve_months_max_dict[key] = max_fulldiscref
plot_bar_dict(cve_months_avg_dict,"Histogram of avg. number of references per month","Average number of FULLDISC references per month","CVE-ID")
plot_bar_dict(cve_months_max_dict,"Histogram of max. number of references per month","Maximum number of FULLDISC references per month","CVE-ID")