Filtering and Ranking Security Bug Reports

This application is used to build prediction models for predicting security bug reports. Included here are bug reports from five different projects, code, experiments, and results. This document provides guidance on how to use the application on the command line. We start with some information about the bug reports and then we procede as follows:

the generation of data matrices for experiments;
calculation of tfidf as a measure of the popularity of security related keywords;
prediction bug reports as security or non security reports and;
ranking bug reports.

Users can clone the farsec repo and run the examples in order, using the commands in bold. The user must also create an output directory in the same location as farsec-v1.jar. All data and results will be sent to the output directory.

About Bug Reports

Bug Report Sources

Characteristics About Bug Reports

Project	Domain	Start ID	End ID	Start ID Date	End ID Date	BRs	SBRs	SBRs(%)
Chromium	Web browser called Chrome.	2	46313	Aug 30 2008	Jun 11 2010	44885	191	0.4
Wicket	Component-based web application framework for the Java programming.	12	5753	Oct 20 2006	Nov 9 2014	1000	10	1
Ambari	Hadoop management web UI backed by its RESTful APIs.	12	6793	Sep 26 2011	Aug 8 2014	1000	29	3
Camel	A rule-based routing and mediation engine.	72	6767	Jul 8 2007	Sep 18 2013	1000	32	3
Derby	A relational database management system.	5	6742	Sep 28 2004	Sep 17 2014	1000	88	9

Bug Reports

For each project, data is partitioned into train and test sets. These are further partitioned into sbrs and nsbrs. Each bug report is converted into a file and stored in the approporiate directory (e.g -sbr-new, -nsbr-new, -sbr-old and -nsbr-old). An example of the result is shown in /resources/data1/wicket/

Data Matrices

Generate Data Matrices for Experiments

Given bug reports for each project we find security related keywords. We use the keywords to generate train and test sets with the bug reports. We generate these matrices for two types of prediction experiments. The first is within predition (WPP) the second is transfer prediction (TPP).

Example: WPP for wicket

Options:

-o output directory name (mkdir)
-p project name
-n number of keywords/features

Run:

java -jar farsec-v1.jar -o wicket-data1 -k data1 -p wicket -n 100 --wpp

Example: TPP for wicket (target) with ambari (source)

Options:

-o output directory name (mkdir)
-c source project name
-p project name
-n number of keywords/features

Run:

java -jar farsec-v1.jar -o wicket-data1 -k data1 -c ambari -p wicket -n 100 --tpp

Generate Data Matrices for Experiments with Filtering

Filter WPP and TPP train data sets. These are denoted as WPPx and TPPx respectively.

Example: WPPx for wicket

Options:

-o output directory name (mkdir)
-c source project name
-p project name
-n number of keywords/features

Run:

java -jar farsec-v1.jar -o wicket-data1 -c ambari -p wicket --wppx
java -jar farsec-v1.jar -o wicket-data1 -c ambari -p wicket --tppx

Calculate Tf-idf

Generate tfidf Files

We use tf-idf as a proxy for popularity of specific keywords present is different sources. These sources are security bug reports and non security bug reports before and after filtering. The result is saved in a csv file.

Example for wicket

Options:

-o output directory name (mkdir)
-p project name
-n number of keywords/features

Run:

java -jar farsec-v1.jar -o wicket-data1 -p wicket -n 100 --tfidf

Make Predictions

We build prediction models using the following machine learning algorithms:

Logistic Regression
Naive Bayes
Random Forest
K-Nearest Neighbor
Multilayer Perceptron

Bug reports in the test set are predicted as security or non security.

Options:

-o output directory name (mkdir)
-c source project name
-p project name
-n number of keywords/features

Run:

java -jar farsec-v1.jar -o wicket-data1 -c ambari -p wicket -s wpp --predict
java -jar farsec-v1.jar -o wicket-data1 -c ambari -p wicket -s wppx --predict
java -jar farsec-v1.jar -o wicket-data1 -c ambari -p wicket -s tpp --predict
java -jar farsec-v1.jar -o wicket-data1 -c ambari -p wicket -s tppx --predict