Using Machine Learning to predict the outcome of a zzuf fuzzing campaign

Overview

Fuzzing campaigns can take time. Unfortunately operating systems contain thousands of different binaries that should be tested and there is usually no enough time to analyze them all. Given a very large amount of programs and inputs to fuzz (a.k.a. testcases), in this simple tutorial we are going to detail how VDiscover is trained to predict which testcases can be mutated using zzuf to uncover bugs. Such bugs can induce interesting unexpected behaviors like crashes, memory leaks and infinite loops among others. After the training, VDiscover is tested on completely new programs and detects with reasonable accuracy if the fuzzer will discover a buggy behavior or not.

A repository including all the code and the trained predictor is available here

Getting started

Before start our fuzzing campaigns, it is very important to define a controlled environment to perform our experiments. We use Vagrant for configuring and accessing virtual machines to perform several fuzzing campaigns. For our experiments, we started from a preinstalled Ubuntu image. In particular, we selected the 32-bit version of Ubuntu since the x86-64 support of VDiscover is not ready yet. After installing Vagrant in the host, an Ubuntu 14.04 virtual machine can be easily fetched executing:

vagrant init ubuntu/trusty32
vagrant up --provider virtualbox

Now, our brand-new virtual machine is accessible using ssh:

vagrant ssh

A full upgrade is recommended before starting, since we want to find bugs in the last versions of the software installed:

# apt-get update
# apt-get upgrade

Since the programs included in the image are only the essential ones, we will probably want to install more packages. Obviously, we will need zzuf to perform the fuzzing (another option is to compile it from its repository). We are also interested in obtain more programs to analyze focusing on the ones that parse and process different types of files:

# apt-get install zzuf poppler-utils vorbis-tools imagemagick graphviz ...

Discovering bugs automatically

To discover interesting bugs, first we need to define where are the executable files to analyze. For instance, we can grab binaries to fuzz from the usual directories:

/usr/bin/
/usr/sbin/

Also, we need a procedure to extract command-line testcases. Later, using a suitable input file, a random mutation can uncover buggy behaviors. The objective of such procedure is to collect a large amount of testcases to train VDiscover. Therefore we need an end-to-end fully automatic approach. We utilized several tools and resources:

manfuzzer

To obtain random command-lines to execute programs we used manfuzzer. This python script produces and executes command-line for fuzzing based on the flags found by running -h, -H, --help and the man page of a program. Our fork of manfuzzer was created to allow the generation of command-line testcases using an input file. After selecting a candidate testcase, it will automatically check if the program tries to interact with the input file, otherwise, it discards the command-line testcase.
input files to mutate

After we selected a large number of testcases, we want to mutate different types of input files in order to uncover interesting bugs. Finding a large variety of files to mutate can be a challenging task but fortunately the fuzzing project provides a basic but nice collection of them.
zzuf:

At the last step of this procedure, we used zzuf, a popular multi-purpose fuzzer to detect crashes, aborts and other interesting unexpected behavior. Since the fuzzing campaigns produced by zzuf are based on random mutations, it can require thousands of tries to uncover an interesting bug. Therefore, in order to obtain noticeable results we will fuzz up to 100,000 times every testcase mutating between 1% and 50% of the input files. If zzuf detects a crash, an abort, a timeout or runs out of memory, the testcase is flagged as buggy and the fuzzing is stopped.

Once we extract a large amount of testcases to fuzz, we need to actually run this procedure to collect enough data to train and test VDiscover. In our experiments, more than 64,000 of testcases were selected but unfortunately the exploration of such a large amount of files will take weeks. Instead of that, a sample of almost 3000 testcases was analyzed to train and test our tool.

After several days fuzzing millions of program executions, we collected and analyzed a large number of testcases in our dataset containing the command-lines executed and their outcomes after fuzzing the input files (in order words, if the testcase resulted buggy or robust). The results of the fuzzing campaings in terms of percentage of testcases found buggy or robust is summarized here:

Now we can start the training phase.

Predicting interesting testcases before starting a fuzzing campaign

To obtain a prediction of every testcase, we extracted dynamic features using the same procedure and parameters defined in the VDiscover technical report. This feature extraction requires only one execution: the original testcase, which is clearly faster than fuzz it 100,000 times. So, if we manage to obtain a reasonable prediction using these features, the saved resources in terms of execution time will be remarkable.

After extracting the features from the testcases, we are ready to start learning from them and predict which ones we have to fuzz. To perform an evaluation, it is extremely important to separate our dataset in two disjoint subsets:

A set to train the predictor
A set to test the predictor

This is a standard procedure in Machine Learning to obtain a sensible measure of the predictions accuracy. Nevertheless, there is a significant detail that can lead to misleading results: there are several testcases for the same program and usually zzuf can find a crash in more than one of them. Then VDiscover will learn to find more bugs just picking the testcases from the same programs it saw during the training. Such strategy will give no new bugs and it is clearly not what we want.

In order to test our predictor in a more realistic environment, we carefully split the dataset in train and test keeping all the testcases of every program either in the train or the test set. As a result of this decision, every time we evaluate our predictions, we are analyzing previously unknown binaries.

After running 20 independent experiments (e.g. shuffling the training and testing subsets), the average recall scores are:

92% detecting robust testcases
42% detecting buggy testcases

As you can see from the results, VDiscover is quite effective detecting testcases that uncover no bugs, but not so much with the interesting ones. Despite such imbalance, our tool can be very useful to save our resources discarding testcases that are probably robust to fuzzing.

In fact, we can estimate the reduction in the effort needed to discover new buggy testcases. If we recall the percentage of testcases found buggy (26%) and robust (74%), we can compute which is the percentage of all the testcases our tool flags as potentially buggy to fuzz using a weighted average:

26% \* 0.42 + 74% \* 0.08 = 10.92% + 5.92% = 16.84%

which we can visualize here:

Consequently, by analyzing 16.84% of our test set pointed by VDiscover we can detect 42% of the buggy testcases. As expected, without the help of our tool, a fuzzing campaign will randomly select testcases to mutate. It will need to analyze 42% of the programs to detect 42% of the buggy testcases. Therefore, in terms of our experimental results, we can detect the same amount of buggy testcases 249% faster ($\approx$ 42%/16.84%).

Finally, we can see a particular example of how VDiscover predicts. In this simple experiment, we only selected the wget testcases for testing while the rest of the dataset is used to train. It is interesting to note that there is no similar program to wget in the train set. We summaries the results in the following table:

testcase	predicted	expected
/usr/bin/wget:21598	1	1
/usr/bin/wget:21558	1	1
/usr/bin/wget:21600	1	1
/usr/bin/wget:21288	0	0
/usr/bin/wget:21329	1	0
/usr/bin/wget:21442	1	1
/usr/bin/wget:21295	0	0
/usr/bin/wget:21512	1	1
/usr/bin/wget:21515	1	1
/usr/bin/wget:21552	1	1
/usr/bin/wget:21584	1	1

As you can see, VDiscover misclassifies only one testcase. Anyone interested in this dataset can download the trained model (created using the complete dataset) from here.

Interesting bugs (re)discovered during these experiments

A format string vulnerability in graphviz, very similar to CVE-2014-9157 [reported]
An out-of-bound read using invalid UNICODE strings in libidn (affecting wget at least)
Some crashes in oggenc
A few more i need to report (but nothing serious)