In [1]:
%env CUDA_VISIBLE_DEVICES=1 # limit GPU usage, if any to this GPU


env: CUDA_VISIBLE_DEVICES=1 # limit GPU usage, if any to this GPU

In [2]:
import numpy as np
from classifier import common
import os
labels = common.fetch_samples()

from sklearn.model_selection import train_test_split
np.random.seed(123)
y_train, y_test, sha256_train, sha256_test = train_test_split(
    list(labels.values()), list(labels.keys()), test_size=1000)

End-to-end deep learning for malware

So, let's move to "real" end-to-end deep learning, because deep learning does everything better, right? For this, there's a few things to note.

  1. Images have a notion of pixel intensity which has a natural ordering: black < gray < white. But binaries are made of bytes, some of which represents instructions, some ASCII or unicode text (depending on placement and context in the file), and the byte value has no real logical ordering. In our model, we'll let the model choose a mapping from byte value to "color", where you can specify how many dimensions make up each "color". The layer that does this is called an embedding layer.
  2. Even hefty GPUs may have a hard time holding lots of embedded binary files in contiguous memory. For this reason, we'll set a cap on the maximum number of bytes we'll read in (hint: the median size of malware on VirusTotal is close to 1MB), and furthermore, break the file into chunks that the memory manager can conveniently place into contiguous memory on the GPU. (Even though you may have 4GB of memory on your GPU, you may not be able to hold a 1MB file with its embedded/"colored" bytes in contiguous memory.)
  3. Speaking of memory usage, we can limit the memory usage by the batch size.

But, think of it! End-to-end deep learing for static malware detection. No PE parsing required! No feature engineering required! No work required! Right?

You can find code that defines the end-to-end model architecture at classifier/endtoend.py.


In [3]:
# for this demo, will slurp in only the first 256K (2**18) bytes of the file
# for a nice GPU like a Titan X, you should be able to squeeze in > 2MB ...
# ...but warning! this makes training more difficult...larger haystack to find needles
max_file_length = int(2**18) # powers of 2 FTW
file_chunks = 8  # break file into this many chunks
file_chunk_size = max_file_length // file_chunks

batch_size = 8

In [ ]:
# That this is very long running cell, and we're going
# it may appear that the output is truncated before training completes

# let's train this puppy
from classifier import endtoend
import math
from keras.callbacks import LearningRateScheduler, EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

# create_model(input_shape, byte_embedding_size=2, input_dropout=0.05, hidden_dropout=0.05, kernel_size=16, n_filters_per_layer=[64,256,1024], n_mlp_layers=2 )
model_e2e = endtoend.create_model(input_shape=(file_chunks, file_chunk_size))
train_generator = common.generator(list(zip(sha256_train, y_train)), batch_size, file_chunks, file_chunk_size)
test_generator = common.generator(list(zip(sha256_test, y_test)), 1, file_chunks, file_chunk_size)
training_history = model_e2e.fit_generator(train_generator,
                        steps_per_epoch=math.ceil(len(sha256_train) / batch_size),
                        epochs=100,
                        callbacks=[
                            EarlyStopping( patience=10 ),
                            ModelCheckpoint( 'endtoend.h5', save_best_only=True),
                            ReduceLROnPlateau( patience=5)],
                        validation_data=test_generator,
                        validation_steps=len(sha256_test))


Using TensorFlow backend.
Epoch 1/100
12375/12375 [==============================] - 3953s - loss: 0.7266 - acc: 0.7699 - val_loss: 0.5830 - val_acc: 0.8290
Epoch 2/100
12375/12375 [==============================] - 3875s - loss: 0.6587 - acc: 0.8010 - val_loss: 0.5958 - val_acc: 0.8500
Epoch 3/100
12375/12375 [==============================] - 3812s - loss: 0.6289 - acc: 0.8124 - val_loss: 0.6909 - val_acc: 0.7580
Epoch 4/100
12375/12375 [==============================] - 3784s - loss: 0.6077 - acc: 0.8188 - val_loss: 0.5648 - val_acc: 0.8560
Epoch 5/100
12375/12375 [==============================] - 3787s - loss: 0.5892 - acc: 0.8265 - val_loss: 0.5707 - val_acc: 0.7880
Epoch 6/100
12375/12375 [==============================] - 3787s - loss: 0.5719 - acc: 0.8288 - val_loss: 0.5401 - val_acc: 0.8450
Epoch 7/100
12375/12375 [==============================] - 3801s - loss: 0.5582 - acc: 0.8316 - val_loss: 0.5857 - val_acc: 0.8120
Epoch 8/100
12375/12375 [==============================] - 3908s - loss: 0.5420 - acc: 0.8368 - val_loss: 0.4487 - val_acc: 0.8690
Epoch 9/100
12375/12375 [==============================] - 3912s - loss: 0.5269 - acc: 0.8395 - val_loss: 0.4956 - val_acc: 0.8430
Epoch 10/100
12375/12375 [==============================] - 3806s - loss: 0.5169 - acc: 0.8392 - val_loss: 0.4832 - val_acc: 0.8840
Epoch 11/100
12375/12375 [==============================] - 3786s - loss: 0.5047 - acc: 0.8443 - val_loss: 0.6340 - val_acc: 0.6940
Epoch 12/100
12375/12375 [==============================] - 3789s - loss: 0.4930 - acc: 0.8466 - val_loss: 0.4439 - val_acc: 0.8690
Epoch 13/100
12375/12375 [==============================] - 3787s - loss: 0.4837 - acc: 0.8476 - val_loss: 0.4131 - val_acc: 0.8760
Epoch 14/100
12375/12375 [==============================] - 3791s - loss: 0.4722 - acc: 0.8503 - val_loss: 0.7371 - val_acc: 0.7790
Epoch 15/100
12375/12375 [==============================] - 3785s - loss: 0.4612 - acc: 0.8523 - val_loss: 0.5839 - val_acc: 0.8230
Epoch 16/100
 8288/12375 [===================>..........] - ETA: 1241s - loss: 0.4561 - acc: 0.8543

Notice that the output above is truncated, because Jupyter notebook client couldn't muster the patience to wait for all the output coming from the kernel. Got bored. Moved along. (shakes fist). Millennials!


In [4]:
from keras.models import load_model
# we'll load the "best" model (in this case, the penultimate model) that was saved 
# by our ModelCheckPoint callback
model_e2e = load_model('endtoend.h5')
# we could load the "best" model, but in this case, the "best" model is the penultimate, and not much better
# than the model we have in hand
y_pred = []
for sha256, lab in zip(sha256_test, y_test):
    y_pred.append(
        model_e2e.predict_on_batch(
            np.asarray([common.get_file_data(sha256, lab, max_file_length)]).reshape(
                (-1, file_chunks, file_chunk_size))
        )
    )
common.summarize_performance(np.asarray(y_pred).flatten(), y_test, "End-to-end convnet")


Using TensorFlow backend.
** End-to-end convnet **
ROC AUC = 0.949816612210904
threshold=0.9198635220527649: 0.5020661157024794 TP rate @ 0.009689922480620155 FP rate
confusion matrix @ threshold:
[[511   5]
 [242 242]]
accuracy @ threshold = 0.753
Out[4]:
(0.94981661221090397,
 0.91986352,
 0.0096899224806201549,
 0.50206611570247939,
 array([[511,   5],
        [242, 242]]),
 0.753)

Uggh, really?

Wow, not really that good at all. Looks like my fancy end-to-end model is having a hard time learning from these data.
I guess I need to make my model even more special?


In [ ]: