Determining whether a Javascript sample is malicious is not computable (https://en.wikipedia.org/wiki/Computable_function) : we are looking for an algorithm that takes a program (which can be seen as an arbitrary Turing Machine) as an input and whose output is a property of the execution of that program.

If you are unfamiliar with the theory of computability and want ot get an intuitive sense of this, imagine writing a JS sample that non trivially never terminates. A simple while(1){} would not do the trick because it can be trivially proven (without executing it) that it never terminates.

A program terminating depending on the answer to some complex mathematical problem (e.g. finding whether a big number is prime) can not be proven to terminate short of actually solving the problem, the best method for doing so being to actually execute the program.

Therefore, the best way to now if this program will terminate is to execute it, which may never ends. That is why deciding a property about the execution of that program is not computable in the general case.

This does not deter us from trying though, because in practice a program that does not terminate in a few seconds will be interrupted by the browser, and is therefore neither malicious nor begnin, it is non-fonctional. The goal here is to devise some indicator of malignity of a JS sample without even executing it (who wants to execute malicious code ?).

Related works

\cite{likarish2009obfuscated}. Bonne intro, bon blabla, mais ils ont créé un détecteur d'obfuscation plus qu'autre chose. On utilise quand même leur features.

On se limite aux features qu'on peut calculer sans même parser le JS (ne fut-ce que parce qu'on est pas à l'abri d'une attaque sur le parser.

Code


In [48]:
import glob
import string
import re
import numpy as np

# Loading the data
data = []
for js_file in glob.glob('Javascript/*/*'):
    new = {}
    new['name'] = js_file.split('/')[-1]
    new['code'] = open(js_file,'r').read()
    if new['name'][-2:] == 'js':
        if new['name'][-6:] == 'min.js':
            new['nature'] = 'Minified'
            new['color'] = 'b'
        else:
            new['nature'] = 'Normal'
            new['color'] = 'g'
    elif new['name'][-3:] == 'out':
        new['nature'] = 'Malicious'
        new['color'] = 'r'
    data.append(new)

Features


In [49]:
def length(code):
    return len(code)

def nb_lines(code):
    return len(code.split('\n'))

def avg_char_per_line(code):
    return length(code)/nb_lines(code)

def nb_strings(code):
    '''Ugly approximation, no simple way out of this short of actually parsing the JS.'''
    return len(code.split("'"))+len(code.split('"'))

def nb_non_printable(code):
    '''\cite{likarish2009obfuscated} use unicode symbol, but we are more general'''
    return len([x for x in code if not x in string.printable])

hex_octal_re = re.compile('([^A-F0-9]0[0-7]+|0x[A-F0-9]+)')
def hex_or_octal(code):
    '''Ugly as hell, but we dont want to parse'''
    return len(list(hex_octal_re.finditer(code)))

def max_nesting_level(code):
    l = 0
    max_l = 0
    for c in code:
        if c in '({[':
            l+=1
            max_l = l if l > max_l else max_l
        elif c in ')}]':
            l-=1
    return max_l

features = [length, nb_lines, avg_char_per_line, nb_strings, nb_non_printable, hex_or_octal, max_nesting_level]

In [51]:
X = np.array([[f(x['code']) for f in features] for x in data])
X[:30]


Out[51]:
array([[  9.54297000e+05,   2.61810000e+04,   3.64499828e+01,
          9.52700000e+03,   1.21000000e+02,   7.90000000e+01,
          1.70000000e+01],
       [  1.25496000e+05,   2.51000000e+02,   4.99984064e+02,
          2.94000000e+03,   0.00000000e+00,   9.00000000e+00,
          1.70000000e+01],
       [  1.05066000e+05,   2.77900000e+03,   3.78071249e+01,
          4.36000000e+02,   0.00000000e+00,   2.00000000e+00,
          1.00000000e+01],
       [  3.35277000e+05,   9.50400000e+03,   3.52774621e+01,
          1.82300000e+03,   1.11900000e+03,   8.60000000e+01,
          1.40000000e+01],
       [  1.51125000e+05,   5.00000000e+00,   3.02250000e+04,
          1.81500000e+03,   0.00000000e+00,   2.40000000e+01,
          1.30000000e+01],
       [  1.22301700e+06,   9.00000000e+00,   1.35890778e+05,
          2.43140000e+04,   3.00000000e+00,   4.30000000e+01,
          2.60000000e+01],
       [  6.29481000e+05,   1.85660000e+04,   3.39050415e+01,
          6.88600000e+03,   0.00000000e+00,   8.00000000e+00,
          3.10000000e+01],
       [  1.70589900e+06,   5.28740000e+04,   3.22634754e+01,
          1.80480000e+04,   5.00000000e+00,   3.00000000e+00,
          1.70000000e+01],
       [  3.81920000e+05,   1.20000000e+01,   3.18266667e+04,
          1.10760000e+04,   2.00000000e+00,   0.00000000e+00,
          1.30000000e+01],
       [  1.62251200e+06,   5.08620000e+04,   3.19002792e+01,
          1.64790000e+04,   5.00000000e+00,   3.00000000e+00,
          1.50000000e+01],
       [  2.47387000e+05,   9.20600000e+03,   2.68723658e+01,
          2.34100000e+03,   0.00000000e+00,   1.10000000e+01,
          1.80000000e+01],
       [  5.24870000e+04,   1.50300000e+03,   3.49214904e+01,
          8.57000000e+02,   0.00000000e+00,   1.00000000e+00,
          9.00000000e+00],
       [  2.02262000e+05,   7.83000000e+03,   2.58316731e+01,
          4.61700000e+03,   0.00000000e+00,   4.00000000e+01,
          1.70000000e+01],
       [  1.55376000e+05,   6.06900000e+03,   2.56015818e+01,
          2.33500000e+03,   0.00000000e+00,   8.00000000e+00,
          1.10000000e+01],
       [  1.66494300e+06,   5.24610000e+04,   3.17367759e+01,
          1.49700000e+04,   1.20000000e+01,   1.65000000e+02,
          1.80000000e+01],
       [  4.40113000e+05,   3.30000000e+01,   1.33367576e+04,
          7.64100000e+03,   1.00000000e+00,   4.70000000e+01,
          3.50000000e+01],
       [  4.35265000e+05,   1.15300000e+03,   3.77506505e+02,
          9.56700000e+03,   0.00000000e+00,   3.20000000e+01,
          1.00000000e+01],
       [  8.47519000e+05,   2.99380000e+04,   2.83091389e+01,
          1.06890000e+04,   0.00000000e+00,   4.00000000e+01,
          1.10000000e+01],
       [  1.67880000e+04,   2.30000000e+01,   7.29913043e+02,
          1.58200000e+03,   0.00000000e+00,   6.00000000e+00,
          4.00000000e+00],
       [  1.55720000e+04,   3.88700000e+03,   4.00617443e+00,
          1.60000000e+01,   0.00000000e+00,   1.00000000e+00,
          3.00000000e+00],
       [  6.25000000e+03,   1.90000000e+01,   3.28947368e+02,
          1.40000000e+01,   0.00000000e+00,   2.60000000e+01,
          2.00000000e+00],
       [  9.12060000e+04,   2.10000000e+01,   4.34314286e+03,
          4.20000000e+01,   0.00000000e+00,   0.00000000e+00,
          3.00000000e+00],
       [  2.03000000e+02,   5.00000000e+00,   4.06000000e+01,
          4.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          2.00000000e+00],
       [  8.96860000e+04,   2.60000000e+01,   3.44946154e+03,
          1.98000000e+04,   0.00000000e+00,   1.80000000e+01,
          3.00000000e+00],
       [  1.12690000e+04,   2.50000000e+01,   4.50760000e+02,
          1.80000000e+01,   0.00000000e+00,   0.00000000e+00,
          3.00000000e+00],
       [  5.29200000e+03,   1.70000000e+01,   3.11294118e+02,
          1.20000000e+01,   0.00000000e+00,   2.80000000e+01,
          1.00000000e+00],
       [  1.19302000e+05,   1.10000000e+01,   1.08456364e+04,
          1.54320000e+04,   0.00000000e+00,   0.00000000e+00,
          3.00000000e+00],
       [  1.64300000e+03,   5.60000000e+01,   2.93392857e+01,
          4.00000000e+01,   0.00000000e+00,   0.00000000e+00,
          3.00000000e+00],
       [  2.03000000e+02,   5.00000000e+00,   4.06000000e+01,
          4.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          2.00000000e+00],
       [  2.03000000e+02,   5.00000000e+00,   4.06000000e+01,
          4.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          2.00000000e+00]])

In [57]:
#http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#example-manifold-plot-compare-methods-py
from sklearn import manifold
%matplotlib inline
import matplotlib.pylab as plt

n_neighbors = 10
n_components = 2
#Y = manifold.Isomap(n_neighbors, n_components).fit_transform(X)
#Y = manifold.LocallyLinearEmbedding(n_neighbors, n_components,
#                                        eigen_solver='auto').fit_transform(X)
Y = manifold.MDS(n_components, max_iter=100, n_init=1).fit_transform(X)
#Y = manifold.SpectralEmbedding(n_components=n_components,
#                                n_neighbors=n_neighbors).fit_transform(X)
#Y = manifold.TSNE(n_components=n_components, init='pca', random_state=0).fit_transform(X)
plt.scatter(Y[:, 0], Y[:, 1], c=[x['color'] for x in data], alpha=0.2)
for label, x, y in zip([x['name'] for x in data], Y[:, 0], Y[:, 1]):
    if '.js' in label and not ('min.' in lab2el):
        plt.annotate(label ,xy=[x,y])
plt.savefig('toto.pdf')


<matplotlib.figure.Figure at 0x1096316a0>

In [42]:
[[x['name'],hex_or_octal(x['code'])] for x in data[:3]]


Out[42]:
[['angular.js', 79], ['angular.min.js', 9], ['Core.Web.js', 2]]

In [43]:
[x for x in data[-3]['code'] if not x in string.printable]


Out[43]:
[]

On arrive assez facilement à différencier les fichiers JS extrêmement gentils de ceux un peu moins sympa. Le problème est qu'il y a des fichiers pas sympa et des fichiers innocents mélangés au même endroit.

    On va voir de quoi il retourne en enlevant de l'analyse les fichiers trop sympas et facile à détecter, ce qui nous laissera nous concentrer sur les fichiers problématiques.

    Pour ça on crée le répertoire 'hard_js' et le notebook correspondant.