Browser Fingerprint Exercise

Is my network traffic lying to me? Most malware authors don’t seem to spend a lot of effort trying to blend into network traffic. I’m pretty sure the reason for this is “they don’t need to”. By identifying legitimate HTTP requests based on browser request structure we may be able to, more easily, identify malicious traffic. This notebook will focus on some ways to gather legit browser requests, understand them, and use that data to find non-legitiate requests.

Disclaimer: This exercise in particular is super 'rough', especially towards the end we just kinda threw stuff at the wall to see what would stick. Please take the material in the notebook as purely experimental :).

See:https://github.com/ClickSecurity/data_hacking/issues/9

All Code and IPython Notebooks for this talk http://clicksecurity.github.io/data_hacking

Tools:

  • Bro Network Security Monitor (http://www.bro.org)
    Bro provides a comprehensive platform for network traffic analysis. Well grounded in more than 15 years of research, Bro has successfully bridged the traditional gap between academia and operations since its inception.

  • IPython: A mad scientist notebook! (http://ipython.org)
    • What did you do?
    • How did you do it?
    • Can I repoduce it?
    • Easy to share:
      • [NB Viewer](http://nbviewer.ipython.org)
      • [Reddit IPython](http://www.reddit.com/r/ipython)

  • Pandas: Python Data Analysis Library (http://pandas.pydata.org)
    • A fast and efficient DataFrame object
    • Great set of IO Tools
    • Fantastic handling of missing data
    • Flexible reshaping and pivoting
    • Slicing, indexing, and subsetting

Contributions:

  • Wireshark101 (Laura Chappell) http://wiresharkbook.com/101_supplements/wireshark101files.zip
  • Contagio Malware Dump - CrimeWare PCAPs (http://contagiodump.blogspot.com/2013/04/collection-of-pcap-files-from-malware.html)

Approach:

  • Exploration and Understanding
  • Some Simple Statistics
  • Similarity Generation on Sparse Data
  • Hierarchical Clustering
  • Automatic Regular Expression Generation
</font>


In [1]:
import pandas as pd
pd.__version__


Out[1]:
'0.13.1'

In [41]:
import numpy as np
np.__version__


Out[41]:
'1.8.0'

In [42]:
# Just some plotting defaults
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 14.0
plt.rcParams['figure.figsize'] = 12.0, 8.0

Read in the Data

We have a nice Bro log reader that will take in any Bro log output and give you a generator of dictionaries (efficient python datastructure. That could be use directly or just passed to a panda dataframe constructor


In [55]:
# Create a BRO log file reader and pull from the logfile
import bro_log_reader
bro_log = bro_log_reader.BroLogReader()
headers = bro_log.read_log('data/http_headers.log')

In [56]:
# Nice, so lets look at some of the outputs by tossing them into a pandas dataframe
dataframe = pd.DataFrame(headers)

In [57]:
# What do we have
print 'Number of Rows: %d   Columns:%d' % (dataframe.shape[0], dataframe.shape[1])
dataframe.head()


Number of Rows: 4576   Columns:4
Out[57]:
header_events_json origin ts useragent
0 [{"ACCEPT":"*\/*"},{"ACCEPT-LANGUAGE":"en-US"}... client 2012-03-30 17:32:57.382264 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ...
1 [{"CACHE-CONTROL":"no-cache"},{"DATE":"Fri, 30... server 2012-03-30 17:32:57.382264 NA
2 [{"ACCEPT":"*\/*"},{"ACCEPT-LANGUAGE":"en-US"}... client 2012-03-30 17:32:57.382264 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ...
3 [{"CACHE-CONTROL":"no-cache"},{"DATE":"Fri, 30... server 2012-03-30 17:32:57.382264 NA
4 [{"ACCEPT":"*\/*"},{"ACCEPT-LANGUAGE":"en-US"}... client 2012-03-30 17:32:57.382264 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ...

5 rows × 4 columns

Process the Data

We're going to use some nice functionality in the Pandas dataframe to process our data:

- We have both client and server events, just keep the client events for this exercise
- Transform the complicated user-agent string into something more managable (short-agent)
- Generate a 'feature vector' from the header keys

In [58]:
# Okay so were only interested in client header requests for this exercise
dataframe = dataframe[dataframe['origin']=='client']

In [59]:
# Okay we also want to process the header events (that are in a JSON blob) 
# into a header feature vector (just pulling 'keys' not values).
import json
def make_header_features(json_header_info_series):
    header_features = []
    for header_info in json_header_info_series:
        try:
            header_list = json.loads(unicode(header_info, 'utf8'))
            features = [item.keys()[0] for item in header_list]
        # There are some lines w/no features
        except Exception as e:
            features = ''
        header_features.append(features)
    return header_features

# Create a nicely formatted feature vector and a string representation
dataframe['feature_vector'] = make_header_features(dataframe['header_events_json'])
dataframe['features'] = dataframe['feature_vector'].map(lambda x: ':'.join(x))

Short Agents?

So what are we doing with the user-agent strings, how are we doing it and why are we doing it?

The user-agent strings are verbose with lots of information and variety based on agent versions/platforms/layout engines/dll linked/builds etc... the logic around short agent strings it to capture the essence of what the agent IS. So for instance this user-agent string becomes this short-agent:

- User-agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.3; SRC 2.7.1 E1; MS-RTC LM 8; BRI/2; BOIE8;ENUSMSNIP)

- short-agent: mozilla/4.0:msie:7.0:windows:nt:5.1:trident/4.0:n1.0:n1.1:n2.0:n3.0:n3.5 

In [60]:
# Making shorter agent names based on information from
# http://msdn.microsoft.com/library/ms537503.aspx
import re
def replace_stuff(m):
    return 'n'+m.group(1) if '.net clr' in m.group() else ''
def short_agent_names(useragent_series, resolution=12):
    short_agent_list = []
    excludes = re.compile(r',|;|\(|\)|compatible|\.net clr ([0-9].[0-9])[^;]*;|khtml,|like')
    for useragent in useragent_series:
        processed_user_agent = re.sub(excludes, replace_stuff, useragent.lower()).strip()
        short_agent = ':'.join(processed_user_agent.split()[:resolution])
        short_agent_list.append(short_agent)
    return short_agent_list

In [61]:
# Generate shorter agent names
dataframe['short_agent'] = short_agent_names(dataframe['useragent'])
# Remove any 'na' agents
dataframe = dataframe.replace('na',np.nan)
dataframe = dataframe.dropna()

Lets look at the Data

We're going to use some nice functionality in the Pandas dataframe to look at our processed data:

- We can use groupby on the dataframe to see the different header request keys for various agents
- Transform the complicated user-agent string into something more managable (short-agent)
- Generate a 'feature vector' from the header keys

In [62]:
# Okay lets exercise some of the pandas dataframe functionality
dataframe['count'] = 1
agent_group_df = dataframe.groupby(['short_agent','features']).sum()
agent_group_df.head(20)


Out[62]:
count
short_agent features
memeo:autobackup:/4.60.0.7923:/platform=1 ACCEPT-LANGUAGE:ACCEPT:USER-AGENT:HOST:CONNECTION 2
microsoft-cryptoapi/6.1 CACHE-CONTROL:CONNECTION:ACCEPT:IF-MODIFIED-SINCE:IF-NONE-MATCH:USER-AGENT:HOST 2
CACHE-CONTROL:CONNECTION:ACCEPT:IF-MODIFIED-SINCE:USER-AGENT:HOST 3
CONNECTION:ACCEPT:IF-MODIFIED-SINCE:IF-NONE-MATCH:USER-AGENT:HOST 2
CONNECTION:ACCEPT:USER-AGENT:HOST 3
mozilla/4.0 USER-AGENT:HOST 3
USER-AGENT:HOST:IF-MODIFIED-SINCE:IF-NONE-MATCH:CONNECTION 3
mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322 ACCEPT:ACCEPT-LANGUAGE:XXXXXXXXXXXXXXX:USER-AGENT:HOST:CONNECTION 1
mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION 1
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE 4
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION 2
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION 15
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE 12
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL:COOKIE 1
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:X-VERIFY:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL 1
ACCEPT:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION 2
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION 5
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE 6
ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:COOKIE:CONNECTION:HOST 1
ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION 77

20 rows × 1 columns


In [63]:
# Now lets get the number of different header sequence permutations per agent
agent_counts = agent_group_df.count(level=0)

# Looks like MSIE agents have a higher number of permutations than all the other stuff
# So we 'groupby' a conditional statement (do you have msie in your agent string)
agent_types = agent_counts.groupby(by=lambda x: 'msie' if 'msie' in x else 'other')
agent_types.head(20)


Out[63]:
count
short_agent
other memeo:autobackup:/4.60.0.7923:/platform=1 1
microsoft-cryptoapi/6.1 4
mozilla/4.0 2
msie mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322 1
mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 2
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 12
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 17
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0 4
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06 3
other mozilla/5.0:windows:nt:6.1:wow64:rv:12.0:gecko/20100101:firefox/12.0 2
mozilla/5.0:windows:nt:6.1:wow64:rv:14.0:gecko/20100101:firefox/14.0.1 5
mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0 25
mozilla/5.0:windows:u:windows:nt:6.1:en-us:rv:1.9.2.18:gecko/20110614:firefox/3.6.18 4
nis/19.9.0.9:mid/{grpgbpdjzi9qbdsno/32ukaorrc}:sid/fhkuuaaaaaa 1
nis/19.9.0.9:mid/{grpgbpdjzi9qbdsno/32ukaorrc}:sid/fhkuuaaaaaa:lue/1.8.2.10:windows6.1sp1.0x64enu 1
shasta 1
shockwave:flash 6

17 rows × 1 columns


In [64]:
# Get some quick descriptive stats and plot it!
fig, ax = plt.subplots(subplot_kw={'axisbg':'#EEEEE5'})
ax.grid(color='lightgrey', linestyle='solid')
agent_types.boxplot(False)


Out[64]:
{'boxes': [<matplotlib.lines.Line2D at 0x10b266e50>,
  <matplotlib.lines.Line2D at 0x10b26ef90>],
 'caps': [<matplotlib.lines.Line2D at 0x10b2661d0>,
  <matplotlib.lines.Line2D at 0x10b266810>,
  <matplotlib.lines.Line2D at 0x10b26e310>,
  <matplotlib.lines.Line2D at 0x10b26e950>],
 'fliers': [<matplotlib.lines.Line2D at 0x10b269a10>,
  <matplotlib.lines.Line2D at 0x10b26d410>,
  <matplotlib.lines.Line2D at 0x10b273c50>,
  <matplotlib.lines.Line2D at 0x10b276290>],
 'medians': [<matplotlib.lines.Line2D at 0x10b2693d0>,
  <matplotlib.lines.Line2D at 0x10b273610>],
 'whiskers': [<matplotlib.lines.Line2D at 0x10b260850>,
  <matplotlib.lines.Line2D at 0x10b260ad0>,
  <matplotlib.lines.Line2D at 0x10b26da10>,
  <matplotlib.lines.Line2D at 0x10b26dc90>]}

In [65]:
# Now lets flip the group by around
features = dataframe[['short_agent','features','count']].groupby(['features','short_agent']).sum()
print features.shape
features.head(20)


(91, 1)
Out[65]:
count
features short_agent
ACCEPT-LANGUAGE:ACCEPT:USER-AGENT:HOST:CONNECTION memeo:autobackup:/4.60.0.7923:/platform=1 2
ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 2
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 5
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0 1
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06 3
ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 9
ACCEPT:ACCEPT-ENCODING:USER-AGENT:IF-MODIFIED-SINCE:HOST:CONNECTION:COOKIE mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 8
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 1
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 15
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 12
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:IF-MODIFIED-SINCE:HOST:CONNECTION mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 1
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL:COOKIE mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 1
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:X-VERIFY:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 1
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-SVN-REV:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 1
ACCEPT:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 2
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06 2
ACCEPT:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 5
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0 2
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 5
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 4

20 rows × 1 columns

Okay...so after looking at the data we have a better 'feel' it... but now what?

Perhaps we should try a different approach.

  1. Compute similarities between all rows within the system log using LSH: Unlike conventional hash functions the goal of LSH (Locality Sensitive Hashing) is to maximize probability of "collision" of similar items rather than avoid collisions.
  2. Use those similarities as the basis of a Hierarchical Clustering Algorithm: Single-linkage clustering is one of several methods for agglomerative hierarchical clustering.

The LSH Sims python class has two distance metrics

1) Jaccard Index: a set based distance metric (overlaps in sets of elements)

2) Levenshtein Distance: based on the edit distance of the elements (so order matters).

What the F&*# is a Levenshtein!?

"The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other." Levenshtein Distance (WikiPedia)

In this case we are using Levenshtein not on individual letters in strings but tokens in sequences.

Examples:

a = ['ACCEPT', 'USER-AGENT', 'HOST', 'COOKIE']
b = ['ACCEPT', 'USER-AGENT', 'HOST']
c = ['ACCEPT', 'USER-AGENT', 'DORSEYS-MOM']
d = ['COOKIE', 'ACCEPT', 'USER-AGENT', 'HOST']

levenshtein(a,b) = 1.0
levenshtein(b,c) = 1.0
levenshtein(a,d) = 2.0

In [66]:
# Lets look at the a few examples of Levenshtein distance
import data_hacking.lsh_sims as lsh_sims
lsh = lsh_sims.LSHSimilarities([])
a = ['ACCEPT', 'USER-AGENT', 'HOST', 'COOKIE']
b = ['ACCEPT', 'USER-AGENT', 'HOST']
c = ['ACCEPT', 'USER-AGENT', 'DORSEYS-MOM']
d = ['COOKIE', 'ACCEPT', 'USER-AGENT', 'HOST']

print 'Levenshtein: %s -- %s   ( %f )' % (a, b, lsh.levenshtein(a, b))
print 'Levenshtein: %s -- %s   ( %f )' % (b, c, lsh.levenshtein(b, c))
print 'Levenshtein: %s -- %s   ( %f )' % (a, d, lsh.levenshtein(a, d))


Levenshtein: ['ACCEPT', 'USER-AGENT', 'HOST', 'COOKIE'] -- ['ACCEPT', 'USER-AGENT', 'HOST']   ( 1.000000 )
Levenshtein: ['ACCEPT', 'USER-AGENT', 'HOST'] -- ['ACCEPT', 'USER-AGENT', 'DORSEYS-MOM']   ( 1.000000 )
Levenshtein: ['ACCEPT', 'USER-AGENT', 'HOST', 'COOKIE'] -- ['COOKIE', 'ACCEPT', 'USER-AGENT', 'HOST']   ( 2.000000 )

In [67]:
# Lets compute levenshtein distance between the header sequences for each agent
params = {'num_hashes':20, 'lsh_bands':20, 'lsh_rows':1, 'drop_duplicates':True}

agent_distances = {}
agent_groups = dataframe.groupby(['short_agent'])
for name, group in agent_groups:
    lsh = lsh_sims.LSHSimilarities(group['feature_vector'], mh_params=params)
    distances = lsh.batch_compute_similarities(distance_metric='levenshtein_tapered', threshold=10)
    distances.sort() 
    agent_distances[name] = distances

In [68]:
# For one agent show the top 5 closest (levenshtein) header sequences
agent = 'mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0'
distances = agent_distances[agent]

print '\nAgent: %s' % agent
print 'Distances:'
features = agent_groups.get_group(agent)['feature_vector']
for distance in distances[:5]:
    print '\n%s\n%s' % (features.iloc[distance[1]], features.iloc[distance[2]])


Agent: mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0
Distances:

[u'ACCEPT', u'ACCEPT-LANGUAGE', u'X-FLASH-VERSION', u'ACCEPT-ENCODING', u'USER-AGENT', u'HOST', u'CONNECTION', u'COOKIE']
[u'ACCEPT', u'ACCEPT-LANGUAGE', u'REFERER', u'X-FLASH-VERSION', u'ACCEPT-ENCODING', u'USER-AGENT', u'HOST', u'CONNECTION']

Hierarchical Clustering

Now we can use those similarities as the basis of a Hierarchical Clustering Algorithm. Single-linkage clustering is one of several methods for agglomerative hierarchical clustering. The image on the right is an example of how this works.

We're using a bottom up method (image is flipped :), you simply sort the similarities and start building your tree from the bottom. If B and C are the most similar you link them, then D/E and so on until you complete the tree. The devil is definitely in the details on the implementation of this, so luckily we have a python class that does it for us.


In [69]:
# MLPD3 is a cool python module for using D3 as a back end to matplotlib
# go to https://github.com/jakevdp/mpld3 and behold the awesome.

# Note we're commenting this out that the nbviewer work correctly, 
# but feel free to uncomment if you download the notebook and play
# with it yourself.
'''
try:
    import mpld3
    mpld3.enable_notebook(d3_url="/files/d3/d3.v3.js")
except ImportError:
    print 'Info: Could not load mpld3 module. No worries stuff will still work fine...'
'''


Out[69]:
'\ntry:\n    import mpld3\n    mpld3.enable_notebook(d3_url="/files/d3/d3.v3.js")\nexcept ImportError:\n    print \'Info: Could not load mpld3 module. No worries stuff will still work fine...\'\n'

In [70]:
# Compute a hierarchical clustering from the header similarities for each agent
import data_hacking.hcluster as hcluster
agent_h_graphs = {}
groups = dict(list(agent_groups))
for name, group in groups.iteritems():
    lsh = lsh_sims.LSHSimilarities(group['feature_vector'], mh_params=params)
    distances = lsh.batch_compute_similarities(distance_metric='l_tapered_sim', threshold=0)
    h_clustering = hcluster.HCluster(group['feature_vector'])
    h_clustering.set_sim_method(lsh.l_sim)
    h_graph, root = h_clustering.sims_to_hcluster(distances, agg_sim=.2)
    agent_h_graphs[name] = {'graph':h_graph, 'root':root}


<<<< WTF Error: Looks like an empty graph >>>>>
Graph 0 nodes 0 edges
<<<< WTF Error: Looks like an empty graph >>>>>
Graph 0 nodes 0 edges
<<<< WTF Error: Looks like an empty graph >>>>>
Graph 0 nodes 0 edges
<<<< WTF Error: Looks like an empty graph >>>>>
Graph 0 nodes 0 edges
<<<< WTF Error: Looks like an empty graph >>>>>
Graph 0 nodes 0 edges

In [71]:
# Plot a couple of agents
import networkx as nx

def plot_h_tree(graph, layout='neato'):
    pos = nx.graphviz_layout(graph, prog=layout)
    labels = {node[0]:node[1]['label'] for node in graph.nodes(data=True)}
    nx.draw_networkx(graph, pos, node_size=800, alpha=.7, node_color=[.6,.4,.6], labels=labels)
    edge_labels=dict([((u,v,),str(d['weight'])[:4]) for u,v,d in graph.edges(data=True)])
    nx.draw_networkx_edge_labels(graph,pos,edge_labels=edge_labels)

In [72]:
# MSIE 8
msie_8 = 'mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5'
plot_h_tree(agent_h_graphs[msie_8]['graph'])



In [73]:
msie_9 = 'mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0'
plot_h_tree(agent_h_graphs[msie_9]['graph'])



In [74]:
flash = 'shockwave:flash'
plot_h_tree(agent_h_graphs[flash]['graph'])



In [75]:
import collections
def subtree_labels(g, root):
    labels = nx.get_node_attributes(g,'label')
    sub_labels = collections.defaultdict(list)
    leaves = [k for k,v in g.out_degree().iteritems() if v == 0]
    for leaf in leaves:
        sub_labels[g.predecessors(leaf)[0]].append(labels[leaf])
    return sub_labels

import pprint
g = agent_h_graphs[good_test_agent]['graph']
root = agent_h_graphs[good_test_agent]['root']

In [76]:
# Hmph, well just for fun we made a RE Morpher class; you simply keep adding
# strings to it and it figures out the RE that matches all the strings.
# It's very hack-tastic so a better way to auto-generate regular expressions 
# will be a fun task for some contributor :)
import re
import re_morpher

# Lets experiment a bit
a = [u'HOST', u'CONNECTION', u'ACCEPT', u'USER-AGENT', u'ACCEPT-ENCODING']
b = [u'HOST', u'CONNECTION', u'AUTHORIZATION', u'ACCEPT', u'USER-AGENT', u'ACCEPT-ENCODING']
b = [u'HOST', u'CONNECTION', u'AUTHORIZATION', u'ACCEPT', u'USER-AGENT', u'DORSEYS-MOM']

my_re_morpher = re_morpher.REMorpher()
my_re_morpher.add_sequence(a)
print my_re_morpher.get_re_pattern()
my_re_morpher.add_sequence(b)
print my_re_morpher.get_re_pattern()


^HOSTCONNECTIONACCEPTUSER-AGENTACCEPT-ENCODING$
^HOSTCONNECTION(AUTHORIZATION)?ACCEPTUSER-AGENT(DORSEYS-MOM)?(ACCEPT-ENCODING)?$

In [87]:
# Alright now try it out on our agents header sequences
import collections
agent_res = collections.defaultdict(list)
for agent, graph_info in agent_h_graphs.iteritems():
#for agent, graph_info in zip(good_test_agent,agent_h_graphs[good_test_agent]):
    graph = graph_info['graph']
    root = graph_info['root']
    if graph:
        # Get the re patterns for this agent
        for sub_key, feature_list in subtree_labels(graph,root).iteritems():
            for f in feature_list:
                my_re_morpher.add_sequence(f.split(':'))

            # Append to my re list
            agent_res[agent].append(my_re_morpher.get_re_pattern())
            my_re_morpher.reset_re()
    
# Print out the agent sets just to get an idea
for agent, graph_info in agent_h_graphs.iteritems():
    print '\n%s' % agent
    for my_re in agent_res[agent]:
        print '\t%s' % my_re


mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5
	^(ACCEPT)?(ACCEPT-LANGUAGE)?(REFERER)?(X-SVN-REV)?(ACCEPT)?(ACCEPT-ENCODING)?(ACCEPT-LANGUAGE)?(USER-AGENT)?(X-FLASH-VERSION)?(ACCEPT-ENCODING)?(ACCEPT-LANGUAGE)?(USER-AGENT)?(CONTENT-TYPE)?(ACCEPT-ENCODING)?(IF-MODIFIED-SINCE)?(IF-NONE-MATCH)?(HOST)?(CONTENT-LENGTH)?(CONNECTION)?(COOKIE)?(IF-NONE-MATCH)?$
	^ACCEPTACCEPT-ENCODINGUSER-AGENT(IF-MODIFIED-SINCE)?HOSTCONNECTION(COOKIE)?$
	^X-REQUESTED-WITHACCEPT-LANGUAGEREFERERACCEPTCONTENT-TYPEACCEPT-ENCODINGUSER-AGENTIF-MODIFIED-SINCEIF-NONE-MATCHHOSTCONNECTIONCOOKIE$

mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06
	^ACCEPT(ACCEPT-ENCODING)?(ACCEPT-LANGUAGE)?USER-AGENT(ACCEPT-ENCODING)?HOSTCONNECTION$

mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5
	^ACCEPT(REFERER)?ACCEPT-LANGUAGEUSER-AGENTACCEPT-ENCODING(COOKIE)?(CONNECTION)?HOST(CONNECTION)?(COOKIE)?$
	^(X-REQUESTED-WITH)?(ACCEPT)?(ACCEPT-LANGUAGE)?(REFERER)?(ACCEPT)?(CONTENT-LENGTH)?ACCEPT-ENCODINGUSER-AGENTHOSTCONNECTION(COOKIE)?(CACHE-CONTROL)?$
	^ACCEPTACCEPT-LANGUAGE(REFERER)?X-FLASH-VERSION(CONTENT-TYPE)?(CONTENT-LENGTH)?ACCEPT-ENCODINGUSER-AGENTHOSTCONNECTION(CACHE-CONTROL)?(COOKIE)?$

mozilla/5.0:windows:nt:6.1:wow64:rv:14.0:gecko/20100101:firefox/14.0.1
	^HOSTUSER-AGENTACCEPTACCEPT-LANGUAGEACCEPT-ENCODINGCONNECTION(X-REQUESTED-WITH)?(X-YAHOO-MSGR-USER-AGENT)?(REFERER)?(COOKIE)?(IF-MODIFIED-SINCE)?(IF-NONE-MATCH)?$

mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0
	^ACCEPTACCEPT-LANGUAGE(REFERER)?X-FLASH-VERSIONACCEPT-ENCODINGUSER-AGENTHOSTCONNECTION(COOKIE)?$

mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322

shasta

mozilla/4.0
	^USER-AGENTHOST(IF-MODIFIED-SINCE)?(IF-NONE-MATCH)?(CONNECTION)?$

mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0
	^ACCEPT(REFERER)?ACCEPT-LANGUAGEUSER-AGENTACCEPT-ENCODINGHOSTCONNECTION(COOKIE)?$
	^ACCEPTACCEPT-ENCODINGUSER-AGENTHOSTCONNECTION$

mozilla/5.0:windows:nt:6.1:wow64:rv:12.0:gecko/20100101:firefox/12.0
	^HOSTUSER-AGENTACCEPTACCEPT-LANGUAGEACCEPT-ENCODINGCONNECTION(REFERER)?$

memeo:autobackup:/4.60.0.7923:/platform=1

mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0
	^HOSTUSER-AGENTACCEPTACCEPT-LANGUAGEACCEPT-ENCODINGCONNECTION(REFERER)?(ORIGIN)?(COOKIE)?(IF-MODIFIED-SINCE)?(CONTENT-TYPE)?(X-REQUESTED-WITH)?(REFERER)?(CONTENT-LENGTH)?(COOKIE)?(PRAGMA)?(IF-NONE-MATCH)?(CACHE-CONTROL)?$

nis/19.9.0.9:mid/{grpgbpdjzi9qbdsno/32ukaorrc}:sid/fhkuuaaaaaa:lue/1.8.2.10:windows6.1sp1.0x64enu

shockwave:flash
	^CONTENT-TYPEUSER-AGENTHOSTCONTENT-LENGTHCONNECTIONCACHE-CONTROL(COOKIE)?$
	^(REFERER)?X-FLASH-VERSIONUSER-AGENTHOST(CACHE-CONTROL)?(CONNECTION)?$

nis/19.9.0.9:mid/{grpgbpdjzi9qbdsno/32ukaorrc}:sid/fhkuuaaaaaa

microsoft-cryptoapi/6.1
	^(CACHE-CONTROL)?CONNECTIONACCEPT(IF-MODIFIED-SINCE)?(IF-NONE-MATCH)?USER-AGENTHOST$

mozilla/5.0:windows:u:windows:nt:6.1:en-us:rv:1.9.2.18:gecko/20110614:firefox/3.6.18
	^HOSTUSER-AGENTACCEPTACCEPT-LANGUAGEACCEPT-ENCODINGACCEPT-CHARSETKEEP-ALIVECONNECTION(REFERER)?(COOKIE)?(IF-MODIFIED-SINCE)?$

Validation and Evaluation

 - Make sure the regular expressions match all the agents/features in the training set.
 - Test the expressions against data/PCAPs that have known bad/sneaky agents.

Well by definition the regular expressions are suppose to match the training set, so the first evaluation is more of a sanity check. For the second test we find 'matching' agents in the PCAP file and test their header sequences.

NOTE: This work is fairly embryonic right now and this section in particular needs more formality around it. Also, as always, we need a super huge set of training data to get broader coverage of more agents and all of the permuations.


In [78]:
# An evaluation method for our auto-magically-generated RE expressions
import re
def evaluate_agents(agent_list, feature_list):
    print 'Evaluating %d requests' % len(agent_list)
    for agent, features in zip(agent_list, feature_list):
        my_res = [re.compile(my_re) for my_re in agent_res[agent]]
        match = any([my_re.match(features.replace(':','')) for my_re in my_res])
        if not match:
            print '\nAlert: No Match on Agent(%s) Sequence(%s)' % (agent,features)

In [85]:
# Evaluation against the training set (there should be no alerts)
t_agents = [(len(agent_res[agent])>0) for agent in dataframe['short_agent']] # Degenerate case where no H-Tree was built
training_agents = dataframe[t_agents]
evaluate_agents(training_agents['short_agent'], training_agents['features'])


Evaluating 2268 requests

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE:IF-MODIFIED-SINCE:IF-NONE-MATCH)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE:IF-MODIFIED-SINCE:IF-NONE-MATCH)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE:IF-MODIFIED-SINCE:IF-NONE-MATCH)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE:IF-MODIFIED-SINCE:IF-NONE-MATCH)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:COOKIE:CONNECTION:HOST)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:CONTENT-TYPE:ACCEPT-ENCODING:HOST:CONTENT-LENGTH:CONNECTION:CACHE-CONTROL)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:14.0:gecko/20100101:firefox/14.0.1) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-TYPE:X-YAHOO-MSGR-USER-AGENT:REFERER:COOKIE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:REFERER:ORIGIN:RANGE:IF-RANGE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:REFERER:ORIGIN:RANGE:IF-RANGE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5) Sequence(ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:X-VERIFY:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE)

Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:COOKIE:REFERER:CONTENT-TYPE:CONTENT-LENGTH)

In [80]:
# Read in from contagio dumps' pcap samples for evaluation testing
bro_log = bro_log_reader.BroLogReader()
contagio_headers = bro_log.read_log('data/contagio.headers.txt')
contagio_df = pd.DataFrame(contagio_headers)
contagio_df.head()


Out[80]:
header_events_json origin ts useragent
0 [{"ACCEPT":"application\/octet-stream"},{"CONT... client 2013-08-10 23:26:48.150406 Alina v5.3
1 [{"DATE":"Sun, 11 Aug 2013 05:25:27 GMT"},{"SE... server 2013-08-10 23:26:48.150406 NA
2 [{"ACCEPT":"application\/octet-stream"},{"CONT... client 2013-08-10 23:28:40.198085 Alina v5.3
3 [{"DATE":"Sun, 11 Aug 2013 05:27:19 GMT"},{"SE... server 2013-08-10 23:28:40.198085 NA
4 [{"ACCEPT":"application\/octet-stream"},{"CONT... client 2013-08-10 23:32:41.074339 Alina v5.3

5 rows × 4 columns


In [81]:
# A bit of processing on the raw data to prepate it for evaluation
contagio_df = contagio_df[contagio_df['origin']=='client']
contagio_df['short_agent'] = short_agent_names(contagio_df['useragent'])
contagio_df['feature_vector'] = make_header_features(contagio_df['header_events_json'])
contagio_df['features'] = contagio_df['feature_vector'].map(lambda x: ':'.join(x))

In [82]:
# Lets look at the overlap of agents from our training set and the contagio set
trained_agents = set(dataframe['short_agent'].unique())
evil_agents = set(contagio_df['short_agent'].unique())
evil_agents = evil_agents.intersection(trained_agents)
contagio_subset = contagio_df[contagio_df['short_agent'].isin(evil_agents)]
evil_agents

# Well only a couple of agents overlap our training data, but that's okay 
# still a reasonable set of header requests to test against.


Out[82]:
{'microsoft-cryptoapi/6.1',
 'mozilla/4.0',
 'mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322'}

In [83]:
# Lets see how the Contagio CrimeWare PCAP requests measure up against our dataset of computed regex's
evaluate_agents(contagio_subset['short_agent'],contagio_subset['features'])


Evaluating 33 requests

Alert: No Match on Agent(mozilla/4.0) Sequence(CACHE-CONTROL:CONNECTION:PRAGMA:CONTENT-TYPE:USER-AGENT:CONTENT-LENGTH:HOST)

Alert: No Match on Agent(mozilla/4.0) Sequence(CACHE-CONTROL:CONNECTION:PRAGMA:CONTENT-TYPE:USER-AGENT:CONTENT-LENGTH:HOST)

Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT)

Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT)

Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT)

Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT)

Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT)

Alert: No Match on Agent(mozilla/4.0) Sequence(HOST:USER-AGENT:CONTENT-TYPE:CONTENT-LENGTH:CONNECTION)

Conclusions

We read in some Bro log data, did a bit of processing and then applied some neat analytics. The analytics in this notebook covered these topics and the reader is encourage to visit the respective pages:

In general the material in this notebook represents fairly embryonic work.

Please take the results as a 'work in progress' at this point...

Time for the fancy monkeys on a spinning rock in the middle of nowhere to go have some beer!

Papers on the Automatic Generation of Regular Expressions

Bartoli, Davanzo, De Lorenzo, Mauri, Medvet, Sorio, Automatic Generation of Regular Expressions from Examples with Genetic Programming, ACM Genetic and Evolutionary Computation Conference (GECCO), 2012, Philadelphia (US)

De Lorenzo, Medvet, Bartoli, Automatic String Replace by Examples, ACM Genetic and Evolutionary Computation Conference (GECCO), 2013, Amsterdam (Netherlands)—the string replace functionality described in this paper is based on an extension of the work showcased on this web app; it is currently not exposed on the web.

There's also a neat IPython notebook on generating regular expressions

xkcd 1313: Something is Wrong on the Internet!

![](http://imgs.xkcd.com/comics/regex_golf.png)

The IPython notebook uses a strategy to find a regex that given two python sequences matches the first but ensures that it does not match the second using a set cover technique and or'ing the components together. Please see: http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb for more info.