A light-weight and flexible approach to task management, execution and pipelining.
Open source tools have 3 primary components that are critical for adoption:
### How easy is it to get the system up and running?
### How easy is it to use?
import zerorpc c = zerorpc.Client() c.connect("tcp://127.0.0.1:4242") with open('evil.pcap','rb') as f: md5 = c.store_sample(f.read(), 'evil.pcap', 'pcap') print c.work_request('pcap_meta', md5)
{'pcap_meta': {'encoding': 'binary', 'file_size': 54339570, 'file_type': 'tcpdump (little-endian) - version 2.4 (Ethernet, 65535)', 'filename': 'evil.pcap', 'import_time': '2014-02-08T22:15:50.282000Z', 'md5': 'bba97e16d7f92240196dc0caef9c457a', 'mime_type': 'application/vnd.tcpdump.pcap'}}
Run the workbench server (from somewhere, for the demo we're just going to start a local one)
$ workbench_server
In [35]:
# Lets start to interact with workbench, please note there is NO specific client to workbench,
# Just use the ZeroRPC Python, Node.js, or CLI interfaces.
import zerorpc
c = zerorpc.Client()
c.connect("tcp://127.0.0.1:4242")
Out[35]:
Workbench is often confusing for new users (we're trying to work on that). Please see our github repository https://github.com/SuperCowPowers/workbench for the latest documentation and notebooks examples (the notebook examples can really help). New users can start by typing **c.help()** after they connect to workbench.
In [36]:
# I forgot what stuff I can do with workbench
print c.help()
In [37]:
print c.help_basic()
In [38]:
# STEP 1:
# Okay get the list of commands from workbench
print c.help_commands()
In [39]:
# STEP 2:
# Lets gets the infomation on a specific command 'store_sample'
print c.help_command('store_sample')
In [40]:
# STEP 3:
# Now lets get infomation about the dynamically loaded workers (your site may have many more!)
# Next to each worker name is the list of dependences that worker has declared
print c.help_workers()
In [41]:
# STEP 4:
# Lets gets the infomation about the meta worker
print c.help_worker('meta')
In [42]:
# STEP 5:
# Okay when we load up a file, we get the md5 back
filename = '../data/pe/bad/0cb9aa6fb9c4aa3afad7a303e21ac0f3'
with open(filename,'rb') as f:
my_md5 = c.store_sample(f.read(), filename, 'exe')
print my_md5
In [43]:
# STEP 6:
# Run a worker on my sample
output = c.work_request('meta', my_md5)
output
Out[43]:
In [44]:
# Lets see what view_pe does
print c.help_worker('view_pe')
In [45]:
# Okay lets give it a try
c.work_request('view_pe', my_md5)
Out[45]:
In [46]:
# Okay, that worker needed the output of pe_features and pe_indicators
# so what happened? The worker has a dependency list and workbench
# recursively satisfies that dependency list.. this is powerful because
# when we're interested in one particular analysis we just want to get
# the darn thing without having to worry about a bunch of details
# Well lets do this for a bunch of files!
import os
file_list = [os.path.join('../data/pe/bad', child) for child in os.listdir('../data/pe/bad')]
working_set = []
for filename in file_list:
with open(filename,'rb') as f:
md5 = c.store_sample(f.read(), filename, 'exe')
results = c.work_request('pe_classifier', md5)
working_set.append(md5)
print 'Results: %s' % (results)
In [47]:
# We just ran the classifer on 50 files and you'll note that we ONLY got back the
# information we ask for. On a large amount of files (100k or greater) if you don't
# have a granular system, something this easy WILL NOT BE POSSIBLE! (dramatic enough?)
# So lets look at the features going into the classifier (btw the classifier is currently a TOY EXAMPLE)
c.work_request('pe_features', md5)
Out[47]:
In [48]:
c.work_request('pe_indicators', md5)
Out[48]:
On another note, did we just waste some time there? Did workbench have to recompute the features? No everything done by workbench is pushed into the MongoDB backend and then if the work results for that md5 are already in the datastore the a very lightweight call is made to get the results. In fact results are never directly returned, the worker pushes into Mongo and then we pull them out and hand them to the client, that way we ^ensure^ that the bits in the datastore and the bits that you get are the exact same 'gold bits' (seems like overkill but it's important).
In [49]:
# Another example.. I want to look at strings for different types of files (not just pe_files)
# So we can load up a few pdfs (the pe's are already in the datastore)
file_list = [os.path.join('../data/pdf/bad', child) for child in os.listdir('../data/pdf/bad')]
for filename in file_list:
with open(filename,'rb') as f:
md5 = c.store_sample(f.read(), filename, 'pdf')
working_set.append(md5)
In [50]:
# Now we rip the strings worker on them all
for md5 in working_set:
result = c.work_request('strings', md5)
print 'results: %s' % (result['strings']['string_list'][:5]) # strings output is large so just showing the first 5
Views exemplify the true power of the workbench. They are meta workers in the broadest sense, they can call any set of workers (and other views, which are just workers of course). All of the previous notebook code focused on demonstrating the level of control and granularity you can use with workbench, here the example we're going to show for views will be for those who don't care about granularity and really just want a big 'GO' button.
Views can also be precise or general (example shows the latter):
- Customer billing View
- Sample volume over time View
- All samples that use communications calls View
- DO_EVERYTHING_BECAUSE_I_WANT_TO_PUNCH_GRANULARITY_IN_THE_NUTS! View
So lets look at the last kind .. it's called 'view' and like many of the other workers it's 20 lines of code.
But it's deceptively simple, if you think about what must be happening below... over a dozen workers are getting orchestrated and run only when it makes sense for that MIME type. So with a few 'pull' calls the recursive dependency chains are invoked; work is done if/when it's needed and the whole thing is fantastically elegant and efficient. If your mind isn't a little bit blown by what happens below then you might not be paying attention.
In [51]:
# This just grabs all the file_paths recursively
def tag_type(path):
types = ['bro','json','log','pcap','pdf','exe','swf','zip']
for try_type in types:
if try_type in os.path.dirname(path):
return try_type
file_list = []
for p,d,f_list in os.walk('../data'):
file_list += [os.path.join(p, f) for f in f_list]
In [54]:
# We're going to load in all the files which include PE files, PCAPS, PDFs, and ZIPs and run 'view' on them.
# Note: This takes a while :)
import pprint
results = []
for filename in file_list:
with open(filename,'rb') as f:
md5 = c.store_sample(f.read(), os.path.basename(filename), tag_type(filename))
results.append(c.work_request('view', md5))
pprint.pprint(results[:5])
In [55]:
# Okay so views can either aggregate results from multiple workers or they
# can subset to just want you want (webpage presentation for instance)
results = c.batch_work_request('view_customer')
print results
In [56]:
# At this granularity it opens up a new world
import pandas as pd
df = pd.DataFrame(results)
df.head(10)
Out[56]:
In [57]:
# Lets look at the file submission types broken down by customer
df['count'] = 1
df.groupby(['customer','type_tag']).sum()
Out[57]:
In [58]:
# Plotting defaults
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 12.0
plt.rcParams['figure.figsize'] = 18.0, 8.0
In [59]:
# Plot box plots based on customer (PDFs)
df[df['type_tag']=='pdf'].boxplot('length','customer')
plt.xlabel('Customer')
plt.ylabel('File Size')
plt.title('File Length (PDF) by Customer')
plt.suptitle('')
Out[59]:
In [60]:
# Plot box plots based on customer (PEs)
df[df['type_tag']=='exe'].boxplot('length','customer')
plt.xlabel('Customer')
plt.ylabel('File Size')
plt.title('File Length (PE) by Customer')
plt.suptitle('')
Out[60]:
In [65]:
# Okay now lets do some plots on the file meta-data
results = c.batch_work_request('meta_deep')
In [66]:
df_meta = pd.DataFrame(results)
df_meta.head()
Out[66]:
In [67]:
# Plot entropy box plots based on file type
df_meta.boxplot('entropy','type_tag')
plt.xlabel('Mime Type')
plt.ylabel('Entropy')
Out[67]:
In [68]:
# Plot customer submissions based on file type
group_df = df[['customer','type_tag']]
group_df['submissions'] = 1
group_df = group_df.groupby(['customer','type_tag']).sum().unstack()
group_df.head()
Out[68]:
In [80]:
# Plot entropy box plots based on mime-type
my_colors = [(x/9.0, .8, 1.0-x/9.0) for x in range(10)] # Why the heck dosen't matplotlib have better categorical cmaps?
group_df['submissions'].plot(kind='bar', stacked=True, color=my_colors)
plt.xlabel('Customer')
plt.ylabel('Submissions')
Out[80]: