Like most people I were wondering "What is my laptop doing? Has it become a botnet? Should I stop downloading a bunch of weird stuff all the time? Will I ever move out of my Mom's basement?" but I digress..
This notebook is an exploration of my laptop's network usage using Workbench https://github.com/SuperCowPowers/workbench.git
We wanted to get a 'gist' of the network activity happening from a particular capture point, for this case a laptop, but the cpature point could be anywhere. Obviously there are super great tools that already perform exploration and analysis of pcaps: WireShark, Chop Shop, Scapy, blah, foo, etc.. here we leveraging Bro IDS to generate our starting-point data and then hoping off in various directions from there. The work here should be viewed as complimentary to these other tools :)
Run the workbench server (from somewhere, for the demo we're just going to start a local one)
$ workbench_server
In [20]:
# Lets start to interact with workbench, please note there is NO specific client to workbench,
# Just use the ZeroRPC Python, Node.js, or CLI interfaces.
import zerorpc
c = zerorpc.Client(timeout=120)
c.connect("tcp://127.0.0.1:4242")
Out[20]:
In [21]:
# I forgot what stuff I can do with workbench
print c.help()
Lets look at the PCAPs that are being tossed into workbench. A script in the utils directory called pcap_streamer.py will 'stream' PCAPs into workbench off of a live network interface. We can use the 'get_sample_window' call to workbench for it to give us the last 30 MB of streaming PCAPs
In [34]:
# Grab a range of pcaps in workbench (last 100 MegaBytes worth in this case)
pcap_md5s = c.get_sample_window('pcap', 50)
print 'Number of PCAPs %d' % len(pcap_md5s)
In [35]:
# Workbench lets you store sample sets
pcap_set = c.store_sample_set(pcap_md5s)
In [36]:
# Now give us a HTTP graph of all the activities within that window of PCAPs.
# Workbench also has DNS and CONN graphs, but for now we're just interested in HTTP.
c.work_request('pcap_http_graph', pcap_set)
Out[36]:
The HTTP graph has quite a bit of info, but you can see that we've conducted a shortest paths search from all nodes of type 'origin' (any node originating http communications) to any node of type 'file'. So we're particularly interested in all of the various files that got downloaded from our network tap in the last few minutes.
In [108]:
# We can also ask workbench for a python dictionary of all the info from this set of (100MB) PCAPs,
# because sometimes visualization are useful and sometimes organized data is useful.
output = c.work_request('view_pcap_details', pcap_set)['view_pcap_details']
output
Out[108]:
In [109]:
# Critical Code: Transition from Bro logs to Pandas Dataframes
# This one line of code populates dataframes from the Bro logs,
# streaming client/server generators, zero-copy, efficient, awesome...
import pandas as pd
dataframes = {name:pd.DataFrame(c.stream_sample(bro_log)) for name, bro_log in output['bro_logs'].iteritems()}
We're going to use some nice functionality in the Pandas dataframe to look at our network data, specifically we're going to group by origin, host, host-ip, and mime_type. The last column represents the aggregated sum of response_body_len.
This type of operation is really just scratching the surface when it comes to dataframes, so quickly and efficiently populating a dataframe is super awesome.
In [110]:
# Now we group by host and show the different response mime types for each host
group_host = dataframes['http_log'].groupby(['id.orig_h','host','id.resp_h','resp_mime_types'])[['response_body_len']].sum()
group_host.head(100)
Out[110]:
In [111]:
# Now we group by host and show the different response mime types for each host
group_host = dataframes['http_log'].groupby(['host','id.resp_h','resp_mime_types','uri'])[['response_body_len']].sum()
group_host.head(50)
Out[111]:
In [112]:
# Look at Weird logs
dataframes['weird_log'].head(20)
Out[112]:
In [113]:
# Convert the 'ts' field to an official datetime object
dataframes['http_log']['time'] = pd.to_datetime(dataframes['http_log']['ts'],unit='s')
dataframes['http_log']['time'].head()
Out[113]:
In [114]:
# Explore pivoting and resampling
response_bytes = dataframes['http_log'][['time','resp_mime_types','response_body_len']]
response_bytes['response_body_len'] = response_bytes['response_body_len'].astype(int)
print response_bytes.head()
pivot = pd.pivot_table(response_bytes, rows='time', values='response_body_len', cols=['resp_mime_types'], aggfunc=sum)
sampled_bytes = pivot.resample('1Min', how='sum')
sampled_bytes.head()
Out[114]:
In [115]:
# Plotting defaults
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 12.0
plt.rcParams['figure.figsize'] = 12.0, 8.0
In [116]:
# Let plot it!
sampled_bytes.plot()
Out[116]: