PCAP to Pandas Dataframe

This notebook demonstrates a particularily kewl feature of workbench. Quickly and efficiently going from raw data to a Pandas Dataframe.

Here we're using the workbench server to look at a specific case captured by ThreatGlass. The exploited website for this exercise is gold-xxx.net ThreatGlass_Info.

Tools in this Notebook:

More Info:

Lets start up the workbench server...

Run the workbench server (from somewhere, for the demo we're just going to start a local one)

$ workbench_server

In [5]:
# Lets start to interact with workbench, please note there is NO specific client to workbench,
# Just use the ZeroRPC Python, Node.js, or CLI interfaces.
import zerorpc
c = zerorpc.Client(timeout=120)
c.connect("tcp://127.0.0.1:4242")


Out[5]:
[None]

Read in the Data

The data is pulled from [ThreatGlass](http://www.threatglass.com/), the exploited website for this exercise is gold-xxx.net [ThreatGlass_Info](http://www.threatglass.com/malicious_urls/141deabbc8741175d9f51559cf4ef3dd?process_date=2014-05-29).


In [6]:
# Load in the PCAP file
with open('../data/pcap/gold_xxx.pcap','rb') as f:
    pcap_md5 = c.store_sample(f.read(), 'gold_xxx', 'pcap')

In [10]:
# We can also ask workbench for a python dictionary of all the info from this PCAP,
# because sometimes visualization are useful and sometimes organized data is useful.
output = c.work_request('view_pcap_details', pcap_md5)['view_pcap_details']
output


Out[10]:
{'bro_logs': {'conn_log': 'e6b210abca2299821a31e4448260f5da',
  'dhcp_log': 'cf081f397ae93aaeada91cb68ac86168',
  'dns_log': '3af86fbc1bd125c83160d1f5d0cafa39',
  'files_log': '468dca88929d2ed54b26ff81fe1b1700',
  'http_log': 'b76384cf1c5179bccf64f8320882af94',
  'packet_filter_log': '87676f840e783c2dc537efe51acfc075',
  'weird_log': 'a8d45ed8b0ddb0b9e115d05fe9f65dea'},
 'extracted_files': [{'entropy': 6.845206689967475,
   'file_size': 219136,
   'file_type': 'PE32 executable (GUI) Intel 80386, for MS Windows',
   'md5': '17d786f9a3ac2b54cf29122cd58bdabe',
   'sha256': 'cc5bd99f15d2b2c3153ca132245b2780fac08e66c8ed0dc096919b81beb886b5',
   'ssdeep': '3072:ddZuptT5MSMLp30xUiteu55Cva5xmSnaCOnQe+kAiE7jtMH4jIT9m26zD2FzXunl:dneTSjaxjeu50va5xm2jtcUQR2'},
  {'entropy': 6.845206689967475,
   'file_size': 219136,
   'file_type': 'PE32 executable (GUI) Intel 80386, for MS Windows',
   'md5': '17d786f9a3ac2b54cf29122cd58bdabe',
   'sha256': 'cc5bd99f15d2b2c3153ca132245b2780fac08e66c8ed0dc096919b81beb886b5',
   'ssdeep': '3072:ddZuptT5MSMLp30xUiteu55Cva5xmSnaCOnQe+kAiE7jtMH4jIT9m26zD2FzXunl:dneTSjaxjeu50va5xm2jtcUQR2'},
  {'entropy': 6.672123050634468,
   'file_size': 149,
   'file_type': 'data',
   'md5': '016d482b6cf4fda57240b539e1468794',
   'sha256': 'b5b1c65f25374362658c6c8ffbdf15fa61f7c09890eb30bd9e835906b2fdd9c6',
   'ssdeep': '3:WH9TISb/x9doI+YBU4eqezcW73YOJKVsv/dXN5bgZMlR/73bU4ZEMGOB9:6ISbvV+2bHYn/d95bg2zbDEI7'},
  {'entropy': 3.7725899061892,
   'file_size': 1150,
   'file_type': 'MS Windows icon resource - 1 icon',
   'md5': '8f8e6f2edc6d89b9632d7fa73ee4f5ea',
   'sha256': 'af0db6b81e131069926379a610d770973d24b2be92a99cb5c1a30b5fc4b13e5e',
   'ssdeep': '12:jG28R+X1Zxr94pUfHudkLaJaJaJaJaaJaJaJgEhKqEYax5adqwtmUUlR55n:i28AxrVfHsZUUUXUUVhz1l8UUj'},
  {'entropy': 6.672123050634468,
   'file_size': 149,
   'file_type': 'data',
   'md5': '016d482b6cf4fda57240b539e1468794',
   'sha256': 'b5b1c65f25374362658c6c8ffbdf15fa61f7c09890eb30bd9e835906b2fdd9c6',
   'ssdeep': '3:WH9TISb/x9doI+YBU4eqezcW73YOJKVsv/dXN5bgZMlR/73bU4ZEMGOB9:6ISbvV+2bHYn/d95bg2zbDEI7'},
  {'entropy': 6.845206689967475,
   'file_size': 219136,
   'file_type': 'PE32 executable (GUI) Intel 80386, for MS Windows',
   'md5': '17d786f9a3ac2b54cf29122cd58bdabe',
   'sha256': 'cc5bd99f15d2b2c3153ca132245b2780fac08e66c8ed0dc096919b81beb886b5',
   'ssdeep': '3072:ddZuptT5MSMLp30xUiteu55Cva5xmSnaCOnQe+kAiE7jtMH4jIT9m26zD2FzXunl:dneTSjaxjeu50va5xm2jtcUQR2'},
  {'entropy': 6.845206689967475,
   'file_size': 219136,
   'file_type': 'PE32 executable (GUI) Intel 80386, for MS Windows',
   'md5': '17d786f9a3ac2b54cf29122cd58bdabe',
   'sha256': 'cc5bd99f15d2b2c3153ca132245b2780fac08e66c8ed0dc096919b81beb886b5',
   'ssdeep': '3072:ddZuptT5MSMLp30xUiteu55Cva5xmSnaCOnQe+kAiE7jtMH4jIT9m26zD2FzXunl:dneTSjaxjeu50va5xm2jtcUQR2'},
  {'entropy': 6.845206689967475,
   'file_size': 219136,
   'file_type': 'PE32 executable (GUI) Intel 80386, for MS Windows',
   'md5': '17d786f9a3ac2b54cf29122cd58bdabe',
   'sha256': 'cc5bd99f15d2b2c3153ca132245b2780fac08e66c8ed0dc096919b81beb886b5',
   'ssdeep': '3072:ddZuptT5MSMLp30xUiteu55Cva5xmSnaCOnQe+kAiE7jtMH4jIT9m26zD2FzXunl:dneTSjaxjeu50va5xm2jtcUQR2'},
  {'entropy': 7.306977991950132,
   'file_size': 12761,
   'file_type': 'PDF document, version 1.6',
   'md5': 'b85cb7cee9e145ac4dfb7e8f1870e360',
   'sha256': 'af1f3785ea4a2be08cb13c6b32a3cf71bbe24f50530b591ddb3be9d363d2e6a3',
   'ssdeep': '384:SW3+jGeEBnZgazaw0eUqfgij6aUi0f+xOfESUxDnDKT:fW+h0eho5YxOIFKT'},
  {'entropy': 7.855433934299751,
   'file_size': 10629,
   'file_type': 'Java Jar file data (zip)',
   'md5': 'd4c56b4d0ba9fbbd1028b83401eec133',
   'sha256': '155527017e6dbeec01c1b0a99f9db9dcf7a0b4ea902fb6586b463b2836e481bc',
   'ssdeep': '192:rh3cJzBpxUybGZXwnYzG2/HCtLVl+Wg/+czIPKGgGmnzmIlbA:t3KLFbGdqU6tCWdc8PKUwmf'},
  {'entropy': 6.845206689967475,
   'file_size': 219136,
   'file_type': 'PE32 executable (GUI) Intel 80386, for MS Windows',
   'md5': '17d786f9a3ac2b54cf29122cd58bdabe',
   'sha256': 'cc5bd99f15d2b2c3153ca132245b2780fac08e66c8ed0dc096919b81beb886b5',
   'ssdeep': '3072:ddZuptT5MSMLp30xUiteu55Cva5xmSnaCOnQe+kAiE7jtMH4jIT9m26zD2FzXunl:dneTSjaxjeu50va5xm2jtcUQR2'},
  {'entropy': 4.869185904780487,
   'file_size': 7692,
   'file_type': 'assembler source, ASCII text',
   'md5': '739993fe99fcb74d283f7faa1617984b',
   'sha256': '2882a10573b80d8c6cfe2137160f993523082dd20948cbbfcb2458128d0a3043',
   'ssdeep': '192:/2TFrmh1FVFlrfVWIXXoJAaF7Fe7cFJeS1NM8M8D2f:uTFryFVFlr9WcaF7FeoF0mHm'},
  {'entropy': 6.570728034441538,
   'file_size': 156160,
   'file_type': 'PE32 executable (GUI) Intel 80386, for MS Windows',
   'md5': 'c37f0a7b0249c91a43e749ad3660fb55',
   'sha256': 'e0124e97a99c40b6f48760ab90d2ed009f9b4c661f3e98c86c848b876588c47d',
   'ssdeep': '3072:Dd/1aVzIF8P3h3cJUitGoDp+XSB5mKAKYOnQe+FSDHslun81L:DraVzTP3GJjGoDYXSB5mkHet'},
  {'entropy': 7.855433934299751,
   'file_size': 10629,
   'file_type': 'Java Jar file data (zip)',
   'md5': 'd4c56b4d0ba9fbbd1028b83401eec133',
   'sha256': '155527017e6dbeec01c1b0a99f9db9dcf7a0b4ea902fb6586b463b2836e481bc',
   'ssdeep': '192:rh3cJzBpxUybGZXwnYzG2/HCtLVl+Wg/+czIPKGgGmnzmIlbA:t3KLFbGdqU6tCWdc8PKUwmf'},
  {'entropy': 6.570728034441538,
   'file_size': 156160,
   'file_type': 'PE32 executable (GUI) Intel 80386, for MS Windows',
   'md5': 'c37f0a7b0249c91a43e749ad3660fb55',
   'sha256': 'e0124e97a99c40b6f48760ab90d2ed009f9b4c661f3e98c86c848b876588c47d',
   'ssdeep': '3072:Dd/1aVzIF8P3h3cJUitGoDp+XSB5mKAKYOnQe+FSDHslun81L:DraVzTP3GJjGoDYXSB5mkHet'},
  {'entropy': 7.855433934299751,
   'file_size': 10629,
   'file_type': 'Java Jar file data (zip)',
   'md5': 'd4c56b4d0ba9fbbd1028b83401eec133',
   'sha256': '155527017e6dbeec01c1b0a99f9db9dcf7a0b4ea902fb6586b463b2836e481bc',
   'ssdeep': '192:rh3cJzBpxUybGZXwnYzG2/HCtLVl+Wg/+czIPKGgGmnzmIlbA:t3KLFbGdqU6tCWdc8PKUwmf'}],
 'md5': 'c8e58ff22b9a8e48838373fbb1692bdd'}

If the next line of code doesn't blow your mind, you aren't paying attention!

Thanks to ZeroRPC all of the bro logs are streamed from server to client with NETWORK STREAMING GENERATORS, those highly efficient generators are zero-copy and stream data directly into Pandas Dataframes.

For more on client/server generators and client-contructed/server-executed generator pipelines see our super spiffy Generator Pipelines notebook.


In [11]:
# Critical Code: Transition from Bro logs to Pandas Dataframes
# This one line of code populates dataframes from the Bro logs, 
# streaming client/server generators, zero-copy, efficient, awesome...
import pandas as pd
dataframes = {name:pd.DataFrame(c.stream_sample(bro_log, None)) for name, bro_log in output['bro_logs'].iteritems()}

Lets look at the Data

We're going to use some nice functionality in the Pandas dataframe to look at our network data, specifically we're going to group by host, host-ip, mime_type and uri. The last column represents the aggregated sum of response_body_len.

This type of operation is really just scratching the surface when it comes to dataframes, so quickly and efficiently populating a dataframe is super awesome.


In [20]:
# Look at DNS logs
dataframes['dns_log'][['query','answers','qtype_name']].head(10)


Out[20]:
query answers qtype_name
0 gold-xxx.net 178.208.85.60 A
1 counter.yadro.ru 88.212.196.105,88.212.196.122,88.212.196.123,8... A
2 picsee.net 62.75.207.72 A
3 counter.rambler.ru 81.19.88.95,81.19.88.96,81.19.88.102,81.19.88.... A
4 freeroomhostelz.com 91.194.254.195 A
5 57d9bf9co3qbc.paroonic.ru 80.72.37.112 A
6 323210841-1.paroonic.ru 80.72.37.112 A
7 domainsfullkolls.biz 94.242.216.61 A
8 update.microsoft.com update.microsoft.com.nsatc.net,157.56.96.60,15... A
9 update.microsoft.com update.microsoft.com.nsatc.net,65.55.138.114,1... A

10 rows × 3 columns


In [21]:
# Look at Conn logs
dataframes['conn_log'].head(10)


Out[21]:
conn_state duration history id.orig_h id.orig_p id.resp_h id.resp_p local_orig missed_bytes orig_bytes orig_ip_bytes orig_pkts proto resp_bytes resp_ip_bytes resp_pkts service ts tunnel_parents uid
0 RSTO 0.631127 ShADadR 192.168.39.10 1036 178.208.85.60 80 - 0 587 955 9 tcp 6459 6743 7 http 1.401401e+09 (empty) CBdyLA2FydtnrvzeBi
1 RSTO 0.150052 ShR 192.168.39.10 1037 178.208.85.60 80 - 0 0 88 2 tcp 0 44 1 - 1.401401e+09 (empty) C0gD932Pau36gzSy93
2 RSTO 0.151529 ShR 192.168.39.10 1040 178.208.85.60 80 - 0 0 88 2 tcp 0 44 1 - 1.401401e+09 (empty) CXTGqY2XNUSXRynhFg
3 RSTO 0.152817 ShR 192.168.39.10 1039 178.208.85.60 80 - 0 0 88 2 tcp 0 44 1 - 1.401401e+09 (empty) CNLjfJ1CSChTSRPRO1
4 RSTO 0.154018 ShR 192.168.39.10 1038 178.208.85.60 80 - 0 0 88 2 tcp 0 44 1 - 1.401401e+09 (empty) CWUpVH2eZ0g3HgXe1k
5 RSTO 0.15395 ShR 192.168.39.10 1041 178.208.85.60 80 - 0 0 88 2 tcp 0 44 1 - 1.401401e+09 (empty) C7WF992bGceJEID9k5
6 SF 0.802815 ShADdfFa 192.168.39.10 1051 88.212.196.105 80 - 0 653 901 6 tcp 1621 1829 5 http 1.401401e+09 (empty) C05Lh44T4Od4nxKuPc
7 SF 0.009575 Dd 192.168.39.10 1048 4.2.2.3 53 - 0 28 56 1 udp 44 72 1 dns 1.401401e+09 (empty) CoK5EL2wWO4GXS26D7
8 RSTO 6.797592 ShADadfR 192.168.39.10 1055 91.194.254.195 80 - 0 326 574 6 tcp 801 969 4 http 1.401401e+09 (empty) CiDYwq4jpcz4IGmX0g
9 RSTO 8.367047 ShADdfR 192.168.39.10 1054 81.19.88.95 80 - 0 799 1287 12 tcp 7618 7946 8 http 1.401401e+09 (empty) CDR6w61LKf7XpsHC36

10 rows × 20 columns


In [24]:
# Simple Stats with Pandas Dataframe
dataframes['conn_log'][['missed_bytes','orig_ip_bytes','resp_ip_bytes','resp_pkts']].describe()


Out[24]:
missed_bytes orig_ip_bytes resp_ip_bytes resp_pkts
count 65 65.000000 65.000000 65.000000
mean 0 1221.769231 39791.953846 17.107692
std 0 1685.608669 74050.375527 27.989464
min 0 48.000000 0.000000 0.000000
25% 0 63.000000 79.000000 1.000000
50% 0 464.000000 365.000000 4.000000
75% 0 1600.000000 51482.000000 18.000000
max 0 5312.000000 223664.000000 106.000000

8 rows × 4 columns


In [64]:
# Simple Filtering with Pandas Dataframe
not_80_df = dataframes['conn_log'][dataframes['conn_log']['id.resp_p'] != 80]
not_80_df.head(10)


Out[64]:
conn_state duration history id.orig_h id.orig_p id.resp_h id.resp_p local_orig missed_bytes orig_bytes orig_ip_bytes orig_pkts proto resp_bytes resp_ip_bytes resp_pkts service ts tunnel_parents uid
7 SF 0.009575 Dd 192.168.39.10 1048 4.2.2.3 53 - 0 28 56 1 udp 44 72 1 dns 1.401401e+09 (empty) CoK5EL2wWO4GXS26D7 ...
11 SF 11.55683 Dd 192.168.39.10 1035 4.2.2.3 53 - 0 221 389 6 udp 669 837 6 dns 1.401401e+09 (empty) CBWgV42cYiaGqu9Z25 ...
12 S0 0.000911 D 0.0.0.0 68 255.255.255.255 67 - 0 626 682 2 udp 0 0 0 dhcp 1.401401e+09 (empty) CKYJGa15i6s1Rpq0Y ...
13 SF 2.110374 dD 192.168.39.10 68 192.168.39.1 67 - 0 314 342 1 udp 900 984 3 dhcp 1.401401e+09 (empty) C6zKAr3iKNDLKufq4h ...
15 SF 0.027165 Dd 192.168.39.10 1035 4.2.2.3 53 - 0 38 66 1 udp 54 82 1 dns 1.401401e+09 (empty) C50uAqkg2gpAF0eH5 ...
26 S0 6.024315 D 192.168.39.10 1058 239.255.255.250 1900 - 0 399 483 3 udp 0 0 0 - 1.401401e+09 (empty) CAirC04eZatVhqLFvc ...
41 SF 0.008703 Dd 192.168.39.10 1035 4.2.2.3 53 - 0 38 66 1 udp 114 142 1 dns 1.401401e+09 (empty) Cjnp9B3tRdimStImlc ...
42 SF 0.009521 Dd 192.168.39.10 1048 4.2.2.3 53 - 0 38 66 1 udp 114 142 1 dns 1.401401e+09 (empty) Cl5qe21lfeB8wXEZg4 ...
43 SF 0.021553 Dd 192.168.39.10 1066 8.8.4.4 53 - 0 35 63 1 udp 51 79 1 dns 1.401401e+09 (empty) Cn9DTl3yp91dxD8hFl ...
44 SF 0.001862 Dd 192.168.39.10 1068 8.8.4.4 53 - 0 35 63 1 udp 51 79 1 dns 1.401401e+09 (empty) Cy7sm63NXzJ1XuxTpj ...

10 rows × 21 columns


In [66]:
# Now we group by host and show the different response mime types for each host
group_host = dataframes['http_log'].groupby(['host','id.resp_h','resp_mime_types','uri'])[['response_body_len']].sum()
group_host


Out[66]:
response_body_len
host id.resp_h resp_mime_types uri
323210841-1.paroonic.ru 80.72.37.112 application/jar /1401379560.jar 31887
application/pdf /1401379560.pdf 12761
application/x-dosexec /f/1401379560/5/x00cf6b534e520804090407000700080150050f0304045106565601;1;5 156160
/f/1401379560/6 156160
text/html /1401379560.htm 21138
57d9bf9co3qbc.paroonic.ru 80.72.37.112 text/html / 89366
counter.rambler.ru 81.19.88.95 image/gif /top100.scn?2148353&rn=2061653628&v=0.3i&bs=780x427&ce=1&rf&en=windows-1251&pt=Download porn free clips, sex porn video&cd=24-bit&sr=800x600&la=en-us&ja=1&acn=Mozilla&an=Microsoft Internet Explorer&pl=Win32&tz=240&fv=10.0 r32&sv&le=1 49
text/plain /top100.jcn?2148353 6853
counter.yadro.ru 88.212.196.105 image/gif /hit?q;t28.6;r;s800*600*24;uhttp://gold-xxx.net/;0.6712898022427673 763
text/html /hit?t28.6;r;s800*600*24;uhttp://gold-xxx.net/;0.6712898022427673 32
freeroomhostelz.com 91.194.254.195 text/html / 413
gold-xxx.net 178.208.85.60 - /css/engine.css 0
image/gif /templates/xxibeka/images/all_bg.png 2247
/templates/xxibeka/images/col-mid-r.gif 107
/templates/xxibeka/images/col-top-r.gif 1122
/templates/xxibeka/images/news.gif 1416
/templates/xxibeka/images/search.gif 254
image/jpeg /templates/xxibeka/images/content.jpg 1631
/templates/xxibeka/images/header.jpg 3630
/templates/xxibeka/images/logo.jpg 47378
/uploads/posts/2014-05/1401395470_eavlolprh2nywzf.jpeg 34600
/uploads/posts/2014-05/1401395923_8z4umchk5zwstxw.jpeg 34866
/uploads/posts/2014-05/1401396326_clybb9irt17pygm.jpeg 31199
/uploads/posts/2014-05/1401396909_n8xfse70ugli1cz.jpeg 51466
/uploads/posts/2014-05/1401397123_avgnhrqeylibgyk.jpeg 49333
/uploads/posts/2014-05/1401397541_qbdkzupvm18lgre.jpeg 36246
/uploads/posts/2014-05/1401397822_03ifpwcqtqmz65r.jpeg 37594
/uploads/posts/2014-05/1401398230_zkihngphpwvsuxf.jpeg 36792
/uploads/posts/2014-05/1401400046_jpfy4mfyskolthi.jpeg 38434
/uploads/posts/2014-05/1401400094_aiax6go3xlpkr8d.jpeg 44885
image/png /templates/xxibeka/images/nav.png 184
/templates/xxibeka/images/send.png 3826
image/x-icon /favicon.ico 1150
text/html / 30992
/css/engine.css 290
/css/site.css 288
/css/style.css 289
/engine/classes/js/jquery.js 94840
/engine/classes/js/js_edit.js 11012
/templates/xxibeka/css/template_css.css 314
/templates/xxibeka/images/h1bg.gif 309
/templates/xxibeka/images/m-bottom.gife 314
/templates/xxibeka/images/m-center.gife 314
/templates/xxibeka/images/m-top.gife 311
text/plain /engine/classes/js/dialog.js 47054
/engine/classes/js/dle_ajax.js 4931
/engine/classes/js/effects.js 13628
/engine/classes/js/menu.js 2992
/templates/xxibeka/css/site.css 595
/templates/xxibeka/css/style.css 13241
text/x-asm /templates/xxibeka/css/engine.css 7692
newsbrontima.com 192.64.115.91 - /5jeno9e6lbsffl 0
text/plain /epb4y7viha3 14
online-serial.net 178.208.83.15 application/x-dosexec /lok2.exe 438272
picsee.net 62.75.207.72 image/jpeg /upload/2014-05-30/258f383e1712.jpeg 61906
/upload/2014-05-30/552a444e0929.jpeg 25267
/upload/2014-05-30/8db84be028a0.jpeg 62012
/upload/2014-05-30/a906f803f653.jpeg 71508
/upload/2014-05-30/ffb0f9075f26.jpeg 65423
t2bot.ru 178.208.83.55 application/x-dosexec /lok2.exe 876544
...

62 rows × 1 columns


In [75]:
# Plotting defaults
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 12.0
plt.rcParams['figure.figsize'] = 15.0, 8.0

In [82]:
# Plot hosts and mime-types
plot_df = dataframes['http_log'].groupby(['host','resp_mime_types'])[['response_body_len']].sum().unstack()
plot_df['response_body_len'].plot(kind='bar', stacked=True)
plt.xlabel('Domain')
plt.ylabel('Response Bytes')
plt.xticks(rotation=45, ha='right')


Out[82]:
(array([  0.5,   1.5,   2.5,   3.5,   4.5,   5.5,   6.5,   7.5,   8.5,
          9.5,  10.5]), <a list of 11 Text xticklabel objects>)

Wrap Up

Well for this short notebook we used 6-7 lines of python to go from PCAP file to a Pandas Dataframe and then we did a bunch of kewl stuff using the Dataframe. We hope this exercise showed some neato functionality using Workbench, we encourage you to check out the GitHub repository and our other notebooks: