Visual GPU Log Analytics Part I: CPU Baseline in Python Pandas

Graphistry is great -- Graphistry and RAPIDS/BlazingDB is better!

This tutorial series visually analyzes Zeek/Bro network connection logs using different compute engines:

Part I Contents:

Time using CPU-based Python Pandas and Graphistry for a full ETL & visual analysis flow:

  1. Load data
  2. Analyze data
  3. Visualize data

In [1]:
#!pip install graphistry -q


import pandas as pd

import graphistry
#graphistry.register(key='MY_KEY', protocol='https', server='graphistry.site.com')
graphistry.__version__


Out[1]:
'0.9.64'

1. Load data


In [2]:
%%time
!curl https://www.secrepo.com/maccdc2012/conn.log.gz | gzip -d > conn.log
  
!head -n 3 conn.log


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  523M  100  523M    0     0  18.8M      0  0:00:27  0:00:27 --:--:-- 6100k    0  19.7M      0  0:00:26  0:00:08  0:00:18 21.4M
1331901000.000000	CCUIP21wTjqkj8ZqX5	192.168.202.79	50463	192.168.229.251	80	tcp	-	-	-	-	SH	-	0	Fa	1	52	1	52	(empty)
1331901000.000000	Csssjd3tX0yOTPDpng	192.168.202.79	46117	192.168.229.254	443	tcp	-	-	-	-	SF	-	0	dDafFr	3	382	9	994	(empty)
1331901000.000000	CHEt7z3AzG4gyCNgci	192.168.202.79	50465	192.168.229.251	80	tcp	http	0.010000	166	214	SF	-	0	ShADfFa	4	382	3	382	(empty)
CPU times: user 884 ms, sys: 261 ms, total: 1.15 s
Wall time: 45.4 s

In [3]:
# OPTIONAL: For slow devices, work on a subset
#!awk 'NR % 20 == 0' < conn.log > conn-5pc.log

In [4]:
df = pd.read_csv("./conn.log", sep="\t", header=None, 
                 names=["time", "uid", "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "proto", "service",
                        "duration", "orig_bytes", "resp_bytes", "conn_state", "local_orig", "missed_bytes",
                        "history", "orig_pkts", "orig_ip_bytes", "resp_pkts", "resp_ip_bytes", "tunnel_parents"], 
                 na_values=['-'], index_col=False)

In [7]:
df.sample(3)


Out[7]:
time uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents
10127897 1.331923e+09 Cv78PO2bZXJQxhLbZe 192.168.202.110 42526 192.168.23.102 2536 tcp NaN NaN NaN NaN S0 NaN 0 S 1 48 0 0 (empty)
18350474 1.332000e+09 CqdYUa2rK5yxuRR2A 192.168.202.144 59204 192.168.25.103 8800 tcp NaN NaN NaN NaN S0 NaN 0 S 1 44 0 0 (empty)
9733712 1.331922e+09 CS6JX63EdG82I2AEN9 192.168.202.110 52576 192.168.21.100 1321 tcp NaN 0.01 0.0 0.0 REJ NaN 0 Sr 1 48 1 40 (empty)

2. Analyze Data

Summarize network activities between every communicating src/dst IP, split by connection state


In [8]:
df_summary = df\
.assign(
    sum_bytes=df.apply(lambda row: row['orig_bytes'] + row['resp_bytes'], axis=1))\
.groupby(['id.orig_h', 'id.resp_h', 'conn_state'])\
.agg({
    'time': ['min', 'max', 'size'],
    'id.resp_p':  ['nunique'],
    'uid': ['nunique'],
    'duration':   ['min', 'max', 'mean'],
    'orig_bytes': ['min', 'max', 'sum', 'mean'],
    'resp_bytes': ['min', 'max', 'sum', 'mean'],
    'sum_bytes':  ['min', 'max', 'sum', 'mean']
}).reset_index()

In [10]:
df_summary.columns = [' '.join(col).strip() for col in df_summary.columns.values]
df_summary = df_summary\
.rename(columns={'time size': 'count'})\
.assign(
    conn_state_uid=df_summary.apply(lambda row: row['id.orig_h'] + '_' + row['id.resp_h'] + '_' + row['conn_state'], axis=1))

In [11]:
print ('# rows', len(df_summary))
df_summary.sample(3)


# rows 79829
Out[11]:
id.orig_h id.resp_h conn_state time min time max count id.resp_p nunique uid nunique duration min duration max ... orig_bytes mean resp_bytes min resp_bytes max resp_bytes sum resp_bytes mean sum_bytes min sum_bytes max sum_bytes sum sum_bytes mean conn_state_uid
13480 192.168.202.110 192.168.229.25 S0 1.331909e+09 1.331922e+09 32 28 32 NaN NaN ... NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN 192.168.202.110_192.168.229.25_S0
48440 192.168.202.44 192.168.25.100 REJ 1.332000e+09 1.332015e+09 5 1 5 NaN NaN ... NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN 192.168.202.44_192.168.25.100_REJ
51894 192.168.202.79 172.16.2.102 OTH 1.331920e+09 1.331920e+09 4 3 4 0.1 1.13 ... 12.0 0.0 0.0 0.0 0.0 0.0 24.0 24.0 12.0 192.168.202.79_172.16.2.102_OTH

3 rows × 24 columns

3. Visualize data

  • Nodes:
    • IPs
    • Bigger when more sessions (split by connection state) involving them
  • Edges:
    • src_ip -> dest_ip, split by connection state

In [12]:
hg = graphistry.hypergraph(
    df_summary,
    ['id.orig_h', 'id.resp_h'],
    direct=True,
    opts={
        'CATEGORIES': {
            'ip': ['id.orig_h', 'id.resp_h']
        }
    })


# links 79829
# events 79829
# attrib entities 5556

In [13]:
hg['graph'].plot()


Out[13]:

In [ ]:


In [ ]: