Visual GPU Log Analytics Part I: CPU Baseline in Python Pandas

Graphistry is great -- Graphistry and RAPIDS/BlazingDB is better!

This tutorial series visually analyzes Zeek/Bro network connection logs using different compute engines:

Part I: CPU Baseline in Python Pandas
Part II: GPU Dataframse with RAPIDS Python cudf bindings

Part I Contents:

Time using CPU-based Python Pandas and Graphistry for a full ETL & visual analysis flow:

Load data
Analyze data
Visualize data



In [1]:

    
#!pip install graphistry -q


import pandas as pd

import graphistry
#graphistry.register(key='MY_KEY', protocol='https', server='graphistry.site.com')
graphistry.__version__









    Out[1]:





'0.9.64'

1. Load data



In [2]:

    
%%time
!curl https://www.secrepo.com/maccdc2012/conn.log.gz | gzip -d > conn.log
  
!head -n 3 conn.log









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  523M  100  523M    0     0  18.8M      0  0:00:27  0:00:27 --:--:-- 6100k    0  19.7M      0  0:00:26  0:00:08  0:00:18 21.4M
1331901000.000000	CCUIP21wTjqkj8ZqX5	192.168.202.79	50463	192.168.229.251	80	tcp	-	-	-	-	SH	-	0	Fa	1	52	1	52	(empty)
1331901000.000000	Csssjd3tX0yOTPDpng	192.168.202.79	46117	192.168.229.254	443	tcp	-	-	-	-	SF	-	0	dDafFr	3	382	9	994	(empty)
1331901000.000000	CHEt7z3AzG4gyCNgci	192.168.202.79	50465	192.168.229.251	80	tcp	http	0.010000	166	214	SF	-	0	ShADfFa	4	382	3	382	(empty)
CPU times: user 884 ms, sys: 261 ms, total: 1.15 s
Wall time: 45.4 s



In [3]:

    
# OPTIONAL: For slow devices, work on a subset
#!awk 'NR % 20 == 0' < conn.log > conn-5pc.log



In [4]:

    
df = pd.read_csv("./conn.log", sep="\t", header=None, 
                 names=["time", "uid", "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "proto", "service",
                        "duration", "orig_bytes", "resp_bytes", "conn_state", "local_orig", "missed_bytes",
                        "history", "orig_pkts", "orig_ip_bytes", "resp_pkts", "resp_ip_bytes", "tunnel_parents"], 
                 na_values=['-'], index_col=False)



In [7]:

    
df.sample(3)









    Out[7]:







  
    
      
      time
      uid
      id.orig_h
      id.orig_p
      id.resp_h
      id.resp_p
      proto
      service
      duration
      orig_bytes
      resp_bytes
      conn_state
      local_orig
      missed_bytes
      history
      orig_pkts
      orig_ip_bytes
      resp_pkts
      resp_ip_bytes
      tunnel_parents
    
  
  
    
      10127897
      1.331923e+09
      Cv78PO2bZXJQxhLbZe
      192.168.202.110
      42526
      192.168.23.102
      2536
      tcp
      NaN
      NaN
      NaN
      NaN
      S0
      NaN
      0
      S
      1
      48
      0
      0
      (empty)
    
    
      18350474
      1.332000e+09
      CqdYUa2rK5yxuRR2A
      192.168.202.144
      59204
      192.168.25.103
      8800
      tcp
      NaN
      NaN
      NaN
      NaN
      S0
      NaN
      0
      S
      1
      44
      0
      0
      (empty)
    
    
      9733712
      1.331922e+09
      CS6JX63EdG82I2AEN9
      192.168.202.110
      52576
      192.168.21.100
      1321
      tcp
      NaN
      0.01
      0.0
      0.0
      REJ
      NaN
      0
      Sr
      1
      48
      1
      40
      (empty)

2. Analyze Data

Summarize network activities between every communicating src/dst IP, split by connection state



In [8]:

    
df_summary = df\
.assign(
    sum_bytes=df.apply(lambda row: row['orig_bytes'] + row['resp_bytes'], axis=1))\
.groupby(['id.orig_h', 'id.resp_h', 'conn_state'])\
.agg({
    'time': ['min', 'max', 'size'],
    'id.resp_p':  ['nunique'],
    'uid': ['nunique'],
    'duration':   ['min', 'max', 'mean'],
    'orig_bytes': ['min', 'max', 'sum', 'mean'],
    'resp_bytes': ['min', 'max', 'sum', 'mean'],
    'sum_bytes':  ['min', 'max', 'sum', 'mean']
}).reset_index()



In [10]:

    
df_summary.columns = [' '.join(col).strip() for col in df_summary.columns.values]
df_summary = df_summary\
.rename(columns={'time size': 'count'})\
.assign(
    conn_state_uid=df_summary.apply(lambda row: row['id.orig_h'] + '_' + row['id.resp_h'] + '_' + row['conn_state'], axis=1))



In [11]:

    
print ('# rows', len(df_summary))
df_summary.sample(3)









    



# rows 79829






    Out[11]:







  
    
      
      id.orig_h
      id.resp_h
      conn_state
      time min
      time max
      count
      id.resp_p nunique
      uid nunique
      duration min
      duration max
      ...
      orig_bytes mean
      resp_bytes min
      resp_bytes max
      resp_bytes sum
      resp_bytes mean
      sum_bytes min
      sum_bytes max
      sum_bytes sum
      sum_bytes mean
      conn_state_uid
    
  
  
    
      13480
      192.168.202.110
      192.168.229.25
      S0
      1.331909e+09
      1.331922e+09
      32
      28
      32
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      0.0
      NaN
      NaN
      NaN
      0.0
      NaN
      192.168.202.110_192.168.229.25_S0
    
    
      48440
      192.168.202.44
      192.168.25.100
      REJ
      1.332000e+09
      1.332015e+09
      5
      1
      5
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      0.0
      NaN
      NaN
      NaN
      0.0
      NaN
      192.168.202.44_192.168.25.100_REJ
    
    
      51894
      192.168.202.79
      172.16.2.102
      OTH
      1.331920e+09
      1.331920e+09
      4
      3
      4
      0.1
      1.13
      ...
      12.0
      0.0
      0.0
      0.0
      0.0
      0.0
      24.0
      24.0
      12.0
      192.168.202.79_172.16.2.102_OTH
    
  

3 rows × 24 columns

3. Visualize data

Nodes:
- IPs
- Bigger when more sessions (split by connection state) involving them
Edges:
- src_ip -> dest_ip, split by connection state



In [12]:

    
hg = graphistry.hypergraph(
    df_summary,
    ['id.orig_h', 'id.resp_h'],
    direct=True,
    opts={
        'CATEGORIES': {
            'ip': ['id.orig_h', 'id.resp_h']
        }
    })









    



# links 79829
# events 79829
# attrib entities 5556



In [13]:

    
hg['graph'].plot()









    Out[13]:

Next Steps

Part I: CPU Baseline in Python Pandas
Part II: GPU Dataframe with RAPIDS Python cudf bindings



In [ ]:



In [ ]:

	time	uid	id.orig_h	id.orig_p	id.resp_h	id.resp_p	proto	service	duration	orig_bytes	resp_bytes	conn_state	local_orig	history	orig_pkts	orig_ip_bytes	resp_pkts	resp_ip_bytes	tunnel_parents
10127897	1.331923e+09	Cv78PO2bZXJQxhLbZe	192.168.202.110	42526	192.168.23.102	2536	tcp	NaN	NaN	NaN	NaN	S0	NaN	S	1	48	0	0	(empty)
18350474	1.332000e+09	CqdYUa2rK5yxuRR2A	192.168.202.144	59204	192.168.25.103	8800	tcp	NaN	NaN	NaN	NaN	S0	NaN	S	1	44	0	0	(empty)
9733712	1.331922e+09	CS6JX63EdG82I2AEN9	192.168.202.110	52576	192.168.21.100	1321	tcp	NaN	0.01	0.0	0.0	REJ	NaN	Sr	1	48	1	40	(empty)

	id.orig_h	id.resp_h	conn_state	time min	time max	count	id.resp_p nunique	uid nunique	duration min	duration max	...	orig_bytes mean	resp_bytes min	resp_bytes max	resp_bytes mean	sum_bytes min	sum_bytes max	sum_bytes sum	sum_bytes mean	conn_state_uid
13480	192.168.202.110	192.168.229.25	S0	1.331909e+09	1.331922e+09	32	28	32	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	192.168.202.110_192.168.229.25_S0
48440	192.168.202.44	192.168.25.100	REJ	1.332000e+09	1.332015e+09	5	1	5	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	192.168.202.44_192.168.25.100_REJ
51894	192.168.202.79	172.16.2.102	OTH	1.331920e+09	1.331920e+09	4	3	4	0.1	1.13	...	12.0	0.0	0.0	0.0	0.0	24.0	24.0	12.0	192.168.202.79_172.16.2.102_OTH