Graphistry is great -- Graphistry and RAPIDS/BlazingDB is better!
This tutorial series visually analyzes Zeek/Bro network connection logs using different compute engines:
Part I Contents:
Time using CPU-based Python Pandas and Graphistry for a full ETL & visual analysis flow:
In [1]:
#!pip install graphistry -q
import pandas as pd
import graphistry
#graphistry.register(key='MY_KEY', protocol='https', server='graphistry.site.com')
graphistry.__version__
Out[1]:
In [2]:
%%time
!curl https://www.secrepo.com/maccdc2012/conn.log.gz | gzip -d > conn.log
!head -n 3 conn.log
In [3]:
# OPTIONAL: For slow devices, work on a subset
#!awk 'NR % 20 == 0' < conn.log > conn-5pc.log
In [4]:
df = pd.read_csv("./conn.log", sep="\t", header=None,
names=["time", "uid", "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "proto", "service",
"duration", "orig_bytes", "resp_bytes", "conn_state", "local_orig", "missed_bytes",
"history", "orig_pkts", "orig_ip_bytes", "resp_pkts", "resp_ip_bytes", "tunnel_parents"],
na_values=['-'], index_col=False)
In [7]:
df.sample(3)
Out[7]:
In [8]:
df_summary = df\
.assign(
sum_bytes=df.apply(lambda row: row['orig_bytes'] + row['resp_bytes'], axis=1))\
.groupby(['id.orig_h', 'id.resp_h', 'conn_state'])\
.agg({
'time': ['min', 'max', 'size'],
'id.resp_p': ['nunique'],
'uid': ['nunique'],
'duration': ['min', 'max', 'mean'],
'orig_bytes': ['min', 'max', 'sum', 'mean'],
'resp_bytes': ['min', 'max', 'sum', 'mean'],
'sum_bytes': ['min', 'max', 'sum', 'mean']
}).reset_index()
In [10]:
df_summary.columns = [' '.join(col).strip() for col in df_summary.columns.values]
df_summary = df_summary\
.rename(columns={'time size': 'count'})\
.assign(
conn_state_uid=df_summary.apply(lambda row: row['id.orig_h'] + '_' + row['id.resp_h'] + '_' + row['conn_state'], axis=1))
In [11]:
print ('# rows', len(df_summary))
df_summary.sample(3)
Out[11]:
In [12]:
hg = graphistry.hypergraph(
df_summary,
['id.orig_h', 'id.resp_h'],
direct=True,
opts={
'CATEGORIES': {
'ip': ['id.orig_h', 'id.resp_h']
}
})
In [13]:
hg['graph'].plot()
Out[13]:
In [ ]:
In [ ]: